Automatic Diacritics Restoration in Croatian Texts (CROSBI ID 556323)
Prilog sa skupa u zborniku | izvorni znanstveni rad | međunarodna recenzija
Podaci o odgovornosti
Šantić, Nikola ; Šnajder, Jan ; Dalbelo Bašić, Bojana
engleski
Automatic Diacritics Restoration in Croatian Texts
The absence of diacritics in digitally encoded text is a common problem for languages whose writing systems are not covered by the standard ASCII character set. It is a deterioration of language in its own right, but also a serious impediment to automated text processing and information retrieval. Restoration of diacritics, if performed manually, is a tedious and time-consuming process. In this paper we describe a robust system for automatic diacritics restoration in Croatian texts. The system combines dictionary look-up and statistical language modelling. Diacritics restoration is evaluated on a corpus of newspaper articles and discussion forum posts. Our experiments show that high levels of accuracy can be achieved with fairly simple and computationally inexpensive methods.
natural language processing; diacritics restoration; statistical language
nije evidentirano
nije evidentirano
nije evidentirano
nije evidentirano
nije evidentirano
nije evidentirano
Podaci o prilogu
309-318.
2009.
objavljeno
Podaci o matičnoj publikaciji
Stančić, Hrvoje ; Seljan, Sanja ; Bawden, David ; Lasić-Lazić, Jadranka ; Slavić, Aida
Zagreb: Odsjek za informacijske i komunikacijske znanosti Filozofskog fakulteta Sveučilišta u Zagrebu
978-953-175-355-5
Podaci o skupu
2nd International Conference The Future of Information Sciences INFuture2009
predavanje
04.11.2009-06.11.2009
Zagreb, Hrvatska