Pregled bibliografske jedinice broj: 433102
Automatic Diacritics Restoration in Croatian Texts
Automatic Diacritics Restoration in Croatian Texts // The Future of Information Sciences, Digital Resources and Knowledge Sharing / Stančić, Hrvoje ; Seljan, Sanja ; Bawden, David ; Lasić-Lazić, Jadranka ; Slavić, Aida (ur.).
Zagreb: Odsjek za informacijske i komunikacijske znanosti Filozofskog fakulteta Sveučilišta u Zagrebu, 2009. str. 309-318 (predavanje, međunarodna recenzija, cjeloviti rad (in extenso), znanstveni)
CROSBI ID: 433102 Za ispravke kontaktirajte CROSBI podršku putem web obrasca
Naslov
Automatic Diacritics Restoration in Croatian Texts
Autori
Šantić, Nikola ; Šnajder, Jan ; Dalbelo Bašić, Bojana
Vrsta, podvrsta i kategorija rada
Radovi u zbornicima skupova, cjeloviti rad (in extenso), znanstveni
Izvornik
The Future of Information Sciences, Digital Resources and Knowledge Sharing
/ Stančić, Hrvoje ; Seljan, Sanja ; Bawden, David ; Lasić-Lazić, Jadranka ; Slavić, Aida - Zagreb : Odsjek za informacijske i komunikacijske znanosti Filozofskog fakulteta Sveučilišta u Zagrebu, 2009, 309-318
ISBN
978-953-175-355-5
Skup
2nd International Conference The Future of Information Sciences (INFuture 2009)
Mjesto i datum
Zagreb, Hrvatska, 04.11.2009. - 06.11.2009
Vrsta sudjelovanja
Predavanje
Vrsta recenzije
Međunarodna recenzija
Ključne riječi
natural language processing; diacritics restoration; statistical language
Sažetak
The absence of diacritics in digitally encoded text is a common problem for languages whose writing systems are not covered by the standard ASCII character set. It is a deterioration of language in its own right, but also a serious impediment to automated text processing and information retrieval. Restoration of diacritics, if performed manually, is a tedious and time-consuming process. In this paper we describe a robust system for automatic diacritics restoration in Croatian texts. The system combines dictionary look-up and statistical language modelling. Diacritics restoration is evaluated on a corpus of newspaper articles and discussion forum posts. Our experiments show that high levels of accuracy can be achieved with fairly simple and computationally inexpensive methods.
Izvorni jezik
Engleski
Znanstvena područja
Računarstvo
POVEZANOST RADA
Projekti:
036-1300646-1986 - Otkrivanje znanja u tekstnim podacima (Dalbelo-Bašić, Bojana, MZO ) ( CroRIS)
Ustanove:
Fakultet elektrotehnike i računarstva, Zagreb