Pregled bibliografske jedinice broj: 486604
Correcting Word Merge Errors in Croatian Texts
Correcting Word Merge Errors in Croatian Texts // Seventh International Conference on Formal Approaches to South Slavic and Balkan Languages / Tadić, Marko ; Dimitrova-Vulchanova, Mila ; Koeva, Svetla (ur.).
Zagreb: Hrvatsko društvo za jezične tehnologije, 2010. str. 67-75 (predavanje, međunarodna recenzija, cjeloviti rad (in extenso), znanstveni)
CROSBI ID: 486604 Za ispravke kontaktirajte CROSBI podršku putem web obrasca
Naslov
Correcting Word Merge Errors in Croatian Texts
Autori
Mikša, Mladen ; Šnajder, Jan ; Dalbelo Bašić ; Bojana
Vrsta, podvrsta i kategorija rada
Radovi u zbornicima skupova, cjeloviti rad (in extenso), znanstveni
Izvornik
Seventh International Conference on Formal Approaches to South Slavic and Balkan Languages
/ Tadić, Marko ; Dimitrova-Vulchanova, Mila ; Koeva, Svetla - Zagreb : Hrvatsko društvo za jezične tehnologije, 2010, 67-75
ISBN
978-953-55375-2-6
Skup
Seventh International Conference on Formal Approaches to South Slavic and Balkan Languages
Mjesto i datum
Dubrovnik, Hrvatska, 04.10.2010. - 06.10.2010
Vrsta sudjelovanja
Predavanje
Vrsta recenzije
Međunarodna recenzija
Ključne riječi
word merge errors; OCR errors; combinatorial optimization; language modeling; natural language processing; Croatian language
Sažetak
In many text processing tasks character-level errors (due to mistyping, OCR, etc.) typically lead to performance degradation. Most approaches to error correction are dictionary based and cannot be used to correct word boundary errors. Word boundary errors are quite common in OCR- generated texts, especially the word merge errors. In this paper we describe an approach to correcting word merge errors in texts written in Croatian language. The approach is based on combinatorial optimization with beam search strategy that determines the most plausible segmentation of the input token. The plausibility of the segmentation is assessed using a statistical language model and several heuristics. We evaluate the performance of our approach on a sample of artificially generated word merge errors. The achieved results are comparable to the results of the approaches found in the literature.
Izvorni jezik
Engleski
Znanstvena područja
Računarstvo
POVEZANOST RADA
Projekti:
036-1300646-1986 - Otkrivanje znanja u tekstnim podacima (Dalbelo-Bašić, Bojana, MZO ) ( CroRIS)
Ustanove:
Fakultet elektrotehnike i računarstva, Zagreb
Profili:
Jan Šnajder
(autor)