Correcting Word Merge Errors in Croatian Texts

Mikša, Mladen; Šnajder, Jan; Dalbelo Bašić; Bojana

Pregled bibliografske jedinice broj: 486604

Correcting Word Merge Errors in Croatian Texts

Mikša, Mladen; Šnajder, Jan; Dalbelo Bašić; Bojana

Correcting Word Merge Errors in Croatian Texts // Seventh International Conference on Formal Approaches to South Slavic and Balkan Languages / Tadić, Marko ; Dimitrova-Vulchanova, Mila ; Koeva, Svetla (ur.).
Zagreb: Hrvatsko društvo za jezične tehnologije, 2010. str. 67-75 (predavanje, međunarodna recenzija, cjeloviti rad (in extenso), znanstveni)

CROSBI ID: 486604 Za ispravke kontaktirajte CROSBI podršku putem web obrasca

Naslov
Correcting Word Merge Errors in Croatian Texts

Autori
Mikša, Mladen ; Šnajder, Jan ; Dalbelo Bašić ; Bojana

Vrsta, podvrsta i kategorija rada
Radovi u zbornicima skupova, cjeloviti rad (in extenso), znanstveni

Izvornik
Seventh International Conference on Formal Approaches to South Slavic and Balkan Languages / Tadić, Marko ; Dimitrova-Vulchanova, Mila ; Koeva, Svetla - Zagreb : Hrvatsko društvo za jezične tehnologije, 2010, 67-75

ISBN
978-953-55375-2-6

Skup
Seventh International Conference on Formal Approaches to South Slavic and Balkan Languages

Mjesto i datum
Dubrovnik, Hrvatska, 04.10.2010. - 06.10.2010

Vrsta sudjelovanja
Predavanje

Vrsta recenzije
Međunarodna recenzija

Ključne riječi
word merge errors; OCR errors; combinatorial optimization; language modeling; natural language processing; Croatian language

Sažetak
In many text processing tasks character-level errors (due to mistyping, OCR, etc.) typically lead to performance degradation. Most approaches to error correction are dictionary based and cannot be used to correct word boundary errors. Word boundary errors are quite common in OCR- generated texts, especially the word merge errors. In this paper we describe an approach to correcting word merge errors in texts written in Croatian language. The approach is based on combinatorial optimization with beam search strategy that determines the most plausible segmentation of the input token. The plausibility of the segmentation is assessed using a statistical language model and several heuristics. We evaluate the performance of our approach on a sample of artificially generated word merge errors. The achieved results are comparable to the results of the approaches found in the literature.

Izvorni jezik
Engleski

Znanstvena područja
Računarstvo

POVEZANOST RADA

Projekti:
036-1300646-1986 - Otkrivanje znanja u tekstnim podacima (Dalbelo-Bašić, Bojana, MZO ) ( CroRIS)

Ustanove:
Fakultet elektrotehnike i računarstva, Zagreb

Profili:

Jan Šnajder (autor)

hnk.ffzg.hr

CROSBI Hrvatska znanstvena bibliografija

Pregled bibliografske jedinice broj: 486604

Correcting Word Merge Errors in Croatian Texts

Citiraj ovu publikaciju:

Pregled bibliografske jedinice broj: 486604

Correcting Word Merge Errors in Croatian Texts

Citiraj ovu publikaciju:

Podijeli: