Pretražite po imenu i prezimenu autora, mentora, urednika, prevoditelja

Napredna pretraga

Pregled bibliografske jedinice broj: 484273

Croatian OCR Error Correction Using Character Confusions and Language Modelling


Marović, Mladen; Mikša, Mladen; Šnajder, Jan; Dalbelo Bašić, Bojana
Croatian OCR Error Correction Using Character Confusions and Language Modelling // Proceedings of the 21st Central European Conference on Information and Intelligent Systems / Auer, Boris ; Bača, Miroslav ; Schatten, Markus (ur.).
Varaždin: Fakultet organizacije i informatike Sveučilišta u Zagrebu, 2010. str. 281-288 (predavanje, međunarodna recenzija, cjeloviti rad (in extenso), znanstveni)


CROSBI ID: 484273 Za ispravke kontaktirajte CROSBI podršku putem web obrasca

Naslov
Croatian OCR Error Correction Using Character Confusions and Language Modelling

Autori
Marović, Mladen ; Mikša, Mladen ; Šnajder, Jan ; Dalbelo Bašić, Bojana

Vrsta, podvrsta i kategorija rada
Radovi u zbornicima skupova, cjeloviti rad (in extenso), znanstveni

Izvornik
Proceedings of the 21st Central European Conference on Information and Intelligent Systems / Auer, Boris ; Bača, Miroslav ; Schatten, Markus - Varaždin : Fakultet organizacije i informatike Sveučilišta u Zagrebu, 2010, 281-288

Skup
Central European Conference on Information and Intelligent Systems - CECIIS 2010

Mjesto i datum
Varaždin, Hrvatska, 22.09.2010. - 24.09.2010

Vrsta sudjelovanja
Predavanje

Vrsta recenzije
Međunarodna recenzija

Ključne riječi
Natural language processing; OCR; character confusions; character n-grams; word merge errors; language model; Croatian language

Sažetak
Manual correction of errors produced by optical character recognition (OCR) is a time-consuming task. This paper presents an automatic post- processing system that utilizes various methods for improving the OCR results of Croatian language texts. The system relies on knowledge of general characteristics of OCR errors, as well as language-specific knowledge. Used methods include character confusions, a character n-gram model, and word-splitting. A statistical language model is used for ranking the generated candidates depending on the sentential context. Experimental evaluation, performed on newspaper texts supplied by the Croatian News Agency, shows an error rate reduction of above 20%. These results amount to about 36% of the performance of manual correction.

Izvorni jezik
Engleski

Znanstvena područja
Računarstvo



POVEZANOST RADA


Projekti:
036-1300646-1986 - Otkrivanje znanja u tekstnim podacima (Dalbelo-Bašić, Bojana, MZO ) ( CroRIS)

Ustanove:
Fakultet elektrotehnike i računarstva, Zagreb

Profili:

Avatar Url Jan Šnajder (autor)

Avatar Url Bojana Dalbelo Bašić (autor)

Citiraj ovu publikaciju:

Marović, Mladen; Mikša, Mladen; Šnajder, Jan; Dalbelo Bašić, Bojana
Croatian OCR Error Correction Using Character Confusions and Language Modelling // Proceedings of the 21st Central European Conference on Information and Intelligent Systems / Auer, Boris ; Bača, Miroslav ; Schatten, Markus (ur.).
Varaždin: Fakultet organizacije i informatike Sveučilišta u Zagrebu, 2010. str. 281-288 (predavanje, međunarodna recenzija, cjeloviti rad (in extenso), znanstveni)
Marović, M., Mikša, M., Šnajder, J. & Dalbelo Bašić, B. (2010) Croatian OCR Error Correction Using Character Confusions and Language Modelling. U: Auer, B., Bača, M. & Schatten, M. (ur.)Proceedings of the 21st Central European Conference on Information and Intelligent Systems.
@article{article, author = {Marovi\'{c}, Mladen and Mik\v{s}a, Mladen and \v{S}najder, Jan and Dalbelo Ba\v{s}i\'{c}, Bojana}, year = {2010}, pages = {281-288}, keywords = {Natural language processing, OCR, character confusions, character n-grams, word merge errors, language model, Croatian language}, title = {Croatian OCR Error Correction Using Character Confusions and Language Modelling}, keyword = {Natural language processing, OCR, character confusions, character n-grams, word merge errors, language model, Croatian language}, publisher = {Fakultet organizacije i informatike Sveu\v{c}ili\v{s}ta u Zagrebu}, publisherplace = {Vara\v{z}din, Hrvatska} }
@article{article, author = {Marovi\'{c}, Mladen and Mik\v{s}a, Mladen and \v{S}najder, Jan and Dalbelo Ba\v{s}i\'{c}, Bojana}, year = {2010}, pages = {281-288}, keywords = {Natural language processing, OCR, character confusions, character n-grams, word merge errors, language model, Croatian language}, title = {Croatian OCR Error Correction Using Character Confusions and Language Modelling}, keyword = {Natural language processing, OCR, character confusions, character n-grams, word merge errors, language model, Croatian language}, publisher = {Fakultet organizacije i informatike Sveu\v{c}ili\v{s}ta u Zagrebu}, publisherplace = {Vara\v{z}din, Hrvatska} }




Contrast
Increase Font
Decrease Font
Dyslexic Font