Croatian OCR Error Correction Using Character Confusions and Language Modelling (CROSBI ID 566711)
Prilog sa skupa u časopisu | izvorni znanstveni rad | međunarodna recenzija
Podaci o odgovornosti
Marović, Mladen ; Mikša, Mladen ; Šnajder, Jan ; Dalbelo Bašić, Bojana
engleski
Croatian OCR Error Correction Using Character Confusions and Language Modelling
Manual correction of errors produced by optical character recognition (OCR) is a time-consuming task. This paper presents an automatic post- processing system that utilizes various methods for improving the OCR results of Croatian language texts. The system relies on knowledge of general characteristics of OCR errors, as well as language-specific knowledge. Used methods include character confusions, a character n-gram model, and word-splitting. A statistical language model is used for ranking the generated candidates depending on the sentential context. Experimental evaluation, performed on newspaper texts supplied by the Croatian News Agency, shows an error rate reduction of above 20%. These results amount to about 36% of the performance of manual correction.
Natural language processing; OCR; character confusions; character n-grams; word merge errors; language model; Croatian language
nije evidentirano
nije evidentirano
nije evidentirano
nije evidentirano
nije evidentirano
nije evidentirano
Podaci o prilogu
281-288.
2010.
nije evidentirano
objavljeno
Podaci o matičnoj publikaciji
Central European conference on information and intelligent systems
Auer, Boris ; Bača, Miroslav ; Schatten, Markus
Varaždin: Fakultet organizacije i informatike Sveučilišta u Zagrebu
1847-2001
Podaci o skupu
Central European Conference on Information and Intelligent Systems, CECIIS 2010
predavanje
22.09.2010-24.09.2010
Varaždin, Hrvatska