Language Morphology Offset: Text Classification on a Croatian-English Parallel Corpus (CROSBI ID 126863)
Prilog u časopisu | izvorni znanstveni rad | međunarodna recenzija
Podaci o odgovornosti
Malenica, Mislav ; Šmuc, Tomislav ; Šnajder, Jan ; Dalbelo Bašić, Bojana
engleski
Language Morphology Offset: Text Classification on a Croatian-English Parallel Corpus
We investigate how, and to what extent, morphological complexity of the language influences text classification using Support Vector Machines (SVM). The Croatian-English parallel corpus provides the basis for direct comparison of two languages of radically different morphological complexity. We quantified, compared, and statistically tested the effects of morphological normalisation on SVM classifier performance based on a series of parallel experiments on both languages, carried over a large scale of different feature subset sizes obtained by different feature selection methods, and applying different levels of morphological normalisation. We also quantified the trade-off between feature space size and performance for different levels of morphological normalisation, and compared the results for both languages. Our experiments have shown that the improvements in SVM classifier performance are statistically significant ; they are greater for small and medium number of features, especially for Croatian, whereas for large number of features the improvements are rather small and may be negligible in practice for both languages.
text classification ; SVM ; Croatian ; English ; morphological normalisation ; stemming ; lemmatization ; feature selection
nije evidentirano
nije evidentirano
nije evidentirano
nije evidentirano
nije evidentirano
nije evidentirano
Podaci o izdanju
44 (1)
2008.
325-339
objavljeno
0306-4573
1873-5371
10.1016/j.ipm.2006.12.007
Povezanost rada
Informacijske i komunikacijske znanosti, Računarstvo