Nalazite se na CroRIS probnoj okolini. Ovdje evidentirani podaci neće biti pohranjeni u Informacijskom sustavu znanosti RH. Ako je ovo greška, CroRIS produkcijskoj okolini moguće je pristupi putem poveznice www.croris.hr
izvor podataka: crosbi !

Models for Predicting the Inflectional Paradigm of Unknown Croatian Words (CROSBI ID 208181)

Prilog u časopisu | izvorni znanstveni rad | međunarodna recenzija

Šnajder, Jan Models for Predicting the Inflectional Paradigm of Unknown Croatian Words // Slovenscina 2.0, 1 (2013), 2; 1-34

Podaci o odgovornosti

Šnajder, Jan

engleski

Models for Predicting the Inflectional Paradigm of Unknown Croatian Words

Morphological analysis is a prerequisite for many natural language processing tasks. For inflectionally rich languages such as Croatian, morphological analysis typically relies on a morphological lexicon, which lists the lemmas and their paradigms. However, a real-life morphological analyzer must also be able to handle properly the out-of-vocabulary words. We address the task of predicting the correct inflectional paradigm of unknown Croatian words. We frame this as a supervised machine learning problem: we train a classifier to predict whether a candidate lemma-paradigm pair is correct based on a number of string- and corpus-based features. The candidate lemma-paradigm pairs are generated using a handcrafted morphology grammar. Our aim is to examine the machine learning aspect of the problem: we test a comprehensive set of features and evaluate the classification accuracy using different feature subsets. We show that satisfactory classification accuracy (92%) can be achieved with SVM using a combination of string- and corpus-based features. On a per word basis, the F1-score is 53% and accuracy is 70%, which outperforms a frequency-based baseline by a wide margin. We discuss a number of possible directions for future research.

computational morphology ; paradigm prediction ; machine learning ; feature selection ; Croatian language

nije evidentirano

nije evidentirano

nije evidentirano

nije evidentirano

nije evidentirano

nije evidentirano

Podaci o izdanju

1 (2)

2013.

1-34

objavljeno

2335-2736

Povezanost rada

Računarstvo

Poveznice