Models for Predicting the Inflectional Paradigm of Unknown Croatian Words (CROSBI ID 208181)
Prilog u časopisu | izvorni znanstveni rad | međunarodna recenzija
Podaci o odgovornosti
Šnajder, Jan
engleski
Models for Predicting the Inflectional Paradigm of Unknown Croatian Words
Morphological analysis is a prerequisite for many natural language processing tasks. For inflectionally rich languages such as Croatian, morphological analysis typically relies on a morphological lexicon, which lists the lemmas and their paradigms. However, a real-life morphological analyzer must also be able to handle properly the out-of-vocabulary words. We address the task of predicting the correct inflectional paradigm of unknown Croatian words. We frame this as a supervised machine learning problem: we train a classifier to predict whether a candidate lemma-paradigm pair is correct based on a number of string- and corpus-based features. The candidate lemma-paradigm pairs are generated using a handcrafted morphology grammar. Our aim is to examine the machine learning aspect of the problem: we test a comprehensive set of features and evaluate the classification accuracy using different feature subsets. We show that satisfactory classification accuracy (92%) can be achieved with SVM using a combination of string- and corpus-based features. On a per word basis, the F1-score is 53% and accuracy is 70%, which outperforms a frequency-based baseline by a wide margin. We discuss a number of possible directions for future research.
computational morphology ; paradigm prediction ; machine learning ; feature selection ; Croatian language
nije evidentirano
nije evidentirano
nije evidentirano
nije evidentirano
nije evidentirano
nije evidentirano