Pregled bibliografske jedinice broj: 711759
Models for Predicting the Inflectional Paradigm of Unknown Croatian Words
Models for Predicting the Inflectional Paradigm of Unknown Croatian Words // Slovenscina 2.0, 1 (2013), 2; 1-34 (međunarodna recenzija, članak, znanstveni)
CROSBI ID: 711759 Za ispravke kontaktirajte CROSBI podršku putem web obrasca
Naslov
Models for Predicting the Inflectional Paradigm of Unknown Croatian Words
Autori
Šnajder, Jan
Izvornik
Slovenscina 2.0 (2335-2736) 1
(2013), 2;
1-34
Vrsta, podvrsta i kategorija rada
Radovi u časopisima, članak, znanstveni
Ključne riječi
computational morphology ; paradigm prediction ; machine learning ; feature selection ; Croatian language
Sažetak
Morphological analysis is a prerequisite for many natural language processing tasks. For inflectionally rich languages such as Croatian, morphological analysis typically relies on a morphological lexicon, which lists the lemmas and their paradigms. However, a real-life morphological analyzer must also be able to handle properly the out-of-vocabulary words. We address the task of predicting the correct inflectional paradigm of unknown Croatian words. We frame this as a supervised machine learning problem: we train a classifier to predict whether a candidate lemma-paradigm pair is correct based on a number of string- and corpus-based features. The candidate lemma-paradigm pairs are generated using a handcrafted morphology grammar. Our aim is to examine the machine learning aspect of the problem: we test a comprehensive set of features and evaluate the classification accuracy using different feature subsets. We show that satisfactory classification accuracy (92%) can be achieved with SVM using a combination of string- and corpus-based features. On a per word basis, the F1-score is 53% and accuracy is 70%, which outperforms a frequency-based baseline by a wide margin. We discuss a number of possible directions for future research.
Izvorni jezik
Engleski
Znanstvena područja
Računarstvo
POVEZANOST RADA
Projekti:
MZO-ZP-036-1300646-1986 - Otkrivanje znanja u tekstnim podacima (Dalbelo-Bašić, Bojana, MZO ) ( CroRIS)
Ustanove:
Fakultet elektrotehnike i računarstva, Zagreb
Profili:
Jan Šnajder
(autor)
Citiraj ovu publikaciju:
Uključenost u ostale bibliografske baze podataka::
- Directory of open access journals