Models for Predicting the Inflectional Paradigm of Unknown Croatian Words

Šnajder, Jan

izvor podataka: crosbi !

Models for Predicting the Inflectional Paradigm of Unknown Croatian Words (CROSBI ID 208181)

Prilog u časopisu | izvorni znanstveni rad | međunarodna recenzija

Šnajder, Jan Models for Predicting the Inflectional Paradigm of Unknown Croatian Words // Slovenscina 2.0, 1 (2013), 2; 1-34

Podaci o odgovornosti

Autori

Šnajder, Jan

Osnovni podaci na izvornom jeziku
Osnovni podaci na ostalim jezicima

Jezik

engleski

Naslov

Models for Predicting the Inflectional Paradigm of Unknown Croatian Words

Sažetak

Morphological analysis is a prerequisite for many natural language processing tasks. For inflectionally rich languages such as Croatian, morphological analysis typically relies on a morphological lexicon, which lists the lemmas and their paradigms. However, a real-life morphological analyzer must also be able to handle properly the out-of-vocabulary words. We address the task of predicting the correct inflectional paradigm of unknown Croatian words. We frame this as a supervised machine learning problem: we train a classifier to predict whether a candidate lemma-paradigm pair is correct based on a number of string- and corpus-based features. The candidate lemma-paradigm pairs are generated using a handcrafted morphology grammar. Our aim is to examine the machine learning aspect of the problem: we test a comprehensive set of features and evaluate the classification accuracy using different feature subsets. We show that satisfactory classification accuracy (92%) can be achieved with SVM using a combination of string- and corpus-based features. On a per word basis, the F1-score is 53% and accuracy is 70%, which outperforms a frequency-based baseline by a wide margin. We discuss a number of possible directions for future research.

Ključne riječi

computational morphology ; paradigm prediction ; machine learning ; feature selection ; Croatian language

Napomena

nije evidentirano

Jezik

nije evidentirano

Naslov

nije evidentirano

Sažetak

nije evidentirano

Ključne riječi

nije evidentirano

Napomena

nije evidentirano

Podaci o izdanju

Časopis

Slovenscina 2.0

Volumen (broj)

1 (2)

Godina

2013.

Stranice rada

1-34

Status objave rada

objavljeno

e-ISSN

2335-2736

Povezanost rada

Povezane osobe

Jan Šnajder (autor/i)

Povezane ustanove

Fakultet elektrotehnike i računarstva (036) (autorova ustanova)

Povezani projekti

Otkrivanje znanja u tekstnim podacima (rezultat rada na projektu)

Područje

Računarstvo

Poveznice

trojina.org