Pretražite po imenu i prezimenu autora, mentora, urednika, prevoditelja

Napredna pretraga

Pregled bibliografske jedinice broj: 711759

Models for Predicting the Inflectional Paradigm of Unknown Croatian Words


Šnajder, Jan
Models for Predicting the Inflectional Paradigm of Unknown Croatian Words // Slovenscina 2.0, 1 (2013), 2; 1-34 (međunarodna recenzija, članak, znanstveni)


CROSBI ID: 711759 Za ispravke kontaktirajte CROSBI podršku putem web obrasca

Naslov
Models for Predicting the Inflectional Paradigm of Unknown Croatian Words

Autori
Šnajder, Jan

Izvornik
Slovenscina 2.0 (2335-2736) 1 (2013), 2; 1-34

Vrsta, podvrsta i kategorija rada
Radovi u časopisima, članak, znanstveni

Ključne riječi
computational morphology ; paradigm prediction ; machine learning ; feature selection ; Croatian language

Sažetak
Morphological analysis is a prerequisite for many natural language processing tasks. For inflectionally rich languages such as Croatian, morphological analysis typically relies on a morphological lexicon, which lists the lemmas and their paradigms. However, a real-life morphological analyzer must also be able to handle properly the out-of-vocabulary words. We address the task of predicting the correct inflectional paradigm of unknown Croatian words. We frame this as a supervised machine learning problem: we train a classifier to predict whether a candidate lemma-paradigm pair is correct based on a number of string- and corpus-based features. The candidate lemma-paradigm pairs are generated using a handcrafted morphology grammar. Our aim is to examine the machine learning aspect of the problem: we test a comprehensive set of features and evaluate the classification accuracy using different feature subsets. We show that satisfactory classification accuracy (92%) can be achieved with SVM using a combination of string- and corpus-based features. On a per word basis, the F1-score is 53% and accuracy is 70%, which outperforms a frequency-based baseline by a wide margin. We discuss a number of possible directions for future research.

Izvorni jezik
Engleski

Znanstvena područja
Računarstvo



POVEZANOST RADA


Projekti:
MZO-ZP-036-1300646-1986 - Otkrivanje znanja u tekstnim podacima (Dalbelo-Bašić, Bojana, MZO ) ( CroRIS)

Ustanove:
Fakultet elektrotehnike i računarstva, Zagreb

Profili:

Avatar Url Jan Šnajder (autor)

Citiraj ovu publikaciju:

Šnajder, Jan
Models for Predicting the Inflectional Paradigm of Unknown Croatian Words // Slovenscina 2.0, 1 (2013), 2; 1-34 (međunarodna recenzija, članak, znanstveni)
Šnajder, J. (2013) Models for Predicting the Inflectional Paradigm of Unknown Croatian Words. Slovenscina 2.0, 1 (2), 1-34.
@article{article, author = {\v{S}najder, Jan}, year = {2013}, pages = {1-34}, keywords = {computational morphology, paradigm prediction, machine learning, feature selection, Croatian language}, journal = {Slovenscina 2.0}, volume = {1}, number = {2}, issn = {2335-2736}, title = {Models for Predicting the Inflectional Paradigm of Unknown Croatian Words}, keyword = {computational morphology, paradigm prediction, machine learning, feature selection, Croatian language} }
@article{article, author = {\v{S}najder, Jan}, year = {2013}, pages = {1-34}, keywords = {computational morphology, paradigm prediction, machine learning, feature selection, Croatian language}, journal = {Slovenscina 2.0}, volume = {1}, number = {2}, issn = {2335-2736}, title = {Models for Predicting the Inflectional Paradigm of Unknown Croatian Words}, keyword = {computational morphology, paradigm prediction, machine learning, feature selection, Croatian language} }

Uključenost u ostale bibliografske baze podataka::


  • Directory of open access journals





Contrast
Increase Font
Decrease Font
Dyslexic Font