Napredna pretraga

Pregled bibliografske jedinice broj: 390143

Evaluating Full Lemmatization of Croatian Texts


Agić, Željko; Tadić, Marko; Dovedan, Zdravko
Evaluating Full Lemmatization of Croatian Texts // Recent Advances in Intelligent Information Systems / Klopotek, Mieczyslaw ; Przepiorkowski, Adam ; Wierzchon, Slawomir ; Trojanowski, Krzysztof (ur.).
Warsaw: Academic Publishing House EXIT, 2009. str. 175-184


Naslov
Evaluating Full Lemmatization of Croatian Texts

Autori
Agić, Željko ; Tadić, Marko ; Dovedan, Zdravko

Vrsta, podvrsta i kategorija rada
Poglavlja u knjigama, znanstveni

Knjiga
Recent Advances in Intelligent Information Systems

Urednik/ci
Klopotek, Mieczyslaw ; Przepiorkowski, Adam ; Wierzchon, Slawomir ; Trojanowski, Krzysztof

Izdavač
Academic Publishing House EXIT

Grad
Warsaw

Godina
2009

Raspon stranica
175-184

ISBN
978-83-60434-59-8

Ključne riječi
Full lemmatization, morphosyntactic tagging, Croatian language

Sažetak
The paper presents the implementation and evaluation of a module for full lemmatization of Croatian texts. The module implements several lemmatization procedures, all of them based on merging outputs of the previously developed stochastic morphosyntactic tagger CroTag and the inflectional lexicon of Croatian Evaluation of the lemmatization module on two test cases, simulating realistic and ideal operating conditions, provided full lemmatization accuracy scores of 96.96 and 98.15 percent, respectively. It is also shown that a majority of errors in this framework, 57.14 percent in the realistic testing scenario, occur on word forms with external homography. Moreover, approximately 80 percent of all lemmatization errors occur on nouns, adjectives and adverbs in that particular order. Language resources, testing environment and procedure descriptions are provided in the paper along with a discussion of obtained results and possible future research directions.

Izvorni jezik
Engleski

Znanstvena područja
Računarstvo, Informacijske i komunikacijske znanosti, Filologija



POVEZANOST RADA


Projekt / tema
130-1300646-0645 - Hrvatski jezični resursi i njihovo obilježavanje (Marko Tadić, )
130-1300646-1776 - Računalna sintaksa hrvatskoga jezika (Zdravko Dovedan Han, )

Ustanove
Filozofski fakultet, Zagreb