Lemmatization and Morphosyntactic Tagging of Croatian and Serbian

Agić, Željko; Ljubešić, Nikola; Merkler, Danijela

Pregled bibliografske jedinice broj: 638909

Lemmatization and Morphosyntactic Tagging of Croatian and Serbian

Agić, Željko; Ljubešić, Nikola; Merkler, Danijela

Lemmatization and Morphosyntactic Tagging of Croatian and Serbian // Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing
Sofija: Association for Computational Linguistics (ACL), 2013. str. 48-57 (predavanje, međunarodna recenzija, cjeloviti rad (in extenso), znanstveni)

CROSBI ID: 638909 Za ispravke kontaktirajte CROSBI podršku putem web obrasca

Naslov
Lemmatization and Morphosyntactic Tagging of Croatian and Serbian

Autori
Agić, Željko ; Ljubešić, Nikola ; Merkler, Danijela

Vrsta, podvrsta i kategorija rada
Radovi u zbornicima skupova, cjeloviti rad (in extenso), znanstveni

Izvornik
Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing / - Sofija : Association for Computational Linguistics (ACL), 2013, 48-57

Skup
4th Biennial International Workshop on Balto-Slavic Natural Language Processing (BSNLP 2013)

Mjesto i datum
Sofija, Bugarska, 08.08.2013. - 09.08.2013

Vrsta sudjelovanja
Predavanje

Vrsta recenzije
Međunarodna recenzija

Ključne riječi
lemmatization; tagging; Croatian; Serbian

Sažetak
We investigate state-of-the-art statistical models for lemmatization and morphosyntactic tagging of Croatian and Serbian. The models stem from a new manually annotated SETIMES.HR corpus of Croatian, based on the SETimes parallel corpus. We train models on Croatian text and evaluate them on samples of Croatian and Serbian from the SETimes corpus and the two Wikipedias. Lemmatization accuracy for the two languages reaches 97.87% and 96.30%, while full morphosyntactic tagging accuracy using a 600-tag tagset peaks at 87.72% and 85.56%, respectively. Part of speech tagging accuracies reach 97.13% and 96.46%. Results indicate that more complex methods of Croatian-to- Serbian annotation projection are not required on such dataset sizes for these particular tasks. The SETIMES.HR corpus, its resulting models and test sets are all made freely available .

Izvorni jezik
Engleski

Znanstvena područja
Informacijske i komunikacijske znanosti

POVEZANOST RADA

Projekti:
130-1300646-1776 - Računalna sintaksa hrvatskoga jezika (Dovedan Han, Zdravko, MZOS ) ( CroRIS)

Ustanove:
Filozofski fakultet, Zagreb

Profili:

Željko Agić (autor)