Combining Available Datasets for Building Named Entity Recognition Models of Croatian and Slovene

Ljubešić, Nikola; Stupar, Marija; Jurić, Tereza; Agić, Željko

izvor podataka: crosbi !

Combining Available Datasets for Building Named Entity Recognition Models of Croatian and Slovene (CROSBI ID 201799)

Prilog u časopisu | izvorni znanstveni rad

Ljubešić, Nikola ; Stupar, Marija ; Jurić, Tereza ; Agić, Željko Combining Available Datasets for Building Named Entity Recognition Models of Croatian and Slovene // Slovenscina 2.0, 2 (2013), 35-57

Podaci o odgovornosti

Autori

Ljubešić, Nikola ; Stupar, Marija ; Jurić, Tereza ; Agić, Željko

Osnovni podaci na izvornom jeziku
Osnovni podaci na ostalim jezicima

Jezik

engleski

Naslov

Combining Available Datasets for Building Named Entity Recognition Models of Croatian and Slovene

Sažetak

The paper presents efforts in developing freely available models for named entity recognition and classification in Croatian and Slovene text. Our experiments focus on the most informative set of linguistic features taking into account the availability of language tools and resources for the languages in question. Besides the classic linguistic features, distributional similarity features calculated from large unannotated monolingual corpora are exploited as well. We performed two batches on experiments, the first one on a self-built dataset on which the optimal set of features is sought, and a second batch with additional, much larger datasets obtained at a later point on which we verify the findings from the first batch. On the initial dataset using distributional information improves the results for 7-8 points in F1 while adding morphological information improves the results for additional 3-4 points in both languages. The second batch of experiments shows that morphosyntactic and distributional information lose importance as the dataset size significantly increases. The best performing models that use distributional information only, along with test sets for comparison with existing and future systems are made publicly available for both academic and non-academic use.

Ključne riječi

named entity recognition; distributional similarity; Croatian language; Slovene language

Napomena

nije evidentirano

Jezik

nije evidentirano

Naslov

nije evidentirano

Sažetak

nije evidentirano

Ključne riječi

nije evidentirano

Napomena

nije evidentirano

Podaci o izdanju

Časopis

Slovenscina 2.0

Volumen (broj)

Godina

2013.

Stranice rada

35-57

Status objave rada

objavljeno

e-ISSN

2335-2736

Povezanost rada

Povezane osobe

Željko Agić (autor/i)

Nikola Ljubešić (autor/i)

Povezane ustanove

Filozofski fakultet u Zagrebu (130) (autorova ustanova)

Područje

Računarstvo, Informacijske i komunikacijske znanosti

Poveznice

trojina.org