Pretražite po imenu i prezimenu autora, mentora, urednika, prevoditelja

Napredna pretraga

Pregled bibliografske jedinice broj: 675915

Combining Available Datasets for Building Named Entity Recognition Models of Croatian and Slovene


Ljubešić, Nikola; Stupar, Marija; Jurić, Tereza; Agić, Željko
Combining Available Datasets for Building Named Entity Recognition Models of Croatian and Slovene // Slovenscina 2.0, 2 (2013), 35-57 (podatak o recenziji nije dostupan, članak, znanstveni)


CROSBI ID: 675915 Za ispravke kontaktirajte CROSBI podršku putem web obrasca

Naslov
Combining Available Datasets for Building Named Entity Recognition Models of Croatian and Slovene

Autori
Ljubešić, Nikola ; Stupar, Marija ; Jurić, Tereza ; Agić, Željko

Izvornik
Slovenscina 2.0 (2335-2736) 2 (2013); 35-57

Vrsta, podvrsta i kategorija rada
Radovi u časopisima, članak, znanstveni

Ključne riječi
named entity recognition; distributional similarity; Croatian language; Slovene language

Sažetak
The paper presents efforts in developing freely available models for named entity recognition and classification in Croatian and Slovene text. Our experiments focus on the most informative set of linguistic features taking into account the availability of language tools and resources for the languages in question. Besides the classic linguistic features, distributional similarity features calculated from large unannotated monolingual corpora are exploited as well. We performed two batches on experiments, the first one on a self-built dataset on which the optimal set of features is sought, and a second batch with additional, much larger datasets obtained at a later point on which we verify the findings from the first batch. On the initial dataset using distributional information improves the results for 7-8 points in F1 while adding morphological information improves the results for additional 3-4 points in both languages. The second batch of experiments shows that morphosyntactic and distributional information lose importance as the dataset size significantly increases. The best performing models that use distributional information only, along with test sets for comparison with existing and future systems are made publicly available for both academic and non-academic use.

Izvorni jezik
Engleski

Znanstvena područja
Računarstvo, Informacijske i komunikacijske znanosti



POVEZANOST RADA


Ustanove:
Filozofski fakultet, Zagreb

Profili:

Avatar Url Nikola Ljubešić (autor)

Avatar Url Željko Agić (autor)

Citiraj ovu publikaciju:

Ljubešić, Nikola; Stupar, Marija; Jurić, Tereza; Agić, Željko
Combining Available Datasets for Building Named Entity Recognition Models of Croatian and Slovene // Slovenscina 2.0, 2 (2013), 35-57 (podatak o recenziji nije dostupan, članak, znanstveni)
Ljubešić, N., Stupar, M., Jurić, T. & Agić, Ž. (2013) Combining Available Datasets for Building Named Entity Recognition Models of Croatian and Slovene. Slovenscina 2.0, 2, 35-57.
@article{article, author = {Ljube\v{s}i\'{c}, Nikola and Stupar, Marija and Juri\'{c}, Tereza and Agi\'{c}, \v{Z}eljko}, year = {2013}, pages = {35-57}, keywords = {named entity recognition, distributional similarity, Croatian language, Slovene language}, journal = {Slovenscina 2.0}, volume = {2}, issn = {2335-2736}, title = {Combining Available Datasets for Building Named Entity Recognition Models of Croatian and Slovene}, keyword = {named entity recognition, distributional similarity, Croatian language, Slovene language} }
@article{article, author = {Ljube\v{s}i\'{c}, Nikola and Stupar, Marija and Juri\'{c}, Tereza and Agi\'{c}, \v{Z}eljko}, year = {2013}, pages = {35-57}, keywords = {named entity recognition, distributional similarity, Croatian language, Slovene language}, journal = {Slovenscina 2.0}, volume = {2}, issn = {2335-2736}, title = {Combining Available Datasets for Building Named Entity Recognition Models of Croatian and Slovene}, keyword = {named entity recognition, distributional similarity, Croatian language, Slovene language} }

Uključenost u ostale bibliografske baze podataka::


  • DOAJ





Contrast
Increase Font
Decrease Font
Dyslexic Font