Nalazite se na CroRIS probnoj okolini. Ovdje evidentirani podaci neće biti pohranjeni u Informacijskom sustavu znanosti RH. Ako je ovo greška, CroRIS produkcijskoj okolini moguće je pristupi putem poveznice www.croris.hr
izvor podataka: crosbi

Introducing the CURLICAT Corpora: Seven-language Domain Specific Annotated Corpora from Curated Sources (CROSBI ID 719615)

Prilog sa skupa u zborniku | izvorni znanstveni rad | međunarodna recenzija

Váradi, Tamás ; Nyéki, Bence ; Koeva, Svetla ; Tadić, Marko ; Štefanec, Vanja ; Ogrodniczuk, Maciej ; Nitoń, Bartłomiej ; Pęzik, Piotr ; Barbu Mititelu, Verginica ; Irimia, Elena et al. Introducing the CURLICAT Corpora: Seven-language Domain Specific Annotated Corpora from Curated Sources // Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022) / Calzolari, Nicoletta ; Béchet, Frédéric ; Blache, Philippe et al. (ur.). Marseille: European Language Resources Association (ELRA), 2022. str. 100-108

Podaci o odgovornosti

Váradi, Tamás ; Nyéki, Bence ; Koeva, Svetla ; Tadić, Marko ; Štefanec, Vanja ; Ogrodniczuk, Maciej ; Nitoń, Bartłomiej ; Pęzik, Piotr ; Barbu Mititelu, Verginica ; Irimia, Elena ; Mitrofan, Maria ; Tufiș, Dan ; Garabík, Radovan ; Krek, Simon ; Repar, Andraž

engleski

Introducing the CURLICAT Corpora: Seven-language Domain Specific Annotated Corpora from Curated Sources

This article presents the current outcomes of the CURLICAT CEF Telecom project, which aims to collect and deeply annotate a set of large corpora from selected domains. The CURLICAT corpus includes 7 monolingual corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing selected samples from respective national corpora. These corpora are automatically tokenized, lemmatized and morphologically analysed and the named entities annotated. The annotations are uniformly provided for each language specific corpus while the common metadata schema is harmonised across the languages. Additionally, the corpora are annotated for IATE terms in all languages. The file format is CoNLL-U Plus format, containing the ten columns specific to the CoNLL-U format and three extra columns specific to our corpora as defined by Varádi et al. (2020). The CURLICAT corpora represent a rich and valuable source not just for training NMT models, but also for further studies and developments in machine learning, cross- lingual terminological data extraction and classification.

national corpora ; comparable corpora ; domain corpora

nije evidentirano

nije evidentirano

nije evidentirano

nije evidentirano

nije evidentirano

nije evidentirano

Podaci o prilogu

100-108.

2022.

objavljeno

Podaci o matičnoj publikaciji

Calzolari, Nicoletta ; Béchet, Frédéric ; Blache, Philippe ; Choukri, Khalid ; Cieri, Christopher ; Declerck, Thierry ; Goggi, Sara ; Isahara, Hitoshi ; Maegaard, Bente ; Mariani, Joseph ; Mazo, Hélène ; Odijk, Jan ; Piperidis, Stelios

Marseille: European Language Resources Association (ELRA)

979-10-95546-72-6

Podaci o skupu

13th Language Resources and Evaluation Conference (LREC2022)

predavanje

20.06.2022-25.06.2022

Marseille, Francuska

Povezanost rada

Filologija, Informacijske i komunikacijske znanosti