Introducing the CURLICAT Corpora: Seven-language Domain Specific Annotated Corpora from Curated Sources

Váradi, Tamás; Nyéki, Bence; Koeva, Svetla; Tadić, Marko; Štefanec, Vanja; Ogrodniczuk, Maciej; Nitoń, Bartłomiej; Pęzik, Piotr; Barbu Mititelu, Verginica; Irimia, Elena; Mitrofan, Maria; Tufiș, Dan; Garabík, Radovan; Krek, Simon; Repar, Andraž

izvor podataka: crosbi ✓

Introducing the CURLICAT Corpora: Seven-language Domain Specific Annotated Corpora from Curated Sources (CROSBI ID 719615)

Prilog sa skupa u zborniku | izvorni znanstveni rad | međunarodna recenzija

Váradi, Tamás ; Nyéki, Bence ; Koeva, Svetla ; Tadić, Marko ; Štefanec, Vanja ; Ogrodniczuk, Maciej ; Nitoń, Bartłomiej ; Pęzik, Piotr ; Barbu Mititelu, Verginica ; Irimia, Elena et al. Introducing the CURLICAT Corpora: Seven-language Domain Specific Annotated Corpora from Curated Sources // Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022) / Calzolari, Nicoletta ; Béchet, Frédéric ; Blache, Philippe et al. (ur.). Marseille: European Language Resources Association (ELRA), 2022. str. 100-108

Podaci o odgovornosti

Autori

Váradi, Tamás ; Nyéki, Bence ; Koeva, Svetla ; Tadić, Marko ; Štefanec, Vanja ; Ogrodniczuk, Maciej ; Nitoń, Bartłomiej ; Pęzik, Piotr ; Barbu Mititelu, Verginica ; Irimia, Elena ; Mitrofan, Maria ; Tufiș, Dan ; Garabík, Radovan ; Krek, Simon ; Repar, Andraž

Osnovni podaci na izvornom jeziku
Osnovni podaci na ostalim jezicima

Jezik

engleski

Naslov

Introducing the CURLICAT Corpora: Seven-language Domain Specific Annotated Corpora from Curated Sources

Sažetak

This article presents the current outcomes of the CURLICAT CEF Telecom project, which aims to collect and deeply annotate a set of large corpora from selected domains. The CURLICAT corpus includes 7 monolingual corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing selected samples from respective national corpora. These corpora are automatically tokenized, lemmatized and morphologically analysed and the named entities annotated. The annotations are uniformly provided for each language specific corpus while the common metadata schema is harmonised across the languages. Additionally, the corpora are annotated for IATE terms in all languages. The file format is CoNLL-U Plus format, containing the ten columns specific to the CoNLL-U format and three extra columns specific to our corpora as defined by Varádi et al. (2020). The CURLICAT corpora represent a rich and valuable source not just for training NMT models, but also for further studies and developments in machine learning, cross- lingual terminological data extraction and classification.

Ključne riječi

national corpora ; comparable corpora ; domain corpora

Napomena

nije evidentirano

Jezik

nije evidentirano

Naslov

nije evidentirano

Sažetak

nije evidentirano

Ključne riječi

nije evidentirano

Napomena

nije evidentirano

Podaci o prilogu

Stranice rada

100-108.

Godina izdavanja

2022.

Status objave rada

objavljeno

Podaci o matičnoj publikaciji

Naslov

Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022)

Urednici

Calzolari, Nicoletta ; Béchet, Frédéric ; Blache, Philippe ; Choukri, Khalid ; Cieri, Christopher ; Declerck, Thierry ; Goggi, Sara ; Isahara, Hitoshi ; Maegaard, Bente ; Mariani, Joseph ; Mazo, Hélène ; Odijk, Jan ; Piperidis, Stelios

Izdavač

Marseille: European Language Resources Association (ELRA)

ISBN

979-10-95546-72-6

Podaci o skupu

Skup

13th Language Resources and Evaluation Conference (LREC2022)

Vrsta sudjelovanja

predavanje

Datum održavanja skupa

20.06.2022-25.06.2022

Mjesto održavanja skupa

Marseille, Francuska

Povezanost rada

Povezane osobe

Marko Tadić (CroRIS ID: 12084; MBZ: 157043) (autor/i)

Vanja Štefanec (CroRIS ID: 44743; MBZ: 404580) (autor/i)

Povezane ustanove

Filozofski fakultet u Zagrebu (130) (autorova ustanova)

Područje

Filologija, Informacijske i komunikacijske znanosti

Poveznice

aclanthology.org

lrec-conf.org