Pretražite po imenu i prezimenu autora, mentora, urednika, prevoditelja

Napredna pretraga

Pregled bibliografske jedinice broj: 1201399

Introducing the CURLICAT Corpora: Seven-language Domain Specific Annotated Corpora from Curated Sources


Váradi, Tamás; Nyéki, Bence; Koeva, Svetla; Tadić, Marko; Štefanec, Vanja; Ogrodniczuk, Maciej; Nitoń, Bartłomiej; Pęzik, Piotr; Barbu Mititelu, Verginica; Irimia, Elena et al.
Introducing the CURLICAT Corpora: Seven-language Domain Specific Annotated Corpora from Curated Sources // Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022) / Calzolari, Nicoletta ; Béchet, Frédéric ; Blache, Philippe ; Choukri, Khalid ; Cieri, Christopher ; Declerck, Thierry ; Goggi, Sara ; Isahara, Hitoshi ; Maegaard, Bente ; Mariani, Joseph ; Mazo, Hélène ; Odijk, Jan ; Piperidis, Stelios (ur.).
Marseille: European Language Resources Association (ELRA), 2022. str. 100-108 (predavanje, međunarodna recenzija, cjeloviti rad (in extenso), znanstveni)


CROSBI ID: 1201399 Za ispravke kontaktirajte CROSBI podršku putem web obrasca

Naslov
Introducing the CURLICAT Corpora: Seven-language Domain Specific Annotated Corpora from Curated Sources

Autori
Váradi, Tamás ; Nyéki, Bence ; Koeva, Svetla ; Tadić, Marko ; Štefanec, Vanja ; Ogrodniczuk, Maciej ; Nitoń, Bartłomiej ; Pęzik, Piotr ; Barbu Mititelu, Verginica ; Irimia, Elena ; Mitrofan, Maria ; Tufiș, Dan ; Garabík, Radovan ; Krek, Simon ; Repar, Andraž

Vrsta, podvrsta i kategorija rada
Radovi u zbornicima skupova, cjeloviti rad (in extenso), znanstveni

Izvornik
Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022) / Calzolari, Nicoletta ; Béchet, Frédéric ; Blache, Philippe ; Choukri, Khalid ; Cieri, Christopher ; Declerck, Thierry ; Goggi, Sara ; Isahara, Hitoshi ; Maegaard, Bente ; Mariani, Joseph ; Mazo, Hélène ; Odijk, Jan ; Piperidis, Stelios - Marseille : European Language Resources Association (ELRA), 2022, 100-108

ISBN
979-10-95546-72-6

Skup
13th Language Resources and Evaluation Conference (LREC2022)

Mjesto i datum
Marseille, Francuska, 20.06.2022. - 25.06.2022

Vrsta sudjelovanja
Predavanje

Vrsta recenzije
Međunarodna recenzija

Ključne riječi
national corpora ; comparable corpora ; domain corpora

Sažetak
This article presents the current outcomes of the CURLICAT CEF Telecom project, which aims to collect and deeply annotate a set of large corpora from selected domains. The CURLICAT corpus includes 7 monolingual corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing selected samples from respective national corpora. These corpora are automatically tokenized, lemmatized and morphologically analysed and the named entities annotated. The annotations are uniformly provided for each language specific corpus while the common metadata schema is harmonised across the languages. Additionally, the corpora are annotated for IATE terms in all languages. The file format is CoNLL-U Plus format, containing the ten columns specific to the CoNLL-U format and three extra columns specific to our corpora as defined by Varádi et al. (2020). The CURLICAT corpora represent a rich and valuable source not just for training NMT models, but also for further studies and developments in machine learning, cross- lingual terminological data extraction and classification.

Izvorni jezik
Engleski

Znanstvena područja
Informacijske i komunikacijske znanosti, Filologija



POVEZANOST RADA


Projekti:
undefined

Ustanove:
Filozofski fakultet, Zagreb

Profili:

Avatar Url Marko Tadić (autor)

Avatar Url Vanja Štefanec (autor)

Poveznice na cjeloviti tekst rada:

www.lrec-conf.org aclanthology.org

Citiraj ovu publikaciju:

Váradi, Tamás; Nyéki, Bence; Koeva, Svetla; Tadić, Marko; Štefanec, Vanja; Ogrodniczuk, Maciej; Nitoń, Bartłomiej; Pęzik, Piotr; Barbu Mititelu, Verginica; Irimia, Elena et al.
Introducing the CURLICAT Corpora: Seven-language Domain Specific Annotated Corpora from Curated Sources // Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022) / Calzolari, Nicoletta ; Béchet, Frédéric ; Blache, Philippe ; Choukri, Khalid ; Cieri, Christopher ; Declerck, Thierry ; Goggi, Sara ; Isahara, Hitoshi ; Maegaard, Bente ; Mariani, Joseph ; Mazo, Hélène ; Odijk, Jan ; Piperidis, Stelios (ur.).
Marseille: European Language Resources Association (ELRA), 2022. str. 100-108 (predavanje, međunarodna recenzija, cjeloviti rad (in extenso), znanstveni)
Váradi, T., Nyéki, B., Koeva, S., Tadić, M., Štefanec, V., Ogrodniczuk, M., Nitoń, B., Pęzik, P., Barbu Mititelu, V. & Irimia, E. (2022) Introducing the CURLICAT Corpora: Seven-language Domain Specific Annotated Corpora from Curated Sources. U: Calzolari, N., Béchet, F., Blache, P., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., Odijk, J. & Piperidis, S. (ur.)Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022).
@article{article, author = {V\'{a}radi, Tam\'{a}s and Ny\'{e}ki, Bence and Koeva, Svetla and Tadi\'{c}, Marko and \v{S}tefanec, Vanja and Ogrodniczuk, Maciej and Nito\'{n}, Bart\lomiej and P\k{e}zik, Piotr and Barbu Mititelu, Verginica and Irimia, Elena and Mitrofan, Maria and Tufiș, Dan and Garab\'{\i}k, Radovan and Krek, Simon and Repar, Andra\v{z}}, year = {2022}, pages = {100-108}, keywords = {national corpora, comparable corpora, domain corpora}, isbn = {979-10-95546-72-6}, title = {Introducing the CURLICAT Corpora: Seven-language Domain Specific Annotated Corpora from Curated Sources}, keyword = {national corpora, comparable corpora, domain corpora}, publisher = {European Language Resources Association (ELRA)}, publisherplace = {Marseille, Francuska} }
@article{article, author = {V\'{a}radi, Tam\'{a}s and Ny\'{e}ki, Bence and Koeva, Svetla and Tadi\'{c}, Marko and \v{S}tefanec, Vanja and Ogrodniczuk, Maciej and Nito\'{n}, Bart\lomiej and P\k{e}zik, Piotr and Barbu Mititelu, Verginica and Irimia, Elena and Mitrofan, Maria and Tufiș, Dan and Garab\'{\i}k, Radovan and Krek, Simon and Repar, Andra\v{z}}, year = {2022}, pages = {100-108}, keywords = {national corpora, comparable corpora, domain corpora}, isbn = {979-10-95546-72-6}, title = {Introducing the CURLICAT Corpora: Seven-language Domain Specific Annotated Corpora from Curated Sources}, keyword = {national corpora, comparable corpora, domain corpora}, publisher = {European Language Resources Association (ELRA)}, publisherplace = {Marseille, Francuska} }




Contrast
Increase Font
Decrease Font
Dyslexic Font