The MARCELL Legislative Corpus

Váradi, Tamás; Koeva, Svetla; Yamalov, Martin; Tadić, Marko; Sass, Bálint; Nitoń, Bartłomiej; Ogrodniczuk, Maciej; Pęzik, Piotr; Barbu Mititelu, Verginica; Ion, Radu; Irimia, Elena; Mitrofan, Maria; Păiș, Vasile; Tufiș, Dan; Garabík, Radovan; Krek, Simon; Repar, Andraz; Rihtar, Matjaž; Brank, Janez

izvor podataka: crosbi ✓

The MARCELL Legislative Corpus (CROSBI ID 690826)

Prilog sa skupa u zborniku | ostalo | međunarodna recenzija

Váradi, Tamás ; Koeva, Svetla ; Yamalov, Martin ; Tadić, Marko ; Sass, Bálint ; Nitoń, Bartłomiej ; Ogrodniczuk, Maciej ; Pęzik, Piotr ; Barbu Mititelu, Verginica ; Ion, Radu et al. The MARCELL Legislative Corpus // Proceedings of The 12th Language Resources and Evaluation Conference / Calzolari, Nicoletta ; Béchet, Frédéric ; Blache, Philippe et al. (ur.). Marseille: European Language Resources Association (ELRA), 2020. str. 3761-3768

Podaci o odgovornosti

Autori

Váradi, Tamás ; Koeva, Svetla ; Yamalov, Martin ; Tadić, Marko ; Sass, Bálint ; Nitoń, Bartłomiej ; Ogrodniczuk, Maciej ; Pęzik, Piotr ; Barbu Mititelu, Verginica ; Ion, Radu ; Irimia, Elena ; Mitrofan, Maria ; Păiș, Vasile ; Tufiș, Dan ; Garabík, Radovan ; Krek, Simon ; Repar, Andraz ; Rihtar, Matjaž ; Brank, Janez

Osnovni podaci na izvornom jeziku
Osnovni podaci na ostalim jezicima

Jezik

engleski

Naslov

The MARCELL Legislative Corpus

Sažetak

This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect and deeply annotate a large comparable corpus of legal documents. The MARCELL corpus includes 7 monolingual sub-corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing the total body of respective national legislative documents. These sub-corpora are automatically sentence split, tokenized, lemmatized and morphologically and syntactically annotated. The monolingual sub- corpora are complemented by a thematically related parallel corpus (Croatian-English). The metadata and the annotations are uniformly provided for each language specific sub-corpus. Besides the standard morphosyntactic analysis plus named entity and dependency and/or noun phrase annotation, the corpus is enriched with the IATE and EuroVoc labels. The file format is CoNLL-U Plus Format, containing the ten columns specific to the CoNLL-U format and four extra columns specific to our corpora. The MARCELL corpora represent a rich and valuable source for further studies and developments in machine learning, cross-lingual terminological data extraction and classification.

Ključne riječi

law corpus ; comparable corpus ; under-resourced languages

Napomena

Zbog pandemije krunastoga virusa, kongres nije održan, ali je zbornik radova objavljen 2020-05- 15.

Jezik

nije evidentirano

Naslov

nije evidentirano

Sažetak

nije evidentirano

Ključne riječi

nije evidentirano

Napomena

nije evidentirano

Podaci o prilogu

Stranice rada

3761-3768.

Godina izdavanja

2020.

Status objave rada

objavljeno

Podaci o matičnoj publikaciji

Naslov

Proceedings of The 12th Language Resources and Evaluation Conference

Urednici

Calzolari, Nicoletta ; Béchet, Frédéric ; Blache, Philippe ; Choukri, Khalid ; Cieri, Christopher ; Declerck, Thierry ; Goggi, Sara ; Isahara, Hitoshi ; Maegaard, Bente ; Mariani, Joseph ; Mazo, Hélène ; Moreno, Asuncion ; Odijk, Jan ; Piperidis, Stelios

Izdavač

Marseille: European Language Resources Association (ELRA)

Podaci o skupu

Skup

The 12th Language Resources and Evaluation Conference (LREC2020)

Vrsta sudjelovanja

predavanje

Datum održavanja skupa

11.05.2020-16.05.2020

Mjesto održavanja skupa

Marseille, Francuska

Povezanost rada

Povezane osobe

Marko Tadić (CroRIS ID: 12084; MBZ: 157043) (autor/i)

Povezane ustanove

Filozofski fakultet u Zagrebu (130) (autorova ustanova)

Područje

Filologija, Informacijske i komunikacijske znanosti

Poveznice

lrec-conf.org

Indeksiranost

Web of Science Core Collection, Conference Proceedings Citation Index - Science (WoSCC-CPCI-S)

Web of Science Core Collection, Conference Proceedings Citation Index - Social Science & Humanities (WoSCC-CPCI-SSH)