Pregled bibliografske jedinice broj: 1062869
The MARCELL Legislative Corpus
The MARCELL Legislative Corpus // Proceedings of The 12th Language Resources and Evaluation Conference / Calzolari, Nicoletta ; Béchet, Frédéric ; Blache, Philippe ; Choukri, Khalid ; Cieri, Christopher ; Declerck, Thierry ; Goggi, Sara ; Isahara, Hitoshi ; Maegaard, Bente ; Mariani, Joseph ; Mazo, Hélène ; Moreno, Asuncion ; Odijk, Jan ; Piperidis, Stelios (ur.).
Marseille: European Language Resources Association (ELRA), 2020. str. 3761-3768 (predavanje, međunarodna recenzija, cjeloviti rad (in extenso), ostalo)
CROSBI ID: 1062869 Za ispravke kontaktirajte CROSBI podršku putem web obrasca
Naslov
The MARCELL Legislative Corpus
Autori
Váradi, Tamás ; Koeva, Svetla ; Yamalov, Martin ; Tadić, Marko ; Sass, Bálint ; Nitoń, Bartłomiej ; Ogrodniczuk, Maciej ; Pęzik, Piotr ; Barbu Mititelu, Verginica ; Ion, Radu ; Irimia, Elena ; Mitrofan, Maria ; Păiș, Vasile ; Tufiș, Dan ; Garabík, Radovan ; Krek, Simon ; Repar, Andraz ; Rihtar, Matjaž ; Brank, Janez
Vrsta, podvrsta i kategorija rada
Radovi u zbornicima skupova, cjeloviti rad (in extenso), ostalo
Izvornik
Proceedings of The 12th Language Resources and Evaluation Conference
/ Calzolari, Nicoletta ; Béchet, Frédéric ; Blache, Philippe ; Choukri, Khalid ; Cieri, Christopher ; Declerck, Thierry ; Goggi, Sara ; Isahara, Hitoshi ; Maegaard, Bente ; Mariani, Joseph ; Mazo, Hélène ; Moreno, Asuncion ; Odijk, Jan ; Piperidis, Stelios - Marseille : European Language Resources Association (ELRA), 2020, 3761-3768
Skup
The 12th Language Resources and Evaluation Conference (LREC2020)
Mjesto i datum
Marseille, Francuska, 11.05.2020. - 16.05.2020
Vrsta sudjelovanja
Predavanje
Vrsta recenzije
Međunarodna recenzija
Ključne riječi
law corpus ; comparable corpus ; under-resourced languages
Sažetak
This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect and deeply annotate a large comparable corpus of legal documents. The MARCELL corpus includes 7 monolingual sub-corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing the total body of respective national legislative documents. These sub-corpora are automatically sentence split, tokenized, lemmatized and morphologically and syntactically annotated. The monolingual sub- corpora are complemented by a thematically related parallel corpus (Croatian-English). The metadata and the annotations are uniformly provided for each language specific sub-corpus. Besides the standard morphosyntactic analysis plus named entity and dependency and/or noun phrase annotation, the corpus is enriched with the IATE and EuroVoc labels. The file format is CoNLL-U Plus Format, containing the ten columns specific to the CoNLL-U format and four extra columns specific to our corpora. The MARCELL corpora represent a rich and valuable source for further studies and developments in machine learning, cross-lingual terminological data extraction and classification.
Izvorni jezik
Engleski
Znanstvena područja
Informacijske i komunikacijske znanosti, Filologija
Napomena
Zbog pandemije krunastoga virusa, kongres nije
održan, ali je zbornik radova objavljen 2020-05-
15.
Citiraj ovu publikaciju:
Časopis indeksira:
- Web of Science Core Collection (WoSCC)
- Conference Proceedings Citation Index - Science (CPCI-S)
- Conference Proceedings Citation Index - Social Sciences & Humanities (CPCI-SSH)