Pretražite po imenu i prezimenu autora, mentora, urednika, prevoditelja

Napredna pretraga

Pregled bibliografske jedinice broj: 1242327

Challenges of Building Domain-Specific Parallel Corpora from Public Administration Documents


Klubička, Filip; Kasunić, Lorena; Blazsetin, Danijel; Bago, Petra
Challenges of Building Domain-Specific Parallel Corpora from Public Administration Documents // Proceedings of the LREC 2022 15th Workshop on Building and Using Comparable Corpora (BUCC 2022) / Rapp, Reinhard ; Zweigenbaum, Pierre ; Sharoff, Serge (ur.).
Marseille: European Language Resources Association (ELRA), 2022. str. 50-55 (predavanje, međunarodna recenzija, cjeloviti rad (in extenso), znanstveni)


CROSBI ID: 1242327 Za ispravke kontaktirajte CROSBI podršku putem web obrasca

Naslov
Challenges of Building Domain-Specific Parallel Corpora from Public Administration Documents

Autori
Klubička, Filip ; Kasunić, Lorena ; Blazsetin, Danijel ; Bago, Petra

Vrsta, podvrsta i kategorija rada
Radovi u zbornicima skupova, cjeloviti rad (in extenso), znanstveni

Izvornik
Proceedings of the LREC 2022 15th Workshop on Building and Using Comparable Corpora (BUCC 2022) / Rapp, Reinhard ; Zweigenbaum, Pierre ; Sharoff, Serge - Marseille : European Language Resources Association (ELRA), 2022, 50-55

ISBN
979-10-95546-94-8

Skup
15th Workshop on Building and Using Comparable Corpora (BUCC 2022)

Mjesto i datum
Marseille, Francuska, 25.06.2022

Vrsta sudjelovanja
Predavanje

Vrsta recenzije
Međunarodna recenzija

Ključne riječi
language resources ; parallel corpora ; machine translation ; Connecting Europe Facility ; eTranslation ; PRINCIPLE

Sažetak
PRINCIPLE was a Connecting Europe Facility (CEF)- funded project that focused on the identification, collection and processing of language resources (LRs) for four European under-resourced languages (Croatian, Icelandic, Irish and Norwegian) in order to improve translation quality of eTranslation, an online machine translation (MT) tool provided by the European Commission. The collected LRs were used for the development of neural MT engines in order to verify the quality of the resources. For all four languages, a total of 66 LRs were collected and made available on the ELRC-SHARE repository under various licenses. For Croatian, we have collected and published 20 LRs: 19 parallel corpora and 1 glossary. The majority of data is in the general domain (72 % of translation units), while the rest is in the eJustice (23 %), eHealth (3 %) and eProcurement (2 %) Digital Service Infrastructures (DSI) domains. The majority of the resources were for the Croatian-English language pair. The data was donated by six data contributors from the public as well as private sector. In this paper we present a subset of 13 Croatian LRs developed based on public administration documents, which are all made freely available, as well as challenges associated with the data collection, cleaning and processing.

Izvorni jezik
Engleski

Znanstvena područja
Informacijske i komunikacijske znanosti



POVEZANOST RADA


Projekti:
EK-CEF Telecom-INEA/CEF/ICT/A2018/1761837 - Providing Resources in Irish, Norwegian, Croatian and Icelandic for Purposes of Language Engineering (PRINCIPLE) (Bago, Petra, EK - 2018-EU-IA-0050) ( CroRIS)

Ustanove:
Filozofski fakultet, Zagreb

Profili:

Avatar Url Lorena Ninčević (autor)

Avatar Url Petra Bago (autor)

Poveznice na cjeloviti tekst rada:

comparable.limsi.fr

Citiraj ovu publikaciju:

Klubička, Filip; Kasunić, Lorena; Blazsetin, Danijel; Bago, Petra
Challenges of Building Domain-Specific Parallel Corpora from Public Administration Documents // Proceedings of the LREC 2022 15th Workshop on Building and Using Comparable Corpora (BUCC 2022) / Rapp, Reinhard ; Zweigenbaum, Pierre ; Sharoff, Serge (ur.).
Marseille: European Language Resources Association (ELRA), 2022. str. 50-55 (predavanje, međunarodna recenzija, cjeloviti rad (in extenso), znanstveni)
Klubička, F., Kasunić, L., Blazsetin, D. & Bago, P. (2022) Challenges of Building Domain-Specific Parallel Corpora from Public Administration Documents. U: Rapp, R., Zweigenbaum, P. & Sharoff, S. (ur.)Proceedings of the LREC 2022 15th Workshop on Building and Using Comparable Corpora (BUCC 2022).
@article{article, author = {Klubi\v{c}ka, Filip and Kasuni\'{c}, Lorena and Blazsetin, Danijel and Bago, Petra}, year = {2022}, pages = {50-55}, keywords = {language resources, parallel corpora, machine translation, Connecting Europe Facility, eTranslation, PRINCIPLE}, isbn = {979-10-95546-94-8}, title = {Challenges of Building Domain-Specific Parallel Corpora from Public Administration Documents}, keyword = {language resources, parallel corpora, machine translation, Connecting Europe Facility, eTranslation, PRINCIPLE}, publisher = {European Language Resources Association (ELRA)}, publisherplace = {Marseille, Francuska} }
@article{article, author = {Klubi\v{c}ka, Filip and Kasuni\'{c}, Lorena and Blazsetin, Danijel and Bago, Petra}, year = {2022}, pages = {50-55}, keywords = {language resources, parallel corpora, machine translation, Connecting Europe Facility, eTranslation, PRINCIPLE}, isbn = {979-10-95546-94-8}, title = {Challenges of Building Domain-Specific Parallel Corpora from Public Administration Documents}, keyword = {language resources, parallel corpora, machine translation, Connecting Europe Facility, eTranslation, PRINCIPLE}, publisher = {European Language Resources Association (ELRA)}, publisherplace = {Marseille, Francuska} }




Contrast
Increase Font
Decrease Font
Dyslexic Font