Pregled bibliografske jedinice broj: 1113170
Enhancing eTranslation in legislative domain for under-resourced languages
Enhancing eTranslation in legislative domain for under-resourced languages // 7th International Conference The Future of Information Sciences (INFuture). INFuture2019: Knowledge in the Digital Age
Zagreb, Hrvatska, 2019. (predavanje, podatak o recenziji nije dostupan, neobjavljeni rad, ostalo)
CROSBI ID: 1113170 Za ispravke kontaktirajte CROSBI podršku putem web obrasca
Naslov
Enhancing eTranslation in legislative domain for
under-resourced languages
Autori
Štefanec, Vanja ; Filko, Matea ; Tadić, Marko
Vrsta, podvrsta i kategorija rada
Sažeci sa skupova, neobjavljeni rad, ostalo
Skup
7th International Conference The Future of Information Sciences (INFuture). INFuture2019: Knowledge in the Digital Age
Mjesto i datum
Zagreb, Hrvatska, 21.11.2019. - 22.11.2019
Vrsta sudjelovanja
Predavanje
Vrsta recenzije
Podatak o recenziji nije dostupan
Ključne riječi
hrvatsko-engleski paralelni korpus ; EUROVOC ; IATE ; korpus zakonodavnih tekstova
(Croatian-English parallel corpuse ; EUROVOC ; IATE ; legislative corpus)
Sažetak
In this presentation, we will present the joint project which aims at enhancing the eTranslation system of the European Commission by preparing the corpora of national legislation (laws, decrees, regulations) in seven under-resourced languages (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak, Slovenian) as training data. The procedure of preparing these data consists of several steps: 1) collection of raw national legislative texts, as well as pre-processing and encoding of the collected data in a uniform representation format ; 2) domain-specific classification of national legislative corpora obtained through step 1 using EUROVOC top level domains/descriptors and IATE terms ; 3) semantic alignment of the multilingual corpora through micro alignment of semantically equal or related segments of text. These steps will in turn result in 1) seven large-scale tokenized and morphologically tagged monolingual corpora of national legislation documents classified into EUROVOC top-level domains and enriched with identified EUROVOC and IATE terms, 2) comparable multilingual corpus of seven languages aligned at the top level domains identified by the EUROVOC descriptors, and 3) cross-lingual semantically aligned comparable corpus. We believe that the presented results of the project activities will improve the eTranslation system in the legislative domain for seven official languages of the EU. In the second part of the presentation, we will present the Croatian‒ English parallel corpus of legislative documents. The parallel corpus consists of ca 1800 documents of Croatian national legislation (ca 500 law texts, ca 800 ordinances, ca 200 decisions, ca 130 regulations etc.) and their official English translations. All texts included in the Croatian‒English parallel corpus have been previously annotated with EUROVOC descriptors. The corpus is processed for sentence alignment and provided in the formats of aligned documents collection and translation memory usable for training of MT systems. Since Croatian is one of the official languages of the EU and the large number of texts are translated from Croatian to English and vice versa on the daily basis, it is reasonable to believe that there is a need of this kind of translation memories limited to narrower domains.
Izvorni jezik
Hrvatski
Znanstvena područja
Informacijske i komunikacijske znanosti, Filologija
POVEZANOST RADA
Ustanove:
Filozofski fakultet, Zagreb