Enhancing eTranslation in legislative domain for under-resourced languages

Štefanec, Vanja; Filko, Matea; Tadić, Marko

Pregled bibliografske jedinice broj: 1113170

Enhancing eTranslation in legislative domain for under-resourced languages

Štefanec, Vanja; Filko, Matea; Tadić, Marko

Enhancing eTranslation in legislative domain for under-resourced languages // 7th International Conference The Future of Information Sciences (INFuture). INFuture2019: Knowledge in the Digital Age
Zagreb, Hrvatska, 2019. (predavanje, podatak o recenziji nije dostupan, neobjavljeni rad, ostalo)

CROSBI ID: 1113170 Za ispravke kontaktirajte CROSBI podršku putem web obrasca

Naslov
Enhancing eTranslation in legislative domain for under-resourced languages

Autori
Štefanec, Vanja ; Filko, Matea ; Tadić, Marko

Vrsta, podvrsta i kategorija rada
Sažeci sa skupova, neobjavljeni rad, ostalo

Skup
7th International Conference The Future of Information Sciences (INFuture). INFuture2019: Knowledge in the Digital Age

Mjesto i datum
Zagreb, Hrvatska, 21.11.2019. - 22.11.2019

Vrsta sudjelovanja
Predavanje

Vrsta recenzije
Podatak o recenziji nije dostupan

Ključne riječi
hrvatsko-engleski paralelni korpus ; EUROVOC ; IATE ; korpus zakonodavnih tekstova
(Croatian-English parallel corpuse ; EUROVOC ; IATE ; legislative corpus)

Sažetak
In this presentation, we will present the joint project which aims at enhancing the eTranslation system of the European Commission by preparing the corpora of national legislation (laws, decrees, regulations) in seven under-resourced languages (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak, Slovenian) as training data. The procedure of preparing these data consists of several steps: 1) collection of raw national legislative texts, as well as pre-processing and encoding of the collected data in a uniform representation format ; 2) domain-specific classification of national legislative corpora obtained through step 1 using EUROVOC top level domains/descriptors and IATE terms ; 3) semantic alignment of the multilingual corpora through micro alignment of semantically equal or related segments of text. These steps will in turn result in 1) seven large-scale tokenized and morphologically tagged monolingual corpora of national legislation documents classified into EUROVOC top-level domains and enriched with identified EUROVOC and IATE terms, 2) comparable multilingual corpus of seven languages aligned at the top level domains identified by the EUROVOC descriptors, and 3) cross-lingual semantically aligned comparable corpus. We believe that the presented results of the project activities will improve the eTranslation system in the legislative domain for seven official languages of the EU. In the second part of the presentation, we will present the Croatian‒ English parallel corpus of legislative documents. The parallel corpus consists of ca 1800 documents of Croatian national legislation (ca 500 law texts, ca 800 ordinances, ca 200 decisions, ca 130 regulations etc.) and their official English translations. All texts included in the Croatian‒English parallel corpus have been previously annotated with EUROVOC descriptors. The corpus is processed for sentence alignment and provided in the formats of aligned documents collection and translation memory usable for training of MT systems. Since Croatian is one of the official languages of the EU and the large number of texts are translated from Croatian to English and vice versa on the daily basis, it is reasonable to believe that there is a need of this kind of translation memories limited to narrower domains.

Izvorni jezik
Hrvatski

Znanstvena područja
Informacijske i komunikacijske znanosti, Filologija

POVEZANOST RADA

Ustanove:
Filozofski fakultet, Zagreb

Profili:

Marko Tadić (autor)

Vanja Štefanec (autor)

Matea Filko (autor)

CROSBI Hrvatska znanstvena bibliografija

Pregled bibliografske jedinice broj: 1113170

Enhancing eTranslation in legislative domain for under-resourced languages

Citiraj ovu publikaciju:

Pregled bibliografske jedinice broj: 1113170

Enhancing eTranslation in legislative domain for under-resourced languages

Citiraj ovu publikaciju:

Podijeli: