Neural Machine Translation for translating into Croatian and Serbian

Popovic, Maja; Poncelas, Alberto; Brkic, Marija; Way, Andy

izvor podataka: crosbi !

Neural Machine Translation for translating into Croatian and Serbian (CROSBI ID 697828)

Prilog sa skupa u zborniku | izvorni znanstveni rad | međunarodna recenzija

Popovic, Maja ; Poncelas, Alberto ; Brkic, Marija ; Way, Andy Neural Machine Translation for translating into Croatian and Serbian // Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects / Zampier, Marco ; Nakov, Preslav ; Ljubešić, Nikola et al. (ur.). Barcelona: International Committee on Computational Linguistics (ICCL), 2020. str. 102-113

Podaci o odgovornosti

Autori

Popovic, Maja ; Poncelas, Alberto ; Brkic, Marija ; Way, Andy

Osnovni podaci na izvornom jeziku
Osnovni podaci na ostalim jezicima

Jezik

engleski

Naslov

Neural Machine Translation for translating into Croatian and Serbian

Sažetak

In this work, we systematically investigate different set-ups for training of neural machine translation (NMT) systems for translation into Croatian and Serbian, two closely related South Slavic languages. We explore English and German as source languages, different sizes and types of training corpora, as well as bilingual and multilingual systems. We also explore translation of English IMDb user movie reviews, a domain/genre where only monolingual data are available. First, our results confirm that multilingual systems with joint target languages perform better. Furthermore, translation performance from English is much better than from German, partly because German is morphologically more complex and partly because the corpus consists mostly of parallel human translations instead of original text and its human translation. The translation from German should be further investigated systematically. For translating user reviews, creating synthetic in-domain parallel data through back- and forward-translation and adding them to a small out-of-domain parallel corpus can yield performance comparable with a system trained on a full out-of-domain corpus. However, it is still not clear what is the optimal size of synthetic in-domain data, especially for forward-translated data where the target language is machine translated. More detailed research including manual evaluation and analysis is needed in this direction.

Ključne riječi

neural machine translation ; South-Slavic languages ; domain ; genre ; synthetic in-domain corpus ; out-of-domain corpus

Napomena

nije evidentirano

Jezik

nije evidentirano

Naslov

nije evidentirano

Sažetak

nije evidentirano

Ključne riječi

nije evidentirano

Napomena

nije evidentirano

Podaci o prilogu

Stranice rada

102-113.

Godina izdavanja

2020.

Status objave rada

objavljeno

Podaci o matičnoj publikaciji

Naslov

Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

Urednici

Zampier, Marco ; Nakov, Preslav ; Ljubešić, Nikola ; Tiedemann ; Jörg, Scherrer, Yves

Izdavač

Barcelona: International Committee on Computational Linguistics (ICCL)

Podaci o skupu

Skup

VarDial 2020

Vrsta sudjelovanja

poster

Datum održavanja skupa

13.12.2020-13.12.2020

Mjesto održavanja skupa

Barcelona, Španjolska

Povezanost rada

Povezane osobe

Marija Brkić Bakarić (autor/i)

Povezane ustanove

Sveučilište u Rijeci, Fakultet informatike i digitalnih tehnologija (318) (autorova ustanova)

Područje

Informacijske i komunikacijske znanosti

Poveznice

aclweb.org