Quality Estimation for Synthetic Parallel Data Generation

Rubino, Raphael; Toral, Antonio; Ljubešić, Nikola; Ramírez-Sánchez, Gema

Pregled bibliografske jedinice broj: 713086

Quality Estimation for Synthetic Parallel Data Generation

Rubino, Raphael; Toral, Antonio; Ljubešić, Nikola; Ramírez-Sánchez, Gema

Quality Estimation for Synthetic Parallel Data Generation // Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Reykjavík, Island: European Language Resources Association (ELRA), 2014. (poster, međunarodna recenzija, cjeloviti rad (in extenso), ostalo)

CROSBI ID: 713086 Za ispravke kontaktirajte CROSBI podršku putem web obrasca

Naslov
Quality Estimation for Synthetic Parallel Data Generation

Autori
Rubino, Raphael ; Toral, Antonio ; Ljubešić, Nikola ; Ramírez-Sánchez, Gema

Vrsta, podvrsta i kategorija rada
Radovi u zbornicima skupova, cjeloviti rad (in extenso), ostalo

Izvornik
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14) / - : European Language Resources Association (ELRA), 2014

ISBN
978-2-9517408-8-4

Skup
Language Resources and Evaluation Conference

Mjesto i datum
Reykjavík, Island, 25.05.2014. - 31.05.2014

Vrsta sudjelovanja
Poster

Vrsta recenzije
Međunarodna recenzija

Ključne riječi
machine translation; quality estimation; synthetic corpora

Sažetak
This paper presents a novel approach for parallel data generation using machine translation and quality estimation. Our study focuses on pivot-based machine translation from English to Croatian through Slovene. We generate an English―Croatian version of the Europarl parallel corpus based on the English―Slovene Europarl corpus and the Apertium rule-based translation system for Slovene―Croatian. These experiments are to be considered as a first step towards the generation of reliable synthetic parallel data for under-resourced languages. We first collect small amounts of aligned parallel data for the Slovene―Croatian language pair in order to build a quality estimation system for sentence-level Translation Edit Rate (TER) estimation. We then infer TER scores on automatically translated Slovene to Croatian sentences and use the best translations to build an English―Croatian statistical MT system. We show significant improvement in terms of automatic metrics obtained on two test sets using our approach compared to a random selection of synthetic parallel data.

Izvorni jezik
Engleski

Znanstvena područja
Informacijske i komunikacijske znanosti

POVEZANOST RADA

Ustanove:
Filozofski fakultet, Zagreb

Profili:

Nikola Ljubešić (autor)

CROSBI Hrvatska znanstvena bibliografija

Pregled bibliografske jedinice broj: 713086

Quality Estimation for Synthetic Parallel Data Generation

Citiraj ovu publikaciju:

Pregled bibliografske jedinice broj: 713086

Quality Estimation for Synthetic Parallel Data Generation

Citiraj ovu publikaciju:

Podijeli: