Pretražite po imenu i prezimenu autora, mentora, urednika, prevoditelja

Napredna pretraga

Pregled bibliografske jedinice broj: 852076

Crawl and crowd to bring machine translation to under- resourced languages


Toral, Antonio; Espla-Gomis, Miquel, Klubička, Filip; Ljubešić, Nikola; Papavassiliou, Vassilis; Prokopidis, Prokopis; Rubino, Raphael; Way, Andy
Crawl and crowd to bring machine translation to under- resourced languages // Language resources and evaluation, 51 (2016), 4; 1019-1051 doi:10.1007/s10579-016-9363-6 (međunarodna recenzija, članak, znanstveni)


CROSBI ID: 852076 Za ispravke kontaktirajte CROSBI podršku putem web obrasca

Naslov
Crawl and crowd to bring machine translation to under- resourced languages

Autori
Toral, Antonio ; Espla-Gomis, Miquel, Klubička, Filip ; Ljubešić, Nikola ; Papavassiliou, Vassilis ; Prokopidis, Prokopis ; Rubino, Raphael ; Way, Andy

Izvornik
Language resources and evaluation (1574-020X) 51 (2016), 4; 1019-1051

Vrsta, podvrsta i kategorija rada
Radovi u časopisima, članak, znanstveni

Ključne riječi
statistical machine translation ; web crawling ; crowdsourcing

Sažetak
We present a widely applicable methodology to bring machine translation (MT) to under- resourced languages in a cost-effective and rapid manner. Our proposal relies on web crawling to automatically acquire parallel data to train statistical MT systems if any such data can be found for the language pair and domain of interest. If that is not the case, we resort to (1) crowdsourcing to translate small amounts of text (hundreds of sentences), which are then used to tune statistical MT models, and (2) web crawling of vast amounts of monolingual data (millions of sentences), which are then used to build language models for MT. We apply these to two respective use-cases for Croatian, an under-resourced language that has gained relevance since it recently attained official status in the European Union. The first use-case regards tourism, given the importance of this sector to Croatia’s economy, while the second has to do with tweets, due to the growing importance of social media. For tourism, we crawl parallel data from 20 web domains using two state-of-the-art crawlers and explore how to combine the crawled data with bigger amounts of general- domain data. Our domain-adapted system is evaluated on a set of three additional tourism web domains and it outperforms the baseline in terms of automatic metrics and/or vocabulary coverage. In the social media use-case, we deal with tweets from the 2014 edition of the soccer World Cup. We build domain-adapted systems by (1) translating small amounts of tweets to be used for tuning by means of crowdsourcing and (2) crawling vast amounts of monolingual tweets. These systems outperform the baseline (Microsoft Bing) by 7.94 BLEU points (5.11 TER) for Croatian-to- English and by 2.17 points (1.94 TER) for English-to-Croatian on a test set translated by means of crowdsourcing. A complementary manual analysis sheds further light on these results.

Izvorni jezik
Engleski

Znanstvena područja
Informacijske i komunikacijske znanosti



POVEZANOST RADA


Ustanove:
Filozofski fakultet, Zagreb

Profili:

Avatar Url Nikola Ljubešić (autor)

Poveznice na cjeloviti tekst rada:

doi link.springer.com

Citiraj ovu publikaciju:

Toral, Antonio; Espla-Gomis, Miquel, Klubička, Filip; Ljubešić, Nikola; Papavassiliou, Vassilis; Prokopidis, Prokopis; Rubino, Raphael; Way, Andy
Crawl and crowd to bring machine translation to under- resourced languages // Language resources and evaluation, 51 (2016), 4; 1019-1051 doi:10.1007/s10579-016-9363-6 (međunarodna recenzija, članak, znanstveni)
Toral, A., Espla-Gomis, Miquel, Klubička, Filip, Ljubešić, N., Papavassiliou, V., Prokopidis, P., Rubino, R. & Way, A. (2016) Crawl and crowd to bring machine translation to under- resourced languages. Language resources and evaluation, 51 (4), 1019-1051 doi:10.1007/s10579-016-9363-6.
@article{article, author = {Toral, Antonio and Ljube\v{s}i\'{c}, Nikola and Papavassiliou, Vassilis and Prokopidis, Prokopis and Rubino, Raphael and Way, Andy}, year = {2016}, pages = {1019-1051}, DOI = {10.1007/s10579-016-9363-6}, keywords = {statistical machine translation, web crawling, crowdsourcing}, journal = {Language resources and evaluation}, doi = {10.1007/s10579-016-9363-6}, volume = {51}, number = {4}, issn = {1574-020X}, title = {Crawl and crowd to bring machine translation to under- resourced languages}, keyword = {statistical machine translation, web crawling, crowdsourcing} }
@article{article, author = {Toral, Antonio and Ljube\v{s}i\'{c}, Nikola and Papavassiliou, Vassilis and Prokopidis, Prokopis and Rubino, Raphael and Way, Andy}, year = {2016}, pages = {1019-1051}, DOI = {10.1007/s10579-016-9363-6}, keywords = {statistical machine translation, web crawling, crowdsourcing}, journal = {Language resources and evaluation}, doi = {10.1007/s10579-016-9363-6}, volume = {51}, number = {4}, issn = {1574-020X}, title = {Crawl and crowd to bring machine translation to under- resourced languages}, keyword = {statistical machine translation, web crawling, crowdsourcing} }

Časopis indeksira:


  • Web of Science Core Collection (WoSCC)
    • Science Citation Index Expanded (SCI-EXP)
    • SCI-EXP, SSCI i/ili A&HCI
  • Scopus


Uključenost u ostale bibliografske baze podataka::


  • Education Research Abstracts Online


Citati:





    Contrast
    Increase Font
    Decrease Font
    Dyslexic Font