Napredna pretraga

Pregled bibliografske jedinice broj: 713087

CaWaC – a Web Corpus of Catalan and its Application to Language Modeling and Machine Translation


Ljubešić, Nikola; Toral, Antonio
caWaC – a Web Corpus of Catalan and its Application to Language Modeling and Machine Translation // Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Reykjavik, Island: ELRA, 2014. (poster, međunarodna recenzija, cjeloviti rad (in extenso), znanstveni)


Naslov
CaWaC – a Web Corpus of Catalan and its Application to Language Modeling and Machine Translation

Autori
Ljubešić, Nikola ; Toral, Antonio

Vrsta, podvrsta i kategorija rada
Radovi u zbornicima skupova, cjeloviti rad (in extenso), znanstveni

Izvornik
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14) / - : ELRA, 2014

ISBN
978-2-9517408-8-4

Skup
Language Resources and Evaluation Conference

Mjesto i datum
Reykjavik, Island, 25-31.05.2014.

Vrsta sudjelovanja
Poster

Vrsta recenzije
Međunarodna recenzija

Ključne riječi
Web corpus; Catalan language; language modeling; machine translation

Sažetak
In this paper we present the construction process of a web corpus of Catalan built from the content of the .cat top-level domain. For collecting and processing data we use the Brno pipeline with the spiderling crawler and its accompanying tools. To the best of our knowledge the corpus represents the largest existing corpus of Catalan containing 687 million words, which is a significant increase given that until now the biggest corpus of Catalan, CuCWeb, counts 166 million words. We evaluate the resulting resource on the tasks of language modeling and statistical machine translation (SMT) by calculating LM perplexity and incorporating the LM in the SMT pipeline. We compare language models trained on different subsets of the resource with those trained on the Catalan Wikipedia and the target side of the parallel data used to train the SMT system.

Izvorni jezik
Engleski

Znanstvena područja
Informacijske i komunikacijske znanosti



POVEZANOST RADA


Ustanove
Filozofski fakultet, Zagreb

Autor s matičnim brojem:
Nikola Ljubešić, (272820)