caWaC – a Web Corpus of Catalan and its Application to Language Modeling and Machine Translation

Ljubešić, Nikola; Toral, Antonio

izvor podataka: crosbi !

caWaC – a Web Corpus of Catalan and its Application to Language Modeling and Machine Translation (CROSBI ID 613774)

Prilog sa skupa u zborniku | izvorni znanstveni rad | međunarodna recenzija

Ljubešić, Nikola ; Toral, Antonio caWaC – a Web Corpus of Catalan and its Application to Language Modeling and Machine Translation // Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14). European Language Resources Association (ELRA), 2014

Podaci o odgovornosti

Autori

Ljubešić, Nikola ; Toral, Antonio

Osnovni podaci na izvornom jeziku
Osnovni podaci na ostalim jezicima

Jezik

engleski

Naslov

caWaC – a Web Corpus of Catalan and its Application to Language Modeling and Machine Translation

Sažetak

In this paper we present the construction process of a web corpus of Catalan built from the content of the .cat top-level domain. For collecting and processing data we use the Brno pipeline with the spiderling crawler and its accompanying tools. To the best of our knowledge the corpus represents the largest existing corpus of Catalan containing 687 million words, which is a significant increase given that until now the biggest corpus of Catalan, CuCWeb, counts 166 million words. We evaluate the resulting resource on the tasks of language modeling and statistical machine translation (SMT) by calculating LM perplexity and incorporating the LM in the SMT pipeline. We compare language models trained on different subsets of the resource with those trained on the Catalan Wikipedia and the target side of the parallel data used to train the SMT system.

Ključne riječi

web corpus; Catalan language; language modeling; machine translation

Napomena

nije evidentirano

Jezik

nije evidentirano

Naslov

nije evidentirano

Sažetak

nije evidentirano

Ključne riječi

nije evidentirano

Napomena

nije evidentirano

Podaci o prilogu

Godina izdavanja

2014.

Status objave rada

objavljeno

Podaci o matičnoj publikaciji

Naslov

Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Izdavač

European Language Resources Association (ELRA)

ISBN

978-2-9517408-8-4

Podaci o skupu

Skup

Language Resources and Evaluation Conference

Vrsta sudjelovanja

poster

Datum održavanja skupa

25.05.2014-31.05.2014

Mjesto održavanja skupa

Reykjavík, Island

Povezanost rada

Povezane osobe

Nikola Ljubešić (autor/i)

Povezane ustanove

Filozofski fakultet u Zagrebu (130) (autorova ustanova)

Područje

Informacijske i komunikacijske znanosti