Nalazite se na CroRIS probnoj okolini. Ovdje evidentirani podaci neće biti pohranjeni u Informacijskom sustavu znanosti RH. Ako je ovo greška, CroRIS produkcijskoj okolini moguće je pristupi putem poveznice www.croris.hr
izvor podataka: crosbi !

{; ; bs, hr, sr}; ; WaC - Web Corpora of Bosnian, Croatian and Serbian (CROSBI ID 656246)

Prilog sa skupa u zborniku | izvorni znanstveni rad | međunarodna recenzija

Ljubešić, Nikola ; Klubička, Filip {; ; bs, hr, sr}; ; WaC - Web Corpora of Bosnian, Croatian and Serbian // Proceedings of the 9th Web as Corpus Workshop (WaC-9) @ EACL 2014 / Bildhauer, Felix ; Schäfer, Roland (ur.). Gothenburg: Association for Computational Linguistics (ACL), 2014. str. 29-35

Podaci o odgovornosti

Ljubešić, Nikola ; Klubička, Filip

engleski

{; ; bs, hr, sr}; ; WaC - Web Corpora of Bosnian, Croatian and Serbian

In this paper, we present the construction process of top-level- domain web corpora of Bosnian, Croatian and Serbian. For constructing the corpora we use the SpiderLing crawler with its associated tools adapted for simultaneous crawling and processing of text written in two scripts, Latin and Cyrillic. In addition to the modified collection process we focus on two sources of noise in the resulting corpora: 1. they contain documents written in the other, closely related languages that can not be identified with standard language identification methods and 2. as most web corpora, they partially contain low- quality data not suitable for the specific research and application objectives. We approach both problems by using language modeling on the crawled data only, omitting the need for manually validated language samples for training. On the task of discriminating between closely related languages, we outperform the state-of-the-art Blacklist classifier reducing its error to a fourth.

Bosnian ; Croatian ; Serbian ; web corpus

nije evidentirano

nije evidentirano

nije evidentirano

nije evidentirano

nije evidentirano

nije evidentirano

Podaci o prilogu

29-35.

2014.

nije evidentirano

objavljeno

978-1-937284-83-1

Podaci o matičnoj publikaciji

Proceedings of the 9th Web as Corpus Workshop (WaC-9) @ EACL 2014

Bildhauer, Felix ; Schäfer, Roland

Gothenburg: Association for Computational Linguistics (ACL)

Podaci o skupu

9th Web as Corpus Workshop (WaC-9)

predavanje

26.04.2014-26.04.2014

Göteborg, Švedska

Povezanost rada

nije evidentirano