{; ; bs, hr, sr}; ; WaC - Web Corpora of Bosnian, Croatian and Serbian

Ljubešić, Nikola; Klubička, Filip

Pregled bibliografske jedinice broj: 912128

{; ; bs, hr, sr}; ; WaC - Web Corpora of Bosnian, Croatian and Serbian

Ljubešić, Nikola; Klubička, Filip

{; ; bs, hr, sr}; ; WaC - Web Corpora of Bosnian, Croatian and Serbian // Proceedings of the 9th Web as Corpus Workshop (WaC-9) @ EACL 2014 / Bildhauer, Felix ; Schäfer, Roland (ur.).
Gothenburg: Association for Computational Linguistics (ACL), 2014. str. 29-35 (predavanje, međunarodna recenzija, cjeloviti rad (in extenso), znanstveni)

CROSBI ID: 912128 Za ispravke kontaktirajte CROSBI podršku putem web obrasca

Naslov
{; ; bs, hr, sr}; ; WaC - Web Corpora of Bosnian, Croatian and Serbian

Autori
Ljubešić, Nikola ; Klubička, Filip

Vrsta, podvrsta i kategorija rada
Radovi u zbornicima skupova, cjeloviti rad (in extenso), znanstveni

Izvornik
Proceedings of the 9th Web as Corpus Workshop (WaC-9) @ EACL 2014 / Bildhauer, Felix ; Schäfer, Roland - Gothenburg : Association for Computational Linguistics (ACL), 2014, 29-35

ISBN
978-1-937284-83-1

Skup
9th Web as Corpus Workshop (WaC-9)

Mjesto i datum
Göteborg, Švedska, 26.04.2014

Vrsta sudjelovanja
Predavanje

Vrsta recenzije
Međunarodna recenzija

Ključne riječi
Bosnian ; Croatian ; Serbian ; web corpus

Sažetak
In this paper, we present the construction process of top-level- domain web corpora of Bosnian, Croatian and Serbian. For constructing the corpora we use the SpiderLing crawler with its associated tools adapted for simultaneous crawling and processing of text written in two scripts, Latin and Cyrillic. In addition to the modified collection process we focus on two sources of noise in the resulting corpora: 1. they contain documents written in the other, closely related languages that can not be identified with standard language identification methods and 2. as most web corpora, they partially contain low- quality data not suitable for the specific research and application objectives. We approach both problems by using language modeling on the crawled data only, omitting the need for manually validated language samples for training. On the task of discriminating between closely related languages, we outperform the state-of-the-art Blacklist classifier reducing its error to a fourth.

Izvorni jezik
Engleski

POVEZANOST RADA

Ustanove:
Filozofski fakultet, Zagreb

Profili:

Nikola Ljubešić (autor)

CROSBI Hrvatska znanstvena bibliografija

Pregled bibliografske jedinice broj: 912128

{; ; bs, hr, sr}; ; WaC - Web Corpora of Bosnian, Croatian and Serbian

Citiraj ovu publikaciju:

Pregled bibliografske jedinice broj: 912128

{; ; bs, hr, sr}; ; WaC - Web Corpora of Bosnian, Croatian and Serbian

Citiraj ovu publikaciju:

Podijeli: