Nalazite se na CroRIS probnoj okolini. Ovdje evidentirani podaci neće biti pohranjeni u Informacijskom sustavu znanosti RH. Ako je ovo greška, CroRIS produkcijskoj okolini moguće je pristupi putem poveznice www.croris.hr
izvor podataka: crosbi

Producing monolingual and parallel web corpora at the same time - SpiderLing and Bitextor's love affair (CROSBI ID 643372)

Prilog sa skupa u zborniku | izvorni znanstveni rad | međunarodna recenzija

Ljubešić, Nikola ; Esplà-Gomis, Miquel ; Toral, Antonio ; Ortiz Rojas, Sergio ; Klubička, Filip Producing monolingual and parallel web corpora at the same time - SpiderLing and Bitextor's love affair // Proceedings of the Tenth International conference on language resources and evaluation (LREC 2016) / Calzolari, N. [et al] (ur.). Portorož: European Language Resources Association (ELRA), 2016. str. 2949-2956

Podaci o odgovornosti

Ljubešić, Nikola ; Esplà-Gomis, Miquel ; Toral, Antonio ; Ortiz Rojas, Sergio ; Klubička, Filip

engleski

Producing monolingual and parallel web corpora at the same time - SpiderLing and Bitextor's love affair

Abstract "This paper presents an approach for building large monolingual corpora and, at the same time, extracting parallel data by crawling the top-level domain of a given language of interest. For gathering linguistically relevant data from top-level domains we use the SpiderLing crawler, modified to crawl data written in multiple languages. The output of this process is then fed to Bitextor, a tool for harvesting parallel data from a collection of documents. We call the system combining these two tools Spidextor, a blend of the names of its two crucial parts. We evaluate the described approach intrinsically by measuring the accuracy of the extracted bitexts from the Croatian top-level domain "".hr"" and the Slovene top-level domain "".si"", and extrinsically on the English-Croatian language pair by comparing an SMT system built from the crawled data with third-party systems. We finally present parallel datasets collected with our approach for the English-Croatian, English-Finnish, English-Serbian and English- Slovene language pairs."

crawling ; top-level domain ; monolingual corpus ; parallel corpus

nije evidentirano

nije evidentirano

nije evidentirano

nije evidentirano

nije evidentirano

nije evidentirano

Podaci o prilogu

2949-2956.

2016.

nije evidentirano

objavljeno

978-2-9517408-9-1

Podaci o matičnoj publikaciji

Proceedings of the Tenth International conference on language resources and evaluation (LREC 2016)

Calzolari, N. [et al]

Portorož: European Language Resources Association (ELRA)

Podaci o skupu

Tenth International conference on language resources and evaluation - LREC'16

poster

23.05.2016-28.05.2016

Portorož, Slovenija

Povezanost rada

Informacijske i komunikacijske znanosti

Poveznice