Pretražite po imenu i prezimenu autora, mentora, urednika, prevoditelja

Napredna pretraga

Pregled bibliografske jedinice broj: 852069

Producing monolingual and parallel web corpora at the same time - SpiderLing and Bitextor's love affair


Ljubešić, Nikola; Esplà-Gomis, Miquel; Toral, Antonio; Ortiz Rojas, Sergio; Klubička, Filip
Producing monolingual and parallel web corpora at the same time - SpiderLing and Bitextor's love affair // Proceedings of the Tenth International conference on language resources and evaluation (LREC 2016) / Calzolari, N. [et al] (ur.).
Portorož: European Language Resources Association (ELRA), 2016. str. 2949-2956 (poster, međunarodna recenzija, cjeloviti rad (in extenso), znanstveni)


CROSBI ID: 852069 Za ispravke kontaktirajte CROSBI podršku putem web obrasca

Naslov
Producing monolingual and parallel web corpora at the same time - SpiderLing and Bitextor's love affair

Autori
Ljubešić, Nikola ; Esplà-Gomis, Miquel ; Toral, Antonio ; Ortiz Rojas, Sergio ; Klubička, Filip

Vrsta, podvrsta i kategorija rada
Radovi u zbornicima skupova, cjeloviti rad (in extenso), znanstveni

Izvornik
Proceedings of the Tenth International conference on language resources and evaluation (LREC 2016) / Calzolari, N. [et al] - Portorož : European Language Resources Association (ELRA), 2016, 2949-2956

ISBN
978-2-9517408-9-1

Skup
Tenth International conference on language resources and evaluation - LREC'16

Mjesto i datum
Portorož, Slovenija, 23.05.2016. - 28.05.2016

Vrsta sudjelovanja
Poster

Vrsta recenzije
Međunarodna recenzija

Ključne riječi
crawling ; top-level domain ; monolingual corpus ; parallel corpus

Sažetak
Abstract "This paper presents an approach for building large monolingual corpora and, at the same time, extracting parallel data by crawling the top-level domain of a given language of interest. For gathering linguistically relevant data from top-level domains we use the SpiderLing crawler, modified to crawl data written in multiple languages. The output of this process is then fed to Bitextor, a tool for harvesting parallel data from a collection of documents. We call the system combining these two tools Spidextor, a blend of the names of its two crucial parts. We evaluate the described approach intrinsically by measuring the accuracy of the extracted bitexts from the Croatian top-level domain "".hr"" and the Slovene top-level domain "".si"", and extrinsically on the English-Croatian language pair by comparing an SMT system built from the crawled data with third-party systems. We finally present parallel datasets collected with our approach for the English-Croatian, English-Finnish, English-Serbian and English- Slovene language pairs."

Izvorni jezik
Engleski

Znanstvena područja
Informacijske i komunikacijske znanosti



POVEZANOST RADA


Ustanove:
Filozofski fakultet, Zagreb

Profili:

Avatar Url Nikola Ljubešić (autor)

Poveznice na cjeloviti tekst rada:

www.aclweb.org

Citiraj ovu publikaciju:

Ljubešić, Nikola; Esplà-Gomis, Miquel; Toral, Antonio; Ortiz Rojas, Sergio; Klubička, Filip
Producing monolingual and parallel web corpora at the same time - SpiderLing and Bitextor's love affair // Proceedings of the Tenth International conference on language resources and evaluation (LREC 2016) / Calzolari, N. [et al] (ur.).
Portorož: European Language Resources Association (ELRA), 2016. str. 2949-2956 (poster, međunarodna recenzija, cjeloviti rad (in extenso), znanstveni)
Ljubešić, N., Esplà-Gomis, M., Toral, A., Ortiz Rojas, S. & Klubička, F. (2016) Producing monolingual and parallel web corpora at the same time - SpiderLing and Bitextor's love affair. U: Calzolari, N. (ur.)Proceedings of the Tenth International conference on language resources and evaluation (LREC 2016).
@article{article, author = {Ljube\v{s}i\'{c}, Nikola and Espl\`{a}-Gomis, Miquel and Toral, Antonio and Ortiz Rojas, Sergio and Klubi\v{c}ka, Filip}, editor = {Calzolari, N.}, year = {2016}, pages = {2949-2956}, keywords = {crawling, top-level domain, monolingual corpus, parallel corpus}, isbn = {978-2-9517408-9-1}, title = {Producing monolingual and parallel web corpora at the same time - SpiderLing and Bitextor's love affair}, keyword = {crawling, top-level domain, monolingual corpus, parallel corpus}, publisher = {European Language Resources Association (ELRA)}, publisherplace = {Portoro\v{z}, Slovenija} }
@article{article, author = {Ljube\v{s}i\'{c}, Nikola and Espl\`{a}-Gomis, Miquel and Toral, Antonio and Ortiz Rojas, Sergio and Klubi\v{c}ka, Filip}, editor = {Calzolari, N.}, year = {2016}, pages = {2949-2956}, keywords = {crawling, top-level domain, monolingual corpus, parallel corpus}, isbn = {978-2-9517408-9-1}, title = {Producing monolingual and parallel web corpora at the same time - SpiderLing and Bitextor's love affair}, keyword = {crawling, top-level domain, monolingual corpus, parallel corpus}, publisher = {European Language Resources Association (ELRA)}, publisherplace = {Portoro\v{z}, Slovenija} }




Contrast
Increase Font
Decrease Font
Dyslexic Font