Comparing Two Acquisition Systems for Automatically Building an English-Croatian Parallel Corpus from Multilingual Websites (CROSBI ID 613766)
Prilog sa skupa u zborniku | izvorni znanstveni rad | međunarodna recenzija
Podaci o odgovornosti
Espla-Gomis, Miquel ; Klubička, Filip ; Ljubešić, Nikola ; Ortiz-Rojas, Sergio ; Papavassiliou, Vassilis ; Prokopidis, Prokopis
engleski
Comparing Two Acquisition Systems for Automatically Building an English-Croatian Parallel Corpus from Multilingual Websites
In this paper we compare two tools for automatically harvesting bitexts from multilingual websites: bitextor and ILSP-FC. We used both tools for crawling 21 multilingual websites from the tourism domain to build a domain-specific English―Croatian parallel corpus. Different settings were tried for both tools and 10, 662 unique document pairs were obtained. A sample of about 10% of them was manually examined and the success rate was computed on the collection of pairs of documents detected by each setting. We compare the performance of the settings and the amount of different corpora detected by each setting. In addition, we describe the resource obtained, both by the settings and through the human evaluation, which has been released as a high-quality parallel corpus.
parallel data acquisition; focused crawling; system comparison
nije evidentirano
nije evidentirano
nije evidentirano
nije evidentirano
nije evidentirano
nije evidentirano
Podaci o prilogu
2014.
objavljeno
Podaci o matičnoj publikaciji
Podaci o skupu
Language Resources and Evaluation Conference 2014
poster
26.05.2014-31.05.2014
Reykjavík, Island