Nalazite se na CroRIS probnoj okolini. Ovdje evidentirani podaci neće biti pohranjeni u Informacijskom sustavu znanosti RH. Ako je ovo greška, CroRIS produkcijskoj okolini moguće je pristupi putem poveznice www.croris.hr
izvor podataka: crosbi !

Four Million Segments and Counting: Building an English-Croatian Parallel Corpus through Crowdsourcing Using a Novel Gamification-Based Platform (CROSBI ID 325973)

Prilog u časopisu | izvorni znanstveni rad | međunarodna recenzija

Jaworski, Rafal ; Seljan, Sanja ; Dunđer, Ivan Four Million Segments and Counting: Building an English-Croatian Parallel Corpus through Crowdsourcing Using a Novel Gamification-Based Platform // Information, 14 (2023), 4; 226-244. doi: 10.3390/info14040226

Podaci o odgovornosti

Jaworski, Rafal ; Seljan, Sanja ; Dunđer, Ivan

engleski

Four Million Segments and Counting: Building an English-Croatian Parallel Corpus through Crowdsourcing Using a Novel Gamification-Based Platform

Parallel corpora have been widely used in the fields of natural language processing and translation as they provide crucial multilingual information. They are used to train machine translation systems, compile dictionaries, or generate inter-language word embeddings. There are many corpora available publicly ; however, support for some languages is still limited. In this paper, the authors present a framework for collecting, organizing, and storing corpora. The solution was originally designed to obtain data for less-resourced languages, but it proved to work very well for the collection of high-value domain-specific corpora. The scenario is based on the collective work of a group of people who are motivated by the means of gamification. The rules of the game motivate the participants to submit large resources, and a peer-review process ensures quality. More than four million translated segments have been collected so far.

parallel corpus ; data acquisition ; gamification ; crowdsourcing ; machine translation ; natural language processing

nije evidentirano

nije evidentirano

nije evidentirano

nije evidentirano

nije evidentirano

nije evidentirano

Podaci o izdanju

14 (4)

2023.

226-244

objavljeno

2078-2489

10.3390/info14040226

Trošak objave rada u otvorenom pristupu

Povezanost rada

Informacijske i komunikacijske znanosti, Računarstvo

Poveznice
Indeksiranost