Pretražite po imenu i prezimenu autora, mentora, urednika, prevoditelja

Napredna pretraga

Pregled bibliografske jedinice broj: 1274164

Four Million Segments and Counting: Building an English-Croatian Parallel Corpus through Crowdsourcing Using a Novel Gamification-Based Platform


Jaworski, Rafal; Seljan, Sanja; Dunđer, Ivan
Four Million Segments and Counting: Building an English-Croatian Parallel Corpus through Crowdsourcing Using a Novel Gamification-Based Platform // Information, 14 (2023), 4; 226-244 doi:10.3390/info14040226 (međunarodna recenzija, članak, znanstveni)


CROSBI ID: 1274164 Za ispravke kontaktirajte CROSBI podršku putem web obrasca

Naslov
Four Million Segments and Counting: Building an English-Croatian Parallel Corpus through Crowdsourcing Using a Novel Gamification-Based Platform

Autori
Jaworski, Rafal ; Seljan, Sanja ; Dunđer, Ivan

Izvornik
Information (2078-2489) 14 (2023), 4; 226-244

Vrsta, podvrsta i kategorija rada
Radovi u časopisima, članak, znanstveni

Ključne riječi
parallel corpus ; data acquisition ; gamification ; crowdsourcing ; machine translation ; natural language processing

Sažetak
Parallel corpora have been widely used in the fields of natural language processing and translation as they provide crucial multilingual information. They are used to train machine translation systems, compile dictionaries, or generate inter-language word embeddings. There are many corpora available publicly ; however, support for some languages is still limited. In this paper, the authors present a framework for collecting, organizing, and storing corpora. The solution was originally designed to obtain data for less-resourced languages, but it proved to work very well for the collection of high-value domain-specific corpora. The scenario is based on the collective work of a group of people who are motivated by the means of gamification. The rules of the game motivate the participants to submit large resources, and a peer-review process ensures quality. More than four million translated segments have been collected so far.

Izvorni jezik
Engleski

Znanstvena područja
Računarstvo, Informacijske i komunikacijske znanosti



POVEZANOST RADA


Projekti:
--11-933-1053 - Strojno učenje i obrada prirodnog jezika u domeni računalne sigurnosti – II. dio (Seljan, Sanja) ( CroRIS)

Ustanove:
Filozofski fakultet, Zagreb

Profili:

Avatar Url Ivan Dunđer (autor)

Avatar Url Sanja Seljan (autor)

Poveznice na cjeloviti tekst rada:

Pristup cjelovitom tekstu rada doi www.mdpi.com

Citiraj ovu publikaciju:

Jaworski, Rafal; Seljan, Sanja; Dunđer, Ivan
Four Million Segments and Counting: Building an English-Croatian Parallel Corpus through Crowdsourcing Using a Novel Gamification-Based Platform // Information, 14 (2023), 4; 226-244 doi:10.3390/info14040226 (međunarodna recenzija, članak, znanstveni)
Jaworski, R., Seljan, S. & Dunđer, I. (2023) Four Million Segments and Counting: Building an English-Croatian Parallel Corpus through Crowdsourcing Using a Novel Gamification-Based Platform. Information, 14 (4), 226-244 doi:10.3390/info14040226.
@article{article, author = {Jaworski, Rafal and Seljan, Sanja and Dun\djer, Ivan}, year = {2023}, pages = {226-244}, DOI = {10.3390/info14040226}, keywords = {parallel corpus, data acquisition, gamification, crowdsourcing, machine translation, natural language processing}, journal = {Information}, doi = {10.3390/info14040226}, volume = {14}, number = {4}, issn = {2078-2489}, title = {Four Million Segments and Counting: Building an English-Croatian Parallel Corpus through Crowdsourcing Using a Novel Gamification-Based Platform}, keyword = {parallel corpus, data acquisition, gamification, crowdsourcing, machine translation, natural language processing} }
@article{article, author = {Jaworski, Rafal and Seljan, Sanja and Dun\djer, Ivan}, year = {2023}, pages = {226-244}, DOI = {10.3390/info14040226}, keywords = {parallel corpus, data acquisition, gamification, crowdsourcing, machine translation, natural language processing}, journal = {Information}, doi = {10.3390/info14040226}, volume = {14}, number = {4}, issn = {2078-2489}, title = {Four Million Segments and Counting: Building an English-Croatian Parallel Corpus through Crowdsourcing Using a Novel Gamification-Based Platform}, keyword = {parallel corpus, data acquisition, gamification, crowdsourcing, machine translation, natural language processing} }

Časopis indeksira:


  • Web of Science Core Collection (WoSCC)
    • Emerging Sources Citation Index (ESCI)
  • Scopus


Citati:





    Contrast
    Increase Font
    Decrease Font
    Dyslexic Font