An economic approach to big data in a minority language

Dembitz, Šandor; Gledec, Gordan; Sokele, Mladen

izvor podataka: crosbi !

An economic approach to big data in a minority language (CROSBI ID 209244)

Prilog u časopisu | izvorni znanstveni rad | međunarodna recenzija

Dembitz, Šandor ; Gledec, Gordan ; Sokele, Mladen An economic approach to big data in a minority language // Procedia computer science, 35 (2014), 427-436. doi: 10.1016/j.procs.2014.08.123

Podaci o odgovornosti

Autori

Dembitz, Šandor ; Gledec, Gordan ; Sokele, Mladen

Osnovni podaci na izvornom jeziku
Osnovni podaci na ostalim jezicima

Jezik

engleski

Naslov

An economic approach to big data in a minority language

Sažetak

Google's n-gram project brought recently big data benefits to several main world languages, like English, Chinese etc. Any attempt to derive such systems, aimed to accelerate the development of NLP applications for world minority languages, in the manner in which it has been done in the project, encounters many obstacles. This paper presents an innovative and economic approach to large-scale n-gram system creation applied to the Croatian language case. Instead of using the Web as the world's biggest text repository, our process of n-gram collection relies on the Croatian academic online spellchecker Hascheck, a language service publicly available since 1993 and popular worldwide. The service has already processed a corpus whose size exceeds the size of the Croatian web-corpus created in recent years. Contrary to the Google n-gram systems, where cutoff criteria were applied, our n-gram filtering is based on dictionary criteria. This resulted in a system comparable in size to the largest n-gram systems of today. Because of the reliance on a service in constant use, the Croatian n-gram system is a dynamic one, unique among the systems compared. The importance of having an n-gram infrastructure for rapid breakthroughs in new application areas is also exemplified in the paper.

Ključne riječi

Croatian language; language modeling; lexical n - gram; n - gram count comparison; traffic modeling .

Napomena

nije evidentirano

Jezik

nije evidentirano

Naslov

nije evidentirano

Sažetak

nije evidentirano

Ključne riječi

nije evidentirano

Napomena

nije evidentirano

Podaci o izdanju

Časopis

Procedia computer science

Volumen (broj)

Godina

2014.

Stranice rada

427-436

Status objave rada

objavljeno

e-ISSN

1877-0509

DOI

10.1016/j.procs.2014.08.123

Povezanost rada

Povezane osobe

Gordan Gledec (CroRIS ID: 4564; MBZ: 226354) (autor/i)

Mladen Sokele (CroRIS ID: 17909; MBZ: 89740) (autor/i)

Šandor Dembitz (CroRIS ID: 25549; MBZ: 9263) (autor/i)

Povezane ustanove

Fakultet elektrotehnike i računarstva (036) (autorova ustanova)

Područje

Elektrotehnika, Računarstvo, Informacijske i komunikacijske znanosti

Poveznice

doi.org

sciencedirect.com

ac.els-cdn.com

Indeksiranost

Web of Science Core Collection, SCI-Exp, SSCI & A&HCI (WoSCC-SCI-Exp, SSCI, A&HCI)