Pretražite po imenu i prezimenu autora, mentora, urednika, prevoditelja

Napredna pretraga

Pregled bibliografske jedinice broj: 716944

An economic approach to big data in a minority language


Dembitz, Šandor; Gledec, Gordan; Sokele, Mladen
An economic approach to big data in a minority language // Procedia Computer Science, 35 (2014), 427-436 doi:10.1016/j.procs.2014.08.123 (međunarodna recenzija, članak, znanstveni)


CROSBI ID: 716944 Za ispravke kontaktirajte CROSBI podršku putem web obrasca

Naslov
An economic approach to big data in a minority language

Autori
Dembitz, Šandor ; Gledec, Gordan ; Sokele, Mladen

Izvornik
Procedia Computer Science (1877-0509) 35 (2014); 427-436

Vrsta, podvrsta i kategorija rada
Radovi u časopisima, članak, znanstveni

Ključne riječi
Croatian language; language modeling; lexical n - gram; n - gram count comparison; traffic modeling .

Sažetak
Google's n-gram project brought recently big data benefits to several main world languages, like English, Chinese etc. Any attempt to derive such systems, aimed to accelerate the development of NLP applications for world minority languages, in the manner in which it has been done in the project, encounters many obstacles. This paper presents an innovative and economic approach to large-scale n-gram system creation applied to the Croatian language case. Instead of using the Web as the world's biggest text repository, our process of n-gram collection relies on the Croatian academic online spellchecker Hascheck, a language service publicly available since 1993 and popular worldwide. The service has already processed a corpus whose size exceeds the size of the Croatian web-corpus created in recent years. Contrary to the Google n-gram systems, where cutoff criteria were applied, our n-gram filtering is based on dictionary criteria. This resulted in a system comparable in size to the largest n-gram systems of today. Because of the reliance on a service in constant use, the Croatian n-gram system is a dynamic one, unique among the systems compared. The importance of having an n-gram infrastructure for rapid breakthroughs in new application areas is also exemplified in the paper.

Izvorni jezik
Engleski

Znanstvena područja
Elektrotehnika, Računarstvo, Informacijske i komunikacijske znanosti



POVEZANOST RADA


Ustanove:
Fakultet elektrotehnike i računarstva, Zagreb

Profili:

Avatar Url Mladen Sokele (autor)

Avatar Url Šandor Dembitz (autor)

Avatar Url Gordan Gledec (autor)

Citiraj ovu publikaciju:

Dembitz, Šandor; Gledec, Gordan; Sokele, Mladen
An economic approach to big data in a minority language // Procedia Computer Science, 35 (2014), 427-436 doi:10.1016/j.procs.2014.08.123 (međunarodna recenzija, članak, znanstveni)
Dembitz, Š., Gledec, G. & Sokele, M. (2014) An economic approach to big data in a minority language. Procedia Computer Science, 35, 427-436 doi:10.1016/j.procs.2014.08.123.
@article{article, author = {Dembitz, \v{S}andor and Gledec, Gordan and Sokele, Mladen}, year = {2014}, pages = {427-436}, DOI = {10.1016/j.procs.2014.08.123}, keywords = {Croatian language, language modeling, lexical n - gram, n - gram count comparison, traffic modeling .}, journal = {Procedia Computer Science}, doi = {10.1016/j.procs.2014.08.123}, volume = {35}, issn = {1877-0509}, title = {An economic approach to big data in a minority language}, keyword = {Croatian language, language modeling, lexical n - gram, n - gram count comparison, traffic modeling .} }
@article{article, author = {Dembitz, \v{S}andor and Gledec, Gordan and Sokele, Mladen}, year = {2014}, pages = {427-436}, DOI = {10.1016/j.procs.2014.08.123}, keywords = {Croatian language, language modeling, lexical n - gram, n - gram count comparison, traffic modeling .}, journal = {Procedia Computer Science}, doi = {10.1016/j.procs.2014.08.123}, volume = {35}, issn = {1877-0509}, title = {An economic approach to big data in a minority language}, keyword = {Croatian language, language modeling, lexical n - gram, n - gram count comparison, traffic modeling .} }

Časopis indeksira:


  • Web of Science Core Collection (WoSCC)
    • SCI-EXP, SSCI i/ili A&HCI


Citati:





    Contrast
    Increase Font
    Decrease Font
    Dyslexic Font