Dynamic N-Gram System Based on an Online Croatian Spellchecking Service

Gledec, Gordan; Šoić, Renato; Dembitz, Šandor

izvor podataka: crosbi !

Dynamic N-Gram System Based on an Online Croatian Spellchecking Service (CROSBI ID 270445)

Prilog u časopisu | izvorni znanstveni rad | međunarodna recenzija

Gledec, Gordan ; Šoić, Renato ; Dembitz, Šandor Dynamic N-Gram System Based on an Online Croatian Spellchecking Service // IEEE access, 7 (2019), 149988-149995. doi: 10.1109/access.2019.2947898

Podaci o odgovornosti

Autori

Gledec, Gordan ; Šoić, Renato ; Dembitz, Šandor

Osnovni podaci na izvornom jeziku
Osnovni podaci na ostalim jezicima

Jezik

engleski

Naslov

Dynamic N-Gram System Based on an Online Croatian Spellchecking Service

Sažetak

As an infrastructure able to accelerate the development of natural language processing applications, large-scale lexical n-gram databases are at present important data systems. However, deriving such systems for world minority languages as it was done in the Google n-gram project leads to many obstacles. This paper presents an innovative approach to large-scale n-gram system creation applied to the Croatian language. Instead of using the Web as the world’s largest text repository, our process of n-gram collection relies on the Croatian online academic spellchecker Hascheck , a language service publicly available since 1993 and popular worldwide. Our n-gram filtering is based on dictionary criteria, contrary to the publicly available Google n-gram systems in which cutoff criteria were applied. After 12 years of collecting, the size of the Croatian n-gram system reached the size of the largest Google Version 1 n-gram systems. Due to reliance on a service in constant use, the Croatian n-gram system is a dynamic one. System dynamics allowed modeling of n-gram count behavior through Heaps’ law, which led to interesting results. Like many minority languages, the Croatian language suffers from a lack of sophisticated language processing systems in many application areas. The importance of a rich lexical n-gram infrastructure for rapid breakthroughs in new application areas is also exemplified in the paper.

Ključne riječi

Croatian language, Heaps’ law, language modeling, lexical n-gram, n-gram system comparison

Napomena

nije evidentirano

Jezik

nije evidentirano

Naslov

nije evidentirano

Sažetak

nije evidentirano

Ključne riječi

nije evidentirano

Napomena

nije evidentirano

Podaci o izdanju

Časopis

IEEE access

Volumen (broj)

Godina

2019.

Stranice rada

149988-149995

Status objave rada

objavljeno

e-ISSN

2169-3536

DOI

10.1109/access.2019.2947898

Trošak objave rada u otvorenom pristupu

APC

1487,00 USD

Povezanost rada

Povezane osobe

Gordan Gledec (autor/i)

Šandor Dembitz (autor/i)

Renato Šoić (autor/i)

Povezane ustanove

Fakultet elektrotehnike i računarstva (036) (autorova ustanova)

Sveučilište u Zagrebu (312) (autorova ustanova)

Područje

Elektrotehnika, Informacijske i komunikacijske znanosti, Računarstvo

Poveznice

doi.org

researchgate.net

ieeexplore.ieee.org

Indeksiranost

Scopus

Current Contents Connect (CCC)

Web of Science Core Collection, Science Citation Index Expanded (WoSCC-SCI-Exp)

Web of Science Core Collection, SCI-Exp, SSCI & A&HCI (WoSCC-SCI-Exp, SSCI, A&HCI)