25 godina Hašeka (CROSBI ID 272258)
Prilog u časopisu | izvorni znanstveni rad
Podaci o odgovornosti
Dembitz, Šandor
hrvatski
25 godina Hašeka
Hašek is a Croatian on-line spellchecker that continuously operates since March 21, 1994, nowadays at the address https://ispravi.me/. In 25 years of functioning Hašek processed nearly 30 million texts, which build a corpus of more than 7 billion tokens. By compari-son, all books ever published in Croatian form a corpus with less than 20 billion tokens. As a WWW-embedded tool, Hašek took advantage of many web-based services including learning. Thanks to Hašek’s learning capability, its dictionary increased from initial 100 thousand to more than 2 million word-types. Another aspect of learning was the creating and regular updating of the Croatian n-gram system. Unlike Google, whose n-gram systems are based on the WaC (Web as Corpus) approach and cut-off criteria, Croatian n-grams were extracted from processed texts by a lexical criterion: each n-gram constituent must be proven by the spellchecker as valid in Croatian spelling. The difference in approaches made Croatian n-gram system comparable in size to the largest Google n-gram systems. Unfortunately, the advantages of on-line spellchecking for rapid breakthroughs into much more sophisticated language technology areas were not recognized by Croatian decision makers, with some consequences mentioned in the paper.
Hašek, strojna provjera teksta, učenje, Google, n-gramski sustavi
nije evidentirano
engleski
25years of Hašek
nije evidentirano
Hašek, spellchecking, learning, Google, n-gram systems
nije evidentirano
Podaci o izdanju
66 (4-5)
2019.
138-150
objavljeno
0021-6925
1849-174X