CroRIS - CROSBI

izvor podataka: crosbi !

Language identification: how to distinguish similar languages? (CROSBI ID 136992)

Prilog u časopisu | izvorni znanstveni rad

Ljubešić, Nikola ; Mikelić, Nives ; Boras, Damir Language identification: how to distinguish similar languages? // ITI ..., 1 (2007), 541-546

Podaci o odgovornosti

Autori

Ljubešić, Nikola ; Mikelić, Nives ; Boras, Damir

Osnovni podaci na izvornom jeziku
Osnovni podaci na ostalim jezicima

Jezik

engleski

Naslov

Language identification: how to distinguish similar languages?

Sažetak

The goal of this paper is to discuss the language identification problem of Croatian, language that even state-of-the-art language identification tools ﬁ nd hard to distinguish from similar languages, such as Serbian, Slovenian or Slovak language. We developed the tool that implements the list of Croatian most frequent words with the threshold that each document needs to satisfy, we added the specific characters elimination rule, applied second-order Markov model classification and a rule of forbidden words. Finally, we built up the tool that overperforms current tools in discriminating between these similar languages.

Ključne riječi

Written language identification; Croatian language; second-order Markov model; web-corpus; most frequent words method; forbidden words method

Napomena

nije evidentirano

Jezik

nije evidentirano

Naslov

nije evidentirano

Sažetak

nije evidentirano

Ključne riječi

nije evidentirano

Napomena

nije evidentirano

Podaci o izdanju

Časopis

ITI ...

Volumen (broj)

1

Godina

2007.

Stranice rada

541-546

Status objave rada

objavljeno

ISSN

1330-1012

Povezanost rada

Povezane osobe

Damir Boras (autor/i)

Nives Mikelić Preradović (autor/i)

Nikola Ljubešić (autor/i)

Povezane ustanove

Filozofski fakultet u Zagrebu (130) (autorova ustanova)

Povezani projekti

Hrvatska rječnička baština i hrvatski europski identitet (rezultat rada na projektu)

Područje

Informacijske i komunikacijske znanosti

Krugovi, vizual Srca