Pregled bibliografske jedinice broj: 427399
String Distance-Based Stemming of the Highly Inflected Croatian Language
String Distance-Based Stemming of the Highly Inflected Croatian Language // Proceedings of Recent Advances in Natural Language Processing (RANLP-2009) / Angelova, Galia ; Bontcheva, Kalina ; Mitkov, Ruslan ; Nicolov, Nicolas ; Nikolov, Nikolai (ur.).
Šumen: Incoma, 2009. str. 411-415 (poster, međunarodna recenzija, cjeloviti rad (in extenso), znanstveni)
CROSBI ID: 427399 Za ispravke kontaktirajte CROSBI podršku putem web obrasca
Naslov
String Distance-Based Stemming of the Highly Inflected Croatian Language
Autori
Šnajder, Jan ; Dalbelo Bašić, Bojana
Vrsta, podvrsta i kategorija rada
Radovi u zbornicima skupova, cjeloviti rad (in extenso), znanstveni
Izvornik
Proceedings of Recent Advances in Natural Language Processing (RANLP-2009)
/ Angelova, Galia ; Bontcheva, Kalina ; Mitkov, Ruslan ; Nicolov, Nicolas ; Nikolov, Nikolai - Šumen : Incoma, 2009, 411-415
Skup
International Conference Recent Advances in Natural Language Processing'2009 (RANLP-2009)
Mjesto i datum
Borovec, Bugarska, 14.09.2009. - 16.09.2009
Vrsta sudjelovanja
Poster
Vrsta recenzije
Međunarodna recenzija
Ključne riječi
Stemming; morphology; string distance; Croatian language
Sažetak
Stemming refers to the grouping of morphologically related words into so-called stem classes for the purpose of improving information retrieval performance. Traditional approaches to stemming are language-specific and require a substantial amount of linguistic knowledge. A viable alternative is string distance-based stemming, in which stem classes are obtained by clustering word-forms from a corpus. In this paper, we apply string distance-based stemming to the highly inflected Croatian language using a number of string distance measures proposed in the literature. We focus on evaluating the stemming performance at both inflectional and derivational level, and investigate how this performance relates to the choice of the distance threshold value. Although our focus is on the Croatian language, we believe our results transfer well to languages of similar morphological complexity.
Izvorni jezik
Engleski
Znanstvena područja
Računarstvo
POVEZANOST RADA
Projekti:
036-1300646-1986 - Otkrivanje znanja u tekstnim podacima (Dalbelo-Bašić, Bojana, MZO ) ( CroRIS)
Ustanove:
Fakultet elektrotehnike i računarstva, Zagreb