Retrieving Information in Croatian: Building a Simple and Efficient Rule-based Stemmer

Ljubešić, Nikola; Boras, Damir; Kubelka, Ozren

Pregled bibliografske jedinice broj: 348540

Retrieving Information in Croatian: Building a Simple and Efficient Rule-based Stemmer

Ljubešić, Nikola; Boras, Damir; Kubelka, Ozren

Retrieving Information in Croatian: Building a Simple and Efficient Rule-based Stemmer // 1. međunarodna znanstvena konferencija "The Future of Information Sciences" (INFuture 2007) : Digital information and heritage : zbornik radova / Seljan, Sanja ; Stančić, Hrvoje (ur.).
Zagreb: Odsjek za informacijske znanosti Filozofskog fakulteta, 2007. str. 313-320 (poster, nije recenziran, cjeloviti rad (in extenso), znanstveni)

CROSBI ID: 348540 Za ispravke kontaktirajte CROSBI podršku putem web obrasca

Naslov
Retrieving Information in Croatian: Building a Simple and Efficient Rule-based Stemmer

Autori
Ljubešić, Nikola ; Boras, Damir ; Kubelka, Ozren

Vrsta, podvrsta i kategorija rada
Radovi u zbornicima skupova, cjeloviti rad (in extenso), znanstveni

Izvornik
1. međunarodna znanstvena konferencija "The Future of Information Sciences" (INFuture 2007) : Digital information and heritage : zbornik radova / Seljan, Sanja ; Stančić, Hrvoje - Zagreb : Odsjek za informacijske znanosti Filozofskog fakulteta, 2007, 313-320

ISBN
978-953-175-305-0

Skup
Međunarodna znanstvena konferencija "The Future of Information Sciences" : Digital information and heritage (1 ; 2007)

Mjesto i datum
Zagreb, Hrvatska, 07.11.2007. - 09.11.2007

Vrsta sudjelovanja
Poster

Vrsta recenzije
Nije recenziran

Ključne riječi
Information retrieval; Croatian language; rule-based stemming; hill climbing optimization; industry awareness

Sažetak
Since Croatian is a highly flective language there is a need for morphological normalization of natural language information so that information could become retrievable in a more efficient way. // Although this topic has been researched for more than two decades in Croatia, the vast majority of information systems that store information written in Croatian still do not have this problem solved. The primary cause for this situation is the high price of existing systems. // The aim of this paper is to analyze the current situation in the industry regarding this problem and to build a rule-based stemmer which would consist of a minimal set of rules for expanding queries to the whole possible paradigm. Such a system could make expensive morphological databases in information retrieval obsolete. // We used a corpus sample, a morphological lexicon and a query sample of 1.000 most frequent nouns in base form to build a rule-based stemmer optimized through the steepest ascent hill climbing algorithm. Using this method we built a stemmer which performs almost equally good as the noun lexicon with // F1 measures of 97.82% without the rules for adjectives and 97.64% with them.

Izvorni jezik
Engleski

Znanstvena područja
Informacijske i komunikacijske znanosti

POVEZANOST RADA

Projekti:
130-1301679-1380 - Hrvatska rječnička baština i hrvatski europski identitet (Boras, Damir, MZOS ) ( CroRIS)

Ustanove:
Filozofski fakultet, Zagreb

Profili:

Nikola Ljubešić (autor)

Ozren Kubelka (autor)

Damir Boras (autor)

CROSBI Hrvatska znanstvena bibliografija

Pregled bibliografske jedinice broj: 348540

Retrieving Information in Croatian: Building a Simple and Efficient Rule-based Stemmer

Citiraj ovu publikaciju:

Pregled bibliografske jedinice broj: 348540

Retrieving Information in Croatian: Building a Simple and Efficient Rule-based Stemmer

Citiraj ovu publikaciju:

Podijeli: