Napredna pretraga

Pregled bibliografske jedinice broj: 69525

Problems of free text searching in flective languages, The Third International Conference: Information Technology and Journalism - Interactive publishing in Central Europe, Dubrovnik, Inter University Center, 25-29 May 1998 (pozvano predavanje)


Boras, Damir; Lauc, Tomislava; Ristov Strahil
Problems of free text searching in flective languages, The Third International Conference: Information Technology and Journalism - Interactive publishing in Central Europe, Dubrovnik, Inter University Center, 25-29 May 1998 (pozvano predavanje), 1998. (pozvano predavanje).


Naslov
Problems of free text searching in flective languages, The Third International Conference: Information Technology and Journalism - Interactive publishing in Central Europe, Dubrovnik, Inter University Center, 25-29 May 1998 (pozvano predavanje)
(Problems of free text searching in flective languages, The Third International Conference: Information Technology and Journalism - Interactive publishing in Central Europe, Dubrovnik, Inter University Center, 25-29 May 1998)

Autori
Boras, Damir ; Lauc, Tomislava ; Ristov Strahil

Izvornik
Problems of free text searching in flective languages, The Third International Conference: Informati

Vrsta, podvrsta
Ostale vrste radova, pozvano predavanje

Godina
1998

Ključne riječi
Free text searching; flective languages

Sažetak
There are three basic problems in preparation of natural language text for information retrieval: 1. text segmentation into smaller units (discourses or sentences); 2. word recognition; and 3. index preparation, which should be sufficiently compact and in the same time with a very fast access time. All these seemingly trivial tasks can be extremely complex when they deal with free word order (or flective) language such as Croatian. The Croatian Language has many specific features that make it impossible to use English based algorithms in the processing of Croatian texts. Since it is a free-order language it is very difficult to determine which elements of sentence or discourse are connected to each other. To solve this problem the Croatian text segmentation model was designed. From the other side, in flective languages every word has several word forms and it is not always obvious to which basic word (or lemma) belongs every token (or word-form). In written Croatian language, nouns can have up to ten, and verbs up to twenty different forms. To solve word-form recognition and lemmatization problems several tools have been designed: lexical data base for standard Croatian, corpus-based Croatian word tagging system, and the robust Croatian proper name recognition system. To prepare sufficiently compact but fast searchable indexes the original LZ (Lempel and Ziv) compression method on static tries for searchable data has been developed, which proved viable for any huge sets of static natural language data of any sort.

Izvorni jezik
Engleski

Znanstvena područja
Informacijske i komunikacijske znanosti



POVEZANOST RADA


Projekt / tema
130743

Ustanove
Filozofski fakultet, Zagreb