Pregled bibliografske jedinice broj: 69525
Problems of free text searching in flective languages, The Third International Conference: Information Technology and Journalism - Interactive publishing in Central Europe, Dubrovnik, Inter University Center, 25-29 May 1998 (pozvano predavanje)
Problems of free text searching in flective languages, The Third International Conference: Information Technology and Journalism - Interactive publishing in Central Europe, Dubrovnik, Inter University Center, 25-29 May 1998 (pozvano predavanje), 1998. (ostalo).
CROSBI ID: 69525 Za ispravke kontaktirajte CROSBI podršku putem web obrasca
Naslov
Problems of free text searching in flective languages, The Third International Conference: Information Technology and Journalism - Interactive publishing in Central Europe, Dubrovnik, Inter University Center, 25-29 May 1998 (pozvano predavanje)
(Problems of free text searching in flective languages, The Third International Conference: Information Technology and Journalism - Interactive publishing in Central Europe, Dubrovnik, Inter University Center, 25-29 May 1998)
Autori
Boras, Damir ; Lauc, Tomislava ; Ristov Strahil
Izvornik
Problems of free text searching in flective languages, The Third International Conference: Informati
Vrsta, podvrsta
Ostale vrste radova, ostalo
Godina
1998
Ključne riječi
free text searching; flective languages
Sažetak
There are three basic problems in preparation of natural language text for information retrieval: 1. text segmentation into smaller units (discourses or sentences); 2. word recognition; and 3. index preparation, which should be sufficiently compact and in the same time with a very fast access time. All these seemingly trivial tasks can be extremely complex when they deal with free word order (or flective) language such as Croatian.
The Croatian Language has many specific features that make it impossible to use English based algorithms in the processing of Croatian texts. Since it is a free-order language it is very difficult to determine which elements of sentence or discourse are connected to each other. To solve this problem the Croatian text segmentation model was designed.
From the other side, in flective languages every word has several word forms and it is not always obvious to which basic word (or lemma) belongs every token (or word-form). In written Croatian language, nouns can have up to ten, and verbs up to twenty different forms. To solve word-form recognition and lemmatization problems several tools have been designed: lexical data base for standard Croatian, corpus-based Croatian word tagging system, and the robust Croatian proper name recognition system.
To prepare sufficiently compact but fast searchable indexes the original LZ (Lempel and Ziv) compression method on static tries for searchable data has been developed, which proved viable for any huge sets of static natural language data of any sort.
Izvorni jezik
Engleski
Znanstvena područja
Informacijske i komunikacijske znanosti