Web indexing and search with local language support

Krstinić, Damir; Slapničar, Ivan
Web indexing and search with local language support // Proceedings of SoftCOM 2003 / D. Begušić, N. Rožić (ur.).
Split: FESB Split, 2003. str. 488-492 (predavanje, međunarodna recenzija, cjeloviti rad (in extenso), stručni)

Radovi u zbornicima skupova, cjeloviti rad (in extenso), stručni

Proceedings of SoftCOM 2003 / D. Begušić, N. Rožić - Split : FESB Split, 2003, 488-492

SoftCOM 2003.

Split, Hrvatska ; Venecija, Ankona, Italija, 07-10.10.2003

Međunarodna recenzija

WWW; Internet; information retrieval; text search; vector spaces; latent semantic indexing; LSI; singular value decomposition; SVD; grammar; web spider

Web search is becoming essential for every day life, where major need arises for extracting relevant knowledge from enormous amounts of the available data. In a modern information retrieval systems, data is modeled as a term-by-document matrix. User query is represented as a vector and database search becomes a simple vector operation. The Latent Semantic Indexing (LSI) method reduces the size of term by document matrix and improves the performance of information retrieval system. Great majority of these systems are based on the English language. Although these systems are applicable to documents in other languages, they can suffer from incomplete terms recognition. We focus on languages with a complex set of grammar rules where improvement can be achieved by giving the indexing system basic knowledge of the language, and ability to recognize different forms of the same word. Using this technique, original matrix can be reduced by order of magnitude and important term-document connections strengthened. We are developing web indexing engine with local language support using Ispell dictionary files. As part of this effort, Croatian language dictionary files have been developed.

