Pregled bibliografske jedinice broj: 396419
Exploring String and Word Kernels on Croatian-English Parallel Corpus
Exploring String and Word Kernels on Croatian-English Parallel Corpus // Intelligent Systems MIPRO 2009
Rijeka: Hrvatska udruga za informacijsku i komunikacijsku tehnologiju, elektroniku i mikroelektroniku - MIPRO, 2009. str. 308-311 (predavanje, međunarodna recenzija, cjeloviti rad (in extenso), znanstveni)
CROSBI ID: 396419 Za ispravke kontaktirajte CROSBI podršku putem web obrasca
Naslov
Exploring String and Word Kernels on Croatian-English Parallel Corpus
Autori
Jonke, Zeno ; Šilić, Artur ; Dalbelo Bašić, Bojana
Vrsta, podvrsta i kategorija rada
Radovi u zbornicima skupova, cjeloviti rad (in extenso), znanstveni
Izvornik
Intelligent Systems MIPRO 2009
/ - Rijeka : Hrvatska udruga za informacijsku i komunikacijsku tehnologiju, elektroniku i mikroelektroniku - MIPRO, 2009, 308-311
Skup
International Conference MIPRO 2009
Mjesto i datum
Opatija, Hrvatska, 25.05.2009. - 29.05.2009
Vrsta sudjelovanja
Predavanje
Vrsta recenzije
Međunarodna recenzija
Ključne riječi
word kernls; string kernels; text classification
Sažetak
In this paper we investigate classification performance of kernels based document representations, as well as the influence of kernel parameters for text classification in two morphologically different languages. We explore and compare two kernel functions that work at different levels of a sentence. The first is the Gap weighted kernel, a member of the String kernels that operates at the character level and thus compares text documents by subsequences of characters. This removes the need for stemming or lemmatisation, since it captures the stems of the words automatically, which is very important in situations when tools for stemming or lemmatisation are not available. The second method is the Word sequence kernel, an extension of the String kernels that works at the level of the word. This approach provides a more natural representation of the text and has the advantage of reducing document representation, which in turn reduces computation time. These two methods are compared by exploring theirs parameters dependency and by measuring their classification performance for the Croatian-English parallel corpus.
Izvorni jezik
Engleski
Znanstvena područja
Računarstvo
POVEZANOST RADA
Projekti:
036-1300646-1986 - Otkrivanje znanja u tekstnim podacima (Dalbelo-Bašić, Bojana, MZO ) ( CroRIS)
Ustanove:
Fakultet elektrotehnike i računarstva, Zagreb