Nalazite se na CroRIS probnoj okolini. Ovdje evidentirani podaci neće biti pohranjeni u Informacijskom sustavu znanosti RH. Ako je ovo greška, CroRIS produkcijskoj okolini moguće je pristupi putem poveznice www.croris.hr
izvor podataka: crosbi !

Non-Standard Words as Features for Text Categorization (CROSBI ID 611444)

Prilog sa skupa u zborniku | izvorni znanstveni rad | međunarodna recenzija

Beliga, Slobodan ; Martinčić-Ipšić, Sanda Non-Standard Words as Features for Text Categorization // MIPRO-CIS / Ribarić, Slobodan ; Budin, Andrea (ur.). Opatija: Hrvatska udruga za informacijsku i komunikacijsku tehnologiju, elektroniku i mikroelektroniku - MIPRO, 2014. str. 1415-1419

Podaci o odgovornosti

Beliga, Slobodan ; Martinčić-Ipšić, Sanda

engleski

Non-Standard Words as Features for Text Categorization

This paper presents the categorization of Croatian texts using Non-Standard Words (NSW) as features. Non-Standard Words are: numbers, dates, acronyms, abbreviations, currency, etc. NSWs in Croatian language are determined according to Croatian NSW taxonomy. For the purpose of this research, 390 text documents were collected and formed the SKIPEZ collection with 6 classes: official, literary, informative, popular, educational and scientific. Text categorization experiment was conducted on three different representations of the SKIPEZ collection: in the first representation, the frequencies of NSWs are used as features ; in the second representation, the statistic measures of NSWs (variance, coefficient of variation, standard deviation, etc.) are used as features ; while the third representation combines the first two feature sets. Naive Bayes, CN2, C4.5, kNN, Classification Trees and Random Forest algorithms were used in text categorization experiments. The best categorization results are achieved using the first feature set (NSW frequencies) with the categorization accuracy of 87%. This suggests that the NSWs should be considered as features in highly inflectional languages, such as Croatian. NSW based features reduce the dimensionality of the feature space without standard lemmatization procedures, and therefore the bag-of-NSWs should be considered for further Croatian texts categorization experiments.

text categorization; non-standard words; collection representation; features; accuracy

nije evidentirano

nije evidentirano

nije evidentirano

nije evidentirano

nije evidentirano

nije evidentirano

Podaci o prilogu

1415-1419.

2014.

objavljeno

Podaci o matičnoj publikaciji

MIPRO-CIS

Ribarić, Slobodan ; Budin, Andrea

Opatija: Hrvatska udruga za informacijsku i komunikacijsku tehnologiju, elektroniku i mikroelektroniku - MIPRO

978-953-233-078-6

Podaci o skupu

MIPRO 2014

predavanje

25.05.2014-29.05.2014

Opatija, Hrvatska

Povezanost rada

Računarstvo, Informacijske i komunikacijske znanosti