Language Morphology Offset: Text Classification on a Croatian-English Parallel Corpus

Malenica, Mislav; Šmuc, Tomislav; Šnajder, Jan; Dalbelo Bašić, Bojana

izvor podataka: crosbi ✓

Language Morphology Offset: Text Classification on a Croatian-English Parallel Corpus (CROSBI ID 126863)

Prilog u časopisu | izvorni znanstveni rad | međunarodna recenzija

Malenica, Mislav ; Šmuc, Tomislav ; Šnajder, Jan ; Dalbelo Bašić, Bojana Language Morphology Offset: Text Classification on a Croatian-English Parallel Corpus // Information processing & management, 44 (2008), 1; 325-339. doi: 10.1016/j.ipm.2006.12.007

Podaci o odgovornosti

Autori

Malenica, Mislav ; Šmuc, Tomislav ; Šnajder, Jan ; Dalbelo Bašić, Bojana

Osnovni podaci na izvornom jeziku
Osnovni podaci na ostalim jezicima

Jezik

engleski

Naslov

Language Morphology Offset: Text Classification on a Croatian-English Parallel Corpus

Sažetak

We investigate how, and to what extent, morphological complexity of the language influences text classification using Support Vector Machines (SVM). The Croatian-English parallel corpus provides the basis for direct comparison of two languages of radically different morphological complexity. We quantified, compared, and statistically tested the effects of morphological normalisation on SVM classifier performance based on a series of parallel experiments on both languages, carried over a large scale of different feature subset sizes obtained by different feature selection methods, and applying different levels of morphological normalisation. We also quantified the trade-off between feature space size and performance for different levels of morphological normalisation, and compared the results for both languages. Our experiments have shown that the improvements in SVM classifier performance are statistically significant ; they are greater for small and medium number of features, especially for Croatian, whereas for large number of features the improvements are rather small and may be negligible in practice for both languages.

Ključne riječi

text classification ; SVM ; Croatian ; English ; morphological normalisation ; stemming ; lemmatization ; feature selection

Napomena

nije evidentirano

Jezik

nije evidentirano

Naslov

nije evidentirano

Sažetak

nije evidentirano

Ključne riječi

nije evidentirano

Napomena

nije evidentirano

Podaci o izdanju

Časopis

Information processing & management

Volumen (broj)

44 (1)

Godina

2008.

Stranice rada

325-339

Status objave rada

objavljeno

ISSN

0306-4573

e-ISSN

1873-5371

DOI

10.1016/j.ipm.2006.12.007

Povezanost rada

Povezane osobe

Mislav Malenica (autor/i)

Tomislav Šmuc (autor/i)

Jan Šnajder (autor/i)

Bojana Dalbelo Bašić (autor/i)

Povezane ustanove

Fakultet elektrotehnike i računarstva (036) (autorova ustanova)

Institut Ruđer Bošković (098) (autorova ustanova)

Povezani projekti

Otkrivanje znanja u tekstnim podacima (rezultat rada na projektu)

Algoritmi strojnog učenja i njihova primjena (rezultat rada na projektu)

Područje

Informacijske i komunikacijske znanosti, Računarstvo

Poveznice

doi.org

sciencedirect.com

Indeksiranost

Scopus

Current Contents Connect (CCC)

Web of Science Core Collection, Science Citation Index Expanded (WoSCC-SCI-Exp)

Web of Science Core Collection, Social Science Citation Index (WoSCC-SSCI)

Web of Science Core Collection, SCI-Exp, SSCI & A&HCI (WoSCC-SCI-Exp, SSCI, A&HCI)