Pretražite po imenu i prezimenu autora, mentora, urednika, prevoditelja

Napredna pretraga

Pregled bibliografske jedinice broj: 267479

Language Morphology Offset: Text Classification on a Croatian-English Parallel Corpus


Malenica, Mislav; Šmuc, Tomislav; Šnajder, Jan; Dalbelo Bašić, Bojana
Language Morphology Offset: Text Classification on a Croatian-English Parallel Corpus // Information processing & management, 44 (2008), 1; 325-339 doi:10.1016/j.ipm.2006.12.007 (međunarodna recenzija, članak, znanstveni)


CROSBI ID: 267479 Za ispravke kontaktirajte CROSBI podršku putem web obrasca

Naslov
Language Morphology Offset: Text Classification on a Croatian-English Parallel Corpus

Autori
Malenica, Mislav ; Šmuc, Tomislav ; Šnajder, Jan ; Dalbelo Bašić, Bojana

Izvornik
Information processing & management (0306-4573) 44 (2008), 1; 325-339

Vrsta, podvrsta i kategorija rada
Radovi u časopisima, članak, znanstveni

Ključne riječi
text classification ; SVM ; Croatian ; English ; morphological normalisation ; stemming ; lemmatization ; feature selection

Sažetak
We investigate how, and to what extent, morphological complexity of the language influences text classification using Support Vector Machines (SVM). The Croatian-English parallel corpus provides the basis for direct comparison of two languages of radically different morphological complexity. We quantified, compared, and statistically tested the effects of morphological normalisation on SVM classifier performance based on a series of parallel experiments on both languages, carried over a large scale of different feature subset sizes obtained by different feature selection methods, and applying different levels of morphological normalisation. We also quantified the trade-off between feature space size and performance for different levels of morphological normalisation, and compared the results for both languages. Our experiments have shown that the improvements in SVM classifier performance are statistically significant ; they are greater for small and medium number of features, especially for Croatian, whereas for large number of features the improvements are rather small and may be negligible in practice for both languages.

Izvorni jezik
Engleski

Znanstvena područja
Računarstvo, Informacijske i komunikacijske znanosti



POVEZANOST RADA


Projekti:
MZO-ZP-036-1300646-1986 - Otkrivanje znanja u tekstnim podacima (Dalbelo-Bašić, Bojana, MZO ) ( CroRIS)
MZOS-098-0982560-2563 - Algoritmi strojnog učenja i njihova primjena (Gamberger, Dragan, MZOS ) ( CroRIS)

Ustanove:
Fakultet elektrotehnike i računarstva, Zagreb,
Institut "Ruđer Bošković", Zagreb

Poveznice na cjeloviti tekst rada:

doi www.sciencedirect.com

Citiraj ovu publikaciju:

Malenica, Mislav; Šmuc, Tomislav; Šnajder, Jan; Dalbelo Bašić, Bojana
Language Morphology Offset: Text Classification on a Croatian-English Parallel Corpus // Information processing & management, 44 (2008), 1; 325-339 doi:10.1016/j.ipm.2006.12.007 (međunarodna recenzija, članak, znanstveni)
Malenica, M., Šmuc, T., Šnajder, J. & Dalbelo Bašić, B. (2008) Language Morphology Offset: Text Classification on a Croatian-English Parallel Corpus. Information processing & management, 44 (1), 325-339 doi:10.1016/j.ipm.2006.12.007.
@article{article, author = {Malenica, Mislav and \v{S}muc, Tomislav and \v{S}najder, Jan and Dalbelo Ba\v{s}i\'{c}, Bojana}, year = {2008}, pages = {325-339}, DOI = {10.1016/j.ipm.2006.12.007}, keywords = {text classification, SVM, Croatian, English, morphological normalisation, stemming, lemmatization, feature selection}, journal = {Information processing and management}, doi = {10.1016/j.ipm.2006.12.007}, volume = {44}, number = {1}, issn = {0306-4573}, title = {Language Morphology Offset: Text Classification on a Croatian-English Parallel Corpus}, keyword = {text classification, SVM, Croatian, English, morphological normalisation, stemming, lemmatization, feature selection} }
@article{article, author = {Malenica, Mislav and \v{S}muc, Tomislav and \v{S}najder, Jan and Dalbelo Ba\v{s}i\'{c}, Bojana}, year = {2008}, pages = {325-339}, DOI = {10.1016/j.ipm.2006.12.007}, keywords = {text classification, SVM, Croatian, English, morphological normalisation, stemming, lemmatization, feature selection}, journal = {Information processing and management}, doi = {10.1016/j.ipm.2006.12.007}, volume = {44}, number = {1}, issn = {0306-4573}, title = {Language Morphology Offset: Text Classification on a Croatian-English Parallel Corpus}, keyword = {text classification, SVM, Croatian, English, morphological normalisation, stemming, lemmatization, feature selection} }

Časopis indeksira:


  • Current Contents Connect (CCC)
  • Web of Science Core Collection (WoSCC)
    • Science Citation Index Expanded (SCI-EXP)
    • Social Science Citation Index (SSCI)
    • SCI-EXP, SSCI i/ili A&HCI
  • Scopus


Citati:





    Contrast
    Increase Font
    Decrease Font
    Dyslexic Font