Pretražite po imenu i prezimenu autora, mentora, urednika, prevoditelja

Napredna pretraga

Pregled bibliografske jedinice broj: 987762

The Influence of Feature Representation of Text on the Performance of Document Classification


Martinčić-Ipšić, Sanda; Miličić, Tanja; Todorovski, Ljupčo
The Influence of Feature Representation of Text on the Performance of Document Classification // Applied Sciences-Basel, 9 (2019), 4; 743-770 doi:10.3390/app9040743 (međunarodna recenzija, članak, znanstveni)


CROSBI ID: 987762 Za ispravke kontaktirajte CROSBI podršku putem web obrasca

Naslov
The Influence of Feature Representation of Text on the Performance of Document Classification

Autori
Martinčić-Ipšić, Sanda ; Miličić, Tanja ; Todorovski, Ljupčo

Izvornik
Applied Sciences-Basel (2076-3417) 9 (2019), 4; 743-770

Vrsta, podvrsta i kategorija rada
Radovi u časopisima, članak, znanstveni

Ključne riječi
document classification ; bag-of-words ; word2vec ; doc2vec ; graph-of-words ; complex networks

Sažetak
In this paper we perform a comparative analysis of three models for a feature representation of text documents in the context of document classification. In particular, we consider the most often used family of bag-of-words models, the recently proposed continuous space models word2vec and doc2vec, and the model based on the representation of text documents as language networks. While the bag-of-word models have been extensively used for the document classification task, the performance of the other two models for the same task have not been well understood. This is especially true for the network-based models that have been rarely considered for the representation of text documents for classification. In this study, we measure the performance of the document classifiers trained using the method of random forests for features generated with the three models and their variants. Multi- objective rankings are proposed as the framework for multi-criteria comparative analysis of the results. Finally, the results of the empirical comparison show that the commonly used bag-of-words model has a performance comparable to the one obtained by the emerging continuous-space model of doc2vec. In particular, the low-dimensional variants of doc2vec generating up to 75 features are among the top-performing document representation models. The results finally point out that doc2vec shows a superior performance in the tasks of classifying large documents.

Izvorni jezik
Engleski

Znanstvena područja
Računarstvo, Informacijske i komunikacijske znanosti



POVEZANOST RADA


Ustanove:
Fakultet informatike i digitalnih tehnologija, Rijeka

Poveznice na cjeloviti tekst rada:

doi www.mdpi.com

Citiraj ovu publikaciju:

Martinčić-Ipšić, Sanda; Miličić, Tanja; Todorovski, Ljupčo
The Influence of Feature Representation of Text on the Performance of Document Classification // Applied Sciences-Basel, 9 (2019), 4; 743-770 doi:10.3390/app9040743 (međunarodna recenzija, članak, znanstveni)
Martinčić-Ipšić, S., Miličić, T. & Todorovski, L. (2019) The Influence of Feature Representation of Text on the Performance of Document Classification. Applied Sciences-Basel, 9 (4), 743-770 doi:10.3390/app9040743.
@article{article, author = {Martin\v{c}i\'{c}-Ip\v{s}i\'{c}, Sanda and Mili\v{c}i\'{c}, Tanja and Todorovski, Ljup\v{c}o}, year = {2019}, pages = {743-770}, DOI = {10.3390/app9040743}, keywords = {document classification, bag-of-words, word2vec, doc2vec, graph-of-words, complex networks}, journal = {Applied Sciences-Basel}, doi = {10.3390/app9040743}, volume = {9}, number = {4}, issn = {2076-3417}, title = {The Influence of Feature Representation of Text on the Performance of Document Classification}, keyword = {document classification, bag-of-words, word2vec, doc2vec, graph-of-words, complex networks} }
@article{article, author = {Martin\v{c}i\'{c}-Ip\v{s}i\'{c}, Sanda and Mili\v{c}i\'{c}, Tanja and Todorovski, Ljup\v{c}o}, year = {2019}, pages = {743-770}, DOI = {10.3390/app9040743}, keywords = {document classification, bag-of-words, word2vec, doc2vec, graph-of-words, complex networks}, journal = {Applied Sciences-Basel}, doi = {10.3390/app9040743}, volume = {9}, number = {4}, issn = {2076-3417}, title = {The Influence of Feature Representation of Text on the Performance of Document Classification}, keyword = {document classification, bag-of-words, word2vec, doc2vec, graph-of-words, complex networks} }

Časopis indeksira:


  • Current Contents Connect (CCC)
  • Web of Science Core Collection (WoSCC)
    • Science Citation Index Expanded (SCI-EXP)
    • SCI-EXP, SSCI i/ili A&HCI
  • Scopus


Citati:





    Contrast
    Increase Font
    Decrease Font
    Dyslexic Font