CroRIS - CROSBI

izvor podataka: crosbi !

Different Text Representation Models (CROSBI ID 417375)

Ocjenski rad | diplomski rad

Miličić, Tanja Different Text Representation Models / Martinčić-Ipšić, sanda (mentor); Rijeka, . 2017

Podaci o odgovornosti

Autori

Miličić, Tanja

Mentori

Martinčić-Ipšić, sanda

Osnovni podaci na izvornom jeziku
Osnovni podaci na ostalim jezicima

Jezik

engleski

Naslov

Different Text Representation Models

Sažetak

Many successful applications depend on statistical language models such as automatic document classification, information retrieval, speech recognition any many more. This thesis is focused on the task of automatic document classification, more specifically on exploring different statistical language models that can be used to extract features from documents. State-of-the- art methods for feature construction are based on bag-of-words models and are largely used despite their known weaknesses. Their popularity rests on their simplicity and often very high accuracy. With the development of technology and machine learning algorithms, we are now able to explore more complex methods for document representations. The goal of this thesis is to present different document representation models that emerged in recent years and to explore whether computational complexity of these models can be justified by the improvement in performance. Namely, state-of-the art bag-of-word models are used as a base for comparison of word2vec/doc2vec models and models based on complex networks. While the bag-of-word models have been extensively studied in the context of document classification, the other two models have not been well understood on the same task. The study measures the performance of classifiers trained with random forest algorithm on features generated by the specified models tuned with different parameters. Results show that low dimensional doc2vec model is comparable with the traditional bag-of-words model. Also, graph based models that use selectivity measure as a feature show improvements over the bag-of-words model on a dataset with higher number of classes.

Ključne riječi

Document classification, complex networks, bag-of-words, neural networks language models, word2vec, doc2vec

Napomena

nije evidentirano

Jezik

nije evidentirano

Naslov

nije evidentirano

Sažetak

nije evidentirano

Ključne riječi

nije evidentirano

Napomena

nije evidentirano

Podaci o izdanju

Broj stranica

Datum obrane

16.11.2017.

Status objave rada

obranjeno

Podaci o ustanovi koja je dodijelila akademski stupanj

Mjesto

Rijeka

Povezanost rada

Povezane osobe

Sanda Martinčić-Ipšić (mentor/i)

Povezane ustanove

Sveučilište u Rijeci, Fakultet informatike i digitalnih tehnologija (318) (autorova ustanova)

Područje

Informacijske i komunikacijske znanosti, Računarstvo