Classification of Large-Scale Biological Annotations Using Word Embeddings Derived from Corpora of Biomedical Research Literature

Baćac, Adriano

izvor podataka: crosbi !

Classification of Large-Scale Biological Annotations Using Word Embeddings Derived from Corpora of Biomedical Research Literature (CROSBI ID 411200)

Ocjenski rad | diplomski rad

Baćac, Adriano Classification of Large-Scale Biological Annotations Using Word Embeddings Derived from Corpora of Biomedical Research Literature / Šikić, Mile (mentor); Zagreb, Fakultet elektrotehnike i računarstva, . 2017

Podaci o odgovornosti

Autori

Baćac, Adriano

Mentori

Šikić, Mile

Osnovni podaci na izvornom jeziku
Osnovni podaci na ostalim jezicima

Jezik

engleski

Naslov

Classification of Large-Scale Biological Annotations Using Word Embeddings Derived from Corpora of Biomedical Research Literature

Sažetak

Custom Word2vec and GloVe embeddings for scientific literature in the biomedical domain were trained, as well as three classification methods for discriminating phenotype traits, two of which were based on aggregating word embeddings and one on recurrent neural networks. Word embeddings were trained on a large corpus of scientific articles and its more subject-specific subsets. Classification performance was tested on 6 document sources. It was shown that Word2vec achieves better performance when trained on a subject-specific subset corpus comprised of 4.9% articles, than when trained on the entire corpus. Using recurrent neural networks had an overfitting problem, possibly because the documents were too long or the training set too small. Although the proposed models did not outperform support vector machine using bag-of-words, it was shown that using the aggregation methods alongside the baseline model increases the amount of correctly classified minority class in some phenotype traits by around 10%.

Ključne riječi

word embedding, Word2vec, GloVe, RNN, LSTM, phenotype classification, corpus specificity

Napomena

nije evidentirano

Jezik

nije evidentirano

Naslov

nije evidentirano

Sažetak

nije evidentirano

Ključne riječi

nije evidentirano

Napomena

nije evidentirano

Podaci o izdanju

Broj stranica

Datum obrane

10.07.2017.

Status objave rada

obranjeno

Podaci o ustanovi koja je dodijelila akademski stupanj

Ustanova / Organizacija

Fakultet elektrotehnike i računarstva

Mjesto

Zagreb

Povezanost rada

Povezane osobe

Mile Šikić (mentor/i)

Povezane ustanove

Fakultet elektrotehnike i računarstva (036) (autorova ustanova)

Područje

Računarstvo