Classification and Information Extraction from Documents in the Domain of Culture (CROSBI ID 440652)
Ocjenski rad | diplomski rad
Podaci o odgovornosti
Petar Kristijan Bogović
Martinčić-Ipšić, Sanda
engleski
Classification and Information Extraction from Documents in the Domain of Culture
The main goal of this thesis is to develop procedures for computer analysis of documents in the field of culture, cultural policies and activities. The collected documents need to be preprocessed and prepared for further computer processing, e.g. to perform lemmatization, stemming, and other NLP procedures. In this thesis, various NLP procedures will be implemented: classification, automatic extraction of keywords and locations, and the topic modeling procedure. Automatic text classification will be implemented to classify documents into already defined categories of cultural policy impacts on broader social aspects, using a standard word bag model for document representation and machine learning algorithms such as Naive Bayes, Support Vector Machine and Random Tree Forests for the classification of documents. Automatic keyword and location extraction procedures will be implemented using the MAUI keyword extraction method and the Named Entity Recognition algorithm with available tools. The topic modeling process will be performed using the Latent Dirichlet Allocation (LDA) and evaluated using the coherence of the obtained topics.
Information extraction, classification, named entity recognition, topic modelling, keyphrase extraction, culture, policy, society
nije evidentirano
nije evidentirano
nije evidentirano
nije evidentirano
nije evidentirano
nije evidentirano
Podaci o izdanju
56
30.03.2021.
obranjeno
Podaci o ustanovi koja je dodijelila akademski stupanj
Rijeka