Classification and Information Extraction from Documents in the Domain of Culture

Petar Kristijan Bogović

Pregled bibliografske jedinice broj: 1121912

Classification and Information Extraction from Documents in the Domain of Culture

Petar Kristijan Bogović

Classification and Information Extraction from Documents in the Domain of Culture, 2021., diplomski rad, diplomski, Odjel za informatiku, Rijeka

CROSBI ID: 1121912 Za ispravke kontaktirajte CROSBI podršku putem web obrasca

Naslov
Classification and Information Extraction from Documents in the Domain of Culture

Autori
Petar Kristijan Bogović

Vrsta, podvrsta i kategorija rada
Ocjenski radovi, diplomski rad, diplomski

Fakultet
Odjel za informatiku

Mjesto
Rijeka

Datum
30.03

Godina
2021

Stranica
56

Mentor
Martinčić-Ipšić, Sanda

Ključne riječi
Information extraction, classification, named entity recognition, topic modelling, keyphrase extraction, culture, policy, society

Sažetak
The main goal of this thesis is to develop procedures for computer analysis of documents in the field of culture, cultural policies and activities. The collected documents need to be preprocessed and prepared for further computer processing, e.g. to perform lemmatization, stemming, and other NLP procedures. In this thesis, various NLP procedures will be implemented: classification, automatic extraction of keywords and locations, and the topic modeling procedure. Automatic text classification will be implemented to classify documents into already defined categories of cultural policy impacts on broader social aspects, using a standard word bag model for document representation and machine learning algorithms such as Naive Bayes, Support Vector Machine and Random Tree Forests for the classification of documents. Automatic keyword and location extraction procedures will be implemented using the MAUI keyword extraction method and the Named Entity Recognition algorithm with available tools. The topic modeling process will be performed using the Latent Dirichlet Allocation (LDA) and evaluated using the coherence of the obtained topics.

Izvorni jezik
Engleski

Znanstvena područja
Računarstvo, Informacijske i komunikacijske znanosti

POVEZANOST RADA

Projekti:
NadSve-Sveučilište u Rijeci-uniri-drustv-18-20 - Izlučivanje ključnih riječi i sažimanje tekstova na temelju reprezentacije u mrežama jezika-LangNet (LangNet) (Martinčić-Ipšić, Sanda, NadSve - Natječaj za dodjelu sredstava potpore znanstvenim istraživanjima na Sveučilištu u Rijeci za 2018. godinu - projekti iskusnih znanstvenika i umjetnika) ( CroRIS)
EK-H2020-870935 - Measuring the Social Dimension of Culture (MESOC) (EK - H2020-SC6-TRANSFORMATIONS-2019) ( CroRIS)

Ustanove:
Fakultet informatike i digitalnih tehnologija, Rijeka

Profili:

Petar Kristijan Bogović (autor)

Sanda Martinčić - Ipšić (mentor)

CROSBI Hrvatska znanstvena bibliografija

Pregled bibliografske jedinice broj: 1121912

Classification and Information Extraction from Documents in the Domain of Culture

Citiraj ovu publikaciju:

Pregled bibliografske jedinice broj: 1121912

Classification and Information Extraction from Documents in the Domain of Culture

Citiraj ovu publikaciju:

Podijeli: