CroRIS - CROSBI

izvor podataka: crosbi !

Keyword Extraction Based on Structural Properties of Language Complex Networks (CROSBI ID 430438)

Ocjenski rad | doktorska disertacija

Beliga, Slobodan Keyword Extraction Based on Structural Properties of Language Complex Networks / Martinčić-Ipšić, Sanda (mentor); Rijeka, . 2019

Podaci o odgovornosti

Autori

Beliga, Slobodan

Mentori

Martinčić-Ipšić, Sanda

Osnovni podaci na izvornom jeziku
Osnovni podaci na ostalim jezicima

Jezik

engleski

Naslov

Keyword Extraction Based on Structural Properties of Language Complex Networks

Sažetak

Automatic keyword extraction task is the initial step in a number of systems for natural language processing (NLP), text mining (TM), and information retrieval (IR). Keywords concisely and compactly describe the subject of the text. The doctoral thesis examines the issues of automatic keyword extraction and proposes a new method for this challenge. The proposed method is a graph-based unsupervised method based on the structural properties of language complex networks. The thesis employs the standard methodology from the fields of IR and NLP both in the development and evaluation phases of the research. Within the method, new centrality measures for keyword extraction task are proposed and tested. The first is the selectivity, and the second is the generalized selectivity measure. The node selectivity value is calculated from a weighted network as the average weight distributed on the links of a single node. Selectivity-based extraction (SBKE) method does not require external linguistic knowledge since it is purely derived from a network structure, making it suitable for use in different natural languages and a multilingual scenario. The SBKE method consists of two steps: keyword candidate extraction (based on selectivity values) and keyword expansion to longer sequences of keyword candidates. The proposed SBKE method is tested for different natural languages (Croatian, English, Serbian and Italian) and for various domains (scientific publications in the field of mining and geology, essays and critiques in architecture and design, news form politics, sports, culture and economy, and technical texts from Wikipedia in the field of computer science). For the purposes of the thesis, new multilingual datasets are created. Datasets contain comparable texts that are suitable for keyword extraction in general, allowing the evaluation in fully controlled conditions. Specifically, a bilingual Serbian-English and trilingual Croatian-English-Italian datasets are created. The performance of SBKE method is assessed empirically in terms of precision, recall, F1 and F2 scores, and area under the precision-recall curve. The evaluation, according to IIC (inter-indexer consistency) measure and adjusted Kappa statistics (Fleiss’ and Gwet’s coefficients), allows for assessing the consistency of the method with human annotators. The area under the precision-recall curve and Kappa statistics (Fleiss’ and Gwet’s coefficients) are novel evaluation principles for evaluating the keyword extraction tasks. It is experimentally confirmed that the method, by using knowledge from the network structure, without any additional external (linguistic or semantic) knowledge, can successfully extract the keywords from the text and it is close to the level of human annotations of keywords. Additionally, it is confirmed that a novel selectivity measure is appropriate for extraction and ranking of keywords. The proposed SBKE method demonstrates its potential for keyword extraction from different domains of texts, from individual documents or the collection of documents and for portability to new languages. The portability and low cost- feasibility of SBKE characterize the method as a highly desirable candidate for unsupervised automatic keyword extraction, especially in the absence of human annotated resources, for under-resourced languages (lacking the natural language processing resources, and tools) or for a multilingual keyword extraction task.

Ključne riječi

keyword extraction, Selectivity-based Keyword Extraction (SBKE) method, portability, node selectivity, generalized selectivity, language complex network, multilingual keyword extraction, trilingual KE dataset TriKEDS

Napomena

nije evidentirano

Jezik

nije evidentirano

Naslov

nije evidentirano

Sažetak

nije evidentirano

Ključne riječi

nije evidentirano

Napomena

nije evidentirano

Podaci o izdanju

Broj stranica

158

Datum obrane

07.10.2019.

Status objave rada

obranjeno

Podaci o ustanovi koja je dodijelila akademski stupanj

Mjesto

Rijeka

Povezanost rada

Povezane osobe

Slobodan Beliga (autor/i)

Sanda Martinčić-Ipšić (mentor/i)

Povezane ustanove

Sveučilište u Rijeci, Fakultet informatike i digitalnih tehnologija (318) (autorova ustanova)

Područje

Informacijske i komunikacijske znanosti, Računarstvo