Napredna pretraga

Pregled bibliografske jedinice broj: 1026311

Keyword Extraction Based on Structural Properties of Language Complex Networks

Beliga, Slobodan
Keyword Extraction Based on Structural Properties of Language Complex Networks 2019., doktorska disertacija, Odjel za informatiku, Rijeka

Keyword Extraction Based on Structural Properties of Language Complex Networks

Beliga, Slobodan

Vrsta, podvrsta i kategorija rada
Ocjenski radovi, doktorska disertacija

Odjel za informatiku





Martinčić-Ipšić, Sanda

Ključne riječi
Keyword extraction, Selectivity-based Keyword Extraction (SBKE) method, portability, node selectivity, generalized selectivity, language complex network, multilingual keyword extraction, trilingual KE dataset TriKEDS

Automatic keyword extraction task is the initial step in a number of systems for natural language processing (NLP), text mining (TM), and information retrieval (IR). Keywords concisely and compactly describe the subject of the text. The doctoral thesis examines the issues of automatic keyword extraction and proposes a new method for this challenge. The proposed method is a graph-based unsupervised method based on the structural properties of language complex networks. The thesis employs the standard methodology from the fields of IR and NLP both in the development and evaluation phases of the research. Within the method, new centrality measures for keyword extraction task are proposed and tested. The first is the selectivity, and the second is the generalized selectivity measure. The node selectivity value is calculated from a weighted network as the average weight distributed on the links of a single node. Selectivity-based extraction (SBKE) method does not require external linguistic knowledge since it is purely derived from a network structure, making it suitable for use in different natural languages and a multilingual scenario. The SBKE method consists of two steps: keyword candidate extraction (based on selectivity values) and keyword expansion to longer sequences of keyword candidates. The proposed SBKE method is tested for different natural languages (Croatian, English, Serbian and Italian) and for various domains (scientific publications in the field of mining and geology, essays and critiques in architecture and design, news form politics, sports, culture and economy, and technical texts from Wikipedia in the field of computer science). For the purposes of the thesis, new multilingual datasets are created. Datasets contain comparable texts that are suitable for keyword extraction in general, allowing the evaluation in fully controlled conditions. Specifically, a bilingual Serbian-English and trilingual Croatian-English-Italian datasets are created. The performance of SBKE method is assessed empirically in terms of precision, recall, F1 and F2 scores, and area under the precision-recall curve. The evaluation, according to IIC (inter-indexer consistency) measure and adjusted Kappa statistics (Fleiss’ and Gwet’s coefficients), allows for assessing the consistency of the method with human annotators. The area under the precision-recall curve and Kappa statistics (Fleiss’ and Gwet’s coefficients) are novel evaluation principles for evaluating the keyword extraction tasks. It is experimentally confirmed that the method, by using knowledge from the network structure, without any additional external (linguistic or semantic) knowledge, can successfully extract the keywords from the text and it is close to the level of human annotations of keywords. Additionally, it is confirmed that a novel selectivity measure is appropriate for extraction and ranking of keywords. The proposed SBKE method demonstrates its potential for keyword extraction from different domains of texts, from individual documents or the collection of documents and for portability to new languages. The portability and low cost- feasibility of SBKE characterize the method as a highly desirable candidate for unsupervised automatic keyword extraction, especially in the absence of human annotated resources, for under-resourced languages (lacking the natural language processing resources, and tools) or for a multilingual keyword extraction task.

Izvorni jezik

Znanstvena područja
Računarstvo, Informacijske i komunikacijske znanosti


Sveučilište u Rijeci - Odjel za informatiku

Autor s matičnim brojem:
Slobodan Beliga, (346100)