Pregled bibliografske jedinice broj: 1026311
Keyword Extraction Based on Structural Properties of Language Complex Networks
Keyword Extraction Based on Structural Properties of Language Complex Networks, 2019., doktorska disertacija, Odjel za informatiku, Rijeka
CROSBI ID: 1026311 Za ispravke kontaktirajte CROSBI podršku putem web obrasca
Naslov
Keyword Extraction Based on Structural Properties of Language Complex Networks
Autori
Beliga, Slobodan
Vrsta, podvrsta i kategorija rada
Ocjenski radovi, doktorska disertacija
Fakultet
Odjel za informatiku
Mjesto
Rijeka
Datum
07.10
Godina
2019
Stranica
158
Mentor
Martinčić-Ipšić, Sanda
Ključne riječi
keyword extraction, Selectivity-based Keyword Extraction (SBKE) method, portability, node selectivity, generalized selectivity, language complex network, multilingual keyword extraction, trilingual KE dataset TriKEDS
Sažetak
Automatic keyword extraction task is the initial step in a number of systems for natural language processing (NLP), text mining (TM), and information retrieval (IR). Keywords concisely and compactly describe the subject of the text. The doctoral thesis examines the issues of automatic keyword extraction and proposes a new method for this challenge. The proposed method is a graph-based unsupervised method based on the structural properties of language complex networks. The thesis employs the standard methodology from the fields of IR and NLP both in the development and evaluation phases of the research. Within the method, new centrality measures for keyword extraction task are proposed and tested. The first is the selectivity, and the second is the generalized selectivity measure. The node selectivity value is calculated from a weighted network as the average weight distributed on the links of a single node. Selectivity-based extraction (SBKE) method does not require external linguistic knowledge since it is purely derived from a network structure, making it suitable for use in different natural languages and a multilingual scenario. The SBKE method consists of two steps: keyword candidate extraction (based on selectivity values) and keyword expansion to longer sequences of keyword candidates. The proposed SBKE method is tested for different natural languages (Croatian, English, Serbian and Italian) and for various domains (scientific publications in the field of mining and geology, essays and critiques in architecture and design, news form politics, sports, culture and economy, and technical texts from Wikipedia in the field of computer science). For the purposes of the thesis, new multilingual datasets are created. Datasets contain comparable texts that are suitable for keyword extraction in general, allowing the evaluation in fully controlled conditions. Specifically, a bilingual Serbian-English and trilingual Croatian-English-Italian datasets are created. The performance of SBKE method is assessed empirically in terms of precision, recall, F1 and F2 scores, and area under the precision-recall curve. The evaluation, according to IIC (inter-indexer consistency) measure and adjusted Kappa statistics (Fleiss’ and Gwet’s coefficients), allows for assessing the consistency of the method with human annotators. The area under the precision-recall curve and Kappa statistics (Fleiss’ and Gwet’s coefficients) are novel evaluation principles for evaluating the keyword extraction tasks. It is experimentally confirmed that the method, by using knowledge from the network structure, without any additional external (linguistic or semantic) knowledge, can successfully extract the keywords from the text and it is close to the level of human annotations of keywords. Additionally, it is confirmed that a novel selectivity measure is appropriate for extraction and ranking of keywords. The proposed SBKE method demonstrates its potential for keyword extraction from different domains of texts, from individual documents or the collection of documents and for portability to new languages. The portability and low cost- feasibility of SBKE characterize the method as a highly desirable candidate for unsupervised automatic keyword extraction, especially in the absence of human annotated resources, for under-resourced languages (lacking the natural language processing resources, and tools) or for a multilingual keyword extraction task.
Izvorni jezik
Engleski
Znanstvena područja
Računarstvo, Informacijske i komunikacijske znanosti
POVEZANOST RADA
Ustanove:
Fakultet informatike i digitalnih tehnologija, Rijeka