Croatian language networks

Martinčić-Ipšić, Sanda
2014 Adriatic Conference on Graph Theory and Complexity

MedILS Split, Croatia, 25.4-27.4

Complex networks; language networks; natural language processing

Written, as well as spoken language can be modeled via complex networks where the lingual units (words) are represented by nodes and their linguistic interactions by links. Such representations enable language analysis through varying linguistic units ; the examination of language evolution ; the modeling of language acquisition ; or assessing the text quality. The language networks construction can be on word- level and on subword-level. The study of networks interactions across language levels can reveal presently unavailable structural properties of the Croatian language at phonological, syllabic, morphological, co-occurrence and syntax level. In our research we are focused upon the word and sub-word co-occurrence networks of Croatian. Initially, we study the structure of Croatian word co-occurrence networks ; the change of network structure properties by systematically varying the co-occurrence window sizes, the corpus sizes and the removal of stopwords. Below the word level we constructed syllable networks. The results indicate that Croatian syllable networks exhibit certain properties of small world networks. Furthermore, we compared Croatian syllable networks with Portuguese and Chinese syllable networks and we have shown that they have similar properties. The applicative goal of this study is to derive an assessment model for the evaluation of the quality of Croatian texts from complex networks parameters, which could be used to develop software able to consistently carry out a desired analysis of a given text, such as assessing the quality of a summary or estimating the quality of a machine translation.

