Combining morphological resources for Croatian

Šojat, Krešimir; Merkler, Danijela; Štefanec, Vanja; Srebačić, Matea; Tadić, Marko
Combining morphological resources for Croatian // 9th Mediterranean Morphology Meeting Book of Abstracts / Raffaelli, Ida ; Kerovec, Barbara ; Srebačić, Matea (ur.).
Zagreb: Sveučilište u Zagrebu, Filozofski fakultet, 2013. str. 51-52 (poster, međunarodna recenzija, sažetak, znanstveni)

Sažeci sa skupova, sažetak, znanstveni

9th Mediterranean Morphology Meeting Book of Abstracts / Raffaelli, Ida ; Kerovec, Barbara ; Srebačić, Matea - Zagreb : Sveučilište u Zagrebu, Filozofski fakultet, 2013, 51-52

9th Mediterranean Morphology Meeting

Mjesto i datum
Dubrovnik, Hrvatska, 15-18. 09. 2013

Međunarodna recenzija

Ključne riječi
Morphological processing ; Croatian ; Croatian Morphological Lexicon ; CroDeriV

Lexica with morphological information are central components of various NLP tools as e.g. lemmatizers, stemmers and morphological analyzers. In the previous three decades computational processing of Croatian morphology has so far been focussed primarily on inflectional phenomena. The Croatian Morphological Lexicon (HML) comprises app. 120, 000 lemmas and all their inflectional forms. In its on-line version ( HML can be used both as lemmatizer and generator of inflected forms. HML is also used as the basis for morphosyntactic tagging of texts compliant with the MulTextEast recommendations v4.0. However, the processing of derivational phenomena has not been in the focus until recently. The necessity to combine these two lines of work became obvious in the development of tools for morphological analysis beyond inflection. These tools could be used for information extraction from annotated texts and similar tasks. Recently the development of the Derivational Database of Croatian Verbs (CroDeriV) has begun. It comprises more than 14, 000 verbal lemmas analyzed for morphemes. All verbs of the same root are interconnected, and thus the recognition of their derivational spans is enabled, e.g. verb hodati ‘to walk’ is in CroDeriV connected to 25 verbs with the root hod. Since the HML has not been provided with any derivational data and CroDeriV does not include inflectional patterns, we believe that combining these two resources can be beneficial for both of them. Particularly if this procedure can be performed automatically. In this paper we present the first attempts of automatic merging and expanding of these two resources. The experiment was performed on the lexical category of verbs in Croatian, that exhibit extremely rich derivational morphology in terms of affixation. In the first step we examined the coverage of lemmas in both resources. In Table 1. the overall number of verbal lemmas in both resources is aligned with the number of lemmas that are found in one resource, but not in another, i.e. lemmas that exist in HML, but are not listed in CroDeriV, and vice versa. No. of verbal lemmas Uncovered lemmas HML 8964 5716 CroDeriv 13 780 391 The results have shown that a rather large set of lemmas from CroDeriV is not listed in HML. In order to include them in HML we decided to extract them and automatically assign their inflectional patterns. The assignment of inflectional patterns is possible in cases when base verbs are already included in HML. For example, if HML contains the verb hodati ‘to walk’, but does not contain its derivative prehodati ‘to walk over’, the verb prehodati is assigned the inflectional pattern close to hodati based on the derivational relation among them via shared root. The word forms of lemmas with the assigned inflectional patterns can then be easily generated and incorporated into HML. In cases when HML does not contain a paricular base verb, the inflectional pattern has to be assigned manually. On the other hand, the CroDeriV can easily be enriched automatically with the inflectional patterns from HML via lemmas. This procedure enables enrichment of HML and CroDeriV as significant extension of their coverage and usability for further development of tools for morphological analysis in Croatian. The combined inflectional and derivational data can be also used for detailed research on Croatian morphology data, especially distribution and frequency of conjugational classes and verbal paradigms in corpora, i.e. in the area of Croatian linguistics so far almost exclusively based on the intuition of linguists.

Informacijske i komunikacijske znanosti, Filologija


130-1300646-0645 - Hrvatski jezični resursi i njihovo obilježavanje (Marko Tadić, )
130-1300646-1002 - Leksička semantika u izradi Hrvatskog WordNeta (Ida Raffaelli, )

Filozofski fakultet, Zagreb