Combining morphological resources for Croatian

Šojat, Krešimir; Merkler, Danijela; Štefanec, Vanja; Srebačić, Matea; Tadić, Marko

izvor podataka: crosbi !

Combining morphological resources for Croatian (CROSBI ID 599937)

Prilog sa skupa u zborniku | sažetak izlaganja sa skupa | međunarodna recenzija

Šojat, Krešimir ; Merkler, Danijela ; Štefanec, Vanja ; Srebačić, Matea ; Tadić, Marko Combining morphological resources for Croatian // 9th Mediterranean Morphology Meeting Book of Abstracts / Raffaelli, Ida ; Kerovec, Barbara ; Srebačić, Matea (ur.). Zagreb: Filozofski fakultet Sveučilišta u Zagrebu, 2013. str. 51-52

Podaci o odgovornosti

Autori

Šojat, Krešimir ; Merkler, Danijela ; Štefanec, Vanja ; Srebačić, Matea ; Tadić, Marko

Osnovni podaci na izvornom jeziku
Osnovni podaci na ostalim jezicima

Jezik

engleski

Naslov

Combining morphological resources for Croatian

Sažetak

Lexica with morphological information are central components of various NLP tools as e.g. lemmatizers, stemmers and morphological analyzers. In the previous three decades computational processing of Croatian morphology has so far been focussed primarily on inflectional phenomena. The Croatian Morphological Lexicon (HML) comprises app. 120, 000 lemmas and all their inflectional forms. In its on-line version (http://hml.ffzg.hr) HML can be used both as lemmatizer and generator of inflected forms. HML is also used as the basis for morphosyntactic tagging of texts compliant with the MulTextEast recommendations v4.0. However, the processing of derivational phenomena has not been in the focus until recently. The necessity to combine these two lines of work became obvious in the development of tools for morphological analysis beyond inflection. These tools could be used for information extraction from annotated texts and similar tasks. Recently the development of the Derivational Database of Croatian Verbs (CroDeriV) has begun. It comprises more than 14, 000 verbal lemmas analyzed for morphemes. All verbs of the same root are interconnected, and thus the recognition of their derivational spans is enabled, e.g. verb hodati ‘to walk’ is in CroDeriV connected to 25 verbs with the root hod. Since the HML has not been provided with any derivational data and CroDeriV does not include inflectional patterns, we believe that combining these two resources can be beneficial for both of them. Particularly if this procedure can be performed automatically. In this paper we present the first attempts of automatic merging and expanding of these two resources. The experiment was performed on the lexical category of verbs in Croatian, that exhibit extremely rich derivational morphology in terms of affixation. In the first step we examined the coverage of lemmas in both resources. In Table 1. the overall number of verbal lemmas in both resources is aligned with the number of lemmas that are found in one resource, but not in another, i.e. lemmas that exist in HML, but are not listed in CroDeriV, and vice versa. No. of verbal lemmas Uncovered lemmas HML 8964 5716 CroDeriv 13 780 391 The results have shown that a rather large set of lemmas from CroDeriV is not listed in HML. In order to include them in HML we decided to extract them and automatically assign their inflectional patterns. The assignment of inflectional patterns is possible in cases when base verbs are already included in HML. For example, if HML contains the verb hodati ‘to walk’, but does not contain its derivative prehodati ‘to walk over’, the verb prehodati is assigned the inflectional pattern close to hodati based on the derivational relation among them via shared root. The word forms of lemmas with the assigned inflectional patterns can then be easily generated and incorporated into HML. In cases when HML does not contain a paricular base verb, the inflectional pattern has to be assigned manually. On the other hand, the CroDeriV can easily be enriched automatically with the inflectional patterns from HML via lemmas. This procedure enables enrichment of HML and CroDeriV as significant extension of their coverage and usability for further development of tools for morphological analysis in Croatian. The combined inflectional and derivational data can be also used for detailed research on Croatian morphology data, especially distribution and frequency of conjugational classes and verbal paradigms in corpora, i.e. in the area of Croatian linguistics so far almost exclusively based on the intuition of linguists.

Ključne riječi

morphological processing ; Croatian ; Croatian Morphological Lexicon ; CroDeriV

Napomena

nije evidentirano

Jezik

nije evidentirano

Naslov

nije evidentirano

Sažetak

nije evidentirano

Ključne riječi

nije evidentirano

Napomena

nije evidentirano

Podaci o prilogu

Stranice rada

51-52.

Godina izdavanja

2013.

Status objave rada

objavljeno

Podaci o matičnoj publikaciji

Naslov

9th Mediterranean Morphology Meeting Book of Abstracts

Urednici

Raffaelli, Ida ; Kerovec, Barbara ; Srebačić, Matea

Izdavač

Zagreb: Filozofski fakultet Sveučilišta u Zagrebu

Podaci o skupu

Skup

9th Mediterranean Morphology Meeting

Vrsta sudjelovanja

poster

Datum održavanja skupa

15.09.2013-18.09.2013

Mjesto održavanja skupa

Dubrovnik, Hrvatska

Povezanost rada

Povezane osobe

Matea Filko (CroRIS ID: 33706; MBZ: 361231) (autor/i)

Marko Tadić (CroRIS ID: 12084; MBZ: 157043) (autor/i)

Krešimir Šojat (CroRIS ID: 27039; MBZ: 255106) (autor/i)

Vanja Štefanec (CroRIS ID: 44743; MBZ: 404580) (autor/i)

Povezani izdavači

Filozofski fakultet u Zagrebu

Povezane ustanove

Filozofski fakultet u Zagrebu (130) (autorova ustanova)

Povezani projekti

Hrvatski jezični resursi i njihovo obilježavanje (rezultat rada na projektu)

Leksička semantika u izradi Hrvatskog WordNeta (rezultat rada na projektu)

Područje

Informacijske i komunikacijske znanosti, Filologija