Pregled bibliografske jedinice broj: 126566
Finding Multiword Term Candidates in Croatian
Finding Multiword Term Candidates in Croatian // Proceedings of Information Extraction for Slavic Languages 2003 Workshop (IESL2003)
Sofija: BAS, 2003. str. 102-107 (predavanje, međunarodna recenzija, cjeloviti rad (in extenso), znanstveni)
CROSBI ID: 126566 Za ispravke kontaktirajte CROSBI podršku putem web obrasca
Naslov
Finding Multiword Term Candidates in Croatian
Autori
Tadić, Marko ; Šojat, Krešimir
Vrsta, podvrsta i kategorija rada
Radovi u zbornicima skupova, cjeloviti rad (in extenso), znanstveni
Izvornik
Proceedings of Information Extraction for Slavic Languages 2003 Workshop (IESL2003)
/ - Sofija : BAS, 2003, 102-107
Skup
Information Extraction for Slavic Languages 2003 Workshop
Mjesto i datum
Borovec, Bugarska, 08.09.2003. - 09.09.2003
Vrsta sudjelovanja
Predavanje
Vrsta recenzije
Međunarodna recenzija
Ključne riječi
Croatian Language; multiword terms; term candidates; statistical processing; mutual information
Sažetak
The paper presents the research in the field of statistical processing of a corpus of texts in Croatian with the primary aim of finding statistically significant co-occurrences of n-grams of tokens (digrams , trigrams and tetragrams). The collocations found with this method present the list of candidates for multiword terminological units submitted to terminologists for further processing i.e. manual selecting of the “ ; real terms” ; . The statistical measure of co-occurrence used is mutual information (MI3) accompanied with linguistic filters: stop-words and POS. The results on non-lemmatized material of a highly inflected lan-guage such as Croatian show that MI measure alone is not sufficient to find satisfactory number of multi-word term candidates. In this case the usage of absolute frequency combined with linguistic filtering techniques gives broader list of candidates for real terms.
Izvorni jezik
Engleski
Znanstvena područja
Filologija