The applicability of lemmatisation in translation equivalents detection

Tadić, Marko; Fulgosi, Sanja; Šojat, Krešimir

Pregled bibliografske jedinice broj: 125583

The applicability of lemmatisation in translation equivalents detection

Tadić, Marko; Fulgosi, Sanja; Šojat, Krešimir

The applicability of lemmatisation in translation equivalents detection // Meaningful Texts: The Extraction of Semantic Information from Monolingual and Multilingual Corpora / Barnbrook, Geoff ; Danielsson, Pernilla ; Mahlberg, Michaela (ur.).
London : New York (NY): Continuum International Publishing Group, 2004. str. 195-206

CROSBI ID: 125583 Za ispravke kontaktirajte CROSBI podršku putem web obrasca

Naslov
The applicability of lemmatisation in translation equivalents detection

Autori
Tadić, Marko ; Fulgosi, Sanja ; Šojat, Krešimir

Vrsta, podvrsta i kategorija rada
Poglavlja u knjigama, znanstveni

Knjiga
Meaningful Texts: The Extraction of Semantic Information from Monolingual and Multilingual Corpora

Urednik/ci
Barnbrook, Geoff ; Danielsson, Pernilla ; Mahlberg, Michaela

Izdavač
Continuum International Publishing Group

Grad
London : New York (NY)

Godina
2004

Raspon stranica
195-206

ISBN
082647490X

Ključne riječi
Croatian Language, English Language, Croatian-English Parallel Corpus, parallel corpus, lemmatization, translation equivalents, translation equivalents detection

Sažetak
The aim of the research is to help in identification of TEs in 1:1 aligned sentences at the level of single-word units. The research is based on the Croatian-English parallel corpus compiled at the University of Zagreb. The method is based entirely on a statistical approach with no linguistic filter applied before or after the processing which has 3 steps: 1) generation of all possible pairs of tokens from 1:1 aligned sentences (Carthesius product) ; 2) application of mutual information to generated pairs in order to detect candidates for real TE ; 3) sorting the pairs according to calculated MI and choosing real TE for further use. The same method was applied to nonlemmatized and lemmatized material. The latter demonstrated 4.5 % higher precision and it has proven our hypothesis that for Croatian-English pair (and possibly other morphologically rich languages like Croatian) the lemmatized form of corpus data helps the statistical methods of TE detection.

Izvorni jezik
Engleski

Znanstvena područja
Filologija

POVEZANOST RADA

Projekti:
0130418

Ustanove:
Filozofski fakultet, Zagreb

Profili:

Marko Tadić (autor)