Pregled bibliografske jedinice broj: 125583
The applicability of lemmatisation in translation equivalents detection
The applicability of lemmatisation in translation equivalents detection // Meaningful Texts: The Extraction of Semantic Information from Monolingual and Multilingual Corpora / Barnbrook, Geoff ; Danielsson, Pernilla ; Mahlberg, Michaela (ur.).
London : New York (NY): Continuum International Publishing Group, 2004. str. 195-206
CROSBI ID: 125583 Za ispravke kontaktirajte CROSBI podršku putem web obrasca
Naslov
The applicability of lemmatisation in translation equivalents detection
Autori
Tadić, Marko ; Fulgosi, Sanja ; Šojat, Krešimir
Vrsta, podvrsta i kategorija rada
Poglavlja u knjigama, znanstveni
Knjiga
Meaningful Texts: The Extraction of Semantic Information from Monolingual and Multilingual Corpora
Urednik/ci
Barnbrook, Geoff ; Danielsson, Pernilla ; Mahlberg, Michaela
Izdavač
Continuum International Publishing Group
Grad
London : New York (NY)
Godina
2004
Raspon stranica
195-206
ISBN
082647490X
Ključne riječi
Croatian Language, English Language, Croatian-English Parallel Corpus, parallel corpus, lemmatization, translation equivalents, translation equivalents detection
Sažetak
The aim of the research is to help in identification of TEs in 1:1 aligned sentences at the level of single-word units. The research is based on the Croatian-English parallel corpus compiled at the University of Zagreb. The method is based entirely on a statistical approach with no linguistic filter applied before or after the processing which has 3 steps: 1) generation of all possible pairs of tokens from 1:1 aligned sentences (Carthesius product) ; 2) application of mutual information to generated pairs in order to detect candidates for real TE ; 3) sorting the pairs according to calculated MI and choosing real TE for further use. The same method was applied to nonlemmatized and lemmatized material. The latter demonstrated 4.5 % higher precision and it has proven our hypothesis that for Croatian-English pair (and possibly other morphologically rich languages like Croatian) the lemmatized form of corpus data helps the statistical methods of TE detection.
Izvorni jezik
Engleski
Znanstvena područja
Filologija