Bootstrapping a Part-of-Speech Tagger using Clustering and Classification methods

Ćavar, Damir; Ćavar, Malgorzata Ewa; Jazbec, Ivo-Pavao; Brozović Rončević, Dunja

izvor podataka: crosbi !

Bootstrapping a Part-of-Speech Tagger using Clustering and Classification methods (CROSBI ID 513840)

Prilog sa skupa u zborniku | izvorni znanstveni rad | međunarodna recenzija

Ćavar, Damir ; Ćavar, Malgorzata Ewa ; Jazbec, Ivo-Pavao ; Brozović Rončević, Dunja Bootstrapping a Part-of-Speech Tagger using Clustering and Classification methods. 2005. str. ---

Podaci o odgovornosti

Autori

Ćavar, Damir ; Ćavar, Malgorzata Ewa ; Jazbec, Ivo-Pavao ; Brozović Rončević, Dunja

Osnovni podaci na izvornom jeziku
Osnovni podaci na ostalim jezicima

Jezik

engleski

Naslov

Bootstrapping a Part-of-Speech Tagger using Clustering and Classification methods

Sažetak

Part-of-speech (POS) tagging is a necessity in basically any NLP-component, and it is also a crucial instrument in the creation of large scale natural language corpora with part-of- speech annotation. On the other hand, one of the biggest problems of natural language processing is the question about bootstrapping a corpus in order to be able to create various tools for natural language processing, e.g. a POS tagger. As an attempt to escape this paradox with respect to e.g. Croatian, we describe our solution to bootstrap a highly accurate POS tagger for the creation of tagged Croatian and Polish corpora, where the principle is a minimum coding effort and a maximum accuracy with respect to the resulting POS. Various classical approaches to POS taggers require relatively large training sets, handcrafted rules or grammars, or large dictionaries. Classical n-gram based models (Church, 1989 ; Charniak et al., 1993) require large tagged corpora, as well as transformation-based approaches (Brill, 1993) do. On the other hand, a Hidden Markov Model (HMM) based tagger as suggested in Jelinek (1985), Cutting et al. (1991), and Kupiec (1992), does not require a tagged training corpus, but a lexicon that contains all words and their corresponding POS. All the most popular and practical POS taggers nowadays, e.g. the HMM-based TnT tagger (Brants, 2000), the transformation-based tagger (Brill, 1995) and Maximum Entropy based tagger (Ratnaparkhi, 1996) require training corpora. For English (and German) these taggers can reach an accuracy of 96-97%. However, these results depend on the type of language, the algorithm used and the training corpus. The more restrictive the word order in a language and the less ambiguous the morphological paradigms, the better the results will be with these taggers. The bigger the corpus and the more accurate its annotation, the better the resulting language model. Lacking the necessary corpora requires alternative approaches. Intuitively, on the textual level, the distributional and contextual properties of lexical items, as well as morphological properties provide information about their part- of-speech. In various approaches, such information was used to derive or induce POS information using connectionist models (Elman, 1990), bigram statistics (Brill et al., 1990), vector models and clustering algorithms (e.g. Finch & Chater, 1992 ; Finch, 1993 ; Schütze, 1995 ; Gimenez & Marquez, 2003), In this paper we describe the results based on purely unsupervised clustering methods and compare them with classification algorithms where the target POS are known and a limited number of prototypical candidates is given. The clustering and classification algorithms operate on vector spaces that are built from inherent and distributional properties of individual lexical elements. The distributional properties are generated from collocations with the n most frequent words, where n is selected by an approximation of the n most frequent words that co-occur with all other words in a given corpus. We use a set of 335 handcrafted morphological patterns for Croatian and 399 handcrafted morphological patterns for Polish, which correlate with certain POS and map these on the vector space as well. In one setting, the matrix is optimized by elimination of columns with variance close to 0, and by elimination of one of two columns that co-vary significantly. In the clustering task, we use fuzzy agglomerative clustering strategies like soft K-means and EMclustering (e.g. Arabie et al. 1996 ; Lee & Pereira, 1999), as well as density clustering with varying threshold values (e.g. Hader & Hamprecht, 2003 ; Böhm et al., 2004). We compare these clustering methods with classification methods where the target POS are given, together with a certain number of prototypical words for each class. The prototypical word examples are selected, on the one hand, from varying sets of words selected from a decreasing frequency profile, i.e. the top- most 100, 500, 1000, 2000 words from the decreasing frequency profile of randomly chosen text from Polish and Croatian. On the other hand, randomly selected 100, 500, 1000, 2000 words with their corresponding tag are selected from the corpus as an initializer for classification. The POS tags are corresponding to the MULTEXT (Ide & Véronis, 1994) and MULTEXT-East (Erjavec, 2004), however, restricted to the main category only, i.e. N for noun, V for verb etc. For the evaluation purposes here, not further morpho-syntactic features were considered. The tagging procedure for the classification task consists of the respective clustering algorithm procedures. We test two strategies, a. classifying each word on the basis of its inherent and morphological properties, plus local context vector generated from the current utterance, and b. on the basis of the global context vector for each token, generated from the complete corpus. The clustering task is performed over complete context vectors for each token. Initial results show that already few vector elements are enough to derive reliable category information. Morphological cues together with detailed context cues increase the accuracy dramatically for the two Slavic languages. The classification task gives better results with a well-balanced number of prototypical elements for each class, as was intuitively expected. Overall, this method seems to be not only quite appropriate to bootstrap a basic annotated corpus plus an initial tagger, but also to detect potential tagging errors, as will be elaborated on in future. Due to space limitations here, and ongoing complex experiments and evaluations, we will be able to present a complete overview only during the workshop in Split.

Ključne riječi

Computational Linguistics ; Bootstrapping ; Parts-of-Speech Tagger ; Clustering

Napomena

nije evidentirano

Jezik

nije evidentirano

Naslov

nije evidentirano

Sažetak

nije evidentirano

Ključne riječi

nije evidentirano

Napomena

nije evidentirano

Podaci o prilogu

Stranice rada

---.

Godina izdavanja

2005.

Status objave rada

objavljeno

Podaci o matičnoj publikaciji

Podaci o skupu

Skup

Computational Modeling of Lexical Acquisition

Vrsta sudjelovanja

predavanje

Datum održavanja skupa

25.08.2005-28.08.2005

Mjesto održavanja skupa

Split, Hrvatska

Povezanost rada

Povezane osobe

Dunja Brozović Rončević (autor/i)

Malgorzata Ewa Ćavar (autor/i)

Damir Ćavar (autor/i)

Povezane ustanove

Institut za hrvatski jezik, Zagreb (212) (autorova ustanova)

Sveučilište u Zadru (269) (autorova ustanova)

Povezani projekti

Semantičke mreže i računalna leksikologija (rezultat rada na projektu)

Percepcija i artikulacija u hrvatskome jeziku (rezultat rada na projektu)

Područje

Računarstvo, Informacijske i komunikacijske znanosti, Filologija