Corpus-Based Diacritic Restoration for South Slavic Languages

Ljubešić, Nikola; Erjavec, Tomaž; Fišer, Darja

izvor podataka: crosbi !

Corpus-Based Diacritic Restoration for South Slavic Languages (CROSBI ID 643369)

Prilog sa skupa u zborniku | izvorni znanstveni rad | međunarodna recenzija

Ljubešić, Nikola ; Erjavec, Tomaž ; Fišer, Darja Corpus-Based Diacritic Restoration for South Slavic Languages // Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). European Language Resources Association (ELRA), 2016

Podaci o odgovornosti

Autori

Ljubešić, Nikola ; Erjavec, Tomaž ; Fišer, Darja

Osnovni podaci na izvornom jeziku
Osnovni podaci na ostalim jezicima

Jezik

engleski

Naslov

Corpus-Based Diacritic Restoration for South Slavic Languages

Sažetak

In computer-mediated communication, Latin-based scripts users often omit diacritics when writing. Such text is typically easily understandable to humans but very difficult for computational processing because many words become ambiguous or unknown. Letter-level approaches to diacritic restoration generalise better and do not require a lot of training data but word-level approaches tend to yield better results. However, they typically rely on a lexicon which is an expensive resource, not covering non-standard forms, and often not available for less-resourced languages. In this paper we present diacritic restoration models that are trained on easy-to-acquire corpora. We test three different types of corpora (Wikipedia, general web, Twitter) for three South Slavic languages (Croatian, Serbian and Slovene) and evaluate them on two types of text: standard (Wikipedia) and non-standard (Twitter). The proposed approach considerably outperforms charlifter, so far the only open source tool available for this task. We make the best performing systems freely available.

Ključne riječi

computer-mediated communication; diacritic restoration; South-Slavic languages

Napomena

nije evidentirano

Jezik

nije evidentirano

Naslov

nije evidentirano

Sažetak

nije evidentirano

Ključne riječi

nije evidentirano

Napomena

nije evidentirano

Podaci o prilogu

Godina izdavanja

2016.

Status objave rada

objavljeno

Podaci o matičnoj publikaciji

Naslov

Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)

Izdavač

European Language Resources Association (ELRA)

ISBN

978-2-9517408-9-1

Podaci o skupu

Skup

Language Resources and Evaluation Conference

Vrsta sudjelovanja

poster

Datum održavanja skupa

23.05.2016-28.05.2016

Mjesto održavanja skupa

Portorož, Slovenija

Povezanost rada

Povezane osobe

Nikola Ljubešić (autor/i)

Povezane ustanove

Filozofski fakultet u Zagrebu (130) (autorova ustanova)

Područje

Informacijske i komunikacijske znanosti