Napredna pretraga

Pregled bibliografske jedinice broj: 852064

Corpus-Based Diacritic Restoration for South Slavic Languages


Ljubešić, Nikola; Erjavec, Tomaž; Fišer, Darja
Corpus-Based Diacritic Restoration for South Slavic Languages // Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)
Portorož, Slovenija: European Language Resources Association (ELRA), 2016. (poster, međunarodna recenzija, cjeloviti rad (in extenso), znanstveni)


Naslov
Corpus-Based Diacritic Restoration for South Slavic Languages

Autori
Ljubešić, Nikola ; Erjavec, Tomaž ; Fišer, Darja

Vrsta, podvrsta i kategorija rada
Radovi u zbornicima skupova, cjeloviti rad (in extenso), znanstveni

Izvornik
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016) / - : European Language Resources Association (ELRA), 2016

ISBN
978-2-9517408-9-1

Skup
Language Resources and Evaluation Conference

Mjesto i datum
Portorož, Slovenija, 23-28.05.2016.

Vrsta sudjelovanja
Poster

Vrsta recenzije
Međunarodna recenzija

Ključne riječi
Computer-mediated communication; diacritic restoration; South-Slavic languages

Sažetak
In computer-mediated communication, Latin-based scripts users often omit diacritics when writing. Such text is typically easily understandable to humans but very difficult for computational processing because many words become ambiguous or unknown. Letter-level approaches to diacritic restoration generalise better and do not require a lot of training data but word-level approaches tend to yield better results. However, they typically rely on a lexicon which is an expensive resource, not covering non-standard forms, and often not available for less-resourced languages. In this paper we present diacritic restoration models that are trained on easy-to-acquire corpora. We test three different types of corpora (Wikipedia, general web, Twitter) for three South Slavic languages (Croatian, Serbian and Slovene) and evaluate them on two types of text: standard (Wikipedia) and non-standard (Twitter). The proposed approach considerably outperforms charlifter, so far the only open source tool available for this task. We make the best performing systems freely available.

Izvorni jezik
Engleski

Znanstvena područja
Informacijske i komunikacijske znanosti



POVEZANOST RADA


Ustanove
Filozofski fakultet, Zagreb

Autor s matičnim brojem:
Nikola Ljubešić, (272820)