Pretražite po imenu i prezimenu autora, mentora, urednika, prevoditelja

Napredna pretraga

Pregled bibliografske jedinice broj: 792814

Standardizing Tweets with Character-level Machine Translation


Ljubešić, Nikola; Erjavec, Tomaž; Fišer, Darja
Standardizing Tweets with Character-level Machine Translation // Computational Linguistics and Intelligent Text Processing / Gelbukh, Alexander (ur.).
Berlin: Springer, 2014. str. 164-175


CROSBI ID: 792814 Za ispravke kontaktirajte CROSBI podršku putem web obrasca

Naslov
Standardizing Tweets with Character-level Machine Translation

Autori
Ljubešić, Nikola ; Erjavec, Tomaž ; Fišer, Darja

Vrsta, podvrsta i kategorija rada
Poglavlja u knjigama, znanstveni

Knjiga
Computational Linguistics and Intelligent Text Processing

Urednik/ci
Gelbukh, Alexander

Izdavač
Springer

Grad
Berlin

Godina
2014

Raspon stranica
164-175

ISBN
978-3-642-54903-8

Ključne riječi
twitterese, standardization, character-level machine translation

Sažetak
This paper presents the results of the standardization procedure of Slovene tweets that are full of colloquial, dialectal and foreign-language elements. With the aim of minimizing the human input required we produced a manually normalized lexicon of the most salient out-of-vocabulary (OOV) tokens and used it to train a character-level statistical machine translation system (CSMT). Best results were obtained by combining the manually constructed lexicon and CSMT as fallback with an overall improvement of 9.9% increase on all tokens and 31.3% on OOV tokens. Manual preparation of data in a lexicon manner has proven to be more efficient than normalizing running text for the task at hand. Finally we performed an extrinsic evaluation where we automatically lemmatized the test corpus taking as input either original or automatically standardized wordforms, and achieved 75.1% per-token accuracy with the former and 83.6% with the latter, thus demonstrating that standardization has significant benefits for upstream processing.

Izvorni jezik
Engleski

Znanstvena područja
Informacijske i komunikacijske znanosti



POVEZANOST RADA


Ustanove:
Filozofski fakultet, Zagreb

Profili:

Avatar Url Nikola Ljubešić (autor)

Citiraj ovu publikaciju:

Ljubešić, Nikola; Erjavec, Tomaž; Fišer, Darja
Standardizing Tweets with Character-level Machine Translation // Computational Linguistics and Intelligent Text Processing / Gelbukh, Alexander (ur.).
Berlin: Springer, 2014. str. 164-175
Ljubešić, N., Erjavec, T. & Fišer, D. (2014) Standardizing Tweets with Character-level Machine Translation. U: Gelbukh, A. (ur.) Computational Linguistics and Intelligent Text Processing. Berlin, Springer, str. 164-175.
@inbook{inbook, author = {Ljube\v{s}i\'{c}, Nikola and Erjavec, Toma\v{z} and Fi\v{s}er, Darja}, editor = {Gelbukh, A.}, year = {2014}, pages = {164-175}, keywords = {twitterese, standardization, character-level machine translation}, isbn = {978-3-642-54903-8}, title = {Standardizing Tweets with Character-level Machine Translation}, keyword = {twitterese, standardization, character-level machine translation}, publisher = {Springer}, publisherplace = {Berlin} }
@inbook{inbook, author = {Ljube\v{s}i\'{c}, Nikola and Erjavec, Toma\v{z} and Fi\v{s}er, Darja}, editor = {Gelbukh, A.}, year = {2014}, pages = {164-175}, keywords = {twitterese, standardization, character-level machine translation}, isbn = {978-3-642-54903-8}, title = {Standardizing Tweets with Character-level Machine Translation}, keyword = {twitterese, standardization, character-level machine translation}, publisher = {Springer}, publisherplace = {Berlin} }




Contrast
Increase Font
Decrease Font
Dyslexic Font