Predicting the level of text standardness in user-generated content

Ljubešić, Nikola; Fišer, Darja; Erjavec, Tomaž; Čibej, Jaka; Marko, Dafne; Pollak, Senja; Škrjanec, Iza

izvor podataka: crosbi !

Predicting the level of text standardness in user-generated content (CROSBI ID 674799)

Prilog sa skupa u zborniku | izvorni znanstveni rad | međunarodna recenzija

Ljubešić, Nikola ; Fišer, Darja ; Erjavec, Tomaž ; Čibej, Jaka ; Marko, Dafne ; Pollak, Senja ; Škrjanec, Iza Predicting the level of text standardness in user-generated content // Recent Advances in Natural Language Processing : proceedings / Galia Angelova, Kalina Bontcheva, Ruslan Mitkov (ur.). Hisarya, 2015. str. 371-378

Podaci o odgovornosti

Autori

Ljubešić, Nikola ; Fišer, Darja ; Erjavec, Tomaž ; Čibej, Jaka ; Marko, Dafne ; Pollak, Senja ; Škrjanec, Iza

Osnovni podaci na izvornom jeziku
Osnovni podaci na ostalim jezicima

Jezik

engleski

Naslov

Predicting the level of text standardness in user-generated content

Sažetak

Non-standard language as it appears in user-generated content has recently attracted much attention. This paper proposes that non-standardness comes in two basic varieties, technical and linguistic, and develops a machine-learning method to discriminate between standard and nonstandard texts in these two dimensions. We describe the manual annotation of a dataset of Slovene user-generated content and the features used to build our regression models. We evaluate and discuss the results, where the mean absolute error of the best performing method on a three-point scale is 0.38 for technical and 0.42 for linguistic standardness prediction. Even when using no language-dependent information sources, our predictor still outperforms an OOVratio baseline by a wide margin. In addition, we show that very little manually annotated training data is required to perform good prediction. Predicting standardness can help decide when to attempt to normalise the data to achieve better annotation results with standard tools, and provide linguists who are interested in nonstandard language with a simple way of selecting only such texts for their research.

Ključne riječi

Natural language processing systems, Computational linguistics, text normalization

Napomena

nije evidentirano

Jezik

nije evidentirano

Naslov

nije evidentirano

Sažetak

nije evidentirano

Ključne riječi

nije evidentirano

Napomena

nije evidentirano

Podaci o prilogu

Stranice rada

371-378.

Godina izdavanja

2015.

Status objave rada

objavljeno

Podaci o matičnoj publikaciji

Naslov

Recent Advances in Natural Language Processing : proceedings

Urednici

Galia Angelova, Kalina Bontcheva, Ruslan Mitkov

Izdavač

Hisarya:

ISSN

1313-8502

Podaci o skupu

Skup

10th International Conference on Recent Advances in Natural Language Processing, RANLP 2015 ; Hissar ; Bulgaria ; 7 September 2015 through 9 September 2015 ;

Vrsta sudjelovanja

predavanje

Datum održavanja skupa

07.09.2015-09.09.2015

Mjesto održavanja skupa

Hisar, Bugarska

Povezanost rada

Povezane osobe

Nikola Ljubešić (autor/i)

Povezane ustanove

Filozofski fakultet u Zagrebu (130) (autorova ustanova)

Područje

Računarstvo, Informacijske i komunikacijske znanosti

Poveznice

lml.bas.bg

Indeksiranost

Scopus