Predicting the level of text standardness in user-generated content

Ljubešić, Nikola; Fišer, Darja; Erjavec, Tomaž; Čibej, Jaka; Marko, Dafne; Pollak, Senja; Škrjanec, Iza

Pregled bibliografske jedinice broj: 995375

Predicting the level of text standardness in user-generated content

Ljubešić, Nikola; Fišer, Darja; Erjavec, Tomaž; Čibej, Jaka; Marko, Dafne; Pollak, Senja; Škrjanec, Iza

Predicting the level of text standardness in user-generated content // Recent Advances in Natural Language Processing : proceedings / Galia Angelova, Kalina Bontcheva, Ruslan Mitkov (ur.).
Hisarya, 2015. str. 371-378 (predavanje, međunarodna recenzija, cjeloviti rad (in extenso), znanstveni)

CROSBI ID: 995375 Za ispravke kontaktirajte CROSBI podršku putem web obrasca

Naslov
Predicting the level of text standardness in user-generated content

Autori
Ljubešić, Nikola ; Fišer, Darja ; Erjavec, Tomaž ; Čibej, Jaka ; Marko, Dafne ; Pollak, Senja ; Škrjanec, Iza

Vrsta, podvrsta i kategorija rada
Radovi u zbornicima skupova, cjeloviti rad (in extenso), znanstveni

Izvornik
Recent Advances in Natural Language Processing : proceedings / Galia Angelova, Kalina Bontcheva, Ruslan Mitkov - Hisarya, 2015, 371-378

Skup
10th International Conference on Recent Advances in Natural Language Processing, RANLP 2015 ; Hissar ; Bulgaria ; 7 September 2015 through 9 September 2015 ;

Mjesto i datum
Hisar, Bugarska, 07.09.2015. - 09.09.2015

Vrsta sudjelovanja
Predavanje

Vrsta recenzije
Međunarodna recenzija

Ključne riječi
Natural language processing systems, Computational linguistics, text normalization

Sažetak
Non-standard language as it appears in user-generated content has recently attracted much attention. This paper proposes that non-standardness comes in two basic varieties, technical and linguistic, and develops a machine-learning method to discriminate between standard and nonstandard texts in these two dimensions. We describe the manual annotation of a dataset of Slovene user-generated content and the features used to build our regression models. We evaluate and discuss the results, where the mean absolute error of the best performing method on a three-point scale is 0.38 for technical and 0.42 for linguistic standardness prediction. Even when using no language-dependent information sources, our predictor still outperforms an OOVratio baseline by a wide margin. In addition, we show that very little manually annotated training data is required to perform good prediction. Predicting standardness can help decide when to attempt to normalise the data to achieve better annotation results with standard tools, and provide linguists who are interested in nonstandard language with a simple way of selecting only such texts for their research.

Izvorni jezik
Engleski

Znanstvena područja
Računarstvo, Informacijske i komunikacijske znanosti

POVEZANOST RADA

Ustanove:
Filozofski fakultet, Zagreb

Profili:

Nikola Ljubešić (autor)

Poveznice na cjeloviti tekst rada:

lml.bas.bg

Citiraj ovu publikaciju:

Časopis indeksira:

Scopus

Pregled bibliografske jedinice broj: 995375

Predicting the level of text standardness in user-generated content

Poveznice na cjeloviti tekst rada:

Citiraj ovu publikaciju:

Časopis indeksira:

Podijeli: