Nalazite se na CroRIS probnoj okolini. Ovdje evidentirani podaci neće biti pohranjeni u Informacijskom sustavu znanosti RH. Ako je ovo greška, CroRIS produkcijskoj okolini moguće je pristupi putem poveznice www.croris.hr
izvor podataka: crosbi !

Predicting the level of text standardness in user-generated content (CROSBI ID 674799)

Prilog sa skupa u zborniku | izvorni znanstveni rad | međunarodna recenzija

Ljubešić, Nikola ; Fišer, Darja ; Erjavec, Tomaž ; Čibej, Jaka ; Marko, Dafne ; Pollak, Senja ; Škrjanec, Iza Predicting the level of text standardness in user-generated content // Recent Advances in Natural Language Processing : proceedings / Galia Angelova, Kalina Bontcheva, Ruslan Mitkov (ur.). Hisarya, 2015. str. 371-378

Podaci o odgovornosti

Ljubešić, Nikola ; Fišer, Darja ; Erjavec, Tomaž ; Čibej, Jaka ; Marko, Dafne ; Pollak, Senja ; Škrjanec, Iza

engleski

Predicting the level of text standardness in user-generated content

Non-standard language as it appears in user-generated content has recently attracted much attention. This paper proposes that non-standardness comes in two basic varieties, technical and linguistic, and develops a machine-learning method to discriminate between standard and nonstandard texts in these two dimensions. We describe the manual annotation of a dataset of Slovene user-generated content and the features used to build our regression models. We evaluate and discuss the results, where the mean absolute error of the best performing method on a three-point scale is 0.38 for technical and 0.42 for linguistic standardness prediction. Even when using no language-dependent information sources, our predictor still outperforms an OOVratio baseline by a wide margin. In addition, we show that very little manually annotated training data is required to perform good prediction. Predicting standardness can help decide when to attempt to normalise the data to achieve better annotation results with standard tools, and provide linguists who are interested in nonstandard language with a simple way of selecting only such texts for their research.

Natural language processing systems, Computational linguistics, text normalization

nije evidentirano

nije evidentirano

nije evidentirano

nije evidentirano

nije evidentirano

nije evidentirano

Podaci o prilogu

371-378.

2015.

objavljeno

Podaci o matičnoj publikaciji

Recent Advances in Natural Language Processing : proceedings

Galia Angelova, Kalina Bontcheva, Ruslan Mitkov

Hisarya:

1313-8502

Podaci o skupu

10th International Conference on Recent Advances in Natural Language Processing, RANLP 2015 ; Hissar ; Bulgaria ; 7 September 2015 through 9 September 2015 ;

predavanje

07.09.2015-09.09.2015

Hisar, Bugarska

Povezanost rada

Računarstvo, Informacijske i komunikacijske znanosti

Poveznice
Indeksiranost