Predicting the level of text standardness in user-generated content (CROSBI ID 674799)
Prilog sa skupa u zborniku | izvorni znanstveni rad | međunarodna recenzija
Podaci o odgovornosti
Ljubešić, Nikola ; Fišer, Darja ; Erjavec, Tomaž ; Čibej, Jaka ; Marko, Dafne ; Pollak, Senja ; Škrjanec, Iza
engleski
Predicting the level of text standardness in user-generated content
Non-standard language as it appears in user-generated content has recently attracted much attention. This paper proposes that non-standardness comes in two basic varieties, technical and linguistic, and develops a machine-learning method to discriminate between standard and nonstandard texts in these two dimensions. We describe the manual annotation of a dataset of Slovene user-generated content and the features used to build our regression models. We evaluate and discuss the results, where the mean absolute error of the best performing method on a three-point scale is 0.38 for technical and 0.42 for linguistic standardness prediction. Even when using no language-dependent information sources, our predictor still outperforms an OOVratio baseline by a wide margin. In addition, we show that very little manually annotated training data is required to perform good prediction. Predicting standardness can help decide when to attempt to normalise the data to achieve better annotation results with standard tools, and provide linguists who are interested in nonstandard language with a simple way of selecting only such texts for their research.
Natural language processing systems, Computational linguistics, text normalization
nije evidentirano
nije evidentirano
nije evidentirano
nije evidentirano
nije evidentirano
nije evidentirano
Podaci o prilogu
371-378.
2015.
objavljeno
Podaci o matičnoj publikaciji
Recent Advances in Natural Language Processing : proceedings
Galia Angelova, Kalina Bontcheva, Ruslan Mitkov
Hisarya:
1313-8502
Podaci o skupu
10th International Conference on Recent Advances in Natural Language Processing, RANLP 2015 ; Hissar ; Bulgaria ; 7 September 2015 through 9 September 2015 ;
predavanje
07.09.2015-09.09.2015
Hisar, Bugarska