Pregled bibliografske jedinice broj: 995375
Predicting the level of text standardness in user-generated content
Predicting the level of text standardness in user-generated content // Recent Advances in Natural Language Processing : proceedings / Galia Angelova, Kalina Bontcheva, Ruslan Mitkov (ur.).
Hisarya, 2015. str. 371-378 (predavanje, međunarodna recenzija, cjeloviti rad (in extenso), znanstveni)
CROSBI ID: 995375 Za ispravke kontaktirajte CROSBI podršku putem web obrasca
Naslov
Predicting the level of text standardness in user-generated content
Autori
Ljubešić, Nikola ; Fišer, Darja ; Erjavec, Tomaž ; Čibej, Jaka ; Marko, Dafne ; Pollak, Senja ; Škrjanec, Iza
Vrsta, podvrsta i kategorija rada
Radovi u zbornicima skupova, cjeloviti rad (in extenso), znanstveni
Izvornik
Recent Advances in Natural Language Processing : proceedings
/ Galia Angelova, Kalina Bontcheva, Ruslan Mitkov - Hisarya, 2015, 371-378
Skup
10th International Conference on Recent Advances in Natural Language Processing, RANLP 2015 ; Hissar ; Bulgaria ; 7 September 2015 through 9 September 2015 ;
Mjesto i datum
Hisar, Bugarska, 07.09.2015. - 09.09.2015
Vrsta sudjelovanja
Predavanje
Vrsta recenzije
Međunarodna recenzija
Ključne riječi
Natural language processing systems, Computational linguistics, text normalization
Sažetak
Non-standard language as it appears in user-generated content has recently attracted much attention. This paper proposes that non-standardness comes in two basic varieties, technical and linguistic, and develops a machine-learning method to discriminate between standard and nonstandard texts in these two dimensions. We describe the manual annotation of a dataset of Slovene user-generated content and the features used to build our regression models. We evaluate and discuss the results, where the mean absolute error of the best performing method on a three-point scale is 0.38 for technical and 0.42 for linguistic standardness prediction. Even when using no language-dependent information sources, our predictor still outperforms an OOVratio baseline by a wide margin. In addition, we show that very little manually annotated training data is required to perform good prediction. Predicting standardness can help decide when to attempt to normalise the data to achieve better annotation results with standard tools, and provide linguists who are interested in nonstandard language with a simple way of selecting only such texts for their research.
Izvorni jezik
Engleski
Znanstvena područja
Računarstvo, Informacijske i komunikacijske znanosti
Citiraj ovu publikaciju:
Časopis indeksira:
- Scopus