Representing regional variation in corpora

Košutar, Sara; Schmidt, Larissa

Pregled bibliografske jedinice broj: 1168636

Representing regional variation in corpora

Košutar, Sara; Schmidt, Larissa

Representing regional variation in corpora // Interactive workshop on regional markedness in text
Zürich, Švicarska, 2021. (radionica, podatak o recenziji nije dostupan, ostalo)

CROSBI ID: 1168636 Za ispravke kontaktirajte CROSBI podršku putem web obrasca

Naslov
Representing regional variation in corpora

Autori
Košutar, Sara ; Schmidt, Larissa

Vrsta, podvrsta i kategorija rada
Sažeci sa skupova, ostalo, ostalo

Skup
Interactive workshop on regional markedness in text

Mjesto i datum
Zürich, Švicarska, 06.11.2021. - 07.11.2021

Vrsta sudjelovanja
Radionica

Vrsta recenzije
Podatak o recenziji nije dostupan

Ključne riječi
regional variation, dialectometry, corpora, natural language processing

Sažetak
Regional differences are usually studied using a corpus-based methodology within the framework of what is known as dialectometry. This is a framework for language comparison that focuses on the quantitative and computational study of language variation. Initial dialectometry research used questionnaires to study variation, whereas nowadays dialectometry relies heavily on corpus-based methods. Corpora contain samples of naturally occurring linguistic production and can therefore be used to identify and measure variation beyond predefined word-level variables normally captured by traditional questionnaires. In this way, analyses are extended to examine variation at the syntactic, pragmatic and discourse levels. Second, corpora provide information on the frequency of language use, i.e. how often a particular item is used, and the observed frequency is objective. Increasingly, new techniques are being used in corpus- based studies to sample the spontaneously occurring, regionally diverse language production of the Internet, traditional media, social media, and recorded authentic, ordinary language use. These new data sources are used to compare similar languages and varieties outside of traditional dialectology to create a more comprehensive study of regional variation in language use, referred to as modern dialectology. The comparison of similar languages, varieties and dialects is treated in the linguistic literature under the common term micro- variation. The question of how to represent the variability of languages in corpora has recently gained considerable importance in natural language processing (NLP) research. NLP combines linguistics and artificial intelligence to enable computers to understand human language. Since the performance of NLP tools is known to decrease when applied to a text that differs from the training examples, more attention needs to be paid to representing variability. The repository CLARIN (Common Language Resources and Technology Infrastructure) contains a wide range of corpora for studying different languages, as well as a number of parallel and manually tagged corpora, lexicons and language models that can be used in language tools. The data can be used to train or evaluate texts, or to study language use in a variety of ways, such as distinguishing between two languages, standard variants, dialects, non-edited language of the Internet, etc. In this session, we will provide an overview of data sources that can be used to study language variability.

Izvorni jezik
Engleski

Znanstvena područja
Filologija

POVEZANOST RADA

Projekti:
UIP-2017-05-6603 - Višerazinski pristup govornom diskursu u jezičnom razvoju (MultiDis) (Hržica, Gordana, HRZZ - 2017-05) ( CroRIS)

Ustanove:
Edukacijsko-rehabilitacijski fakultet, Zagreb

Profili:

Sara Košutar (autor)

CROSBI Hrvatska znanstvena bibliografija

Pregled bibliografske jedinice broj: 1168636

Representing regional variation in corpora

Citiraj ovu publikaciju:

Pregled bibliografske jedinice broj: 1168636

Representing regional variation in corpora

Citiraj ovu publikaciju:

Podijeli: