Pretražite po imenu i prezimenu autora, mentora, urednika, prevoditelja

Napredna pretraga

Pregled bibliografske jedinice broj: 713092

TweetCaT: a tool for building Twitter corpora of smaller languages


Ljubešić, Nikola; Fišer, Darja; Erjavec, Tomaž
TweetCaT: a tool for building Twitter corpora of smaller languages // Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Reykjavík, Island: European Language Resources Association (ELRA), 2014. (poster, međunarodna recenzija, cjeloviti rad (in extenso), znanstveni)


CROSBI ID: 713092 Za ispravke kontaktirajte CROSBI podršku putem web obrasca

Naslov
TweetCaT: a tool for building Twitter corpora of smaller languages

Autori
Ljubešić, Nikola ; Fišer, Darja ; Erjavec, Tomaž

Vrsta, podvrsta i kategorija rada
Radovi u zbornicima skupova, cjeloviti rad (in extenso), znanstveni

Izvornik
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14) / - : European Language Resources Association (ELRA), 2014

ISBN
978-2-9517408-8-4

Skup
Language Resources and Evaluation Conference

Mjesto i datum
Reykjavík, Island, 25.05.2014. - 31.05.2014

Vrsta sudjelovanja
Poster

Vrsta recenzije
Međunarodna recenzija

Ključne riječi
Twitter; small languages; corpus creation

Sažetak
This paper presents TweetCaT, an open-source Python tool for building Twitter corpora that was designed for smaller languages. Using the Twitter search API and a set of seed terms, the tool identifies users tweeting in the language of interest together with their friends and followers. By running the tool for 235 days we tested it on the task of collecting two monitor corpora, one for Croatian and Serbian and the other for Slovene, thus also creating new and valuable resources for these languages. A post-processing step on the collected corpus is also described, which filters out users that tweet predominantly in a foreign language thus further cleans the collected corpora. Finally, an experiment on discriminating between Croatian and Serbian Twitter users is reported.

Izvorni jezik
Engleski

Znanstvena područja
Informacijske i komunikacijske znanosti



POVEZANOST RADA


Ustanove:
Filozofski fakultet, Zagreb

Profili:

Avatar Url Nikola Ljubešić (autor)


Citiraj ovu publikaciju:

Ljubešić, Nikola; Fišer, Darja; Erjavec, Tomaž
TweetCaT: a tool for building Twitter corpora of smaller languages // Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Reykjavík, Island: European Language Resources Association (ELRA), 2014. (poster, međunarodna recenzija, cjeloviti rad (in extenso), znanstveni)
Ljubešić, N., Fišer, D. & Erjavec, T. (2014) TweetCaT: a tool for building Twitter corpora of smaller languages. U: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14).
@article{article, author = {Ljube\v{s}i\'{c}, Nikola and Fi\v{s}er, Darja and Erjavec, Toma\v{z}}, year = {2014}, keywords = {Twitter, small languages, corpus creation}, isbn = {978-2-9517408-8-4}, title = {TweetCaT: a tool for building Twitter corpora of smaller languages}, keyword = {Twitter, small languages, corpus creation}, publisher = {European Language Resources Association (ELRA)}, publisherplace = {Reykjav\'{\i}k, Island} }
@article{article, author = {Ljube\v{s}i\'{c}, Nikola and Fi\v{s}er, Darja and Erjavec, Toma\v{z}}, year = {2014}, keywords = {Twitter, small languages, corpus creation}, isbn = {978-2-9517408-8-4}, title = {TweetCaT: a tool for building Twitter corpora of smaller languages}, keyword = {Twitter, small languages, corpus creation}, publisher = {European Language Resources Association (ELRA)}, publisherplace = {Reykjav\'{\i}k, Island} }




Contrast
Increase Font
Decrease Font
Dyslexic Font