Pretražite po imenu i prezimenu autora, mentora, urednika, prevoditelja

Napredna pretraga

Pregled bibliografske jedinice broj: 552901

HrWaC and slWac: Compiling Web Corpora for Croatian and Slovene


Ljubešić, Nikola; Erjavec, Tomaž
hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene // Text, Speech and Dialogue, Lecture Notes in Computer Science / Ivan Habernal and Vaclav Matousek (ur.).
Berlin : Heidelberg: Springer, 2011. str. 395-402


CROSBI ID: 552901 Za ispravke kontaktirajte CROSBI podršku putem web obrasca

Naslov
HrWaC and slWac: Compiling Web Corpora for Croatian and Slovene

Autori
Ljubešić, Nikola ; Erjavec, Tomaž

Vrsta, podvrsta i kategorija rada
Poglavlja u knjigama, znanstveni

Knjiga
Text, Speech and Dialogue, Lecture Notes in Computer Science

Urednik/ci
Ivan Habernal and Vaclav Matousek

Izdavač
Springer

Grad
Berlin : Heidelberg

Godina
2011

Raspon stranica
395-402

ISBN
978-3-642-23537-5

Ključne riječi
web corpus, Croatian, Slovene, topic modeling

Sažetak
Web corpora have become an attractive source of linguistic content, yet are for many languages still not available. This paper introduces two new annotated web corpora: the Croatian hrWaC and the Slovene slWaC. Both were built using a modified standard “Web as Corpus” pipeline having in mind the limited amount of available web data. The modifications are described in the paper, focusing on the content extraction from HTML pages, which combines high precision of extracted language content with a decent recall. The paper also investigates text-types of the acquired corpora using topic modeling, comparing the two corpora among themselves and with ukWaC.

Izvorni jezik
Engleski

Znanstvena područja
Informacijske i komunikacijske znanosti



POVEZANOST RADA


Projekti:
130-1301679-1380 - Hrvatska rječnička baština i hrvatski europski identitet (Boras, Damir, MZOS ) ( CroRIS)

Ustanove:
Filozofski fakultet, Zagreb

Profili:

Avatar Url Nikola Ljubešić (autor)


Citiraj ovu publikaciju:

Ljubešić, Nikola; Erjavec, Tomaž
hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene // Text, Speech and Dialogue, Lecture Notes in Computer Science / Ivan Habernal and Vaclav Matousek (ur.).
Berlin : Heidelberg: Springer, 2011. str. 395-402
Ljubešić, N. & Erjavec, T. (2011) hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene. U: Ivan Habernal and Vaclav Matousek (ur.) Text, Speech and Dialogue, Lecture Notes in Computer Science. Berlin : Heidelberg, Springer, str. 395-402.
@inbook{inbook, author = {Ljube\v{s}i\'{c}, Nikola and Erjavec, Toma\v{z}}, year = {2011}, pages = {395-402}, keywords = {web corpus, Croatian, Slovene, topic modeling}, isbn = {978-3-642-23537-5}, title = {hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene}, keyword = {web corpus, Croatian, Slovene, topic modeling}, publisher = {Springer}, publisherplace = {Berlin : Heidelberg} }
@inbook{inbook, author = {Ljube\v{s}i\'{c}, Nikola and Erjavec, Toma\v{z}}, year = {2011}, pages = {395-402}, keywords = {web corpus, Croatian, Slovene, topic modeling}, isbn = {978-3-642-23537-5}, title = {hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene}, keyword = {web corpus, Croatian, Slovene, topic modeling}, publisher = {Springer}, publisherplace = {Berlin : Heidelberg} }




Contrast
Increase Font
Decrease Font
Dyslexic Font