Pregled bibliografske jedinice broj: 1131596
Corpus Analysis of Complex Names with Common Nounsin Croatian
Corpus Analysis of Complex Names with Common Nounsin Croatian // Computational and Corpus-based Phraseology: Proceedings of the Third International Conference EUROPHRAS 2019 / Corpas Pastor, Gloria ; Mitkov, Ruslan ; Kunilovskaya, Maria ; Losey León, María Araceli (ur.).
Ženeva: Editions Tradulex, 2019. str. 106-113 doi:10.26615/978-2-9701095-6-3_014
CROSBI ID: 1131596 Za ispravke kontaktirajte CROSBI podršku putem web obrasca
Naslov
Corpus Analysis of Complex Names with Common Nounsin Croatian
Autori
Matas Ivanković, Ivana ; Blagus Bartolec, Goranka
Vrsta, podvrsta i kategorija rada
Poglavlja u knjigama, znanstveni
Knjiga
Computational and Corpus-based Phraseology: Proceedings of the Third International Conference EUROPHRAS 2019
Urednik/ci
Corpas Pastor, Gloria ; Mitkov, Ruslan ; Kunilovskaya, Maria ; Losey León, María Araceli
Izdavač
Editions Tradulex
Grad
Ženeva
Godina
2019
Raspon stranica
106-113
ISBN
978-2-9701095-6-3
Ključne riječi
Complex Names, Croatian Orthography, Corpus Search
Sažetak
The goal of this corpus-based researchis to see can the complex names with common nouns in their composition be extracted from Croatian hrWaC v2.2 corpus by using regular expressions, i.e. to what extent the capital letter (not the one after the full stop, the exclamation mark or the question mark)can be taken as an indication of a name. Common noun can be used as a regular noun or as a constituent of a complex name, which, on one hand, makes it difficult to tag them automatically, and on the other hand, affects the lexicographic description. With the help of regular expressions, we searched for capitalized common nouns and for sequences in which a capitalized attribute is on the first place and the common noun follows it. After analyzing 1000 examples in each search, we divided results into two groups: names and sequences with an uppercase letter that are not names. Some of the causes of extracting “false” names are technical (e.g. interpunction:separating sentences with paragraph mark(¶), lack of interpunction at the end of sentence ; whole parts of textwrit-ten in upper case...), and some of them lie in the texts crawled for hrWaC, which are not written in accordance with Croatian orthography.
Izvorni jezik
Engleski
Znanstvena područja
Filologija
POVEZANOST RADA
Ustanove:
Institut za hrvatski jezik i jezikoslovlje, Zagreb