Pregled bibliografske jedinice broj: 726449
Searching for Semantically Correct Postal Addresses on the Croatian Web
Searching for Semantically Correct Postal Addresses on the Croatian Web // Proceedings of Central European Conference on Information and Intelligent Systems 2014 / Hunjak, Tihomir ; Lovrenčić, Sandra ; Tomičić, Igor (ur.).
Varaždin: Fakultet organizacije i informatike Sveučilišta u Zagrebu, 2014. str. 276-283 (predavanje, međunarodna recenzija, cjeloviti rad (in extenso), znanstveni)
CROSBI ID: 726449 Za ispravke kontaktirajte CROSBI podršku putem web obrasca
Naslov
Searching for Semantically Correct Postal Addresses on the Croatian Web
Autori
Ugrina, Ivo ; Žigo, Mislav
Vrsta, podvrsta i kategorija rada
Radovi u zbornicima skupova, cjeloviti rad (in extenso), znanstveni
Izvornik
Proceedings of Central European Conference on Information and Intelligent Systems 2014
/ Hunjak, Tihomir ; Lovrenčić, Sandra ; Tomičić, Igor - Varaždin : Fakultet organizacije i informatike Sveučilišta u Zagrebu, 2014, 276-283
Skup
Central European Conference on Information and Intelligent Systems
Mjesto i datum
Varaždin, Hrvatska, 17.09.2014. - 19.09.2014
Vrsta sudjelovanja
Predavanje
Vrsta recenzije
Međunarodna recenzija
Ključne riječi
postal addresses ; string similarity ; machine learning ; geographic location ; address extraction
Sažetak
This article presents a method of extraction and simultaneous verification of postal addresses within web pages written in a highly inflective language (Croatian). The method uses a combined approach of direct city name extraction, string similarity measure (Jaro-Winkler) for street names, an algorithm for treating overlapping addresses and a machine learning classifier (Decision trees) to derive Semantically Correct Postal Addresses. A Semantically Correct Postal Address is defined as one that was meant to be written by an author of the text and is not simply there by a lucky ordering of words. The presented method jointly does geoparsing and geocoding. For the initial search of cities and streets, the method relies on a database containing most of the streets and cities in Croatia. The method was evaluated on a data set consisting of 13, 000, 000 documents (from 35, 000 web domains) and resulted in 4, 000, 000 addresses found in 2, 750, 000 documents. The quality of classifiers was tested on a hand annotated set giving F1 scores greater than 0.9.
Izvorni jezik
Engleski
Znanstvena područja
Matematika, Računarstvo, Informacijske i komunikacijske znanosti
POVEZANOST RADA
Ustanove:
Prirodoslovno-matematički fakultet, Matematički odjel, Zagreb,
Prirodoslovno-matematički fakultet, Zagreb