Searching for Semantically Correct Postal Addresses on the Croatian Web (CROSBI ID 616627)
Prilog sa skupa u časopisu | izvorni znanstveni rad | međunarodna recenzija
Podaci o odgovornosti
Ugrina, Ivo ; Žigo, Mislav
engleski
Searching for Semantically Correct Postal Addresses on the Croatian Web
This article presents a method of extraction and simultaneous verification of postal addresses within web pages written in a highly inflective language (Croatian). The method uses a combined approach of direct city name extraction, string similarity measure (Jaro-Winkler) for street names, an algorithm for treating overlapping addresses and a machine learning classifier (Decision trees) to derive Semantically Correct Postal Addresses. A Semantically Correct Postal Address is defined as one that was meant to be written by an author of the text and is not simply there by a lucky ordering of words. The presented method jointly does geoparsing and geocoding. For the initial search of cities and streets, the method relies on a database containing most of the streets and cities in Croatia. The method was evaluated on a data set consisting of 13, 000, 000 documents (from 35, 000 web domains) and resulted in 4, 000, 000 addresses found in 2, 750, 000 documents. The quality of classifiers was tested on a hand annotated set giving F1 scores greater than 0.9.
postal addresses ; string similarity ; machine learning ; geographic location ; address extraction
nije evidentirano
nije evidentirano
nije evidentirano
nije evidentirano
nije evidentirano
nije evidentirano
Podaci o prilogu
276-283.
2014.
nije evidentirano
objavljeno
Podaci o matičnoj publikaciji
Central European conference on information and intelligent systems
Hunjak, Tihomir ; Lovrenčić, Sandra ; Tomičić, Igor
Varaždin: Fakultet organizacije i informatike Sveučilišta u Zagrebu
1847-2001
1848-2295
Podaci o skupu
Central European Conference on Information and Intelligent Systems
predavanje
17.09.2014-19.09.2014
Varaždin, Hrvatska
Povezanost rada
Informacijske i komunikacijske znanosti, Matematika, Računarstvo