Pregled bibliografske jedinice broj: 1253070
Training a Genre Classifier for Automatic Classification of Web Pages
Training a Genre Classifier for Automatic Classification of Web Pages // Journal of Computing and Information Technology, 15 (2007), 4; 305-311 doi:10.2498/cit.1001137 (međunarodna recenzija, članak, znanstveni)
CROSBI ID: 1253070 Za ispravke kontaktirajte CROSBI podršku putem web obrasca
Naslov
Training a Genre Classifier for Automatic
Classification of Web Pages
Autori
Vidulin, Vedrana ; Luštrek, Mitja ; Gams, Matjaž
Izvornik
Journal of Computing and Information Technology (1330-1136) 15
(2007), 4;
305-311
Vrsta, podvrsta i kategorija rada
Radovi u časopisima, članak, znanstveni
Ključne riječi
genre classification, web page, genre features, ensemble algorithm
Sažetak
This paper presents experiments on classifying web pages by genre. Firstly, a corpus of 1 539 manually labeled web pages was prepared. Secondly, 502 genre features were selected based on the literature and the observation of the corpus. Thirdly, these features were extracted from the corpus to obtain a data set. Finally, two machine learning algorithms, one for induction of decision trees (J48) and one ensemble algorithm (bagging), were trained and tested on the data set. The ensemble algorithm achieved on average 17% better precision and 1.6% better accuracy, but slightly worse recall ; F-measure did not vary significantly. The results indicate that classification by genre could be a useful addition to search engines.
Izvorni jezik
Engleski
Citiraj ovu publikaciju:
Časopis indeksira:
- Scopus