Pretražite po imenu i prezimenu autora, mentora, urednika, prevoditelja

Napredna pretraga

Pregled bibliografske jedinice broj: 1253070

Training a Genre Classifier for Automatic Classification of Web Pages


Vidulin, Vedrana; Luštrek, Mitja; Gams, Matjaž
Training a Genre Classifier for Automatic Classification of Web Pages // Journal of Computing and Information Technology, 15 (2007), 4; 305-311 doi:10.2498/cit.1001137 (međunarodna recenzija, članak, znanstveni)


CROSBI ID: 1253070 Za ispravke kontaktirajte CROSBI podršku putem web obrasca

Naslov
Training a Genre Classifier for Automatic Classification of Web Pages

Autori
Vidulin, Vedrana ; Luštrek, Mitja ; Gams, Matjaž

Izvornik
Journal of Computing and Information Technology (1330-1136) 15 (2007), 4; 305-311

Vrsta, podvrsta i kategorija rada
Radovi u časopisima, članak, znanstveni

Ključne riječi
genre classification, web page, genre features, ensemble algorithm

Sažetak
This paper presents experiments on classifying web pages by genre. Firstly, a corpus of 1 539 manually labeled web pages was prepared. Secondly, 502 genre features were selected based on the literature and the observation of the corpus. Thirdly, these features were extracted from the corpus to obtain a data set. Finally, two machine learning algorithms, one for induction of decision trees (J48) and one ensemble algorithm (bagging), were trained and tested on the data set. The ensemble algorithm achieved on average 17% better precision and 1.6% better accuracy, but slightly worse recall ; F-measure did not vary significantly. The results indicate that classification by genre could be a useful addition to search engines.

Izvorni jezik
Engleski



POVEZANOST RADA


Profili:

Avatar Url Vedrana Vidulin (autor)

Poveznice na cjeloviti tekst rada:

doi cit.fer.hr

Citiraj ovu publikaciju:

Vidulin, Vedrana; Luštrek, Mitja; Gams, Matjaž
Training a Genre Classifier for Automatic Classification of Web Pages // Journal of Computing and Information Technology, 15 (2007), 4; 305-311 doi:10.2498/cit.1001137 (međunarodna recenzija, članak, znanstveni)
Vidulin, V., Luštrek, M. & Gams, M. (2007) Training a Genre Classifier for Automatic Classification of Web Pages. Journal of Computing and Information Technology, 15 (4), 305-311 doi:10.2498/cit.1001137.
@article{article, author = {Vidulin, Vedrana and Lu\v{s}trek, Mitja and Gams, Matja\v{z}}, year = {2007}, pages = {305-311}, DOI = {10.2498/cit.1001137}, keywords = {genre classification, web page, genre features, ensemble algorithm}, journal = {Journal of Computing and Information Technology}, doi = {10.2498/cit.1001137}, volume = {15}, number = {4}, issn = {1330-1136}, title = {Training a Genre Classifier for Automatic Classification of Web Pages}, keyword = {genre classification, web page, genre features, ensemble algorithm} }
@article{article, author = {Vidulin, Vedrana and Lu\v{s}trek, Mitja and Gams, Matja\v{z}}, year = {2007}, pages = {305-311}, DOI = {10.2498/cit.1001137}, keywords = {genre classification, web page, genre features, ensemble algorithm}, journal = {Journal of Computing and Information Technology}, doi = {10.2498/cit.1001137}, volume = {15}, number = {4}, issn = {1330-1136}, title = {Training a Genre Classifier for Automatic Classification of Web Pages}, keyword = {genre classification, web page, genre features, ensemble algorithm} }

Časopis indeksira:


  • Scopus


Citati:





    Contrast
    Increase Font
    Decrease Font
    Dyslexic Font