Nalazite se na CroRIS probnoj okolini. Ovdje evidentirani podaci neće biti pohranjeni u Informacijskom sustavu znanosti RH. Ako je ovo greška, CroRIS produkcijskoj okolini moguće je pristupi putem poveznice www.croris.hr
izvor podataka: crosbi !

An empirical study of data intrinsic characteristics that make learning from imbalanced data difficult (CROSBI ID 295723)

Prilog u časopisu | izvorni znanstveni rad | međunarodna recenzija

Dudjak, Mario ; Martinović, Goran An empirical study of data intrinsic characteristics that make learning from imbalanced data difficult // Expert systems with applications, 182 (2021), 115297, 22. doi: 10.1016/j.eswa.2021.115297

Podaci o odgovornosti

Dudjak, Mario ; Martinović, Goran

engleski

An empirical study of data intrinsic characteristics that make learning from imbalanced data difficult

Learning from data stemming from real-world problems is inherently challenging and difficult due to the numerous intrinsic characteristics present in datasets. The problem of class imbalance is known to significantly impair classification performance and has attracted increasing attention from researchers. On the other hand, some studies suggest that the detrimental effects of class imbalance occur only when the dataset encompasses other intrinsic characteristics such as small disjuncts, class overlapping, noise or data rarity. However, the literature is often ambiguous in terms of understanding and distinguishing the influence of these characteristics on the behaviour of standard classification algorithms. This paper provides a contemporary empirical study of the behaviour and performance of five well-known classifiers on a large number of imbalanced datasets exhibiting numerous combinations of the stated characteristics. The aim of the study is to identify and rank difficulty factors when learning from imbalanced data, depending on the type of classification algorithm used. In general, the obtained results suggest that if classifiers conceptually have no problem with class separation into sub-concepts, noise is the characteristic that most impairs their performance, closely followed by class overlapping and class imbalance. To alleviate these problems, oversampling and undersampling procedures were tested and directions are given for selecting appropriate techniques when dealing with the problem of class imbalance.

Classification ; Class imbalance ; Class overlapping ; Data intrinsic characteristics ; Noise ; Small disjuncts

nije evidentirano

nije evidentirano

nije evidentirano

nije evidentirano

nije evidentirano

nije evidentirano

Podaci o izdanju

182

2021.

115297

22

objavljeno

0957-4174

1873-6793

10.1016/j.eswa.2021.115297

Povezanost rada

Računarstvo

Poveznice
Indeksiranost