Pregled bibliografske jedinice broj: 1132489
An empirical study of data intrinsic characteristics that make learning from imbalanced data difficult
An empirical study of data intrinsic characteristics that make learning from imbalanced data difficult // Expert systems with applications, 182 (2021), 115297, 22 doi:10.1016/j.eswa.2021.115297 (međunarodna recenzija, članak, znanstveni)
CROSBI ID: 1132489 Za ispravke kontaktirajte CROSBI podršku putem web obrasca
Naslov
An empirical study of data intrinsic
characteristics that make learning from
imbalanced data difficult
Autori
Dudjak, Mario ; Martinović, Goran
Izvornik
Expert systems with applications (0957-4174) 182
(2021);
115297, 22
Vrsta, podvrsta i kategorija rada
Radovi u časopisima, članak, znanstveni
Ključne riječi
Classification ; Class imbalance ; Class overlapping ; Data intrinsic characteristics ; Noise ; Small disjuncts
Sažetak
Learning from data stemming from real-world problems is inherently challenging and difficult due to the numerous intrinsic characteristics present in datasets. The problem of class imbalance is known to significantly impair classification performance and has attracted increasing attention from researchers. On the other hand, some studies suggest that the detrimental effects of class imbalance occur only when the dataset encompasses other intrinsic characteristics such as small disjuncts, class overlapping, noise or data rarity. However, the literature is often ambiguous in terms of understanding and distinguishing the influence of these characteristics on the behaviour of standard classification algorithms. This paper provides a contemporary empirical study of the behaviour and performance of five well-known classifiers on a large number of imbalanced datasets exhibiting numerous combinations of the stated characteristics. The aim of the study is to identify and rank difficulty factors when learning from imbalanced data, depending on the type of classification algorithm used. In general, the obtained results suggest that if classifiers conceptually have no problem with class separation into sub-concepts, noise is the characteristic that most impairs their performance, closely followed by class overlapping and class imbalance. To alleviate these problems, oversampling and undersampling procedures were tested and directions are given for selecting appropriate techniques when dealing with the problem of class imbalance.
Izvorni jezik
Engleski
Znanstvena područja
Računarstvo
POVEZANOST RADA
Ustanove:
Fakultet elektrotehnike, računarstva i informacijskih tehnologija Osijek
Citiraj ovu publikaciju:
Časopis indeksira:
- Current Contents Connect (CCC)
- Web of Science Core Collection (WoSCC)
- Science Citation Index Expanded (SCI-EXP)
- SCI-EXP, SSCI i/ili A&HCI
- Scopus