Statistical hierarchical clustering algorithm for outlier detection in evolving data streams

Krleža, Dalibor; Vrdoljak, Boris; Brčić, Mario

izvor podataka: crosbi !

Statistical hierarchical clustering algorithm for outlier detection in evolving data streams (CROSBI ID 282672)

Prilog u časopisu | izvorni znanstveni rad | međunarodna recenzija

Krleža, Dalibor ; Vrdoljak, Boris ; Brčić, Mario Statistical hierarchical clustering algorithm for outlier detection in evolving data streams // Machine learning, 1 (2020), 1, 40. doi: 10.1007/s10994-020-05905-4

Podaci o odgovornosti

Autori

Krleža, Dalibor ; Vrdoljak, Boris ; Brčić, Mario

Osnovni podaci na izvornom jeziku
Osnovni podaci na ostalim jezicima

Jezik

engleski

Naslov

Statistical hierarchical clustering algorithm for outlier detection in evolving data streams

Sažetak

Anomaly detection is a hard data analysis process that requires constant creation and improvement of data analysis algorithms. Using traditional clustering algorithms to analyse data streams is impossible due to processing power and memory issues. To solve this, the traditional clustering algorithm complexity needed to be reduced, which led to the creation of sequential clustering algorithms. The usual approach is two-phase clustering, which uses online phase to relax data details and complexity, and offline phase to cluster concepts created in the online phase. Detecting anomalies in a data stream is usually solved in the online phase, as it requires unreduced data. Contrarily, producing good macro- clustering is done in the offline phase, which is the reason why two-phase clustering algorithms have difficulty being equally good in anomaly detection and macro-clustering. In this paper, we propose a statistical hierarchical clustering algorithm equally suitable for both detecting anomalies and macro-clustering. The proposed algorithm is single-phased and uses statistical inference on the input data stream, resulting in statistical distributions that are constantly updated. This makes the classification adaptable, allowing agglomeration of outliers into clusters, tracking population evolution, and to be used without knowing the expected number of clusters and outliers. The proposed algorithm was tested against typical clustering algorithms, including two-phase algorithms suitable for data stream analysis. A number of typical test cases were selected, to show the universality and qualities of the proposed clustering algorithm.

Ključne riječi

Big data ; Clustering ; Anomaly detection ; Fraud detection

Napomena

nije evidentirano

Jezik

nije evidentirano

Naslov

nije evidentirano

Sažetak

nije evidentirano

Ključne riječi

nije evidentirano

Napomena

nije evidentirano

Podaci o izdanju

Časopis

Machine learning

Volumen (broj)

Godina

2020.

Broj rada

Broj stranica

Status objave rada

objavljeno

ISSN

0885-6125

e-ISSN

1573-0565

DOI

10.1007/s10994-020-05905-4

Povezanost rada

Povezane osobe

Dalibor Krleža (autor/i)

Boris Vrdoljak (autor/i)

Mario Brčić (autor/i)

Povezane ustanove

Fakultet elektrotehnike i računarstva (036) (autorova ustanova)

Područje

Informacijske i komunikacijske znanosti, Matematika, Računarstvo

Poveznice

doi.org

Indeksiranost

Scopus

Current Contents Connect (CCC)

Web of Science Core Collection, Science Citation Index Expanded (WoSCC-SCI-Exp)

Web of Science Core Collection, SCI-Exp, SSCI & A&HCI (WoSCC-SCI-Exp, SSCI, A&HCI)