Pregled bibliografske jedinice broj: 1263868
CASSED: Context-based Approach for Structured Sensitive Data Detection
CASSED: Context-based Approach for Structured Sensitive Data Detection // Expert systems with applications, 223 (2023), 119924, 10 doi:10.1016/j.eswa.2023.119924 (međunarodna recenzija, članak, znanstveni)
CROSBI ID: 1263868 Za ispravke kontaktirajte CROSBI podršku putem web obrasca
Naslov
CASSED: Context-based Approach for Structured
Sensitive Data Detection
Autori
Kužina, Vjeko ; Petric, Ana-Marija ; Barišić, Marko ; Jović, Alan
Izvornik
Expert systems with applications (0957-4174) 223
(2023);
119924, 10
Vrsta, podvrsta i kategorija rada
Radovi u časopisima, članak, znanstveni
Ključne riječi
sensitive data detection ; privacy protection ; structured data ; machine learning ; transformers ; context-based detection
Sažetak
The need for sensitive data detection and identification has increased in recent years. Sensitive data detection and identification are necessary steps for privacy protection. The focus in this field has been on unstructured data detection using natural language processing (NLP) approaches, while there has been little progress in the field of structured data. Most of the structured data approaches consider independent feature representations of cells, without taking potentially relevant context into account. In this work, we introduce a novel context-based approach named CASSED, which stands for Context-based Approach for Structured SEnsitive Data Detection. CASSED addresses the problem of sensitive data detection in structured data through the lens of NLP, using the transformer-based BERT method. Our approach aims to actively capture relations both within and between cells in the same column as the assumption is that the data present in the same column in a table are mostly very similar. CASSED works as a classifier for columns in database tables with the task of predicting a label or multiple labels for different types of sensitive data that a column may represent. Since there is no officially recognized dataset for the task, we compared CASSED on datasets used for similar tasks from related work. Furthermore, we created our own dataset focused on sensitive data to evaluate CASSED. Our method outperformed methods from related work both on their datasets and achieved significantly better results on our own dataset compared to our baseline model as well as models from related work. Our research suggests that treating structured data as context-rich is a viable strategy for sensitive data detection and identification.
Izvorni jezik
Engleski
Znanstvena područja
Računarstvo
POVEZANOST RADA
Projekti:
EK-EFRR-KK.01.2.1.02.0038 - Digitalna platforma za zaštitu privatnosti i sprječavanje zlouporaba upravljanjem životnim ciklusom osobnih podataka (AIPD2) (Golub, Marin, EK ) ( CroRIS)
Ustanove:
Fakultet elektrotehnike i računarstva, Zagreb
Poveznice na cjeloviti tekst rada:
Pristup cjelovitom tekstu rada doi www.sciencedirect.com www.zemris.fer.hr www.zemris.fer.hrCitiraj ovu publikaciju:
Časopis indeksira:
- Current Contents Connect (CCC)
- Web of Science Core Collection (WoSCC)
- Science Citation Index Expanded (SCI-EXP)
- SCI-EXP, SSCI i/ili A&HCI
- Scopus