Data Acquisition and Corpus Creation for Phishing Detection

Dunđer, Ivan; Seljan, Sanja; Odak, Marko

Pregled bibliografske jedinice broj: 1277747

Data Acquisition and Corpus Creation for Phishing Detection

Dunđer, Ivan; Seljan, Sanja; Odak, Marko

Data Acquisition and Corpus Creation for Phishing Detection // MIPRO Proceedings - ICT and Electronics Convention / Skala, Karolj (ur.).
Rijeka: Hrvatska udruga za informacijsku i komunikacijsku tehnologiju, elektroniku i mikroelektroniku - MIPRO, 2023. str. 589-594 (predavanje, međunarodna recenzija, cjeloviti rad (in extenso), znanstveni)

CROSBI ID: 1277747 Za ispravke kontaktirajte CROSBI podršku putem web obrasca

Naslov
Data Acquisition and Corpus Creation for Phishing Detection

Autori
Dunđer, Ivan ; Seljan, Sanja ; Odak, Marko

Vrsta, podvrsta i kategorija rada
Radovi u zbornicima skupova, cjeloviti rad (in extenso), znanstveni

Izvornik
MIPRO Proceedings - ICT and Electronics Convention / Skala, Karolj - Rijeka : Hrvatska udruga za informacijsku i komunikacijsku tehnologiju, elektroniku i mikroelektroniku - MIPRO, 2023, 589-594

Skup
46th ICT and Electronics Convention

Mjesto i datum
Opatija, Hrvatska, 22.05.2023. - 26.05.2023

Vrsta sudjelovanja
Predavanje

Vrsta recenzije
Međunarodna recenzija

Ključne riječi
data acquisition ; digital corpus creation ; computational data analysis ; natural language processing ; phishing ; information privacy ; information security

Sažetak
Detecting phishing attacks is not straightforward, since there are many obstacles that derive from language complexity and technical aspects. Studying phishing attacks and other related issues heavily relies on computer datasets, i.e. digital corpora that reflect these linguistic and technical intricacies. Diverse studies using phishing datasets have been performed, but mainly for the English language. Research for other languages is scarce, and especially for not widely spoken languages. For the Croatian language there is an evident lack of corpora that are essential for diverse analyses and for constructing models that are capable of recognizing phishing attacks and protecting users. These datasets are necessary for natural language processing and building machine learning workflows, where results largely depend on corpora that must be specifically crafted for this purpose. Therefore, creating high-quality domain-specific corpora is of great importance in the domain of information security. Such corpora can be employed for teaching purposes in various courses in higher education, and could be analyzed in numerous ways in order to understand the underlying principles of phishing attack strategies. The aim of this paper is to demonstrate the entire process of data acquisition and corpus creation for the phishing detection domain. In addition, an analysis of the corpus is presented with regard to different aspects, such as descriptive attributes, terminology characteristics, metadata and language.

Izvorni jezik
Engleski

Znanstvena područja
Računarstvo, Informacijske i komunikacijske znanosti

POVEZANOST RADA

Projekti:
--11-933-1053 - Strojno učenje i obrada prirodnog jezika u domeni računalne sigurnosti – II. dio (Seljan, Sanja) ( CroRIS)
EK-EFRR-KK.01.2.1.02.0267 - Istraživanje obrade prirodnog jezika (za hrvatski jezik) i razvoj proizvoda PhisHRban za povećanje kibernetičke sigurnosti (PhisHRban) (Pejić Bach, Mirjana; Seljan, Sanja, EK - KK.01.2.1.02) ( CroRIS)

Ustanove:
Filozofski fakultet, Zagreb

Profili:

Sanja Seljan (autor)