An Unsupervised Approach for Content-Based Clustering of Emails into Spam and Ham through Multiangular Feature Formulation

Research output: Contribution to journalArticlepeer-review

11 Downloads (Pure)

Abstract

The rapid growth of spam email attacks and the inherent malicious dynamism within those attacks on a range of social, personal and business activities warrants an intelligent and automated anti-spam framework. Attempts like malware propagation, identity theft, sensitive data pilfering, monetary as well as reputational damage are sharply increasing, endangering the privacy of the victim. Current solutions that are rather incomplete when the multidimensional feature range of email, is taken into account. We believe a methodology based on Artificial Intelligence, especially unsupervised machine learning is the way forward. This research attempts to investigating the application of unsupervised learning for the clustering of Spam and Ham emails. The overall goal of the research is to develop an unsupervised framework that solely depends on unsupervised methodologies through a clustering approach that includes multiple algorithms, primarily using the email content (body) and the subject header. The clustering has been done on a novel binary dataset of 22,000 entries of ham and spam emails, composed of ten features (reduced from eleven to ten after the feature reduction). Seven out of these ten features are unique to this study, engineered to represent impactful analytical email characteristics from a multiangular point of view. Out of five different clustering algorithms investigated in this work, OPTICS produced the optimum clustering demonstrating a 0.26% higher average efficacy than its nearest performer DBSCAN. The average balanced accuracy for OPTICS and DBSCAN was found to be ≈75.76%.

Original languageEnglish
Pages (from-to)135186-135209
Number of pages24
JournalIEEE Access
Volume9
DOIs
Publication statusPublished - Sep 2021

Fingerprint

Dive into the research topics of 'An Unsupervised Approach for Content-Based Clustering of Emails into Spam and Ham through Multiangular Feature Formulation'. Together they form a unique fingerprint.

Cite this