A Comprehensive Framework of Machine Learning Model for Clustering Ham and Spam Emails based on Unsupervised Learning

    Student thesis: Doctor of Philosophy (PhD) - CDU

    Abstract

    Spam emails are unsolicited bulk emails which can have malicious purposes. This research aims to address this issue by proposing an unsupervised framework for the separation of ham from spam emails through unsupervised clustering.

    Spam emails are generally spread all over the globe by the spammers for simple marketing purposes as well to unleash more critical activities such as eputational damage and financial disruption, both on personal and institutional front primarily oftentimes through ransomware. A number of recent research initiatives have been attempted to introduce machine learning based supervised, semi-supervised and bio-inspired propositions to address the issues [20]. However, no frameworks have been proposed that completely rely on unsupervised methods so far. In this research an attempt has been made to address the gap.

    The study has been attempted on publicly available datasets. The investigation has been carried out into three different steps: first clustering based on header information, excluding the subject header, secondly clustering based on email content and subject header and finally combining the two. The comprehensive framework includes novel feature reduction algorithm. A number of unsupervised algorithms are applied: DBSCAN, HDBSCAN, OPTICS, Spectral, K-means, K-modes and BIRCH.

    OPTICS, K-means and Spectral proved to be producing most effective clustering of the header information. For the body and the subject header, OPTICS and DBSCAN demonstrated the best performance after a series of validation steps, and K-means also showed promise. For our final model evaluation, based on the set of features obtained through the combination of header and content features used in the previous studies, OPTICS, Kmeans and K-modes showed the most promise. These top performing algorithms (OPTICS, K-means and K-modes) demonstrated an average balanced accuracy and purity of over 98%, demonstrating the usefulness of unsupervised clustering in differentiating ham from spam emails, using intuitively engineered features.


    Date of AwardDec 2021
    Original languageEnglish
    SupervisorKannoorpatti Krishnan (Supervisor), Sami Azam (Supervisor) & Bharanidharan Shanmugam (Supervisor)

    Cite this

    '