Predicting likelihood of legitimate data loss in email DLP

Mohamed Falah Faiz, Junaid Arshad, Mamoun Alazab, Andrii Shalaginov

    Research output: Contribution to journalArticlepeer-review

    15 Citations (Scopus)


    The volume and variety of data collected for modern organisations has increased significantly over the last decade necessitating the detection and prevention of disclosure of sensitive data. Data loss prevention is an embedded process used to protect against disclosure of sensitive data to external uncontrolled environments. A typical Data Loss Prevention (DLP) system uses custom policies to identify and prevent accidental and malicious data leakage producing large number of security alerts including significant volume of false positives. Consequently, identifying legitimate data loss can be very challenging as each incident comprises of different characteristics often requiring extensive intervention by a domain expert to review alerts individually. This limits the ability to detect data loss alerts in real-time making organisations vulnerable to financial and reputational damages. The aim of this research is to strengthen data loss detection capabilities of a DLP system by implementing a machine learning model to predict the likelihood of legitimate data loss. We conducted extensive experimentation using Decision Tree and Random Forest algorithms with historical email incident data collected by a globally established telecommunication enterprise. The final model produced with Random Forest algorithm was identified as the most effective as it was successfully able to predict approximately 95% data loss incidents accurately with an average true positive value of 90%. Furthermore, the proposed solution successfully enables identification of legitimate data loss in email DLP whilst facilitating prioritisation of real data loss through human-understandable explanation of the decision thereby improving the efficiency of the process.

    Original languageEnglish
    Pages (from-to)744-757
    Number of pages14
    JournalFuture Generation Computer Systems
    Early online date11 Nov 2019
    Publication statusPublished - Sept 2020


    Dive into the research topics of 'Predicting likelihood of legitimate data loss in email DLP'. Together they form a unique fingerprint.

    Cite this