Predicting likelihood of legitimate data loss in email DLP

Mohamed Falah Faiz, Junaid Arshad, Mamoun Alazab, Andrii Shalaginov

Research output: Contribution to journalArticleResearchpeer-review

Abstract

The volume and variety of data collected for modern organisations has increased significantly over the last decade necessitating the detection and prevention of disclosure of sensitive data. Data loss prevention is an embedded process used to protect against disclosure of sensitive data to external uncontrolled environments. A typical Data Loss Prevention (DLP) system uses custom policies to identify and prevent accidental and malicious data leakage producing large number of security alerts including significant volume of false positives. Consequently, identifying legitimate data loss can be very challenging as each incident comprises of different characteristics often requiring extensive intervention by a domain expert to review alerts individually. This limits the ability to detect data loss alerts in real-time making organisations vulnerable to financial and reputational damages. The aim of this research is to strengthen data loss detection capabilities of a DLP system by implementing a machine learning model to predict the likelihood of legitimate data loss. We conducted extensive experimentation using Decision Tree and Random Forest algorithms with historical email incident data collected by a globally established telecommunication enterprise. The final model produced with Random Forest algorithm was identified as the most effective as it was successfully able to predict approximately 95% data loss incidents accurately with an average true positive value of 90%. Furthermore, the proposed solution successfully enables identification of legitimate data loss in email DLP whilst facilitating prioritisation of real data loss through human-understandable explanation of the decision thereby improving the efficiency of the process.

Original languageEnglish
Pages (from-to)1-14
Number of pages14
JournalFuture Generation Computer Systems
DOIs
Publication statusE-pub ahead of print - 11 Nov 2019

Fingerprint

Loss prevention
Electronic mail
Decision trees
Telecommunication
Learning systems

Cite this

Faiz, Mohamed Falah ; Arshad, Junaid ; Alazab, Mamoun ; Shalaginov, Andrii. / Predicting likelihood of legitimate data loss in email DLP. In: Future Generation Computer Systems. 2019 ; pp. 1-14.
@article{461a9a2e07064a218da432dd8a160bc9,
title = "Predicting likelihood of legitimate data loss in email DLP",
abstract = "The volume and variety of data collected for modern organisations has increased significantly over the last decade necessitating the detection and prevention of disclosure of sensitive data. Data loss prevention is an embedded process used to protect against disclosure of sensitive data to external uncontrolled environments. A typical Data Loss Prevention (DLP) system uses custom policies to identify and prevent accidental and malicious data leakage producing large number of security alerts including significant volume of false positives. Consequently, identifying legitimate data loss can be very challenging as each incident comprises of different characteristics often requiring extensive intervention by a domain expert to review alerts individually. This limits the ability to detect data loss alerts in real-time making organisations vulnerable to financial and reputational damages. The aim of this research is to strengthen data loss detection capabilities of a DLP system by implementing a machine learning model to predict the likelihood of legitimate data loss. We conducted extensive experimentation using Decision Tree and Random Forest algorithms with historical email incident data collected by a globally established telecommunication enterprise. The final model produced with Random Forest algorithm was identified as the most effective as it was successfully able to predict approximately 95{\%} data loss incidents accurately with an average true positive value of 90{\%}. Furthermore, the proposed solution successfully enables identification of legitimate data loss in email DLP whilst facilitating prioritisation of real data loss through human-understandable explanation of the decision thereby improving the efficiency of the process.",
keywords = "Data loss prevention, Email DLP, Insider threats, Machine learning, Threat prediction",
author = "Faiz, {Mohamed Falah} and Junaid Arshad and Mamoun Alazab and Andrii Shalaginov",
year = "2019",
month = "11",
day = "11",
doi = "10.1016/j.future.2019.11.004",
language = "English",
pages = "1--14",
journal = "Future Generation Computer Systems: the international journal of grid computing: theory, methods and applications",
issn = "0167-739X",
publisher = "Elsevier",

}

Predicting likelihood of legitimate data loss in email DLP. / Faiz, Mohamed Falah; Arshad, Junaid; Alazab, Mamoun; Shalaginov, Andrii.

In: Future Generation Computer Systems, 11.11.2019, p. 1-14.

Research output: Contribution to journalArticleResearchpeer-review

TY - JOUR

T1 - Predicting likelihood of legitimate data loss in email DLP

AU - Faiz, Mohamed Falah

AU - Arshad, Junaid

AU - Alazab, Mamoun

AU - Shalaginov, Andrii

PY - 2019/11/11

Y1 - 2019/11/11

N2 - The volume and variety of data collected for modern organisations has increased significantly over the last decade necessitating the detection and prevention of disclosure of sensitive data. Data loss prevention is an embedded process used to protect against disclosure of sensitive data to external uncontrolled environments. A typical Data Loss Prevention (DLP) system uses custom policies to identify and prevent accidental and malicious data leakage producing large number of security alerts including significant volume of false positives. Consequently, identifying legitimate data loss can be very challenging as each incident comprises of different characteristics often requiring extensive intervention by a domain expert to review alerts individually. This limits the ability to detect data loss alerts in real-time making organisations vulnerable to financial and reputational damages. The aim of this research is to strengthen data loss detection capabilities of a DLP system by implementing a machine learning model to predict the likelihood of legitimate data loss. We conducted extensive experimentation using Decision Tree and Random Forest algorithms with historical email incident data collected by a globally established telecommunication enterprise. The final model produced with Random Forest algorithm was identified as the most effective as it was successfully able to predict approximately 95% data loss incidents accurately with an average true positive value of 90%. Furthermore, the proposed solution successfully enables identification of legitimate data loss in email DLP whilst facilitating prioritisation of real data loss through human-understandable explanation of the decision thereby improving the efficiency of the process.

AB - The volume and variety of data collected for modern organisations has increased significantly over the last decade necessitating the detection and prevention of disclosure of sensitive data. Data loss prevention is an embedded process used to protect against disclosure of sensitive data to external uncontrolled environments. A typical Data Loss Prevention (DLP) system uses custom policies to identify and prevent accidental and malicious data leakage producing large number of security alerts including significant volume of false positives. Consequently, identifying legitimate data loss can be very challenging as each incident comprises of different characteristics often requiring extensive intervention by a domain expert to review alerts individually. This limits the ability to detect data loss alerts in real-time making organisations vulnerable to financial and reputational damages. The aim of this research is to strengthen data loss detection capabilities of a DLP system by implementing a machine learning model to predict the likelihood of legitimate data loss. We conducted extensive experimentation using Decision Tree and Random Forest algorithms with historical email incident data collected by a globally established telecommunication enterprise. The final model produced with Random Forest algorithm was identified as the most effective as it was successfully able to predict approximately 95% data loss incidents accurately with an average true positive value of 90%. Furthermore, the proposed solution successfully enables identification of legitimate data loss in email DLP whilst facilitating prioritisation of real data loss through human-understandable explanation of the decision thereby improving the efficiency of the process.

KW - Data loss prevention

KW - Email DLP

KW - Insider threats

KW - Machine learning

KW - Threat prediction

UR - http://www.scopus.com/inward/record.url?scp=85075425403&partnerID=8YFLogxK

U2 - 10.1016/j.future.2019.11.004

DO - 10.1016/j.future.2019.11.004

M3 - Article

SP - 1

EP - 14

JO - Future Generation Computer Systems: the international journal of grid computing: theory, methods and applications

JF - Future Generation Computer Systems: the international journal of grid computing: theory, methods and applications

SN - 0167-739X

ER -