Source Code Authorship Attribution Using Hybrid Approach of Program Dependence Graph and Deep Learning Model

Farhan Ullah, Junfeng Wang, Sohail Jabbar, Fadi Al-Turjman, Mamoun Alazab

Research output: Contribution to journalArticleResearchpeer-review

Abstract

Source Code Authorship Attribution (SCAA) is to find the real author of source code in a corpus. Though, it is a privacy threat to open-source programmers, but, it may be significantly helpful to develop forensic based applications. Such as, ghostwriting detection, copyright dispute settlements, and other code analysis applications. The efficient features extraction is the key challenge for classifying real authors of specific source codes. In this paper, the Program Dependence Graph with Deep Learning (PDGDL) methodology is proposed to identify authors from different programming source codes. First, the PDG is implemented to extract control and data dependencies from source codes. Second, the preprocessing technique is applied to convert PDG features into small instances with frequency details. Third, the Term Frequency Inverse Document Frequency (TFIDF) technique is used to zoom the importance of each PDG feature in source code. Fourth, Synthetic Minority Over-sampling Technique (SMOTE) is applied to tackle the class imbalance problem. Finally, the deep learning algorithm is applied to extract coding styles' features for each programmer and to attribute the real authors. The deep learning algorithm is further fine-tuned with drop out layer, learning error rate, loss and activation function, and dense layers for better accuracy of results. The proposed work is analyzed on 1000 programmers' data, collected from Google Code Jam (GCJ). The dataset contains three different programming languages, i.e., C++, Java, C#. The results are appreciable in outperforming the existing techniques from the perspective of classification accuracy, precision, recall, and f-measure metrics.
Original languageEnglish
Article number8848478
Pages (from-to)141987-141999
Number of pages13
JournalIEEE Access
Volume7
DOIs
Publication statusPublished - 25 Sep 2019

Fingerprint

Learning algorithms
Computer programming languages
Feature extraction
Chemical activation
Sampling
Deep learning

Cite this

Ullah, Farhan ; Wang, Junfeng ; Jabbar, Sohail ; Al-Turjman, Fadi ; Alazab, Mamoun. / Source Code Authorship Attribution Using Hybrid Approach of Program Dependence Graph and Deep Learning Model. In: IEEE Access. 2019 ; Vol. 7. pp. 141987-141999.
@article{56f35c7b6f83464c9ecabda66021c960,
title = "Source Code Authorship Attribution Using Hybrid Approach of Program Dependence Graph and Deep Learning Model",
abstract = "Source Code Authorship Attribution (SCAA) is to find the real author of source code in a corpus. Though, it is a privacy threat to open-source programmers, but, it may be significantly helpful to develop forensic based applications. Such as, ghostwriting detection, copyright dispute settlements, and other code analysis applications. The efficient features extraction is the key challenge for classifying real authors of specific source codes. In this paper, the Program Dependence Graph with Deep Learning (PDGDL) methodology is proposed to identify authors from different programming source codes. First, the PDG is implemented to extract control and data dependencies from source codes. Second, the preprocessing technique is applied to convert PDG features into small instances with frequency details. Third, the Term Frequency Inverse Document Frequency (TFIDF) technique is used to zoom the importance of each PDG feature in source code. Fourth, Synthetic Minority Over-sampling Technique (SMOTE) is applied to tackle the class imbalance problem. Finally, the deep learning algorithm is applied to extract coding styles' features for each programmer and to attribute the real authors. The deep learning algorithm is further fine-tuned with drop out layer, learning error rate, loss and activation function, and dense layers for better accuracy of results. The proposed work is analyzed on 1000 programmers' data, collected from Google Code Jam (GCJ). The dataset contains three different programming languages, i.e., C++, Java, C#. The results are appreciable in outperforming the existing techniques from the perspective of classification accuracy, precision, recall, and f-measure metrics.",
keywords = "Code authorship attribution, deep learning, program dependence graph, software forensics and security, software plagiarism",
author = "Farhan Ullah and Junfeng Wang and Sohail Jabbar and Fadi Al-Turjman and Mamoun Alazab",
year = "2019",
month = "9",
day = "25",
doi = "10.1109/ACCESS.2019.2943639",
language = "English",
volume = "7",
pages = "141987--141999",
journal = "IEEE Access",
issn = "2169-3536",
publisher = "IEEE, Institute of Electrical and Electronics Engineers",

}

Source Code Authorship Attribution Using Hybrid Approach of Program Dependence Graph and Deep Learning Model. / Ullah, Farhan; Wang, Junfeng; Jabbar, Sohail; Al-Turjman, Fadi; Alazab, Mamoun.

In: IEEE Access, Vol. 7, 8848478, 25.09.2019, p. 141987-141999.

Research output: Contribution to journalArticleResearchpeer-review

TY - JOUR

T1 - Source Code Authorship Attribution Using Hybrid Approach of Program Dependence Graph and Deep Learning Model

AU - Ullah, Farhan

AU - Wang, Junfeng

AU - Jabbar, Sohail

AU - Al-Turjman, Fadi

AU - Alazab, Mamoun

PY - 2019/9/25

Y1 - 2019/9/25

N2 - Source Code Authorship Attribution (SCAA) is to find the real author of source code in a corpus. Though, it is a privacy threat to open-source programmers, but, it may be significantly helpful to develop forensic based applications. Such as, ghostwriting detection, copyright dispute settlements, and other code analysis applications. The efficient features extraction is the key challenge for classifying real authors of specific source codes. In this paper, the Program Dependence Graph with Deep Learning (PDGDL) methodology is proposed to identify authors from different programming source codes. First, the PDG is implemented to extract control and data dependencies from source codes. Second, the preprocessing technique is applied to convert PDG features into small instances with frequency details. Third, the Term Frequency Inverse Document Frequency (TFIDF) technique is used to zoom the importance of each PDG feature in source code. Fourth, Synthetic Minority Over-sampling Technique (SMOTE) is applied to tackle the class imbalance problem. Finally, the deep learning algorithm is applied to extract coding styles' features for each programmer and to attribute the real authors. The deep learning algorithm is further fine-tuned with drop out layer, learning error rate, loss and activation function, and dense layers for better accuracy of results. The proposed work is analyzed on 1000 programmers' data, collected from Google Code Jam (GCJ). The dataset contains three different programming languages, i.e., C++, Java, C#. The results are appreciable in outperforming the existing techniques from the perspective of classification accuracy, precision, recall, and f-measure metrics.

AB - Source Code Authorship Attribution (SCAA) is to find the real author of source code in a corpus. Though, it is a privacy threat to open-source programmers, but, it may be significantly helpful to develop forensic based applications. Such as, ghostwriting detection, copyright dispute settlements, and other code analysis applications. The efficient features extraction is the key challenge for classifying real authors of specific source codes. In this paper, the Program Dependence Graph with Deep Learning (PDGDL) methodology is proposed to identify authors from different programming source codes. First, the PDG is implemented to extract control and data dependencies from source codes. Second, the preprocessing technique is applied to convert PDG features into small instances with frequency details. Third, the Term Frequency Inverse Document Frequency (TFIDF) technique is used to zoom the importance of each PDG feature in source code. Fourth, Synthetic Minority Over-sampling Technique (SMOTE) is applied to tackle the class imbalance problem. Finally, the deep learning algorithm is applied to extract coding styles' features for each programmer and to attribute the real authors. The deep learning algorithm is further fine-tuned with drop out layer, learning error rate, loss and activation function, and dense layers for better accuracy of results. The proposed work is analyzed on 1000 programmers' data, collected from Google Code Jam (GCJ). The dataset contains three different programming languages, i.e., C++, Java, C#. The results are appreciable in outperforming the existing techniques from the perspective of classification accuracy, precision, recall, and f-measure metrics.

KW - Code authorship attribution

KW - deep learning

KW - program dependence graph

KW - software forensics and security

KW - software plagiarism

UR - http://www.scopus.com/inward/record.url?scp=85077692068&partnerID=8YFLogxK

U2 - 10.1109/ACCESS.2019.2943639

DO - 10.1109/ACCESS.2019.2943639

M3 - Article

VL - 7

SP - 141987

EP - 141999

JO - IEEE Access

JF - IEEE Access

SN - 2169-3536

M1 - 8848478

ER -