TY - JOUR
T1 - Source Code Authorship Attribution Using Hybrid Approach of Program Dependence Graph and Deep Learning Model
AU - Ullah, Farhan
AU - Wang, Junfeng
AU - Jabbar, Sohail
AU - Al-Turjman, Fadi
AU - Alazab, Mamoun
PY - 2019/9/25
Y1 - 2019/9/25
N2 - Source Code Authorship Attribution (SCAA) is to find the real author of source code in a corpus. Though, it is a privacy threat to open-source programmers, but, it may be significantly helpful to develop forensic based applications. Such as, ghostwriting detection, copyright dispute settlements, and other code analysis applications. The efficient features extraction is the key challenge for classifying real authors of specific source codes. In this paper, the Program Dependence Graph with Deep Learning (PDGDL) methodology is proposed to identify authors from different programming source codes. First, the PDG is implemented to extract control and data dependencies from source codes. Second, the preprocessing technique is applied to convert PDG features into small instances with frequency details. Third, the Term Frequency Inverse Document Frequency (TFIDF) technique is used to zoom the importance of each PDG feature in source code. Fourth, Synthetic Minority Over-sampling Technique (SMOTE) is applied to tackle the class imbalance problem. Finally, the deep learning algorithm is applied to extract coding styles' features for each programmer and to attribute the real authors. The deep learning algorithm is further fine-tuned with drop out layer, learning error rate, loss and activation function, and dense layers for better accuracy of results. The proposed work is analyzed on 1000 programmers' data, collected from Google Code Jam (GCJ). The dataset contains three different programming languages, i.e., C++, Java, C#. The results are appreciable in outperforming the existing techniques from the perspective of classification accuracy, precision, recall, and f-measure metrics.
AB - Source Code Authorship Attribution (SCAA) is to find the real author of source code in a corpus. Though, it is a privacy threat to open-source programmers, but, it may be significantly helpful to develop forensic based applications. Such as, ghostwriting detection, copyright dispute settlements, and other code analysis applications. The efficient features extraction is the key challenge for classifying real authors of specific source codes. In this paper, the Program Dependence Graph with Deep Learning (PDGDL) methodology is proposed to identify authors from different programming source codes. First, the PDG is implemented to extract control and data dependencies from source codes. Second, the preprocessing technique is applied to convert PDG features into small instances with frequency details. Third, the Term Frequency Inverse Document Frequency (TFIDF) technique is used to zoom the importance of each PDG feature in source code. Fourth, Synthetic Minority Over-sampling Technique (SMOTE) is applied to tackle the class imbalance problem. Finally, the deep learning algorithm is applied to extract coding styles' features for each programmer and to attribute the real authors. The deep learning algorithm is further fine-tuned with drop out layer, learning error rate, loss and activation function, and dense layers for better accuracy of results. The proposed work is analyzed on 1000 programmers' data, collected from Google Code Jam (GCJ). The dataset contains three different programming languages, i.e., C++, Java, C#. The results are appreciable in outperforming the existing techniques from the perspective of classification accuracy, precision, recall, and f-measure metrics.
KW - Code authorship attribution
KW - deep learning
KW - program dependence graph
KW - software forensics and security
KW - software plagiarism
UR - http://www.scopus.com/inward/record.url?scp=85077692068&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2019.2943639
DO - 10.1109/ACCESS.2019.2943639
M3 - Article
AN - SCOPUS:85077692068
SN - 2169-3536
VL - 7
SP - 141987
EP - 141999
JO - IEEE Access
JF - IEEE Access
M1 - 8848478
ER -