Variance Ranking Attributes Selection Techniques for Binary Classification Problem in Imbalance Data

Solomon H. Ebenuwa, Mhd Saeed Sharif, Mamoun Alazab, Ameer Al-Nemrat

Research output: Contribution to journalArticleResearchpeer-review

Abstract

Data are being generated and used to support all aspects of healthcare provision, from policy formation to the delivery of primary care services. Particularly, with the change of emphasis from curative to preventive medicine, the importance of data-based research such as data mining and machine learning has emphasized the issues of class distributions in datasets. In typical predictive modeling, the inability to effectively address a class imbalance in a real-life dataset is an important shortcoming of the existing machine learning algorithms. Most algorithms assume a balanced class in their design, resulting in poor performance in predicting the minority target class. Ironically, the minority target class is usually the focus in predicting processes. The misclassification of the minority target class has resulted in serious consequences in detecting chronic diseases and detecting fraud and intrusion where positive cases are erroneously predicted as not positive. This paper presents a new attribute selection technique called variance ranking for handling imbalance class problems in a dataset. The results obtained were compared to two well-known attribute selection techniques: the Pearson correlation and information gain technique. This paper uses a novel similarity measurement technique ranked order similarity-ROS to evaluate the variance ranking attribute selection compared to the Pearson correlations and information gain. Further validation was carried out using three binary classifications: logistic regression, support vector machine, and decision tree. The proposed variance ranking and ranked order similarity techniques showed better results than the benchmarks. The ROS technique provided an excellent means of grading and measuring the similarities where other similarity measurement techniques were inadequate or not applicable.

Original languageEnglish
Article number8651567
Pages (from-to)24649-24666
Number of pages18
JournalIEEE Access
Volume7
DOIs
Publication statusPublished - 25 Feb 2019

Fingerprint

Learning systems
Decision trees
Learning algorithms
Medicine
Support vector machines
Data mining
Logistics

Cite this

Ebenuwa, Solomon H. ; Sharif, Mhd Saeed ; Alazab, Mamoun ; Al-Nemrat, Ameer. / Variance Ranking Attributes Selection Techniques for Binary Classification Problem in Imbalance Data. In: IEEE Access. 2019 ; Vol. 7. pp. 24649-24666.
@article{7ff4077019b04378ad236f6bf7ca67ad,
title = "Variance Ranking Attributes Selection Techniques for Binary Classification Problem in Imbalance Data",
abstract = "Data are being generated and used to support all aspects of healthcare provision, from policy formation to the delivery of primary care services. Particularly, with the change of emphasis from curative to preventive medicine, the importance of data-based research such as data mining and machine learning has emphasized the issues of class distributions in datasets. In typical predictive modeling, the inability to effectively address a class imbalance in a real-life dataset is an important shortcoming of the existing machine learning algorithms. Most algorithms assume a balanced class in their design, resulting in poor performance in predicting the minority target class. Ironically, the minority target class is usually the focus in predicting processes. The misclassification of the minority target class has resulted in serious consequences in detecting chronic diseases and detecting fraud and intrusion where positive cases are erroneously predicted as not positive. This paper presents a new attribute selection technique called variance ranking for handling imbalance class problems in a dataset. The results obtained were compared to two well-known attribute selection techniques: the Pearson correlation and information gain technique. This paper uses a novel similarity measurement technique ranked order similarity-ROS to evaluate the variance ranking attribute selection compared to the Pearson correlations and information gain. Further validation was carried out using three binary classifications: logistic regression, support vector machine, and decision tree. The proposed variance ranking and ranked order similarity techniques showed better results than the benchmarks. The ROS technique provided an excellent means of grading and measuring the similarities where other similarity measurement techniques were inadequate or not applicable.",
keywords = "binary class, class distribution, decision tree, imbalance ratio, Imbalanced dataset, logistic regression, majority class, minority class, oversampling, peak threshold accuracy, ranked order similarity, support vector machine, under sampling",
author = "Ebenuwa, {Solomon H.} and Sharif, {Mhd Saeed} and Mamoun Alazab and Ameer Al-Nemrat",
year = "2019",
month = "2",
day = "25",
doi = "10.1109/ACCESS.2019.2899578",
language = "English",
volume = "7",
pages = "24649--24666",
journal = "IEEE Access",
issn = "2169-3536",
publisher = "IEEE, Institute of Electrical and Electronics Engineers",

}

Variance Ranking Attributes Selection Techniques for Binary Classification Problem in Imbalance Data. / Ebenuwa, Solomon H.; Sharif, Mhd Saeed; Alazab, Mamoun; Al-Nemrat, Ameer.

In: IEEE Access, Vol. 7, 8651567, 25.02.2019, p. 24649-24666.

Research output: Contribution to journalArticleResearchpeer-review

TY - JOUR

T1 - Variance Ranking Attributes Selection Techniques for Binary Classification Problem in Imbalance Data

AU - Ebenuwa, Solomon H.

AU - Sharif, Mhd Saeed

AU - Alazab, Mamoun

AU - Al-Nemrat, Ameer

PY - 2019/2/25

Y1 - 2019/2/25

N2 - Data are being generated and used to support all aspects of healthcare provision, from policy formation to the delivery of primary care services. Particularly, with the change of emphasis from curative to preventive medicine, the importance of data-based research such as data mining and machine learning has emphasized the issues of class distributions in datasets. In typical predictive modeling, the inability to effectively address a class imbalance in a real-life dataset is an important shortcoming of the existing machine learning algorithms. Most algorithms assume a balanced class in their design, resulting in poor performance in predicting the minority target class. Ironically, the minority target class is usually the focus in predicting processes. The misclassification of the minority target class has resulted in serious consequences in detecting chronic diseases and detecting fraud and intrusion where positive cases are erroneously predicted as not positive. This paper presents a new attribute selection technique called variance ranking for handling imbalance class problems in a dataset. The results obtained were compared to two well-known attribute selection techniques: the Pearson correlation and information gain technique. This paper uses a novel similarity measurement technique ranked order similarity-ROS to evaluate the variance ranking attribute selection compared to the Pearson correlations and information gain. Further validation was carried out using three binary classifications: logistic regression, support vector machine, and decision tree. The proposed variance ranking and ranked order similarity techniques showed better results than the benchmarks. The ROS technique provided an excellent means of grading and measuring the similarities where other similarity measurement techniques were inadequate or not applicable.

AB - Data are being generated and used to support all aspects of healthcare provision, from policy formation to the delivery of primary care services. Particularly, with the change of emphasis from curative to preventive medicine, the importance of data-based research such as data mining and machine learning has emphasized the issues of class distributions in datasets. In typical predictive modeling, the inability to effectively address a class imbalance in a real-life dataset is an important shortcoming of the existing machine learning algorithms. Most algorithms assume a balanced class in their design, resulting in poor performance in predicting the minority target class. Ironically, the minority target class is usually the focus in predicting processes. The misclassification of the minority target class has resulted in serious consequences in detecting chronic diseases and detecting fraud and intrusion where positive cases are erroneously predicted as not positive. This paper presents a new attribute selection technique called variance ranking for handling imbalance class problems in a dataset. The results obtained were compared to two well-known attribute selection techniques: the Pearson correlation and information gain technique. This paper uses a novel similarity measurement technique ranked order similarity-ROS to evaluate the variance ranking attribute selection compared to the Pearson correlations and information gain. Further validation was carried out using three binary classifications: logistic regression, support vector machine, and decision tree. The proposed variance ranking and ranked order similarity techniques showed better results than the benchmarks. The ROS technique provided an excellent means of grading and measuring the similarities where other similarity measurement techniques were inadequate or not applicable.

KW - binary class

KW - class distribution

KW - decision tree

KW - imbalance ratio

KW - Imbalanced dataset

KW - logistic regression

KW - majority class

KW - minority class

KW - oversampling

KW - peak threshold accuracy

KW - ranked order similarity

KW - support vector machine

KW - under sampling

UR - http://www.scopus.com/inward/record.url?scp=85062706680&partnerID=8YFLogxK

U2 - 10.1109/ACCESS.2019.2899578

DO - 10.1109/ACCESS.2019.2899578

M3 - Article

VL - 7

SP - 24649

EP - 24666

JO - IEEE Access

JF - IEEE Access

SN - 2169-3536

M1 - 8651567

ER -