Combination of PCA with SMOTE Oversampling for Classification of High-Dimensional Imbalanced Data

Guhdar A. A. Mulla; Yıldırım Demir; Masoud Hassan

doi:10.17798/bitlisfen.939733

EN TR

Combination of PCA with SMOTE Oversampling for Classification of High-Dimensional Imbalanced Data

Abstract

Imbalanced data classification is a common issue in data mining where the classifiers are skewed towards the larger data class. Classification of high-dimensional skewed (imbalanced) data is of great interest to decision-makers as it is more difficult to. The dimension reduction method, a process in which variables are reduced, allows high dimensional datasets to be interpreted more easily with a certain loss. This study, a method combining SMOTE oversampling with principal component analysis is proposed to solve the imbalance problem in high dimensional data. Three classification algorithms consisting of Logistic Regression, K-Nearest Neighbor, Decision Tree methods and two separate datasets were utilized to evaluate the suggested method's efficacy and determine the classifiers' performance. Respectively, raw datasets, converted datasets by PCA, SMOTE and SMOTE+PCA (SMOTE and PCA) methods, were analyzed with the given algorithms. Analyzes were made using WEKA. Analysis results suggest that almost all classification algorithms improve their classification performance using PCA, SOMTE, and SMOTE+PCA methods. However, the SMOTE method gave more efficient results than PCA and PCA+SMOTE methods for data rebalancing. Experimental results also suggest that K-Nearest Neighbor classifier provided higher classification performance compared to other algorithms.

Keywords

Combination of PCA with SMOTE Oversampling for Classification of High-Dimensional Imbalanced Data

Abstract

Dengesiz veri sınıflandırması, sınıflandırıcıların daha büyük veri sınıfına doğru çarpıtıldığı veri madenciliğinde yaygın bir konudur. Yüksek boyutlu çarpık (dengesiz) verilerin sınıflandırılması, daha zor olduğundan karar vericiler için büyük ilgi görmektedir. Değişkenlerin azaltıldığı bir süreç olan boyut küçültme yöntemi, yüksek boyutlu veri setlerinin belirli bir kayıpla daha kolay yorumlanmasına olanak tanır. Bu çalışmada, yüksek boyutlu verilerdeki dengesizlik problemini çözmek için SMOTE aşırı örneklemeyi temel bileşen analizi ile birleştiren bir yöntem önerilmiştir. Önerilen yöntemin etkinliğini değerlendirmek ve sınıflandırıcıların performansını belirlemek için Lojistik Regresyon, K-En Yakın Komşu, Karar Ağacı yöntemlerinden oluşan üç sınıflandırma algoritması ve iki ayrı veri kümesi kullanılmıştır. Sırasıyla, ham veri setleri, PCA, SMOTE ve SMOTE +PCA (SMOTE ve PCA) yöntemleriyle dönüştürülen veri setleri, verilen algoritmalarla analiz edilmiştir. Analizler WEKA ile yapılmıştır. Analiz sonuçları, neredeyse tüm sınıflandırma algoritmalarının PCA, SOMTE ve SMOTE+PCA yöntemlerini kullanarak sınıflandırma performanslarını iyileştirdiğini göstermektedir. Bununla birlikte, SMOTE yöntemi, verilerin yeniden dengelenmesi için PCA ve PCA+SMOTE yöntemlerinden daha verimli sonuçlar vermiştir. Deneysel sonuçlar ayrıca K-En Yakın Komşu sınıflandırıcısının diğer algoritmalara kıyasla daha yüksek sınıflandırma performansı sağladığını göstermektedir.

Keywords

References

Baran M. 2020. Maki̇ne Öğrenmesi̇ Yöntemleri̇yle Çoklu Eti̇ketli̇ Veri̇leri̇n Sınıflandırılması. Yüksek Lisans Tezi, Sivas Cumhuriyet Üniversitesi, Sosya Bilimler Enstitüsü, Sivas.
Lorena A.C., Garcia L.P.F., Lehmann J., Souto M.C.P., Ho T.K. 2019. How Complex is Your Classification Problem?: A Survey on Measuring Classification Complexity. ACM Computing Surveys, 52 (5): 1–34.
Tahir M.A.U.H., Asghar S., Manzoor A., Noor M.A. 2019. A Classification Model for Class Imbalance Dataset Using Genetic Programming. IEEE Access, 7: 71013-71037.
Mustafa N., Li J.P., Memon E.R.A., Omer M.Z. 2017. A Classification Model for Imbalanced Medical Data based on PCA and Farther Distance based Synthetic Minority Oversampling Technique. International Journal of Advanced Computer Science and Applications, 8 (1): 61-67.
Kambhatla N., Leen, T.K. 1997. Dimension Reduction by Local Principal Component Analysis. Neural Computation, 9 (7): 1493-1516.
Hall M., Frank E., Holmes G., Pfahringer B., Reutemann P., Witten I.H. 2009. The WEKA Data Mining Software: An Uptade. SIGKDD Explorations, 11 (1): 10-18.
Sun Y., Wong A.K.C., Kamel M.S. 2009. Classification of Imbalanced Data: A Review. International Journal of Pattern Recognition and Artificial Intelligence, 23 (4): 687-719.
Basgall M.J., Hasperué W., Naiouf M., Fernández A. 2018. SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data. Journal of Computer Science & Technology, 18 (3): 203-209.

Mohammed A.J., Hassan M.M., Kadir D.H. 2020. Improving Classification Performance for a Novel Imbalanced Medical Dataset using SMOTE Method. International Journal of Advanced Trends in Computer Science and Engineering, 9 (3): 3161-3172.
Mythili M.S., Shanavas A.R.M. 2014. An Analysis of Students’ Performance using Classification Algorithms. IOSR Journal of Computer Engineering, 16 (1): 63-69.
Iyer A., Jeyalatha S., Sumbaly R. 2015. Diagnosis of Diabetes Using Classification Mining Techniques. International Journal of Data Mining & Knowledge Management Process, 5 (1): 1-14.
Agrawal S., Agrawal J. 2015. Survey on Anomaly Detection using Data Mining Techniques. Procedia Computer Science, 60 (1): 708-713.
Haghanikhameneh F., Shariat Panahy P.H., Khanahmadliravi N., Mousavi S.A. 2012. A Comparison Study between Data Mining Algorithms over Classification Techniques in Squid Dataset. International Journal of Artificial Intelligence, 9 (12): 59-66.
Peng C.Y.J., Lee K.L., Ingersoll G.M. 2002. An Introduction to Logistic Regression Analysis and Reporting. Journal of Educational Research, 96 (1): 3-14.
Yıldız M., Bozdemir M.N., Kılıçaslan I., Atesçelik M., Gürbüz Ş., Mutlu B., Onur M.R., Gürger M. 2012. Elderly trauma: The two years experience of a University-affiliated Emergency Department. European Review for Medical and Pharmacological Sciences, 16 (SUPPL.1): 62-67.
Samanthula B.K., Elmehdwi Y., Jiang W. 2015. K-Nearest Neighbor Classification over Semantically Secure Encrypted Relational Data. IEEE Transactions on Knowledge and Data Engineering, 27 (5): 1261-1273.
Fix E., Hodges J.L. 1951. Discriminatory Analysis: Nonparametric Discrimination, consistency properties. Prepared at the University of California, Contract No, AF41, Texas. 43.‏
Zhang Z. 2014. Too much covariates in a multivariable model may cause the problem of overfitting. Journal of Thoracic Disease, 6 (9) E196-E197.
Osisanwo F.Y., Akinsola J.E.T., Awodele O., Hinmikaiye J.O., Olakanmi O., Akinjobi J. 2017. Supervised Machine Learning Algorithms: Classification and Comparison. International Journal of Computer Trends and Technology, 48 (3): 128-138.
Mitchell T.M. 1999. Machine Learning and Data Mining. To Appear in Communications of the ACM, 42 (11): 1-13.
Mohammed M., Khan M.B., Bashier E.B.M. 2017. Machine Learning Algorithms and Applications. Crc. Press, Bota Raton, 1-212.
Prati R.C., Batista G.E., Monard M. 2009. Data mining with imbalanced class distributions: Concepts and methods. 4th Indian International Conference on Artificial Intelligence (IICAI-09), 16-18 December 2009, Tumkur India, 359-376.‏
Chawla N.V., Bowyer K.W., Hall L.O., Kegelmeyer W.P. 2002. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16: 321-357.
Naseriparsa M., Kashani M.M.R. 2013. Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset. International Journal of Computer Applications, 77 (3): 33-38.

Details

Primary Language

English

Subjects

-

Journal Section

Research Article

Authors

Guhdar A. A. Mulla This is me
0000-0001-6742-0083
Türkiye

Yıldırım Demir ^*
0000-0002-6350-8122
Türkiye

Masoud Hassan
0000-0003-3461-0942
Iraq

Publication Date

September 17, 2021

Submission Date

May 20, 2021

Acceptance Date

July 28, 2021

Published in Issue

Year 2021 Volume: 10 Number: 3

DOI

https://doi.org/10.17798/bitlisfen.939733

IZ

https://izlik.org/JA84WN24EN

Cite

RIS / Bibtex

APA

Mulla, G. A. A., Demir, Y., & Hassan, M. (2021). Combination of PCA with SMOTE Oversampling for Classification of High-Dimensional Imbalanced Data. Bitlis Eren Üniversitesi Fen Bilimleri Dergisi, 10(3), 858-869. https://doi.org/10.17798/bitlisfen.939733

AMA

1.Mulla GAA, Demir Y, Hassan M. Combination of PCA with SMOTE Oversampling for Classification of High-Dimensional Imbalanced Data. Bitlis Eren Üniversitesi Fen Bilimleri Dergisi. 2021;10(3):858-869. doi:10.17798/bitlisfen.939733

Chicago

Mulla, Guhdar A. A., Yıldırım Demir, and Masoud Hassan. 2021. “Combination of PCA With SMOTE Oversampling for Classification of High-Dimensional Imbalanced Data”. Bitlis Eren Üniversitesi Fen Bilimleri Dergisi 10 (3): 858-69. https://doi.org/10.17798/bitlisfen.939733.

EndNote

Mulla GAA, Demir Y, Hassan M (September 1, 2021) Combination of PCA with SMOTE Oversampling for Classification of High-Dimensional Imbalanced Data. Bitlis Eren Üniversitesi Fen Bilimleri Dergisi 10 3 858–869.

IEEE

[1]G. A. A. Mulla, Y. Demir, and M. Hassan, “Combination of PCA with SMOTE Oversampling for Classification of High-Dimensional Imbalanced Data”, Bitlis Eren Üniversitesi Fen Bilimleri Dergisi, vol. 10, no. 3, pp. 858–869, Sept. 2021, doi: 10.17798/bitlisfen.939733.

ISNAD

Mulla, Guhdar A. A. - Demir, Yıldırım - Hassan, Masoud. “Combination of PCA With SMOTE Oversampling for Classification of High-Dimensional Imbalanced Data”. Bitlis Eren Üniversitesi Fen Bilimleri Dergisi 10/3 (September 1, 2021): 858-869. https://doi.org/10.17798/bitlisfen.939733.

JAMA

1.Mulla GAA, Demir Y, Hassan M. Combination of PCA with SMOTE Oversampling for Classification of High-Dimensional Imbalanced Data. Bitlis Eren Üniversitesi Fen Bilimleri Dergisi. 2021;10:858–869.

MLA

Mulla, Guhdar A. A., et al. “Combination of PCA With SMOTE Oversampling for Classification of High-Dimensional Imbalanced Data”. Bitlis Eren Üniversitesi Fen Bilimleri Dergisi, vol. 10, no. 3, Sept. 2021, pp. 858-69, doi:10.17798/bitlisfen.939733.

Vancouver

1.Guhdar A. A. Mulla, Yıldırım Demir, Masoud Hassan. Combination of PCA with SMOTE Oversampling for Classification of High-Dimensional Imbalanced Data. Bitlis Eren Üniversitesi Fen Bilimleri Dergisi. 2021 Sep. 1;10(3):858-69. doi:10.17798/bitlisfen.939733

Cited By

Artificial Intelligence-based Colon Cancer Prediction by Identifying Genomic Biomarkers

Medical Records

https://doi.org/10.37990/medr.1077024

Predictive Modeling of Student Dropout in MOOCs and Self-Regulated Learning

Computers

https://doi.org/10.3390/computers12100194

Prediction of Lake Van Water Level using Artificial Neural Network Model with Meteorological Parameters and Multiple Linear Regression Analysis: A Comparative Study

Bitlis Eren Üniversitesi Fen Bilimleri Dergisi

https://doi.org/10.17798/bitlisfen.1316881

Combination of PCA with SMOTE Oversampling for Classification of High-Dimensional Imbalanced Data

Combination of PCA with SMOTE Oversampling for Classification of High-Dimensional Imbalanced Data

Abstract

Keywords

Combination of PCA with SMOTE Oversampling for Classification of High-Dimensional Imbalanced Data

Abstract

Keywords

References

Details

Primary Language

Subjects

Journal Section

Authors

Publication Date

Submission Date

Acceptance Date

Published in Issue

DOI

IZ

Cite

Cited By

Artificial Intelligence-based Colon Cancer Prediction by Identifying Genomic Biomarkers

Predictive Modeling of Student Dropout in MOOCs and Self-Regulated Learning

Prediction of Lake Van Water Level using Artificial Neural Network Model with Meteorological Parameters and Multiple Linear Regression Analysis: A Comparative Study

Enhancing Lung Cancer Classification and Prediction With Deep Learning and Multi-Omics Data

TÜRKÇE KONUŞMADA DUYGU TANIMA İÇİN MAKİNE ÖĞRENME YÖNTEMLERİ VE DERİN ÖĞRENME TABANLI MODELLERİN KARŞILAŞTIRILMASI

Çoklu Doğrusal Bağlantı Olması Durumunda Veri Madenciliği Algoritmaları Performanslarının Karşılaştırılması

Identification of key predictors of acute GVHD in pediatric acute Leukemia using machine learning methods

Using Machine Learning Detection Malware in IoHT System