Balancing the Dataset by Generating New Synthetic Data Based on Heinz Mean in Medical Data

İbrahim Halil Gümüş; Serkan Güldal

doi:10.35414/akufemubid.1011058

Research Article

Balancing the Dataset by Generating New Synthetic Data Based on Heinz Mean in Medical Data

Year 2022, , 570 - 576, 30.06.2022

İbrahim Halil Gümüş , Serkan Güldal

https://doi.org/10.35414/akufemubid.1011058

Abstract

Advances in science and technology have caused data sizes to increase at a great rate. Thus, unbalanced data has arisen. A dataset is unbalanced if the classes are not nearly equally represented. In this case, classifying the data causes performance values to decrease because the classification algorithms are developed on the assumption that the datasets are balanced. As the accuracy of the classification favors the majority class, the minority class is often misclassified. The majority of datasets, especially those used in the medical field, have an unbalanced distribution. To balance this distribution, several studies have been performed recently. These studies are undersampling and oversampling processes. In this study, distance and mean based resampling method is used to produce synthetic samples using minority class. For the resampling process, the closest neighbors for all data points belonging to the minority class were determined by using the Euclidean distance. Based on these neighbors and using the Heinz Mean, the desired number of new synthetic samples were formed between each sample to obtain balance. The Random Forest (RF) and Support Vector Machine (SVM) algorithms are used to classify the raw and balanced datasets, and the results were compared. Additionally, the other well known methods (Random Over Sampling-ROS, Random Under Sampling-RUS, and Synthetic Minority Oversampling TEchnique-SMOTE) are compared with the proposed method. It was shown that the balanced dataset using the proposed resampling method increases classification efficiency as compared to the raw dataset and other methods. Accuracy measurements of RF are 0.751 and 0.799 and, accuracy measurements of SVM are 0.762 and 0.781 for raw data and resampled data respectively. Likewise, there are improvements in the other metrics such as Precision, Recall, and F1 Score.

Keywords

Machine learning, Synthetic data, Balanced data, Heinz mean

References

Batista GE, Prati RC, Monard MC, 2004. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter, 6 (1), 20-29.
Breiman L, 2001. Random forests. Machine learning, 45 (1), 5-32.
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP, 2002. SMOTE: Synthetic Minority Over-Sampling Technique. Journal of artificial intelligence research, 16, 321-357.
Chawla NV, Japkowicz N, Kotcz A, 2004. Special Issue on Learning from Imbalanced Data Sets. ACM SIGKDD explorations newsletter, 6 (1), 1-6.
Dal A, Gümüş İH, Güldal S, Yavaş M, 2021. A New Resampling Approach Based on Weighted Geometric Mean for Unbalanced Data. Journal of Engineering Science of Adiyaman University, 8 (15), 343-352. doi:10.54365/adyumbd.940539.
Demidova L, Klyueva I, Sokolova Y, Stepanov N, Tyart N, 2017. Intellectual Approaches to Improvement of the Classification Decisions Quality on the Base of the SVM Classifier. Procedia Computer Science, 103, 222-230.
Elreedy D, Atiya AF, 2019. A comprehensive analysis of synthetic minority oversampling technique (SMOTE) for handling class imbalance. Information Sciences, 505, 32-64.
Fotouhi S, Asadi S, Kattan MW, 2019. A comprehensive data level analysis for cancer diagnosis on imbalanced data. Journal of biomedical informatics, 90, 103089.
Gopinath M, Aarthy S, Manchanda A. 2019. Machine Learning on Medical Dataset. In S. C. Satapathy, V. Bhateja, R. Somanah, X.-S. Yang,R. Senkerik (Eds.), Information Systems Design and Intelligent Applications, Singapore: Springer, 133-143.
Han H, Wang W-Y, Mao B-H. 2005. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. Paper presented at the International Conference on Intelligent Computing.
Kovács G, 2019. Smote-Variants: A Python Implementation of 85 Minority Oversampling Techniques. Neurocomputing, 366, 352-354.
Krawczyk B, 2016. Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence, 5 (4), 221-232.
Liaw A, Wiener M, 2002. Classification and Regression by Random Forest. R news, 2 (3), 18-22.
Mohammed AJ, Hassan MM, Kadir DH, 2020. Improving Classification Performance for a Novel Imbalanced Medical Dataset using SMOTE Method. International Journal, 9 (3), 3161-3172.
Nguyen HM, Cooper EW, Kamei K, 2011. Borderline over-sampling for imbalanced data classification. International Journal of Knowledge Engineering and Soft Data Paradigms, 3 (1), 4-21.
Rahman MM, Davis DN, 2013. Addressing the Class Imbalance Problem in Medical Datasets. International Journal of Machine Learning and Computing, 3 (2), 224-228.
Vapnik V. (2013). The Nature of Statistical Learning Theory (2nd ed.). New York, USA: Springer Science & Business Media. 1-314.
Internet References 1-https://sci2s.ugr.es/keel/dataset.php?cod=21 (05.06.2021)

Tıbbi Verilerde Heinz Ortalamasına Dayalı Yeni Sentetik Veriler Üreterek Veri Kümesini Dengeleme

Year 2022, , 570 - 576, 30.06.2022

İbrahim Halil Gümüş , Serkan Güldal

https://doi.org/10.35414/akufemubid.1011058

Abstract

Bilim ve teknolojideki ilerlemeler veri boyutlarının büyük hızda artmasına neden olmuştur. Böylece dengesiz veriler ortaya çıkmıştır. Sınıflar neredeyse eşit olarak temsil edilmiyorsa, bir veri kümesi dengesizdir. Bu durumda sınıflandırma algoritmaları veri setlerinin dengeli olduğu varsayımı ile geliştirildiği için verilerin sınıflandırılması performans değerlerinin düşmesine neden olur. Sınıflandırmanın doğruluğu çoğunluk sınıfını desteklediğinden, azınlık sınıfı genellikle yanlış sınıflandırılır. Özellikle tıp alanında kullanılan veri kümelerinin çoğu dengesiz bir dağılıma sahiptir. Bu dağılımı dengelemek için son zamanlarda çeşitli çalışmalar yapılmıştır. Bu çalışmalar, eksik örnekleme ve aşırı örnekleme süreçleridir. Bu çalışmada, azınlık sınıfı kullanılarak sentetik örnekler üretmek için uzaklık ve ortalama tabanlı yeniden örnekleme yöntemi kullanıldı. Yeniden örnekleme işlemi için, azınlık sınıfına ait tüm veri noktaları için en yakın komşular Öklid uzaklığı kullanılarak belirlendi. Bu komşulara dayalı olarak ve Heinz Ortalaması kullanılarak veri setini dengeye getirmek için her numune arasında istenilen sayıda yeni sentetik numuneler oluşturuldu. Ham ve dengeli veri setlerini sınıflandırmak için Rassal Orman (RF) ve Destek Vektör Makinesi (SVM) algoritmaları kullanıldı ve sonuçlar karşılaştırıldı. Ayrıca, iyi bilinen diğer yöntemler (ROS, RUS ve SMOTE) önerilen yöntemle karşılaştırılmıştır. Önerilen yeniden örnekleme yöntemini kullanan dengeli veri kümesinin, ham veri kümesi ve diğer yöntemlere kıyasla sınıflandırma verimliliğini artırdığı gösterilmiştir. Sırasıyla ham veriler ve yeniden örneklenmiş veriler için RF'nin doğruluk ölçümleri 0.751 ve 0.799'dur ve SVM'nin doğruluk ölçümleri 0.762 ve 0.781'dir. Aynı şekilde Kesinlik, Hassasiyet ve F1 Skoru gibi diğer metriklerde de iyileştirmeler vardır.

Keywords

Makine Öğrenimi, Sentetik veri, Dengesiz veri, Heinz ortalaması

References

Batista GE, Prati RC, Monard MC, 2004. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter, 6 (1), 20-29.
Breiman L, 2001. Random forests. Machine learning, 45 (1), 5-32.
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP, 2002. SMOTE: Synthetic Minority Over-Sampling Technique. Journal of artificial intelligence research, 16, 321-357.
Chawla NV, Japkowicz N, Kotcz A, 2004. Special Issue on Learning from Imbalanced Data Sets. ACM SIGKDD explorations newsletter, 6 (1), 1-6.
Dal A, Gümüş İH, Güldal S, Yavaş M, 2021. A New Resampling Approach Based on Weighted Geometric Mean for Unbalanced Data. Journal of Engineering Science of Adiyaman University, 8 (15), 343-352. doi:10.54365/adyumbd.940539.
Demidova L, Klyueva I, Sokolova Y, Stepanov N, Tyart N, 2017. Intellectual Approaches to Improvement of the Classification Decisions Quality on the Base of the SVM Classifier. Procedia Computer Science, 103, 222-230.
Elreedy D, Atiya AF, 2019. A comprehensive analysis of synthetic minority oversampling technique (SMOTE) for handling class imbalance. Information Sciences, 505, 32-64.
Fotouhi S, Asadi S, Kattan MW, 2019. A comprehensive data level analysis for cancer diagnosis on imbalanced data. Journal of biomedical informatics, 90, 103089.
Gopinath M, Aarthy S, Manchanda A. 2019. Machine Learning on Medical Dataset. In S. C. Satapathy, V. Bhateja, R. Somanah, X.-S. Yang,R. Senkerik (Eds.), Information Systems Design and Intelligent Applications, Singapore: Springer, 133-143.
Han H, Wang W-Y, Mao B-H. 2005. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. Paper presented at the International Conference on Intelligent Computing.
Kovács G, 2019. Smote-Variants: A Python Implementation of 85 Minority Oversampling Techniques. Neurocomputing, 366, 352-354.
Krawczyk B, 2016. Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence, 5 (4), 221-232.
Liaw A, Wiener M, 2002. Classification and Regression by Random Forest. R news, 2 (3), 18-22.
Mohammed AJ, Hassan MM, Kadir DH, 2020. Improving Classification Performance for a Novel Imbalanced Medical Dataset using SMOTE Method. International Journal, 9 (3), 3161-3172.
Nguyen HM, Cooper EW, Kamei K, 2011. Borderline over-sampling for imbalanced data classification. International Journal of Knowledge Engineering and Soft Data Paradigms, 3 (1), 4-21.
Rahman MM, Davis DN, 2013. Addressing the Class Imbalance Problem in Medical Datasets. International Journal of Machine Learning and Computing, 3 (2), 224-228.
Vapnik V. (2013). The Nature of Statistical Learning Theory (2nd ed.). New York, USA: Springer Science & Business Media. 1-314.
Internet References 1-https://sci2s.ugr.es/keel/dataset.php?cod=21 (05.06.2021)

There are 18 citations in total.

Details

Primary Language	English
Subjects	Software Testing, Verification and Validation
Journal Section	Articles
Authors	İbrahim Halil Gümüş 0000-0002-3071-1159 Serkan Güldal 0000-0002-4247-0786
Publication Date	June 30, 2022
Submission Date	October 18, 2021
Published in Issue	Year 2022

Cite

APA	Gümüş, İ. H., & Güldal, S. (2022). Balancing the Dataset by Generating New Synthetic Data Based on Heinz Mean in Medical Data. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi, 22(3), 570-576. https://doi.org/10.35414/akufemubid.1011058
AMA	Gümüş İH, Güldal S. Balancing the Dataset by Generating New Synthetic Data Based on Heinz Mean in Medical Data. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi. June 2022;22(3):570-576. doi:10.35414/akufemubid.1011058
Chicago	Gümüş, İbrahim Halil, and Serkan Güldal. “Balancing the Dataset by Generating New Synthetic Data Based on Heinz Mean in Medical Data”. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi 22, no. 3 (June 2022): 570-76. https://doi.org/10.35414/akufemubid.1011058.
EndNote	Gümüş İH, Güldal S (June 1, 2022) Balancing the Dataset by Generating New Synthetic Data Based on Heinz Mean in Medical Data. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi 22 3 570–576.
IEEE	İ. H. Gümüş and S. Güldal, “Balancing the Dataset by Generating New Synthetic Data Based on Heinz Mean in Medical Data”, Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi, vol. 22, no. 3, pp. 570–576, 2022, doi: 10.35414/akufemubid.1011058.
ISNAD	Gümüş, İbrahim Halil - Güldal, Serkan. “Balancing the Dataset by Generating New Synthetic Data Based on Heinz Mean in Medical Data”. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi 22/3 (June 2022), 570-576. https://doi.org/10.35414/akufemubid.1011058.
JAMA	Gümüş İH, Güldal S. Balancing the Dataset by Generating New Synthetic Data Based on Heinz Mean in Medical Data. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi. 2022;22:570–576.
MLA	Gümüş, İbrahim Halil and Serkan Güldal. “Balancing the Dataset by Generating New Synthetic Data Based on Heinz Mean in Medical Data”. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi, vol. 22, no. 3, 2022, pp. 570-6, doi:10.35414/akufemubid.1011058.
Vancouver	Gümüş İH, Güldal S. Balancing the Dataset by Generating New Synthetic Data Based on Heinz Mean in Medical Data. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi. 2022;22(3):570-6.