SMOTE vs. KNNOR: An evaluation of oversampling techniques in machine learning

İsmet Abacı; Kazım Yıldız

doi:10.17714/gumusfenbil.1253513

Research Article

BibTex

RIS

Cite

SMOTE vs. KNNOR: An evaluation of oversampling techniques in machine learning

Year 2023, Volume: 13 Issue: 3, 767 - 779, 15.07.2023

İsmet Abacı , Kazım Yıldız

https://doi.org/10.17714/gumusfenbil.1253513

Cited By: 1

Abstract

The increasing availability of big data has led to the development of applications that make human life easier. In order to process this data correctly, it is necessary to extract useful and valid information from large data warehouses through a knowledge discovery process in databases (KDD). Data mining is an important part of this and it involves discovering data and developing models that extract unknown patterns. The quality of the data used in supervised machine learning algorithms plays a significant role in determining the success of predictions. One factor that improves the quality of data is a balanced dataset, where the input values are distributed close to each other. However, in practice, many datasets are unbalanced. To overcome this problem, oversampling techniques are used to generate synthetic data that is as close to real data as possible. In this study, we compared the performance of two oversampling techniques, SMOTE and KNNOR, on a variety of datasets using different machine learning algorithms. Our results showed that the use of SMOTE and KNNOR did not always improve the accuracy of the model. In fact, on many datasets, these techniques resulted in a decrease in accuracy. However, on certain datasets, both SMOTE and KNNOR were able to increase the accuracy of the model. Our results indicate that the effectiveness of oversampling techniques varies depending on the specific dataset and machine learning algorithm being used. Therefore, it is crucial to assess the effectiveness of these methods on a case-by-case basis to determine the best approach for a given dataset and algorithm.

Keywords

KNNOR, Machine learning, Oversampling, SMOTE, Unbalanced data

References

Adekitan, A. I., & Salau, O. P. (2019). The impact of engineering students’ performance in the first three years on their graduation result using educational data mining. Heliyon, 5(2), e01250. https://doi.org/10.1016/j.heliyon.2019.e01250
Ashwin Srinivasan (1988). UCI Machine Learning Repository. Retrieved from http://archive.ics.uci.edu/ml
Asif, R., Merceron, A., & Pathan, M. K. (2014). Predicting Student Academic Performance at Degree Level: A Case Study. International Journal of Intelligent Systems and Applications, 7(1), 49–61. https://doi.org/10.5815/ijisa.2015.01.05
Balcı, M. A., Taşdemir, Ş., Ozmen, G., & Golcuk, A. (2022). Machine Learning-Based Detection of Sleep-Disordered Breathing Type Using Time and Time-Frequency Features. Biomedical Signal Processing and Control, 73, 103402. https://doi.org/10.1016/j.bspc.2021.103402
Yasar, A. (11 2022). Benchmarking analysis of CNN models for bread wheat varieties. European Food Research and Technology, 249. doi:10.1007/s00217-022-04172-y
Unlersen, M., Sonmez, M., Aslan, M., Demir, B., Aydin, N., Sabanci, K., & Ropelewska, E. (08 2022). CNN-SVM hybrid model for varietal classification of wheat based on bulk samples. European Food Research and Technology, 248. doi:10.1007/s00217-022-04029-4
Kaya, E., & Saritas, İ. (2019). Towards a real-time sorting system: Identification of vitreous durum wheat kernels using ANN based on their morphological, colour, wavelet and gaborlet features. Computers and Electronics in Agriculture, 166, 105016. doi:10.1016/j.compag.2019.105016
Sabanci, K., Aslan, M., Ropelewska, E., & Ünlerşen, M. (06 2022). A convolutional neural network-based comparative study for pepper seed classification: Analysis of selected deep features with support vector machine. Journal of Food Process Engineering, e13955. doi:10.1111/jfpe.13955
Batista, G. E. a. P. A., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations, 6(1), 20–29. https://doi.org/10.1145/1007730.1007735
Breiman, L. (2001) Random Forests. Machine Learning, 45, 5-32. http://dx.doi.org/10.1023/A:1010933404324
Chawla, N. V., Bowyer, K. W., Hall, L. J., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297. https://doi.org/10.1007/bf00994018
Cortez, P., & Silva, A. L. (2008). Using data mining to predict secondary school student performance. EUROSIS. https://repositorium.sdum.uminho.pt/bitstream/1822/8024/1/student.pdf
Douzas, G., & Bacao, F. (2018). Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Information Sciences, 465, 1–20. https://doi.org/10.1016/j.ins.2018.06.056
Douzas, G., & Bacao, F. (2018a). Effective data generation for imbalanced learning using conditional generative adversarial networks. Expert Systems With Applications, 91, 464–471. https://doi.org/10.1016/j.eswa.2017.09.030
Flores, A. R., Icoy, R. I., Pena, C. L., & Gorro, K. D. (2018). An Evaluation of SVM and Naive Bayes with SMOTE on Sentiment Analysis Data Set. https://doi.org/10.1109/iceast.2018.8434401
Gail Gong (1988). UCI Machine Learning Repository. Retrieved from http://archive.ics.uci.edu/ml
Galar, M., Fernández, A. Á., Barrenechea, E., Bustince, H., & Herrera, F. (2012). A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches. IEEE Transactions on Systems, Man and Cybernetics, 42(4), 463–484. https://doi.org/10.1109/tsmcc.2011.2161285
Golcuk, A., & Yasar, A. (2023). Classification of bread wheat genotypes by machine learning algorithms. Journal of Food Composition and Analysis, 119, 105253. https://doi.org/10.1016/j.jfca.2023.105253
Han, H., Wang, W., & Mao, B. (2005). Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In Lecture Notes in Computer Science (pp. 878–887). Springer Science+Business Media. https://doi.org/10.1007/11538059_91
Ho, T. K. (1995). Random decision forests. https://doi.org/10.1109/icdar.1995.598994
Islam, A., Samir, B. B., Rahman, A., & Bensmail, H. (2022). KNNOR: An oversampling technique for imbalanced datasets. Applied Soft Computing, 115, 108288. https://doi.org/10.1016/j.asoc.2021.108288
Liu, J. (2021). Importance-SMOTE: a synthetic minority oversampling method for noisy imbalanced data. Soft Computing, 26(3), 1141–1163. https://doi.org/10.1007/s00500-021-06532-4
Maimon, O., & Rokach, L. (2009). Introduction to Knowledge Discovery and Data Mining. In Springer eBooks (pp. 1–15). https://doi.org/10.1007/978-0-387-09823-4_1
Márquez-Vera, C., Cano, A., Romero, C., & Ventura, S. (2013). Predicting student failure at school using genetic programming and different data mining approaches with high dimensional and imbalanced data. Applied Intelligence, 38(3), 315–330. https://doi.org/10.1007/s10489-012-0374-8
Srinilta, C., & Kanharattanachai, S. (2021). Application of Natural Neighbor-based Algorithm on Oversampling SMOTE Algorithms. https://doi.org/10.1109/iceast52143.2021.9426310
Strecht, P., Cruz, L. J., Soares, C. J., Mendes-Moreira, J., & Abreu, R. (2015). A Comparative Study of Classification and Regression Algorithms for Modelling Students’ Academic Performance. In Educational Data Mining. http://files.eric.ed.gov/fulltext/ED560769.pdf
Swain, P. H., & Hauska, H. (1977). The decision tree classifier: Design and potential. IEEE Transactions on Geoscience Electronics, 15(3), 142–147. https://doi.org/10.1109/tge.1977.6498972

SMOTE ve KNNOR: Makine öğreniminde aşırı örnekleme tekniklerinin değerlendirilmesi

Year 2023, Volume: 13 Issue: 3, 767 - 779, 15.07.2023

İsmet Abacı , Kazım Yıldız

https://doi.org/10.17714/gumusfenbil.1253513

Cited By: 1

Abstract

Büyük verinin artan mevcudiyeti, insan hayatını kolaylaştıran uygulamaların gelişmesine yol açmıştır. Bu veriyi doğru şekilde işlemek için, bilgi keşfi veritabanları (KDD) olarak adlandırılan büyük veri deposundan faydalı ve geçerli bilgiyi çıkarmak gereklidir. KDD işlemlerinin önemli bir parçası olan veri madenciliği, veriyi keşfetmeyi ve bilinmeyen desenleri çıkarmak için model geliştirmeyi içermektedir. Supervised makine öğrenimi algoritmalarında kullanılan verinin kalitesi, tahmin başarısının belirlenmesinde önemli bir rol oynar. Verinin kalitesini arttıran bir faktör, girdi değerlerinin birbirine yakın dağılmış olmasıdır. Ancak pratikte, birçok veri seti dengesizdir. Bu sorunu aşmak için, oversampling teknikleri gerçek veriye en yakın şekilde sentetik veri üretebilmek için kullanılır. Bu çalışmada, farklı veri setlerinde iki oversampling tekniği olan SMOTE ve KNNOR'un performanslarını farklı makine öğrenimi algoritmaları kullanarak karşılaştırdık. Sonuçlarımız, SMOTE ve KNNOR'un modellerin doğruluğunu her zaman arttırmadığını, hatta birçok veri setinde bu tekniklerin doğrulukta azalma yaratabileceğini gösterdi. Ancak belirli veri setlerinde, SMOTE ve KNNOR modellerin doğruluğunu arttırmayı başardı. Bulgularımız, oversampling tekniklerinin etkililiğinin belirli veri seti ve makine öğrenimi algoritmasına bağlı olarak değişebileceğini sugere etmektedir. Dolayısıyla, veri seti ve algoritma için en iyi yaklaşımı belirlemek için bu tekniklerin performanslarını durum bazında değerlendirmek önemlidir.

Keywords

KNNOR, Makine öğrenmesi, Aşırı örnekleme, SMOTE, Dengesiz veri

References

Adekitan, A. I., & Salau, O. P. (2019). The impact of engineering students’ performance in the first three years on their graduation result using educational data mining. Heliyon, 5(2), e01250. https://doi.org/10.1016/j.heliyon.2019.e01250
Ashwin Srinivasan (1988). UCI Machine Learning Repository. Retrieved from http://archive.ics.uci.edu/ml
Asif, R., Merceron, A., & Pathan, M. K. (2014). Predicting Student Academic Performance at Degree Level: A Case Study. International Journal of Intelligent Systems and Applications, 7(1), 49–61. https://doi.org/10.5815/ijisa.2015.01.05
Balcı, M. A., Taşdemir, Ş., Ozmen, G., & Golcuk, A. (2022). Machine Learning-Based Detection of Sleep-Disordered Breathing Type Using Time and Time-Frequency Features. Biomedical Signal Processing and Control, 73, 103402. https://doi.org/10.1016/j.bspc.2021.103402
Yasar, A. (11 2022). Benchmarking analysis of CNN models for bread wheat varieties. European Food Research and Technology, 249. doi:10.1007/s00217-022-04172-y
Unlersen, M., Sonmez, M., Aslan, M., Demir, B., Aydin, N., Sabanci, K., & Ropelewska, E. (08 2022). CNN-SVM hybrid model for varietal classification of wheat based on bulk samples. European Food Research and Technology, 248. doi:10.1007/s00217-022-04029-4
Kaya, E., & Saritas, İ. (2019). Towards a real-time sorting system: Identification of vitreous durum wheat kernels using ANN based on their morphological, colour, wavelet and gaborlet features. Computers and Electronics in Agriculture, 166, 105016. doi:10.1016/j.compag.2019.105016
Sabanci, K., Aslan, M., Ropelewska, E., & Ünlerşen, M. (06 2022). A convolutional neural network-based comparative study for pepper seed classification: Analysis of selected deep features with support vector machine. Journal of Food Process Engineering, e13955. doi:10.1111/jfpe.13955
Batista, G. E. a. P. A., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations, 6(1), 20–29. https://doi.org/10.1145/1007730.1007735
Breiman, L. (2001) Random Forests. Machine Learning, 45, 5-32. http://dx.doi.org/10.1023/A:1010933404324
Chawla, N. V., Bowyer, K. W., Hall, L. J., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297. https://doi.org/10.1007/bf00994018
Cortez, P., & Silva, A. L. (2008). Using data mining to predict secondary school student performance. EUROSIS. https://repositorium.sdum.uminho.pt/bitstream/1822/8024/1/student.pdf
Douzas, G., & Bacao, F. (2018). Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Information Sciences, 465, 1–20. https://doi.org/10.1016/j.ins.2018.06.056
Douzas, G., & Bacao, F. (2018a). Effective data generation for imbalanced learning using conditional generative adversarial networks. Expert Systems With Applications, 91, 464–471. https://doi.org/10.1016/j.eswa.2017.09.030
Flores, A. R., Icoy, R. I., Pena, C. L., & Gorro, K. D. (2018). An Evaluation of SVM and Naive Bayes with SMOTE on Sentiment Analysis Data Set. https://doi.org/10.1109/iceast.2018.8434401
Gail Gong (1988). UCI Machine Learning Repository. Retrieved from http://archive.ics.uci.edu/ml
Galar, M., Fernández, A. Á., Barrenechea, E., Bustince, H., & Herrera, F. (2012). A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches. IEEE Transactions on Systems, Man and Cybernetics, 42(4), 463–484. https://doi.org/10.1109/tsmcc.2011.2161285
Golcuk, A., & Yasar, A. (2023). Classification of bread wheat genotypes by machine learning algorithms. Journal of Food Composition and Analysis, 119, 105253. https://doi.org/10.1016/j.jfca.2023.105253
Han, H., Wang, W., & Mao, B. (2005). Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In Lecture Notes in Computer Science (pp. 878–887). Springer Science+Business Media. https://doi.org/10.1007/11538059_91
Ho, T. K. (1995). Random decision forests. https://doi.org/10.1109/icdar.1995.598994
Islam, A., Samir, B. B., Rahman, A., & Bensmail, H. (2022). KNNOR: An oversampling technique for imbalanced datasets. Applied Soft Computing, 115, 108288. https://doi.org/10.1016/j.asoc.2021.108288
Liu, J. (2021). Importance-SMOTE: a synthetic minority oversampling method for noisy imbalanced data. Soft Computing, 26(3), 1141–1163. https://doi.org/10.1007/s00500-021-06532-4
Maimon, O., & Rokach, L. (2009). Introduction to Knowledge Discovery and Data Mining. In Springer eBooks (pp. 1–15). https://doi.org/10.1007/978-0-387-09823-4_1
Márquez-Vera, C., Cano, A., Romero, C., & Ventura, S. (2013). Predicting student failure at school using genetic programming and different data mining approaches with high dimensional and imbalanced data. Applied Intelligence, 38(3), 315–330. https://doi.org/10.1007/s10489-012-0374-8
Srinilta, C., & Kanharattanachai, S. (2021). Application of Natural Neighbor-based Algorithm on Oversampling SMOTE Algorithms. https://doi.org/10.1109/iceast52143.2021.9426310
Strecht, P., Cruz, L. J., Soares, C. J., Mendes-Moreira, J., & Abreu, R. (2015). A Comparative Study of Classification and Regression Algorithms for Modelling Students’ Academic Performance. In Educational Data Mining. http://files.eric.ed.gov/fulltext/ED560769.pdf
Swain, P. H., & Hauska, H. (1977). The decision tree classifier: Design and potential. IEEE Transactions on Geoscience Electronics, 15(3), 142–147. https://doi.org/10.1109/tge.1977.6498972

There are 28 citations in total.

Details

Primary Language	English
Subjects	Engineering
Journal Section	Articles
Authors	İsmet Abacı 0000-0003-4646-2000 Kazım Yıldız 0000-0001-6999-1410
Publication Date	July 15, 2023
Submission Date	February 20, 2023
Acceptance Date	June 23, 2023
Published in Issue	Year 2023 Volume: 13 Issue: 3

Cite

APA	Abacı, İ., & Yıldız, K. (2023). SMOTE vs. KNNOR: An evaluation of oversampling techniques in machine learning. Gümüşhane Üniversitesi Fen Bilimleri Dergisi, 13(3), 767-779. https://doi.org/10.17714/gumusfenbil.1253513

Cited By

Gebelikte Anne Sağlığı Risk Gruplarının Tahminine Yönelik Makine Öğrenmesi Tabanlı Bir Karar Destek Sistem Tasarımı

Black Sea Journal of Engineering and Science

https://doi.org/10.34248/bsengineering.1455473

Download Cover Image

Article Files

Full Text