Research Article
BibTex RIS Cite

Aykırı Değerlere Karşı Normalleştirme Yöntemlerinin Sağlamlığı ve Makine Öğrenmesi Performansı Üzerindeki Etkileri: Sistematik Bir Simülasyon Çalışması

Year 2026, Volume: 9 Issue: 2, 1129 - 1140, 16.03.2026
https://doi.org/10.47495/okufbed.1846741
https://izlik.org/JA97TF67TH

Abstract

Bu çalışma, veri bozulmasının makine öğrenmesi (ML) sınıflandırma algoritmalarının performansı üzerindeki etkisini sistematik bir biçimde incelemektedir. Araştırmanın odağı; aykırı değer oranları, veri boyutluluğu ve farklı öznitelik ölçekleme teknikleri arasındaki karmaşık etkileşimdir. Bu kapsamda, çok değişkenli lojistik regresyona dayalı kapsamlı bir faktöriyel simülasyon tasarımı oluşturulmuş ve yaygın olarak kullanılan dört ML algoritması (XGBoost, Random Forest, LightGBM ve SVM), sekizi farklı ölçekleme yöntemi ve biri normalizasyon uygulanmamış referans durum olmak üzere toplam dokuz ön-işleme koşulu altında karşılaştırmalı olarak değerlendirilmiştir. Çalışmanın temel amacı; örneklem büyüklüğü, değişken sayısı ve veri bozulma şiddeti arttıkça en yüksek düzeyde sağlamlığı koruyan normalizasyon stratejisinin belirlenmesidir. Bulgular, Medyan/MAD (MD) yönteminin, özellikle aykırı değer oranının yüksek olduğu durumlarda (%50’ye kadar), Z-Score gibi geleneksel yöntemlerde gözlenen performans düşüşünü etkili biçimde telafi ederek tutarlı biçimde üstün bir dayanıklılık sunduğunu açıkça ortaya koymaktadır. Her ne kadar ağaç tabanlı topluluk yöntemleri (XGBoost, Random Forest ve LightGBM) doğaları gereği aykırı değerlere karşı çekirdek tabanlı SVM’ye kıyasla daha toleranslı olsa da SVM’nin MD gibi sağlam bir ölçekleyici ile birlikte kullanılması, yüksek derecede kirlenmiş veri kümelerinde modelin kararlılığını belirgin biçimde artırmaktadır.

References

  • Alshdaifat E., Alshdaifat DA., Alsarhan A., Hussein F., El Salhi SMFS. The effect of preprocessing techniques, applied to numeric features, on classification algorithms' performance. Data 2021; 6(2): 11.
  • Ashraf MWA., Habib I., Akram MU., Al Ghamdi M. A hybrid approach using support vector machine rule extraction for anomaly detection. Sci Rep 2024; 14(1): 27058.
  • Bilal M., Ali G., Iqbal MW., Anwar M., Malik MSA., Kadir RA. Auto-prep: efficient and automated data preprocessing pipeline. IEEE Access 2022; 10: 107764-107784.
  • Bischl B., Casalicchio G., Feurer M., Gijsbers P., Hutter F., Lang M., Mantovani RG., van Rijn JN., Vanschoren J. OpenML benchmarking suites. arXiv [Preprint] 2021. arXiv:1708.03731. Available from: https://arxiv.org/abs/1708.03731
  • Çetin V., Yıldız O. A comprehensive review on data preprocessing techniques in data analysis. Pamukkale Univ J Eng Sci 2022; 28(2): 299-312.
  • Choudhury A., Mondal A., Sarkar S. Decision tree based machine learning algorithms: A comparative study of Random Forest, AdaBoost, XGBoost and LightGBM. Eur Phys J Spec Top 2024; 233(15): 2425-2463.
  • Dash CSK., Behera R., Dehuri S., Ghosh A. An outliers detection and elimination framework in classification task of data mining. Decis Anal J 2023; 6: 100164.
  • Defilippis L., Xu Y., Girardin J., Troiani E., Erba V., Zdeborová L., Loureiro B., Krzakala F. Scaling laws and spectra of shallow neural networks in the feature learning regime. arXiv [Preprint] 2025. arXiv:2509.24882.
  • Glotsos D., Kostopoulos S., Liaparinos P. Investigating student success rates in biomedical engineering education using machine learning and descriptive statistics. WSEAS Trans Adv Eng Educ 2025; 22: 107-113.
  • Huang L., Qin J., Zhou Y., Zhu F., Liu L., Shao L. Normalization techniques in training DNNs: methodology, analysis and application. IEEE Trans Pattern Anal Mach Intell 2023; 45(8): 10173-10196.
  • Kalinina I., Gozhyj A., Bidyuk P., Gozhyi V., Korobchynskyi M., Nadraga V. A systematic approach to data normalization and standardization in machine learning problems. In: Babichev S., Lytvynenko V., editors. Lecture Notes in Data Engineering, Computational Intelligence, and Decision-Making. Vol 2. Lecture Notes on Data Engineering and Communications Technologies; vol 244. Cham: Springer 2025; p. 206-219.
  • Kandanaarachchi S., Muñoz MA., Hyndman RJ., Smith-Miles K. On normalization and algorithm selection for unsupervised outlier detection. Data Min Knowl Discov 2020; 34: 309-354.
  • Khan H., Rasheed MT., Zhang S., Wang X., Liu H. Empirical study of outlier impact in classification context. Expert Syst Appl 2024; 256: 124953.
  • Kord A., Zand M., Zand S. Academic course planning recommendation using machine learning models. Educ Inf Technol 2024; 29: 1245-1268.
  • Kord A., Aboelfetouh A., Shohieb SM. Academic course planning recommendation and students’ performance prediction multi-modal based on educational data mining techniques. J Comput High Educ 2025.
  • Lartey C., Liu J., Asamoah RK., Greet C., Zanin M., Skinner W. Effective outlier detection for ensuring data quality in flotation data modelling using machine learning (ML) algorithms. Minerals 2024; 14(9): 925.
  • Li X., Liu F. A mathematical comparison of data assimilation and machine learning in earth system state estimation from a Bayesian inference viewpoint. Information Geography 2025; 1(1): 100001.
  • Li Z., Zhang L. An ensemble outlier detection method based on information entropy-weighted subspaces for high-dimensional data. Entropy 2023; 25(8): 1185.
  • Mishra P., Singh U., Sahoo S. New data preprocessing trends based on ensemble of multiple preprocessing techniques. TrAC Trends Anal Chem 2020; 132: 116045.
  • Scikit-learn developers. Compare the effect of different scalers on data with outliers. In: scikit-learn: Machine Learning in Python. Version 1.8.0 2025. Available from: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html
  • Singh D., Singh B. Feature wise normalization: An effective way of normalizing data. Pattern Recognit 2022; 122: 108307.
  • Suenaga D., Takase Y., Abe T., Orita G., Ando S. Prediction accuracy of Random Forest, XGBoost, LightGBM, and artificial neural network. Structures 2023; 50: 1252-1263.
  • Sujon KM., Hassan RB., Towshi ZT., Othman MA., Samad MA., Choi K. When to use standardization and normalization: Empirical evidence from machine learning models and XAI. IEEE Access 2024; 12: 135300-135314.
  • Thakur S., Tiwari VK., Agrawal J. Performance analysis of linear kernel support vector machine models on real-world datasets. Int J Adv Netw Appl 2025; 17(1): 6753-6760.
  • Tseng CY., Salguero JA., Breidenbach JD., et al. Evaluation of normalization strategies for mass spectrometry-based multi-omics datasets. Metabolomics 2025; 21: 98.
  • Vafaei N., Ribeiro RA., Camarinha-Matos LM. Comparison of normalization techniques on data sets with outliers. Int J Decis Support Syst Technol 2022; 14(1): 1-17.
  • Wyatt M., Radford B., Callow N., Brown L. Using ensemble methods to improve the robustness of deep learning for image classification. Methods Ecol Evol 2022; 13(6): 1317-1328.
  • Zhang D., Gong Y. The comparison of LightGBM and XGBoost coupling factor analysis and prediagnosis of acute liver failure. IEEE Access 2020; 8: 220990-221003.
  • Zhu X., Li Y. Robust feature scaling for time series forecasting using interquartile range. Expert Syst Appl 2023; 228: 120401.

The Robustness of Normalization Methods to Outliers and Their Impact on Machine Learning Performance: A Systematic Simulation Study

Year 2026, Volume: 9 Issue: 2, 1129 - 1140, 16.03.2026
https://doi.org/10.47495/okufbed.1846741
https://izlik.org/JA97TF67TH

Abstract

This research provides a systematic investigation into how data corruption impacts the performance of Machine Learning (ML) classification algorithms. Our focus is on the intricate interplay between outlier ratios, data dimensionality, and various feature scaling techniques. We constructed a comprehensive factorial simulation based on multivariate logistic regression, benchmarking four widely-used ML algorithms (XGBoost, Random Forest, LightGBM, and SVM) under nine preprocessing conditions: eight feature-scaling methods and one unnormalized baseline. The primary goal was to identify the normalization strategy that maintains the highest level of robustness as data size, feature count, and corruption severity escalate. The findings clearly indicate that the Median/MAD (MD) method offers consistently superior resilience, especially when the outlier ratio is substantial (up to 50%), effectively counteracting the performance collapse seen with conventional methods like Z-Score. While ensemble tree methods (XGBoost, Random Forest, LightGBM) naturally tolerate outliers better than the kernel-based SVM, pairing a robust scaler like MD with SVM dramatically improves the latter's stability in highly contaminated datasets. This work emphasizes the necessity of context-aware data preprocessing, offering empirically-grounded recommendations for practitioners seeking to build resilient models in data science applications.

References

  • Alshdaifat E., Alshdaifat DA., Alsarhan A., Hussein F., El Salhi SMFS. The effect of preprocessing techniques, applied to numeric features, on classification algorithms' performance. Data 2021; 6(2): 11.
  • Ashraf MWA., Habib I., Akram MU., Al Ghamdi M. A hybrid approach using support vector machine rule extraction for anomaly detection. Sci Rep 2024; 14(1): 27058.
  • Bilal M., Ali G., Iqbal MW., Anwar M., Malik MSA., Kadir RA. Auto-prep: efficient and automated data preprocessing pipeline. IEEE Access 2022; 10: 107764-107784.
  • Bischl B., Casalicchio G., Feurer M., Gijsbers P., Hutter F., Lang M., Mantovani RG., van Rijn JN., Vanschoren J. OpenML benchmarking suites. arXiv [Preprint] 2021. arXiv:1708.03731. Available from: https://arxiv.org/abs/1708.03731
  • Çetin V., Yıldız O. A comprehensive review on data preprocessing techniques in data analysis. Pamukkale Univ J Eng Sci 2022; 28(2): 299-312.
  • Choudhury A., Mondal A., Sarkar S. Decision tree based machine learning algorithms: A comparative study of Random Forest, AdaBoost, XGBoost and LightGBM. Eur Phys J Spec Top 2024; 233(15): 2425-2463.
  • Dash CSK., Behera R., Dehuri S., Ghosh A. An outliers detection and elimination framework in classification task of data mining. Decis Anal J 2023; 6: 100164.
  • Defilippis L., Xu Y., Girardin J., Troiani E., Erba V., Zdeborová L., Loureiro B., Krzakala F. Scaling laws and spectra of shallow neural networks in the feature learning regime. arXiv [Preprint] 2025. arXiv:2509.24882.
  • Glotsos D., Kostopoulos S., Liaparinos P. Investigating student success rates in biomedical engineering education using machine learning and descriptive statistics. WSEAS Trans Adv Eng Educ 2025; 22: 107-113.
  • Huang L., Qin J., Zhou Y., Zhu F., Liu L., Shao L. Normalization techniques in training DNNs: methodology, analysis and application. IEEE Trans Pattern Anal Mach Intell 2023; 45(8): 10173-10196.
  • Kalinina I., Gozhyj A., Bidyuk P., Gozhyi V., Korobchynskyi M., Nadraga V. A systematic approach to data normalization and standardization in machine learning problems. In: Babichev S., Lytvynenko V., editors. Lecture Notes in Data Engineering, Computational Intelligence, and Decision-Making. Vol 2. Lecture Notes on Data Engineering and Communications Technologies; vol 244. Cham: Springer 2025; p. 206-219.
  • Kandanaarachchi S., Muñoz MA., Hyndman RJ., Smith-Miles K. On normalization and algorithm selection for unsupervised outlier detection. Data Min Knowl Discov 2020; 34: 309-354.
  • Khan H., Rasheed MT., Zhang S., Wang X., Liu H. Empirical study of outlier impact in classification context. Expert Syst Appl 2024; 256: 124953.
  • Kord A., Zand M., Zand S. Academic course planning recommendation using machine learning models. Educ Inf Technol 2024; 29: 1245-1268.
  • Kord A., Aboelfetouh A., Shohieb SM. Academic course planning recommendation and students’ performance prediction multi-modal based on educational data mining techniques. J Comput High Educ 2025.
  • Lartey C., Liu J., Asamoah RK., Greet C., Zanin M., Skinner W. Effective outlier detection for ensuring data quality in flotation data modelling using machine learning (ML) algorithms. Minerals 2024; 14(9): 925.
  • Li X., Liu F. A mathematical comparison of data assimilation and machine learning in earth system state estimation from a Bayesian inference viewpoint. Information Geography 2025; 1(1): 100001.
  • Li Z., Zhang L. An ensemble outlier detection method based on information entropy-weighted subspaces for high-dimensional data. Entropy 2023; 25(8): 1185.
  • Mishra P., Singh U., Sahoo S. New data preprocessing trends based on ensemble of multiple preprocessing techniques. TrAC Trends Anal Chem 2020; 132: 116045.
  • Scikit-learn developers. Compare the effect of different scalers on data with outliers. In: scikit-learn: Machine Learning in Python. Version 1.8.0 2025. Available from: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html
  • Singh D., Singh B. Feature wise normalization: An effective way of normalizing data. Pattern Recognit 2022; 122: 108307.
  • Suenaga D., Takase Y., Abe T., Orita G., Ando S. Prediction accuracy of Random Forest, XGBoost, LightGBM, and artificial neural network. Structures 2023; 50: 1252-1263.
  • Sujon KM., Hassan RB., Towshi ZT., Othman MA., Samad MA., Choi K. When to use standardization and normalization: Empirical evidence from machine learning models and XAI. IEEE Access 2024; 12: 135300-135314.
  • Thakur S., Tiwari VK., Agrawal J. Performance analysis of linear kernel support vector machine models on real-world datasets. Int J Adv Netw Appl 2025; 17(1): 6753-6760.
  • Tseng CY., Salguero JA., Breidenbach JD., et al. Evaluation of normalization strategies for mass spectrometry-based multi-omics datasets. Metabolomics 2025; 21: 98.
  • Vafaei N., Ribeiro RA., Camarinha-Matos LM. Comparison of normalization techniques on data sets with outliers. Int J Decis Support Syst Technol 2022; 14(1): 1-17.
  • Wyatt M., Radford B., Callow N., Brown L. Using ensemble methods to improve the robustness of deep learning for image classification. Methods Ecol Evol 2022; 13(6): 1317-1328.
  • Zhang D., Gong Y. The comparison of LightGBM and XGBoost coupling factor analysis and prediagnosis of acute liver failure. IEEE Access 2020; 8: 220990-221003.
  • Zhu X., Li Y. Robust feature scaling for time series forecasting using interquartile range. Expert Syst Appl 2023; 228: 120401.
There are 29 citations in total.

Details

Primary Language Turkish
Subjects Context Learning, Deep Learning
Journal Section Research Article
Authors

Hamza Yalçin 0000-0003-0733-7821

İbrahim Öztürk

Submission Date December 22, 2025
Acceptance Date February 5, 2026
Publication Date March 16, 2026
DOI https://doi.org/10.47495/okufbed.1846741
IZ https://izlik.org/JA97TF67TH
Published in Issue Year 2026 Volume: 9 Issue: 2

Cite

APA Yalçin, H., & Öztürk, İ. (2026). Aykırı Değerlere Karşı Normalleştirme Yöntemlerinin Sağlamlığı ve Makine Öğrenmesi Performansı Üzerindeki Etkileri: Sistematik Bir Simülasyon Çalışması. Osmaniye Korkut Ata Üniversitesi Fen Bilimleri Enstitüsü Dergisi, 9(2), 1129-1140. https://doi.org/10.47495/okufbed.1846741
AMA 1.Yalçin H, Öztürk İ. Aykırı Değerlere Karşı Normalleştirme Yöntemlerinin Sağlamlığı ve Makine Öğrenmesi Performansı Üzerindeki Etkileri: Sistematik Bir Simülasyon Çalışması. Osmaniye Korkut Ata University Journal of The Institute of Science and Techno. 2026;9(2):1129-1140. doi:10.47495/okufbed.1846741
Chicago Yalçin, Hamza, and İbrahim Öztürk. 2026. “Aykırı Değerlere Karşı Normalleştirme Yöntemlerinin Sağlamlığı Ve Makine Öğrenmesi Performansı Üzerindeki Etkileri: Sistematik Bir Simülasyon Çalışması”. Osmaniye Korkut Ata Üniversitesi Fen Bilimleri Enstitüsü Dergisi 9 (2): 1129-40. https://doi.org/10.47495/okufbed.1846741.
EndNote Yalçin H, Öztürk İ (March 1, 2026) Aykırı Değerlere Karşı Normalleştirme Yöntemlerinin Sağlamlığı ve Makine Öğrenmesi Performansı Üzerindeki Etkileri: Sistematik Bir Simülasyon Çalışması. Osmaniye Korkut Ata Üniversitesi Fen Bilimleri Enstitüsü Dergisi 9 2 1129–1140.
IEEE [1]H. Yalçin and İ. Öztürk, “Aykırı Değerlere Karşı Normalleştirme Yöntemlerinin Sağlamlığı ve Makine Öğrenmesi Performansı Üzerindeki Etkileri: Sistematik Bir Simülasyon Çalışması”, Osmaniye Korkut Ata University Journal of The Institute of Science and Techno, vol. 9, no. 2, pp. 1129–1140, Mar. 2026, doi: 10.47495/okufbed.1846741.
ISNAD Yalçin, Hamza - Öztürk, İbrahim. “Aykırı Değerlere Karşı Normalleştirme Yöntemlerinin Sağlamlığı Ve Makine Öğrenmesi Performansı Üzerindeki Etkileri: Sistematik Bir Simülasyon Çalışması”. Osmaniye Korkut Ata Üniversitesi Fen Bilimleri Enstitüsü Dergisi 9/2 (March 1, 2026): 1129-1140. https://doi.org/10.47495/okufbed.1846741.
JAMA 1.Yalçin H, Öztürk İ. Aykırı Değerlere Karşı Normalleştirme Yöntemlerinin Sağlamlığı ve Makine Öğrenmesi Performansı Üzerindeki Etkileri: Sistematik Bir Simülasyon Çalışması. Osmaniye Korkut Ata University Journal of The Institute of Science and Techno. 2026;9:1129–1140.
MLA Yalçin, Hamza, and İbrahim Öztürk. “Aykırı Değerlere Karşı Normalleştirme Yöntemlerinin Sağlamlığı Ve Makine Öğrenmesi Performansı Üzerindeki Etkileri: Sistematik Bir Simülasyon Çalışması”. Osmaniye Korkut Ata Üniversitesi Fen Bilimleri Enstitüsü Dergisi, vol. 9, no. 2, Mar. 2026, pp. 1129-40, doi:10.47495/okufbed.1846741.
Vancouver 1.Hamza Yalçin, İbrahim Öztürk. Aykırı Değerlere Karşı Normalleştirme Yöntemlerinin Sağlamlığı ve Makine Öğrenmesi Performansı Üzerindeki Etkileri: Sistematik Bir Simülasyon Çalışması. Osmaniye Korkut Ata University Journal of The Institute of Science and Techno. 2026 Mar. 1;9(2):1129-40. doi:10.47495/okufbed.1846741

23487


196541947019414

19433194341943519436 1960219721 197842261021238 23877

*This journal is an international refereed journal 

*Our journal does not charge any article processing fees over publication process.

* This journal is online publishes 5 issues per year (January, March, June, September, December)

*This journal published in Turkish and English as open access. 

19450 This work is licensed under a Creative Commons Attribution 4.0 International License.