Evaluation of Oversampling Methods (OVER, SMOTE, and ROSE) in Classifying Soil Liquefaction Dataset based on SVM, RF, and Naïve Bayes

Selçuk Demir; Emrehan Kutluğ Şahin

doi:10.31590/ejosat.1077867

EN TR

Evaluation of Oversampling Methods (OVER, SMOTE, and ROSE) in Classifying Soil Liquefaction Dataset based on SVM, RF, and Naïve Bayes

Abstract

Class imbalanced datasets are prevalent in real-world applications, including engineering, medical domain, financial sector, and others. Machine learning (ML)-based prediction models have successfully demonstrated the applicability of various algorithms for the solution of different problems. However, their application for the soil liquefaction issue considering the class imbalance situation is limited. This paper presents the prediction results of random forest (RF), support vector machine (SVM), and naïve bayes (NB) algorithms with different training sample sizes for soil liquefaction. The effect of oversampling methods, namely simple oversampling (OVER), random oversampling examples (ROSE), and synthetic minority oversampling technique (SMOTE), on the prediction performance of classification algorithms is also investigated. Performance results are evaluated by means of some metrics, including Accuracy, Kappa, Precision, Recall, and F-measure. The results concluded the effectiveness of applying oversampling methods on imbalanced data before the modeling phase. All of the oversampling methods helped to enhance the overall performances of the classification models. It is also observed that the SMOTE exhibited slightly better performance than other considered oversampling methods. Furthermore, the SVM model outperformed compared to RF and NB models when all algorithms were trained by the SMOTE algorithm.

Keywords

SVM, RF ve Naive Bayes'e Dayalı Olarak Zemin Sıvılaşma Veri Setinin Sınıflandırılmasında Aşırı Örnekleme Yöntemlerinin (OVER, SMOTE ve ROSE) Değerlendirilmesi

Öz

Dengesiz sınıf veri kümeleri, mühendislik, tıp alanı, finans sektörü ve diğerleri dahil olmak üzere gerçek dünya uygulamalarında oldukça yaygındır. Makine öğrenimi (ML) tabanlı tahmin modelleri, farklı problemlerin çözümü için çeşitli algoritmaların uygulanabilirliğini başarıyla göstermiştir. Ancak sınıf dengesizliği durumu göz önüne alındığında zemin sıvılaşması sorununa yönelik uygulamaları sınırlıdır. Bu çalışma, zemin sıvılaşması için farklı eğitim örneği boyutlarına sahip rastgele orman (RF), destek vektör makinesi (SVM) ve naive bayes (NB) algoritmalarının tahmin sonuçlarını sunmaktadır. Ayrıca, basit aşırı örnekleme (OVER), rastgele aşırı örnekleme örnekleri (ROSE) ve sentetik azınlık aşırı örnekleme tekniğinin (SMOTE) gibi aşırı örnekleme yöntemlerinin sınıflandırma algoritmalarının tahmin performansı üzerindeki etkisi araştırılmıştır. Performans sonuçları, Accuracy, Kappa, Precision, Recall ve F-measure gibi metrikler aracılığıyla değerlendirilmiştir. Sonuçlar, modelleme aşamasından önce dengesiz veriler üzerinde aşırı örnekleme yöntemlerinin uygulanmasının etkili olduğu göstermiştir. Ayrıca, bütün aşırı örnekleme yöntemlerinin, sınıflandırma modellerinin genel performanslarını geliştirmeye yardımcı olduğu görülmüştür. SMOTE yönteminin diğer dikkate alınan aşırı örnekleme yöntemlerinden biraz daha iyi performans gösterdiği gözlemlenmiştir. Bununla beraber, bütün algoritmalar SMOTE algoritması ile eğitildiğinde, SVM modeli RF ve NB modellerine kıyasla daha iyi performans sergilemiştir.

Anahtar Kelimeler

References

Adalier, K., & Elgamal, A. (2004). Mitigation of liquefaction and associated ground deformations by stone columns. Engineering Geology, 72(3-4), 275-291.
Allen, J. R. L. (1982). Sedimentary Structures: Their Character and Physical Basis. Volume II. Developments in Sedimentology, 30B, Amsterdam.
Amiri, M., Bakhshandeh Amnieh, H., Hasanipanah, M., & Mohammad Khanli, L. (2016). A new combination of artificial neural network and K-nearest neighbors models to predict blast-induced ground vibration and air-overpressure. Engineering with Computers, 32(4), 631-644.
Cetin, K. O., Seed, R. B., Der Kiureghian, A., Tokimatsu, K., Harder Jr, L. F., Kayen, R. E., & Moss, R. E. (2004). Standard penetration test-based probabilistic and deterministic assessment of seismic soil liquefaction potential. Journal of Geotechnical and Geoenvironmental Engineering, 130(12), 1314-1340.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321-357.
Chen, B., Xia, S., Chen, Z., Wang, B., & Wang, G. (2021). RSMOTE: A self-adaptive robust SMOTE for imbalanced problems with label noise. Information Sciences, 553, 397-428.
Demir, S., & Sahin, E. K. (2022). Comparison of tree-based machine learning algorithms for predicting liquefaction potential using canonical correlation forest, rotation forest, and random forest based on CPT data. Soil Dynamics and Earthquake Engineering, 154, 107130.
Douzas, G., & Bacao, F. (2017). Self-Organizing Map Oversampling (SOMO) for imbalanced data set learning. Expert Systems with Applications, 82, 40-52.

Fernández, A., García, S., Galar, M., Prati, R. C., Krawczyk, B., & Herrera, F. (2018). Learning from imbalanced data sets (Vol. 10, pp. 978-3). Berlin: Springer.
He H., & Ma, Y. (2013) Imbalanced learning: foundations, algorithms, and applications. John Wiley & Sons, Inc., Hoboken, New Jersey.
He, S., Wu, J., Wang, D., & He, X. (2022). Predictive modeling of groundwater nitrate pollution and evaluating its main impact factors using random forest. Chemosphere, 290, 133388.
Hu, J., Zou, W., Wang, J., & Pang, L. (2021). Minimum training sample size requirements for achieving high prediction accuracy with the BN model: A case study regarding seismic liquefaction. Expert Systems with Applications, 185, 115702.
Jain, D., Mishra, A. K., & Das, S. K. (2021). Machine learning based automatic prediction of Parkinson’s disease using speech features. In Proceedings of International Conference on Artificial Intelligence and Applications (pp. 351-362). Springer, Singapore.
Juang, C. H., Yuan, H., Lee, D. H., & Lin, P. S. (2003). Simplified cone penetration test-based method for evaluating liquefaction resistance of soils. Journal of Geotechnical and Geoenvironmental Engineering, 129(1), 66-80.
Kayen, R., Moss, R. E. S., Thompson, E. M., Seed, R. B., Cetin, K. O., Kiureghian, A. D., ... & Tokimatsu, K. (2013). Shear-wave velocity–based probabilistic and deterministic assessment of seismic soil liquefaction potential. Journal of Geotechnical and Geoenvironmental Engineering, 139(3), 407-419.
Koopialipoor, M., Fahimifar, A., Ghaleini, E. N., Momenzadeh, M., & Armaghani, D. J. (2020). Development of a new hybrid ANN for solving a geotechnical problem related to tunnel boring machine performance. Engineering with Computers, 36(1), 345-357.
Liu, J. (2022). Importance-SMOTE: a synthetic minority oversampling method for noisy imbalanced data. Soft Computing, 26, 1141–11631.
Menardi, G., & Torelli, N. (2014). Training and assessing classification rules with imbalanced data. Data mining and knowledge discovery, 28(1), 92-122.
Robertson, P. K., & Wride, C. E. (1998). Evaluating cyclic liquefaction potential using the cone penetration test. Canadian Geotechnical Journal, 35(3), 442-459.
Samui, P. (2008). Support vector machine applied to settlement of shallow foundations on cohesionless soils. Computers and Geotechnics, 35(3), 419-427.
Seed, H. B., & Idriss, I. M. (1971). Simplified procedure for evaluating soil liquefaction potential. Journal of the Soil Mechanics and Foundations Division, 97(9), 1249-1273.
Vluymans, Sarah. Dealing with Imbalanced and Weakly Labelled Data in Machine Learning Using Fuzzy and Rough Set Methods. Ghent University. Faculty of Medicine and Health Sciences; University of Granada. Department of Computer Science and Artificial Intelligence, 2018.
Wang, L., Wu, C., Tang, L., Zhang, W., Lacasse, S., Liu, H., & Gao, L. (2020). Efficient reliability analysis of earth dam slope stability using extreme gradient boosting method. Acta Geotechnica, 15(11), 3135-3150.
Wu, C., Fang, C., Wu, X., & Zhu, G. (2020). Health-risk assessment of arsenic and groundwater quality classification using random Forest in the Yanchi region of Northwest China. Exposure and Health, 12(4), 761-774.
Xie, Y., Ebad Sichani, M., Padgett, J. E., & DesRoches, R. (2020). The promise of implementing machine learning in earthquake engineering: A state-of-the-art review. Earthquake Spectra, 36(4), 1769-1801.
Zhang, W., Li, H., Li, Y., Liu, H., Chen, Y., & Ding, X. (2021a). Application of deep learning algorithms in geotechnical engineering: a short critical review. Artificial Intelligence Review, 54(8), 5633-5673.
Zhang, Y., Xie, Y., Zhang, Y., Qiu, J., & Wu, S. (2021b). The adoption of deep neural network (DNN) to the prediction of soil liquefaction based on shear wave velocity. Bulletin of Engineering Geology and the Environment, 80(6), 5053-5060.
Zhao, Z., Duan, W., & Cai, G. (2021). A novel PSO-KELM based soil liquefaction potential evaluation system using CPT and Vs measurements. Soil Dynamics and Earthquake Engineering, 150, 106930.
Zhou, J., Huang, S., Wang, M., & Qiu, Y. (2021). Performance evaluation of hybrid GA–SVM and GWO–SVM models to predict earthquake-induced liquefaction potential of soil: a multi-dataset investigation. Engineering with Computers, https://doi.org/10.1007/s00366-021-01418-3.

Details

Primary Language

English

Subjects

Engineering

Journal Section

Research Article

Authors

Selçuk Demir ^*
0000-0003-2520-4395
Türkiye

Emrehan Kutluğ Şahin
0000-0002-9830-8585
Türkiye

Publication Date

March 31, 2022

Submission Date

February 23, 2022

Acceptance Date

February 23, 2022

Published in Issue

Year 1970 Number: 34

DOI

https://doi.org/10.31590/ejosat.1077867

IZ

https://izlik.org/JA89SD35ZM

Cite

RIS / Bibtex

APA

Demir, S., & Şahin, E. K. (2022). Evaluation of Oversampling Methods (OVER, SMOTE, and ROSE) in Classifying Soil Liquefaction Dataset based on SVM, RF, and Naïve Bayes. Avrupa Bilim Ve Teknoloji Dergisi, 34, 142-147. https://doi.org/10.31590/ejosat.1077867