Research Article

The Effect of Regularized Regression and Tree-Based Missing Data Imputation Methods on Classification Performance in High Dimensional Data

Volume: 7 Number: 6 November 15, 2024
EN TR

The Effect of Regularized Regression and Tree-Based Missing Data Imputation Methods on Classification Performance in High Dimensional Data

Abstract

Missing data is an important problem in the analysis and classification of high dimensional data. The aim of this study is to compare the effects of four different missing data imputation methods on classification performance in high dimensional data. In this study, missing data imputation methods were evaluated using data sets, whose independent variables between mixed correlated with each other, for binary dependent variable, p=500 independent variables, n=150 units and 1000 times running simulation. Missing data structures were created according to different missing rates. Different datasets were obtained by imputing the missing values using different methods. Regularized regression methods such as least absolute shrinkage and selection operator (lasso) and elastic net regression were used for imputation, as well as tree-based methods such as support vector machine and classification and regression trees. At the end of simulation, the classification scores of the methods were obtained by gradient boosting machine and the missing data prediction performances were evaluated according to the distance of these scores from the reference. Our simulation demonstrates that regularized regression methods outperform tree-based methods in classifying high dimensional datasets. Additionally, it was found that the increase in the amount of missing values reduced the classification performance of the methods in high dimensional data.

Keywords

Ethical Statement

Ethics committee approval was not required for this study because of there was no study on animals or humans.

References

  1. Acuna E, Rodriguez C. 2004. The treatment of missing values and its effect on classifier accuracy. In: Classification, Clustering, and Data Mining Applications: Proceedings of the Meeting of the International Federation of Classification Societies (IFCS), Illinois Institute of Technology, July 15–18, Chicago, USA, pp: 639-647.
  2. Breiman L. 1995. Better subset regression using the nonnegative garrote. Technometrics, 37(4): 373-384.
  3. Breiman L. 2017. Classification and regression trees. Routledge, New York, USA, 1st ed., pp: 368.
  4. Chang LY, Chen WC. 2005. Data mining of tree-based models to analyze freeway accident frequency. J Saf Res, 36(4): 365-375.
  5. Choudhury SJ, Pal NR. 2019. Imputation of missing data with neural networks for classification. Knowledge-Based Syst, 182: 104838.
  6. Clark LA, Pregibon D. 2017. Tree-based models. In: Hastie T, Chambers J, editors. Statistical models in S, Routledge, Oxfordshire, UK, pp: 377-419.
  7. Cortes C, Vapnik V. 1995. Support-vector networks. Mach Learn, 20: 273-297.
  8. Deng Y, Chang C, Ido MS, Long Q. 2016. Multiple imputation for general missing data patterns in the presence of high-dimensional. Data Sci Rep, 621689, 6(1): 21689.

Details

Primary Language

English

Subjects

Biostatistics, Statistical Analysis, Applied Statistics

Journal Section

Research Article

Publication Date

November 15, 2024

Submission Date

August 12, 2024

Acceptance Date

October 21, 2024

Published in Issue

Year 2024 Volume: 7 Number: 6

APA
Varol, B., Kurt Omurlu, İ., & Türe, M. (2024). The Effect of Regularized Regression and Tree-Based Missing Data Imputation Methods on Classification Performance in High Dimensional Data. Black Sea Journal of Engineering and Science, 7(6), 1263-1269. https://doi.org/10.34248/bsengineering.1531546
AMA
1.Varol B, Kurt Omurlu İ, Türe M. The Effect of Regularized Regression and Tree-Based Missing Data Imputation Methods on Classification Performance in High Dimensional Data. BSJ Eng. Sci. 2024;7(6):1263-1269. doi:10.34248/bsengineering.1531546
Chicago
Varol, Buğra, İmran Kurt Omurlu, and Mevlüt Türe. 2024. “The Effect of Regularized Regression and Tree-Based Missing Data Imputation Methods on Classification Performance in High Dimensional Data”. Black Sea Journal of Engineering and Science 7 (6): 1263-69. https://doi.org/10.34248/bsengineering.1531546.
EndNote
Varol B, Kurt Omurlu İ, Türe M (November 1, 2024) The Effect of Regularized Regression and Tree-Based Missing Data Imputation Methods on Classification Performance in High Dimensional Data. Black Sea Journal of Engineering and Science 7 6 1263–1269.
IEEE
[1]B. Varol, İ. Kurt Omurlu, and M. Türe, “The Effect of Regularized Regression and Tree-Based Missing Data Imputation Methods on Classification Performance in High Dimensional Data”, BSJ Eng. Sci., vol. 7, no. 6, pp. 1263–1269, Nov. 2024, doi: 10.34248/bsengineering.1531546.
ISNAD
Varol, Buğra - Kurt Omurlu, İmran - Türe, Mevlüt. “The Effect of Regularized Regression and Tree-Based Missing Data Imputation Methods on Classification Performance in High Dimensional Data”. Black Sea Journal of Engineering and Science 7/6 (November 1, 2024): 1263-1269. https://doi.org/10.34248/bsengineering.1531546.
JAMA
1.Varol B, Kurt Omurlu İ, Türe M. The Effect of Regularized Regression and Tree-Based Missing Data Imputation Methods on Classification Performance in High Dimensional Data. BSJ Eng. Sci. 2024;7:1263–1269.
MLA
Varol, Buğra, et al. “The Effect of Regularized Regression and Tree-Based Missing Data Imputation Methods on Classification Performance in High Dimensional Data”. Black Sea Journal of Engineering and Science, vol. 7, no. 6, Nov. 2024, pp. 1263-9, doi:10.34248/bsengineering.1531546.
Vancouver
1.Buğra Varol, İmran Kurt Omurlu, Mevlüt Türe. The Effect of Regularized Regression and Tree-Based Missing Data Imputation Methods on Classification Performance in High Dimensional Data. BSJ Eng. Sci. 2024 Nov. 1;7(6):1263-9. doi:10.34248/bsengineering.1531546

                            24890