The Effect of Regularized Regression and Tree-Based Missing Data Imputation Methods on Classification Performance in High Dimensional Data
Year 2024,
, 1263 - 1269, 15.11.2024
Buğra Varol
,
İmran Kurt Omurlu
,
Mevlüt Türe
Abstract
Missing data is an important problem in the analysis and classification of high dimensional data. The aim of this study is to compare the effects of four different missing data imputation methods on classification performance in high dimensional data. In this study, missing data imputation methods were evaluated using data sets, whose independent variables between mixed correlated with each other, for binary dependent variable, p=500 independent variables, n=150 units and 1000 times running simulation. Missing data structures were created according to different missing rates. Different datasets were obtained by imputing the missing values using different methods. Regularized regression methods such as least absolute shrinkage and selection operator (lasso) and elastic net regression were used for imputation, as well as tree-based methods such as support vector machine and classification and regression trees. At the end of simulation, the classification scores of the methods were obtained by gradient boosting machine and the missing data prediction performances were evaluated according to the distance of these scores from the reference. Our simulation demonstrates that regularized regression methods outperform tree-based methods in classifying high dimensional datasets. Additionally, it was found that the increase in the amount of missing values reduced the classification performance of the methods in high dimensional data.
References
- Acuna E, Rodriguez C. 2004. The treatment of missing values and its effect on classifier accuracy. In: Classification, Clustering, and Data Mining Applications: Proceedings of the Meeting of the International Federation of Classification Societies (IFCS), Illinois Institute of Technology, July 15–18, Chicago, USA, pp: 639-647.
- Breiman L. 1995. Better subset regression using the nonnegative garrote. Technometrics, 37(4): 373-384.
- Breiman L. 2017. Classification and regression trees. Routledge, New York, USA, 1st ed., pp: 368.
- Chang LY, Chen WC. 2005. Data mining of tree-based models to analyze freeway accident frequency. J Saf Res, 36(4): 365-375.
- Choudhury SJ, Pal NR. 2019. Imputation of missing data with neural networks for classification. Knowledge-Based Syst, 182: 104838.
- Clark LA, Pregibon D. 2017. Tree-based models. In: Hastie T, Chambers J, editors. Statistical models in S, Routledge, Oxfordshire, UK, pp: 377-419.
- Cortes C, Vapnik V. 1995. Support-vector networks. Mach Learn, 20: 273-297.
- Deng Y, Chang C, Ido MS, Long Q. 2016. Multiple imputation for general missing data patterns in the presence of high-dimensional. Data Sci Rep, 621689, 6(1): 21689.
- Elith J, Leathwick JR, Hastie T. 2008. A working guide to boosted regression trees. J Anim Ecol, 77(4): 802-813.
- Enders CK. 2022. Applied missing data analysis. Guilford Press, New York, USA, 2nd ed., pp: 546.
- Farhangfar A, Kurgan L, Dy J. 2008. Impact of imputation of missing values on classification error for discrete data. Pattern Recognit, 41(12): 3692-3705.
- Fawcett T. 2006. An introduction to ROC analysis. Pattern Recognit Lett, 27(8): 861-874.
- Friedman J, Hastie T, Tibshirani R. 2010. Regularization paths for generalized linear models via coordinate descent. J Stat Software, 33(1): 1-22.
- Hanley JA, McNeil BJ. 1982. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1): 29-36.
- Hastie T, Tibshirani R, Friedman JH, Friedman JH. 2009. The elements of statistical learning: data mining, inference, and prediction. Springer, New York, USA, 2nd ed., pp: 737.
- Jadhav A, Pramod D, Ramanathan K. 2019. Comparison of performance of data imputation methods for numeric dataset. Appl Artif Intell, 33(10): 913-933.
- Lavanya K, Reddy L, Eswara Reddy B. 2019. A study of high-dimensional data imputation using additive LASSO regression model. In: Behera HS, Nayak J, Naik B, Abraham A, editors. Computational intelligence in data mining. Springer, Singapore, pp: 19-30.
- Little RJ, Rubin DB. 2019. Statistical analysis with missing data. John Wiley & Sons, New York, USA, 3rd ed., pp: 449.
- Liu CH, Tsai CF, Sue KL, Huang MW. 2020. The feature selection effect on missing value imputation of medical datasets. Appl Sci, 10(7): 2344.
- Liu Y, De A. 2015. Multiple imputation by fully conditional specification for dealing with missing data in a large epidemiologic study. Int J Stat Med Res, 4(3): 287-295.
- Loh WY. 2011. Classification and regression trees. Interdiscip Rev Data Min Knowl Discov, 1(1): 14-23.
- Nawar S, Mouazen AM. 2017. Comparison between random forests, artificial neural networks and gradient boosted machines methods of on-line Vis-NIR spectroscopy measurements of soil total nitrogen and total carbon. Sensors, 17(10): 2428.
- Patil AR, Kim S. 2020. Combination of ensembles of regularized regression models with resampling-based lasso feature selection in high dimensional data. Mathematics, 8(1): 110.
- Peña M, Ortega P, Orellana M. 2019. A novel imputation method for missing values in air pollutant time series data. In: 2019 IEEE Latin American Conference on Computational Intelligence (LA-CCI), November 11-15, Guayaquil, Ecuador, pp: 1-6.
- Przednowek K, Wiktorowicz K. 2013. Prediction of the result in race walking using regularized regression models. J Theor Appl Comput Sci, 7(2): 45-58.
- Qin Y, Zhang S, Zhu X, Zhang J, Zhang C. 2007. Semi-parametric optimization for missing data imputation. Appl Intell, 27(1): 79-88.
- Rubin DB. 1988. An overview of multiple imputation. Proc Surv Res methods Sect Am Stat Assoc, 16: 79-84.
- Schafer JL, Graham JW. 2002. Missing data: our view of the state of the art. Psychol methods, 7(2): 147-177.
- Schapire RE. 2003. The boosting approach to machine learning: An overview. In: Denison DD, Hansen MH, Holmes CC, Mallick M, Yu B, editors. Nonlinear estimation and classification. Springer, New York, 2023rd ed., pp: 149-171.
- Slade E, Naylor MG. 2020. A fair comparison of tree‐based and parametric methods in multiple imputation by chained equations. Stat Med, 39(8): 1156-1166.
- Stekhoven DJ, Bühlmann P. 2012. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1): 112-118.
- Tharwat A. 2021. Classification assessment methods. Appl Comput Inform, 17(1): 168-192.
- Tian Z, Xiao J, Feng H, Wei Y. 2020. Credit risk assessment based on gradient boosting decision tree. Procedia Comput Sci, 174: 150-160.
- Tibshirani R. 1996. Regression shrinkage and selection via the lasso. J R Stat Soc Series B Stat Methodol, 58(1): 267-288.
- Yin X, Levy D, Willinger C, Adourian A, Larson MG. 2016. Multiple imputation and analysis for high‐dimensional incomplete proteomics data. Stat Med, 35(8): 1315-1326.
- Zhang S, Gong L, Zeng Q, Li W, Xiao F, Lei J. 2021. Imputation of gps coordinate time series using missforest. Remote Sens, 13(12): 2312.
- Zhang Z. 2016. Multiple imputation with multivariate imputation by chained equation (MICE) package. Ann Transl Med, 4(2): 30.
- Zhang Z, Zhao Y, Canes A, Steinberg D, Lyashevska O. 2019. Predictive analytics with gradient boosting in clinical medicine. Ann Transl Med, 7(7): 152.
- Zhao Y, Long Q. 2016. Multiple imputation in the presence of high-dimensional data. Stat Methods Med Res, 25(5): 2021-2035.
- Zou H, Hastie T. 2005. Regularization and variable selection via the elastic net. J R Stat Soc Series B Stat Methodol, 67(2): 301-320.
The Effect of Regularized Regression and Tree-Based Missing Data Imputation Methods on Classification Performance in High Dimensional Data
Year 2024,
, 1263 - 1269, 15.11.2024
Buğra Varol
,
İmran Kurt Omurlu
,
Mevlüt Türe
Abstract
Missing data is an important problem in the analysis and classification of high dimensional data. The aim of this study is to compare the effects of four different missing data imputation methods on classification performance in high dimensional data. In this study, missing data imputation methods were evaluated using data sets, whose independent variables between mixed correlated with each other, for binary dependent variable, p=500 independent variables, n=150 units and 1000 times running simulation. Missing data structures were created according to different missing rates. Different datasets were obtained by imputing the missing values using different methods. Regularized regression methods such as least absolute shrinkage and selection operator (lasso) and elastic net regression were used for imputation, as well as tree-based methods such as support vector machine and classification and regression trees. At the end of simulation, the classification scores of the methods were obtained by gradient boosting machine and the missing data prediction performances were evaluated according to the distance of these scores from the reference. Our simulation demonstrates that regularized regression methods outperform tree-based methods in classifying high dimensional datasets. Additionally, it was found that the increase in the amount of missing values reduced the classification performance of the methods in high dimensional data.
Ethical Statement
Ethics committee approval was not required for this study because of there was no study on animals or humans.
References
- Acuna E, Rodriguez C. 2004. The treatment of missing values and its effect on classifier accuracy. In: Classification, Clustering, and Data Mining Applications: Proceedings of the Meeting of the International Federation of Classification Societies (IFCS), Illinois Institute of Technology, July 15–18, Chicago, USA, pp: 639-647.
- Breiman L. 1995. Better subset regression using the nonnegative garrote. Technometrics, 37(4): 373-384.
- Breiman L. 2017. Classification and regression trees. Routledge, New York, USA, 1st ed., pp: 368.
- Chang LY, Chen WC. 2005. Data mining of tree-based models to analyze freeway accident frequency. J Saf Res, 36(4): 365-375.
- Choudhury SJ, Pal NR. 2019. Imputation of missing data with neural networks for classification. Knowledge-Based Syst, 182: 104838.
- Clark LA, Pregibon D. 2017. Tree-based models. In: Hastie T, Chambers J, editors. Statistical models in S, Routledge, Oxfordshire, UK, pp: 377-419.
- Cortes C, Vapnik V. 1995. Support-vector networks. Mach Learn, 20: 273-297.
- Deng Y, Chang C, Ido MS, Long Q. 2016. Multiple imputation for general missing data patterns in the presence of high-dimensional. Data Sci Rep, 621689, 6(1): 21689.
- Elith J, Leathwick JR, Hastie T. 2008. A working guide to boosted regression trees. J Anim Ecol, 77(4): 802-813.
- Enders CK. 2022. Applied missing data analysis. Guilford Press, New York, USA, 2nd ed., pp: 546.
- Farhangfar A, Kurgan L, Dy J. 2008. Impact of imputation of missing values on classification error for discrete data. Pattern Recognit, 41(12): 3692-3705.
- Fawcett T. 2006. An introduction to ROC analysis. Pattern Recognit Lett, 27(8): 861-874.
- Friedman J, Hastie T, Tibshirani R. 2010. Regularization paths for generalized linear models via coordinate descent. J Stat Software, 33(1): 1-22.
- Hanley JA, McNeil BJ. 1982. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1): 29-36.
- Hastie T, Tibshirani R, Friedman JH, Friedman JH. 2009. The elements of statistical learning: data mining, inference, and prediction. Springer, New York, USA, 2nd ed., pp: 737.
- Jadhav A, Pramod D, Ramanathan K. 2019. Comparison of performance of data imputation methods for numeric dataset. Appl Artif Intell, 33(10): 913-933.
- Lavanya K, Reddy L, Eswara Reddy B. 2019. A study of high-dimensional data imputation using additive LASSO regression model. In: Behera HS, Nayak J, Naik B, Abraham A, editors. Computational intelligence in data mining. Springer, Singapore, pp: 19-30.
- Little RJ, Rubin DB. 2019. Statistical analysis with missing data. John Wiley & Sons, New York, USA, 3rd ed., pp: 449.
- Liu CH, Tsai CF, Sue KL, Huang MW. 2020. The feature selection effect on missing value imputation of medical datasets. Appl Sci, 10(7): 2344.
- Liu Y, De A. 2015. Multiple imputation by fully conditional specification for dealing with missing data in a large epidemiologic study. Int J Stat Med Res, 4(3): 287-295.
- Loh WY. 2011. Classification and regression trees. Interdiscip Rev Data Min Knowl Discov, 1(1): 14-23.
- Nawar S, Mouazen AM. 2017. Comparison between random forests, artificial neural networks and gradient boosted machines methods of on-line Vis-NIR spectroscopy measurements of soil total nitrogen and total carbon. Sensors, 17(10): 2428.
- Patil AR, Kim S. 2020. Combination of ensembles of regularized regression models with resampling-based lasso feature selection in high dimensional data. Mathematics, 8(1): 110.
- Peña M, Ortega P, Orellana M. 2019. A novel imputation method for missing values in air pollutant time series data. In: 2019 IEEE Latin American Conference on Computational Intelligence (LA-CCI), November 11-15, Guayaquil, Ecuador, pp: 1-6.
- Przednowek K, Wiktorowicz K. 2013. Prediction of the result in race walking using regularized regression models. J Theor Appl Comput Sci, 7(2): 45-58.
- Qin Y, Zhang S, Zhu X, Zhang J, Zhang C. 2007. Semi-parametric optimization for missing data imputation. Appl Intell, 27(1): 79-88.
- Rubin DB. 1988. An overview of multiple imputation. Proc Surv Res methods Sect Am Stat Assoc, 16: 79-84.
- Schafer JL, Graham JW. 2002. Missing data: our view of the state of the art. Psychol methods, 7(2): 147-177.
- Schapire RE. 2003. The boosting approach to machine learning: An overview. In: Denison DD, Hansen MH, Holmes CC, Mallick M, Yu B, editors. Nonlinear estimation and classification. Springer, New York, 2023rd ed., pp: 149-171.
- Slade E, Naylor MG. 2020. A fair comparison of tree‐based and parametric methods in multiple imputation by chained equations. Stat Med, 39(8): 1156-1166.
- Stekhoven DJ, Bühlmann P. 2012. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1): 112-118.
- Tharwat A. 2021. Classification assessment methods. Appl Comput Inform, 17(1): 168-192.
- Tian Z, Xiao J, Feng H, Wei Y. 2020. Credit risk assessment based on gradient boosting decision tree. Procedia Comput Sci, 174: 150-160.
- Tibshirani R. 1996. Regression shrinkage and selection via the lasso. J R Stat Soc Series B Stat Methodol, 58(1): 267-288.
- Yin X, Levy D, Willinger C, Adourian A, Larson MG. 2016. Multiple imputation and analysis for high‐dimensional incomplete proteomics data. Stat Med, 35(8): 1315-1326.
- Zhang S, Gong L, Zeng Q, Li W, Xiao F, Lei J. 2021. Imputation of gps coordinate time series using missforest. Remote Sens, 13(12): 2312.
- Zhang Z. 2016. Multiple imputation with multivariate imputation by chained equation (MICE) package. Ann Transl Med, 4(2): 30.
- Zhang Z, Zhao Y, Canes A, Steinberg D, Lyashevska O. 2019. Predictive analytics with gradient boosting in clinical medicine. Ann Transl Med, 7(7): 152.
- Zhao Y, Long Q. 2016. Multiple imputation in the presence of high-dimensional data. Stat Methods Med Res, 25(5): 2021-2035.
- Zou H, Hastie T. 2005. Regularization and variable selection via the elastic net. J R Stat Soc Series B Stat Methodol, 67(2): 301-320.