Subset Selection of Best Predictors in Quantile Regression Model Using the Genetic Algorithm and Information Measure of Complexity as the Fitness Function
Year 2025,
Volume: 1 Issue: 2, 14 - 34, 20.12.2025
Hamparsum Bozdogan
Abstract
In this paper, we introduce and propose a novel subset selection of
variables in the quantile regression (QR) model using information-based
model selection criterion ICOMP(IFIM)misspec as the fitness function
within the genetic algorithm (GA) as our optimization tool. Estimation
of the coefficients of the QR model is carried out using a computationally
efficient weighted least squares (WLS) method. A nonparametric
kernel covariance estimator of the QR model is derived and implemented
as the asymptotic covariance matrix of the model to derive
and score ICOMP(IFIM)misspec. A real numerical example is carried
out on a benchmark prostate data set to show the performance of the
ICOMP(IFIM)misspec criterion along with AIC and SBC for selecting
the best subset of variables to explain the conditional distribution of response
on several quantile levels via all possible subset model selection
methods and the GA. Our results show that the AIC criterion includes
many redundant variables than the ICOMP-based models do, which is not
so surprising. The models selected by the GA coincide with the models
selected via all possible subset selection methods. However, GA is a more
time-efficient and less costly procedure than all possible subset selection
methods, especially in high-dimensional datasets. Our proposed quantile
regression (QR) model approach outperforms the model selection as
compared to the classic linear regression (LR) modeling approach.
Ethical Statement
This paper has not been submitted elsewhere
Supporting Institution
TUBITAK Post Doctoral Scholarship
References
-
Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B. Petrox &F. Csaki (Eds.), Second international symposium on information theory. (p. 267-281). Budapest.
-
Barrodale, I., & Roberts, F. (1973). An improved algorithm for discrete L1 linear approximation. SIAM Journalof Numerical Analysis, 10(5), 839-848.
-
Barrodale, I., & Roberts, F. (1974). Solution of an overdetermined system of equations in the L1 norm.Communications of the Association for Computing Machinery, 17, 319-320.
-
Beaton, A. E., & Tukey, J. W. (1974). The fitting of power series, meaning polynomials, illustrated on
band-spectroscopic data. Technometrics, 16, 147-185.
-
Behl, P., Claeskens, G., & Dette, H. (2014). Focused model selection in quantile regression. Statistica Sinica,24, 601-624.
-
Birkes, D., & Dodge, Y. (1993). Alternative methods of regression. New York: Wiley and Sons,Inc.
-
Bowman, A. W. (1984). An investigation of the properties of some simple kernel density estimators. Journal ofthe Royal Statistical Society: Series B (Methodological), 46(3), 305–316.
-
Bozdogan, H. (1988). Icomp: A new model-selection criteria. In H. Bock (Ed.), Classification and relatedmethods of data analysis. North-Holland.
-
Bozdogan, H. (1990). On the information-based measure of covariance complexity and its application to theevaluation of multivariate linear models. Communication in Statistics, Theory and Methods, 19, 221-278.
-
Bozdogan, H. (1994). Choosing the number of clusters, subset selection of variables, and outlier detection in the standard mixture-model cluster analysis. In E. Diday (Ed.), New approaches in classification and data analysis (p. 169-177). New York: Springer-Verlag
-
Bozdogan, H. (2000). Akaike’s information criterion and recent developments in information complexity. Journal of Mathematical Psychology, 44, 62-91.
-
Bozdogan, H. (2004). Intelligent statistical data mining with information complexity and genetic algorithms. In H. Bozdogan (Ed.), Statistical data mining and knowledge discovery. Boca Raton, Florida: Chapman and Hall/CRC.
-
Bozdogan, H. (2010). A new class of information complexity (icomp) criteria with an application to customer profiling and segmentation. Istanbul University Journal of the School and Business Administration, 39, 370-398.
-
Bozdogan, H., & Bearse, P. M. (1998). Subset selection in vector autoregressive models using the genetic algorithm with information complexity as the fitness function. Systems Analysis Modelling Simulation, 31, 61-91.
-
Bozdogan, H., & Haughton, D. (1998). Informational complexity criteria for regression models. Computational Statistics and Data Analysis, 28, 51-76.
-
Bozdogan, H., & Howe, J. A. (2012). Misspecified multivariate regression models using the genetic algorithm and information complexity as the fitness function. European Journal of Pure and Applied Mathematics, 5, 211-249.
-
Buchinsky, M. (1994). Changes in the u.s. wage structure 1963-1987: Application of quantile regression. Econometrica, 62, 405-458.
-
Buchinsky, M. (1995). Estimating the asymptotic covariance matrix for quantile regression models: A monte carlo study. Journal of Econometrics, 68, 303-338.
-
Buchinsky, M. (1996). Estimating the asymptotic covariance matrix for quantile regression models: A monte carlo study. Journal of Econometrics, 68, 303-338.
-
Buchinsky, M. (1998). Recent advances in quantile regression-a practical quide for empirical research. Journal of Human Resources, 33, 88-126.
-
Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference:a practical information theoretic approach. New York: Springer.
-
Cade, B., Terrell, J., & Schroeder, R. (1999). Estimating effects of limiting factors with regression quantiles. Ecology, 80, 311-323.
-
Cizek, P. (2003). Quantile regression. In Z. H. W. Härdle & S. Klinke (Eds.), Xplore application guide (p. 19-48). Berlin: Springer.
-
Cramér, H. (1946). Mathematical methods of statistics. Princeton, New Jersey: Princeton University Press.
-
Davino, C., Furno, M., & Vistocco, D. (2014). Quantile regression. United Kingdom: John Wiley and Sons, Ltd.
-
Davino, C., & Vistocco, D. (2008). The evaluation of university educational processes:a quantile regression approach. Statistica, 3, 281-292.
-
Dette, H., & Volgushe, S. (2008, September). Non-crossing nonparametric estimates of quantile curves. Journal of the Royal Statistical Society, Ser. B, 70, 609-627.
-
Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of American Statistical Association, 96, 1348-1360.
-
Greene, W. H. (2002). Econometric analysis. Upper Saddle River, New Jersey: Pearson Education, Inc.
-
Hendricks, W., & Koenker, R. (1992). Hierarchical spline models for conditional quantiles and the demand for electricity. Journal of American Statistical Association, 87, 58-68.
-
Hocking, R. R. (1976). The analysis and selection of variables in linear regression. Biometrics, 32(1), 1-49.
-
Huber, P. J. (1967). The behaviour of maximum likelihood estimates under nonstandard conditions.. In Fifth Berkeley symposium in mathematical statistics and probability. (p. 221-233). Berkeley, Calif. USA.
-
Koenker, R. (2005). Quantile regression. Cambridge: Cambridge University Press.
-
Koenker, R., & Bassett, G. (1978). Regression quantiles. Econometrica, 46(1), 33-50.
-
Koenker, R., & d’Orey, V. (1987). Algorithm as 229: Computing regression quantiles. Journal of the Royal Statistical Society, Series C (Applied Statistics), 36(3), 383-393.
-
Koenker, R., & d’Orey, V. (1994). A remark on algorithm as 229: Computing dual regression quantiles and regression rank scores. Journal of the Royal Statistical Society, Series C (Applied Statistics), 43(2), 410-414.
-
Koenker, R., & Geling, R. (2001). Reappraising medfly longevity: a quantile regression survival analysis
-
Koenker, R., & Hallock, K. (2001). Quantile regression. Journal of Economic Perspectives, 15, 143-156.
-
Koenker, R., & Portnoy, S. (1987). L-estimation for linear models. Journal of American Statistical Association, 82(339), 851-857.
-
Machado, J. A. (1993, December). Robust model selection and m-estimation. Econometric Theory, 9, 478-493.
-
Magnus, J. R. (2007). The asymptotic variance of the pseudo maximum likelihood estimator. Econometric Theory, 23, 1022-1032.
-
Mantel, N. (1970). Why stepdown procedures in variable selection. Technometrics, 12, 621-625.
-
Moses, L. E. (1986a). Statistical aspects of model selection. New York, NY, USA: Springer-Verlag New York, Inc.
-
Moses, L. E. (1986b). Think and explain with statistics. Boston, US: Addison-Wesley Publishing Company.
-
Portnoy, S., & Koenker, R. (1997). The gaussian hare and the laplacian tortoise: Computability of squared error versus absolute error estimators (with discussion). Statistical Science, 12(4), 279-300.
-
Rao, C. R. (1947). Minimum variance and the estimation of several parameters. Mathematical Proceedings of the Cambridge Philosophical Society, 43, 280-283.
-
Rao, C. R. (1948). Sufficient statistics and minimum variance estimates. Mathematical Proceedings of the Cambridge Philosophical Society, 45, 213.
-
Rogers, W. H. (1992). Quantile regression standard errors. Stata Technical Bulletin, 9, 16-19.
-
Ronchetti, E. (1985). Robust model selection in regression. Statistics and Probability Letters, 3, 21-23.
-
Schulze, N. (2004). Applied quantile regression:microeconometric, financial, and environment analyses. (Unpublished doctoral dissertation). Eberhard-Karls-University.
-
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461-464.
-
Sheather, S. J., & Jones, M. C. (1991). A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society: Series B (Methodological), 53(3), 683–690.
-
Shows, J. H., Lu, W., & Zhang, H. H. (2010). Sparse estimation and inference for censored median regression. Journal of Statistical Planning and Inference, 140(7), 1903-1917.
-
Silverman, B. W. (1986). Density estimation for statistics and data analysis. Chapman and Hall.
-
Stamey, T., Kabalin, J., McNeal, J., Johnstone, I., Freiha, F., Redwine, E., & Yang, N. (1989). Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate.ii. radical prostatectomy treated patients. Journal of Urology, 141(5), 1076-1083.
-
Tibshirani, R. J. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Ser. B, 58, 267-288.
-
van Emden, M. H. (1971). An analysis of complexity. Amsterdam: Mathematisch Centrum.
-
Wang, H., & He, X. (2007). Detecting differential expression in genechip microarray studies: A quantile approach. Journal of American Statistical Association, 102, 104-112.
-
Wei, C. Z. (1992). On predictive least squares principles. Annals of Statistics, 20(1), 1–42.
-
Wei, Y., & He, X. (2006). Conditional growth charts (with discussions). Annals of Statistics, 34, 2069-2031.
-
Wei, Y., Pere, A., Koenker, R., & He, X. (2006). Quantile regression methods for reference growth curves. Statistical Medicine, 25, 1369-1382.
-
White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica, 50, 1-25.
-
Wu, Y., & Liu, Y. (2009). Variable selection in quantile regression. Statistica Sinica, 19, 801-817.
-
Yang, S. (1999). Censored median regression using weighted empirical survival and hazard functions. Journal of American Statistical Association, 94, 137-145.
-
Yu, K., & Jones, M. C. (1998). Local linear quantile regression. Journal of American the Statistical Association, 93(441), 228-237.
-
Yu, K., Lu, Z., & Stander, J. (2003). Quantile regression: Application and current research areas. Journal of Royal Statistical Society, Series D, 52(3), 331-350.
-
Zou, H., & Yuan, M. (2008). Composite quantile regression and the oracle model selection theory. Annals of Statistics, 36, 1108-1126.