Model-Based Feature Selection Using Structural Equation Modeling for Enhanced Classification Performance in High-Dimensional Datasets

Muammer Albayrak; Kemal Turhan

doi:10.35378/gujs.1507978

Research Article

BibTex

RIS

Cite

Year 2025, Volume: 38 Issue: 3, 1247 - 1260, 01.09.2025

Muammer Albayrak , Kemal Turhan

https://doi.org/10.35378/gujs.1507978

Abstract

References

[1] Zhao, Z. A. and Liu, H., “Spectral feature selection for data mining: 1st ed.”, Chapman and Hall/CRC, (2012). DOI: https://doi.org/10.1201/b11426
[2] Dong, H., Sun, J., Sun, X., Ding, R., “A many-objective feature selection for multi-label classification”, Knowledge-Based Systems, 208, (2020). DOI: https://doi.org/10.1016/j.knosys.2020.106456
[3] Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A., “Recent advances and emerging challenges of feature selection in the context of big data”, Knowledge-Based Systems, 86: 33–45, (2015). DOI: https://doi.org/10.1016/j.knosys.2015.05.014
[4] Li, M., Wang, H., Yang, L., Liang, Y., Shang, Z., Wan, H., “Fast hybrid dimensionality reduction method for classification based on feature selection and grouped feature extraction”, Expert Systems with Applications, 150, (2020). DOI: https://doi.org/10.1016/j.eswa.2020.113277
[5] Sha, Z. C., Liu, Z. M., Ma, C., Chen, J., “Feature selection for multi-label classification by maximizing full-dimensional conditional mutual information”, Applied Intelligence, 51(1): 326–340, (2021). DOI: https://doi.org/10.1007/s10489-020-01822-0
[6] Uysal, A. K., and Gunal, S., “A novel probabilistic feature selection method for text classification”, Knowledge-Based Systems, 36: 226–235, (2012). DOI: https://doi.org/10.1016/j.knosys.2012.06.005
[7] Remeseiro, B., and Bolon-Canedo, V., “A review of feature selection methods in medical applications”, Computers in Biology and Medicine, 112, (2019). DOI: https://doi.org/10.1016/j.compbiomed.2019.103375
[8] Jiang, Z., and Zhao, W., “Optimal selection of customized features for implementing seizure detection in wearable electroencephalography sensor”, IEEE Sensors Journal, 20(21): 12941–12949, (2020). DOI: https://doi.org/10.1109/JSEN.2020.3003733
[9] Bolón-Canedo, Verónica, Sánchez-Maroño, N., Alonso-Betanzos, A., “A review of feature selection methods on synthetic data”, Knowledge and Information Systems, 34(3): 483–519, (2013). DOI: https://doi.org/10.1007/s10115-012-0487-8
[10] Colaco, S., Kumar, S., Tamang, A., Biju, V. G., “A review on feature selection algorithms”, Advances in Intelligent Systems and Computing, Vol. 906, (2019). DOI: https://doi.org/10.1007/978-981-13-6001-5_11
[11] Jiliang Tang, Salem Alelyani, H. L., “Feature selection for classification: a review” (C. C. Aggarwal (ed.)). Chapman and Hall/CRC, (2014). DOI: https://doi.org/10.1201/b17320
[12] Monirul Kabir, M., Monirul Islam, M., Murase, K., “A new wrapper feature selection approach using neural network”, Neurocomputing, 73(16–18): 3273–3283, (2010). DOI: https://doi.org/10.1016/j.neucom.2010.04.003
[13] Gan, J., Wen, G., Yu, H., Zheng, W., Lei, C., “Supervised feature selection by self-paced learning regression”, Pattern Recognition Letters, 132: 30–37, (2020). DOI: https://doi.org/10.1016/j.patrec.2018.08.029
[14] Mazaheri, V., and Khodadadi, H., “Heart arrhythmia diagnosis based on the combination of morphological, frequency and nonlinear features of ECG signals and metaheuristic feature selection algorithm”, Expert Systems with Applications, 161, (2020). DOI: https://doi.org/10.1016/j.eswa.2020.113697
[15] Sahebi, G., Movahedi, P., Ebrahimi, M., Pahikkala, T., Plosila, J., Tenhunen, H., “GeFeS: A generalized wrapper feature selection approach for optimizing classification performance”, Computers in Biology and Medicine, 125, (2020). DOI: https://doi.org/10.1016/j.compbiomed.2020.103974
[16] Rick H. Hoyle, “Handbook of structural equation modeling”, The Guilford Press, (2014).
[17] Congdon, P., “Structural equation models for area health outcomes with model selection”, Journal of Applied Statistics, 38(4): 745–767, (2011). DOI: https://doi.org/10.1080/02664760903563692
[18] Marsh, Herbert W., Hocevar, D., “Application of confirmatory factor analysis to the study of self-concept: First- and higher order factor models and their invariance across groups”, Psychological Bulletin, 97(3): 562–582, (1985). DOI: https://doi.org/10.1037/0033-2909.97.3.562
[19] Bentler, P. M., “Comparative fit indexes in structural models”, Psychological Bulletin, 107(2): 238–246, (1990). DOI: https://doi.org/10.1037/0033-2909.107.2.238
[20] Browne, M. W., and Cudeck, R., “Alternative ways of assessing model fit”, Sociological Methods & Research, 21(2): 230–258, (1992). DOI: https://doi.org/10.1177/0049124192021002005
[21] Kursa, M. B., Jankowski, A., Rudnicki, W. R., “Boruta - a system for feature selection”, Fundamenta Informaticae, 101(4): 271–285, (2010). DOI: https://doi.org/10.3233/FI-2010-288
[22] Too, J., and Mirjalili, S., “General learning equilibrium optimizer: a new feature selection method for biological data classification”, Applied Artificial Intelligence, 35(3): 247–263, (2020). DOI: https://doi.org/10.1080/08839514.2020.1861407

Model-Based Feature Selection Using Structural Equation Modeling for Enhanced Classification Performance in High-Dimensional Datasets

Year 2025, Volume: 38 Issue: 3, 1247 - 1260, 01.09.2025

Muammer Albayrak , Kemal Turhan

https://doi.org/10.35378/gujs.1507978

Abstract

Feature selection is becoming more and more important for machine learning and data mining. Especially for high dimensional datasets, it is necessary to filter out irrelevant and unnecessary features to overcome the problems of overfitting and multidimensionality. We hypothesized that an effective feature selection can be made with a model-based approach using the Structural Equation Modeling (SEM) method. The dataset consists of 2969 samples and 117 features. First, a measurement model created was tested with confirmatory factor analysis (CFA) and the number of features was reduced to 58 by removing the statistically insignificant features. In SEM analysis, sub-feature sets consisting of 55, 52, 41 and 35 features were obtained by removing the variables whose relationship was below the threshold values determined for the standardized regression coefficient (SRC). The obtained sub-feature sets were tested with a multilayer perceptron (MLP) and their effect on performance was examined. Results were compared with random forest feature importance as baseline method. SEM and random forest have generally performed very closely. While sub-feature sets created with the random forest in two-class classification produced better results, the sub-feature sets created with the suggested SEM-based method in three and five-class classification provided better performance. These results showed that effective feature selection can be made with the proposed model-based approach using SEM. With this approach, it is possible to obtain sub-feature sets that form a model which statistically significant and consistent with field knowledge by including expert knowledge in the modeling process.

Keywords

Feature selection , Structural equation modeling , Feature importance , Artificial neural networks

Ethical Statement

For this study, data was obtained from the dbGap database run by NCBI (National Center for Biotechnology Information), which contains genotype and phenotype data sets. It is stated in the study page that dataset includes no personal identifying information for research participants and Institutional Review Board (IRB) approval is not required to use it. Also permission to share data and samples was obtained from all subjects in the informed consent form. Informed consent was obtained from all subjects by trained research assistants. Prior to signing the consent form, a research assistant reviewed the form with the subject and answered any questions.

References

[1] Zhao, Z. A. and Liu, H., “Spectral feature selection for data mining: 1st ed.”, Chapman and Hall/CRC, (2012). DOI: https://doi.org/10.1201/b11426
[2] Dong, H., Sun, J., Sun, X., Ding, R., “A many-objective feature selection for multi-label classification”, Knowledge-Based Systems, 208, (2020). DOI: https://doi.org/10.1016/j.knosys.2020.106456
[3] Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A., “Recent advances and emerging challenges of feature selection in the context of big data”, Knowledge-Based Systems, 86: 33–45, (2015). DOI: https://doi.org/10.1016/j.knosys.2015.05.014
[4] Li, M., Wang, H., Yang, L., Liang, Y., Shang, Z., Wan, H., “Fast hybrid dimensionality reduction method for classification based on feature selection and grouped feature extraction”, Expert Systems with Applications, 150, (2020). DOI: https://doi.org/10.1016/j.eswa.2020.113277
[5] Sha, Z. C., Liu, Z. M., Ma, C., Chen, J., “Feature selection for multi-label classification by maximizing full-dimensional conditional mutual information”, Applied Intelligence, 51(1): 326–340, (2021). DOI: https://doi.org/10.1007/s10489-020-01822-0
[6] Uysal, A. K., and Gunal, S., “A novel probabilistic feature selection method for text classification”, Knowledge-Based Systems, 36: 226–235, (2012). DOI: https://doi.org/10.1016/j.knosys.2012.06.005
[7] Remeseiro, B., and Bolon-Canedo, V., “A review of feature selection methods in medical applications”, Computers in Biology and Medicine, 112, (2019). DOI: https://doi.org/10.1016/j.compbiomed.2019.103375
[8] Jiang, Z., and Zhao, W., “Optimal selection of customized features for implementing seizure detection in wearable electroencephalography sensor”, IEEE Sensors Journal, 20(21): 12941–12949, (2020). DOI: https://doi.org/10.1109/JSEN.2020.3003733
[9] Bolón-Canedo, Verónica, Sánchez-Maroño, N., Alonso-Betanzos, A., “A review of feature selection methods on synthetic data”, Knowledge and Information Systems, 34(3): 483–519, (2013). DOI: https://doi.org/10.1007/s10115-012-0487-8
[10] Colaco, S., Kumar, S., Tamang, A., Biju, V. G., “A review on feature selection algorithms”, Advances in Intelligent Systems and Computing, Vol. 906, (2019). DOI: https://doi.org/10.1007/978-981-13-6001-5_11
[11] Jiliang Tang, Salem Alelyani, H. L., “Feature selection for classification: a review” (C. C. Aggarwal (ed.)). Chapman and Hall/CRC, (2014). DOI: https://doi.org/10.1201/b17320
[12] Monirul Kabir, M., Monirul Islam, M., Murase, K., “A new wrapper feature selection approach using neural network”, Neurocomputing, 73(16–18): 3273–3283, (2010). DOI: https://doi.org/10.1016/j.neucom.2010.04.003
[13] Gan, J., Wen, G., Yu, H., Zheng, W., Lei, C., “Supervised feature selection by self-paced learning regression”, Pattern Recognition Letters, 132: 30–37, (2020). DOI: https://doi.org/10.1016/j.patrec.2018.08.029
[14] Mazaheri, V., and Khodadadi, H., “Heart arrhythmia diagnosis based on the combination of morphological, frequency and nonlinear features of ECG signals and metaheuristic feature selection algorithm”, Expert Systems with Applications, 161, (2020). DOI: https://doi.org/10.1016/j.eswa.2020.113697
[15] Sahebi, G., Movahedi, P., Ebrahimi, M., Pahikkala, T., Plosila, J., Tenhunen, H., “GeFeS: A generalized wrapper feature selection approach for optimizing classification performance”, Computers in Biology and Medicine, 125, (2020). DOI: https://doi.org/10.1016/j.compbiomed.2020.103974
[16] Rick H. Hoyle, “Handbook of structural equation modeling”, The Guilford Press, (2014).
[17] Congdon, P., “Structural equation models for area health outcomes with model selection”, Journal of Applied Statistics, 38(4): 745–767, (2011). DOI: https://doi.org/10.1080/02664760903563692
[18] Marsh, Herbert W., Hocevar, D., “Application of confirmatory factor analysis to the study of self-concept: First- and higher order factor models and their invariance across groups”, Psychological Bulletin, 97(3): 562–582, (1985). DOI: https://doi.org/10.1037/0033-2909.97.3.562
[19] Bentler, P. M., “Comparative fit indexes in structural models”, Psychological Bulletin, 107(2): 238–246, (1990). DOI: https://doi.org/10.1037/0033-2909.107.2.238
[20] Browne, M. W., and Cudeck, R., “Alternative ways of assessing model fit”, Sociological Methods & Research, 21(2): 230–258, (1992). DOI: https://doi.org/10.1177/0049124192021002005
[21] Kursa, M. B., Jankowski, A., Rudnicki, W. R., “Boruta - a system for feature selection”, Fundamenta Informaticae, 101(4): 271–285, (2010). DOI: https://doi.org/10.3233/FI-2010-288
[22] Too, J., and Mirjalili, S., “General learning equilibrium optimizer: a new feature selection method for biological data classification”, Applied Artificial Intelligence, 35(3): 247–263, (2020). DOI: https://doi.org/10.1080/08839514.2020.1861407

There are 22 citations in total.

Details

Primary Language	English
Subjects	Data Engineering and Data Science
Journal Section	Computer Engineering
Authors	Muammer Albayrak 0000-0002-5946-6310 Kemal Turhan 0000-0001-7871-3025
Early Pub Date	July 29, 2025
Publication Date	September 1, 2025
Submission Date	July 1, 2024
Acceptance Date	May 10, 2025
Published in Issue	Year 2025 Volume: 38 Issue: 3

Cite

APA	Albayrak, M., & Turhan, K. (2025). Model-Based Feature Selection Using Structural Equation Modeling for Enhanced Classification Performance in High-Dimensional Datasets. Gazi University Journal of Science, 38(3), 1247-1260. https://doi.org/10.35378/gujs.1507978
AMA	Albayrak M, Turhan K. Model-Based Feature Selection Using Structural Equation Modeling for Enhanced Classification Performance in High-Dimensional Datasets. Gazi University Journal of Science. September 2025;38(3):1247-1260. doi:10.35378/gujs.1507978
Chicago	Albayrak, Muammer, and Kemal Turhan. “Model-Based Feature Selection Using Structural Equation Modeling for Enhanced Classification Performance in High-Dimensional Datasets”. Gazi University Journal of Science 38, no. 3 (September 2025): 1247-60. https://doi.org/10.35378/gujs.1507978.
EndNote	Albayrak M, Turhan K (September 1, 2025) Model-Based Feature Selection Using Structural Equation Modeling for Enhanced Classification Performance in High-Dimensional Datasets. Gazi University Journal of Science 38 3 1247–1260.
IEEE	M. Albayrak and K. Turhan, “Model-Based Feature Selection Using Structural Equation Modeling for Enhanced Classification Performance in High-Dimensional Datasets”, Gazi University Journal of Science, vol. 38, no. 3, pp. 1247–1260, 2025, doi: 10.35378/gujs.1507978.
ISNAD	Albayrak, Muammer - Turhan, Kemal. “Model-Based Feature Selection Using Structural Equation Modeling for Enhanced Classification Performance in High-Dimensional Datasets”. Gazi University Journal of Science 38/3 (September2025), 1247-1260. https://doi.org/10.35378/gujs.1507978.
JAMA	Albayrak M, Turhan K. Model-Based Feature Selection Using Structural Equation Modeling for Enhanced Classification Performance in High-Dimensional Datasets. Gazi University Journal of Science. 2025;38:1247–1260.
MLA	Albayrak, Muammer and Kemal Turhan. “Model-Based Feature Selection Using Structural Equation Modeling for Enhanced Classification Performance in High-Dimensional Datasets”. Gazi University Journal of Science, vol. 38, no. 3, 2025, pp. 1247-60, doi:10.35378/gujs.1507978.
Vancouver	Albayrak M, Turhan K. Model-Based Feature Selection Using Structural Equation Modeling for Enhanced Classification Performance in High-Dimensional Datasets. Gazi University Journal of Science. 2025;38(3):1247-60.

Article Files

Full Text