Research Article
BibTex RIS Cite

Histogram-Based Feature Selection for Binary Classification

Year 2024, Volume: 1 Issue: 2, 63 - 70, 20.12.2024

Abstract

This paper presents a novel method for feature selection in binary classification tasks based on histogram-based scoring. By leveraging the distribution differences between feature values associated with positive and negative classes, we generate a score to determine the most informative features. The method, called Histogram-Based Feature Selection (HBFS), has been tested against a variety of datasets and compared to the Fisher Score for performance assessment. Our findings indicate that HBFS either matches or outperforms Fisher Score in most datasets.

References

  • Li, K., Wang, F., Yang, L., & Liu, R. (2023). Deep feature screening: Feature selection for ultra high-dimensional data via deep neural networks. Neurocomputing, 538, 126186.
  • Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern Classification, Hoboken. In: NJ: Wiley.
  • Abiodun, E. O., Alabdulatif, A., Abiodun, O. I., Alawida, M., Alabdulatif, A., & Alkhawaldeh, R. S. (2021). A systematic review of emerging feature selection optimization methods for optimal text classification: the present state and prospective opportunities. Neural Computing and Applications, 33(22), 15091-15118.
  • Gan, M., & Zhang, L. (2021). Iteratively local fisher score forfeature selection. Applied Intelligence, 51, 6167-6181. He, X., Cai, D., & Niyogi, P. (2005). Laplacian score for feature selection. Advances in neural information processing systems, 18.
  • Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), 1157-1182.
  • Khan, Z., Ali, A., & Aldahmani, S. (2024). Feature Selection via Robust Weighted Score for High Dimensional Binary ClassImbalanced Gene Expression Data. arXiv preprint arXiv:2401.12667.
  • Jagdhuber, R., Lang, M., Stenzl, A., Neuhaus, J., & Rahnenführer, J. (2020). Cost-Constrained feature selection in binary classification: adaptations for greedy forward selection and genetic algorithms. BMC bioinformatics, 21, 1-21.
  • Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information criteria of max-dependency, maxrelevance, and min-redundancy. IEEE Transactions on pattern analysis and machine intelligence, 27(8), 1226-1238.
  • Datasets: Feature selection . (n.d.). Retrieved from https://jundongl.github.io/scikit-feature/datasets.html
  • Davide Nardone. (2019). Biological datasets for SMBA. https://doi.org/10.5281/zenodo.2709491
  • GEO Accession viewer. (n.d.). https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE4412
  • Freije, W. A., Castro-Vargas, F. E., Fang, Z., Horvath, S., Cloughesy, T., Liau, L. M., Mischel, P. S., & Nelson, S. F. (2004). Gene expression profiling of gliomas strongly predicts survival. Cancer research, 64(18), 6503–6510. https://doi.org/10.1158/0008-5472.CAN-04-0452
  • Spira, A., Beane, J. E., Shah, V., Steiling, K., Liu, G., Schembri, F., Gilman, S., Dumas, Y. M., Calner, P., Sebastiani, P., Sridhar, S., Beamis, J., Lamb, C., Anderson, T., Gerry, N., Keane, J., Lenburg, M. E., & Brody, J. S. (2007). Airway epithelial gene expression in the diagnostic evaluation of smokers with suspect lung cancer. Nature medicine, 13(3), 361–366. https://doi.org/10.1038/nm1556
  • Gustafson, A. M., Soldi, R., Anderlind, C., Scholand, M. B., Qian, J., Zhang, X., Cooper, K., Walker, D., McWilliams, A., Liu, G., Szabo, E., Brody, J., Massion, P. P., Lenburg, M. E., Lam, S., Bild, A. H., & Spira, A. (2010). Airway PI3K pathway activation is an early and reversible event in lung cancer development. Science translational medicine, 2(26), 26ra25. https://doi.org/10.1126/scitranslmed.3000251
  • Wayback machine. (n.d.). https://web.archive.org/web/20150221003104/ http://www.nipsfsc.ecs.soton.ac.uk/papers/NIPS2003-Datasets.pdf
  • Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., ... & Varoquaux, G. (2013). API design for machine learning software: experiences from the scikit-learn project. arXiv preprint arXiv:1309.0238.
  • Geurts, P., Ernst, D., & Wehenkel, L. (2006). Extremely randomized trees. Machine learning, 63, 3-42.
There are 17 citations in total.

Details

Primary Language English
Subjects Machine Learning Algorithms, Classification Algorithms
Journal Section Research Article
Authors

Selman Delil 0000-0001-8149-3561

Melih Ağraz This is me

Birol Kuyumcu This is me

Publication Date December 20, 2024
Submission Date November 28, 2024
Acceptance Date December 12, 2024
Published in Issue Year 2024 Volume: 1 Issue: 2

Cite

APA Delil, S., Ağraz, M., & Kuyumcu, B. (2024). Histogram-Based Feature Selection for Binary Classification. Transactions on Computer Science and Applications, 1(2), 63-70.