Research Article
BibTex RIS Cite

PWFS: A scalable parallel Python module for wrapper feature selection

Year 2025, Volume: 5 Issue: 2, 704 - 719, 31.07.2025
https://doi.org/10.61112/jiens.1639780

Abstract

In the field of machine learning, the feature selection process is a crucial step, and it can significantly impact the performance of predictive models. Despite the existence of various time-efficient algorithms, the only method that guarantees problem optimization is exhaustive search, but it requires an enormous computational load. Although the exhaustive search ensures the best feature selection, a lifetime would not be enough after certain large feature counts. This study proposes a generic, scalable open-source parallel Python module to find the best wrapper feature subset in a fully optimized execution time, especially for reasonable feature counts. This parallel wrapper feature selection module, PWFS, is independent of machine learning algorithms and can function with user-defined methods. The framework promises maximum benefit on the machine learning side by empowering parallel performance and efficiency. The system design is built on the most efficient message-passing communication, where the framework distributes the computational load equally among the parallel agents via feature masking. The module is validated on two workstations, one of which is hyper-threading capable. An overall performance gain of 19.77% is achieved with hyper-threading. Various scenarios and experiments yield different speedups and efficiencies up to 96.74%, validating the flexible design of the proposed parallel framework. The source code of the module is available at https://github.com/haeren/parallel-feature-selector and https://pypi.org/project/parallel-feature-selector/.

References

  • Okyay S, Adar N (2018) Parallel 3D brain modeling & feature extraction: ADNI dataset case study. 14th International Conference on Advanced Trends in Radioelectronics, Telecommunications and Computer Engineering (TCSET), Lviv-Slavske, Ukraine, Feb. 20-24. https://doi.org/10.1109/TCSET.2018.8336172
  • Jovi A, Brki K, Bogunovi N (2015) A review of feature selection methods with applications. 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), Opatija, Croatia, May 25-29. https://doi.org/10.1109/MIPRO.2015.7160458
  • Nersisyan S, Novosad V, Galatenko A, Sokolov A, Bokov G, Konovalov A et al (2022) ExhauFS: exhaustive search-based feature selection for classification and survival regression. PeerJ 10:e13200. https://doi.org/10.7717/peerj.13200
  • Okyay S, Adar N (2021) Filter Feature Selection Analysis to Determine the Characteristics of Dementia. Journal of Engineering and Architecture Faculty of Eskisehir Osmangazi University 29(1):20–7. https://doi.org/10.31796/ogummf.768872
  • Bolón-Canedo V, Sánchez-Marono N, Cervino-Rabunal J (2014) Toward parallel feature selection from vertically partitioned data. ESANN 2014 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, Bruges, Belgium, Apr. 23-25.
  • Roffo G (2016) Feature selection library (MATLAB toolbox). arXiv preprint arXiv:160701327.
  • Yu K, Ding W, Wu X (2016) LOFS: A library of online streaming feature selection. Knowledge-Based Systems 113:1–3. https://doi.org/10.1016/j.knosys.2016.08.026
  • Horn F, Pack R, Rieger M (2019) The autofeat python library for automated feature engineering and selection. Machine Learning and Knowledge Discovery in Databases: International Workshops of ECML PKDD 2019, Würzburg, Germany, Sep. 16-20.
  • Masoudi-Sobhanzadeh Y, Motieghader H, Masoudi-Nejad A (2019) FeatureSelect: a software for feature selection based on machine learning approaches. BMC Bioinformatics 20:1–17. https://doi.org/10.1186/s12859-019-2754-0
  • Pilnenskiy N, Smetannikov I (2020) Feature selection algorithms as one of the python data analytical tools. Future Internet 12(3):54. https://doi.org/10.3390/fi12030054
  • Zhao Z, Zhang R, Cox J, Duling D, Sarle W (2013) Massively parallel feature selection: an approach based on variance preservation. Mach Learning 92:195–220. https://doi.org/10.1007/s10994-013-5373-4
  • Stojanovski TD (2014) Performance of exhaustive search with parallel agents. Turkish Journal of Electrical Engineering and Computer Sciences 22(5):1382–94. https://doi.org/10.3906/elk-1210-105
  • Sun Z, Li Z (2014) Data intensive parallel feature selection method study. International Joint Conference on Neural Networks (IJCNN), Beijing, China, Jul. 6-11. https://doi.org/10.1109/IJCNN.2014.6889409
  • Zhou Y, Porwal U, Zhang C, Ngo HQ, Nguyen X, Ré C et al (2014) Parallel feature selection inspired by group testing. Advances in Neural Information Processing Systems 27.
  • El-Alfy ESM, Alshammari MA (2016) Towards scalable rough set based attribute subset selection for intrusion detection using parallel genetic algorithm in MapReduce. Simulation Modelling Practice and Theory 64:18–29. https://doi.org/10.1016/j.simpat.2016.01.010
  • Gieseke F, Polsterer KL, Mahabal A, Igel C, Heskes T (2017) Massively-parallel best subset selection for ordinary least-squares regression. IEEE Symposium Series on Computational Intelligence (SSCI), Honolulu, HI, USA, Nov. 27 – Dec. 1. https://doi.org/10.1109/SSCI.2017.8285225
  • Li Z, Lu W, Sun Z, Xing W (2017) A parallel feature selection method study for text classification. Neural Computing and Applications 28:513–24. https://doi.org/10.1007/s00521-016-2351-3
  • González-Domínguez J, Bolón-Canedo V, Freire B, Touriño J (2019) Parallel feature selection for distributed-memory clusters. Information Sciences 496:399–409. https://doi.org/10.1016/j.ins.2019.01.050
  • Nguyen T, Phan N, Nguyen N, Nguyen BT, Halvorsen P, Riegler MA (2022) Parallel feature selection based on the trace ratio criterion. International Joint Conference on Neural Networks (IJCNN), Padua, Italy, Jul. 18-23. https://doi.org/10.1109/IJCNN55064.2022.9892181
  • Vivek Y, Ravi V, Krishna PR (2023) Scalable feature subset selection for big data using parallel hybrid evolutionary algorithm based wrapper under apache spark environment. Cluster Computing 26(3):1949–83. https://doi.org/10.1007/s10586-022-03725-w
  • Dalcin LD, Paz RR, Kler PA, Cosimo A (2011) Parallel distributed computing using Python. Advances in Water Resources 34(9):1124–39. https://doi.org/10.1016/j.advwatres.2011.04.013
  • Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O et al (2011) Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12(85):2825–30.
  • McKinney W (2010) Data structures for statistical computing in Python. SciPy 445(1):51–6. https://doi.org/10.25080/Majora-92bf1922-00a
  • Marr DT, Binns F, Hill DL, Hinton G, Koufaty DA, Miller JA et al (2002) Hyper-Threading Technology Architecture and Microarchitecture. Intel Technology Journal 6(1).
  • Tau Leng RA, Hsieh J, Mashayekhi V, Rooholamini R (2002) An empirical study of hyper-threading in high performance computing clusters. Linux HPC Revolution 45.
  • Eager DL, Zahorjan J, Lazowska ED (1989) Speedup versus efficiency in parallel systems. IEEE Transactions on Computers 38(3):408–23. https://doi.org/10.1109/12.21127

PWFS: A scalable parallel Python module for wrapper feature selection

Year 2025, Volume: 5 Issue: 2, 704 - 719, 31.07.2025
https://doi.org/10.61112/jiens.1639780

Abstract

In the field of machine learning, the feature selection process is a crucial step, and it can significantly impact the performance of predictive models. Despite the existence of various time-efficient algorithms, the only method that guarantees problem optimization is exhaustive search, but it requires an enormous computational load. Although the exhaustive search ensures the best feature selection, a lifetime would not be enough after certain large feature counts. This study proposes a generic, scalable open-source parallel Python module to find the best wrapper feature subset in a fully optimized execution time, especially for reasonable feature counts. This parallel wrapper feature selection module, PWFS, is independent of machine learning algorithms and can function with user-defined methods. The framework promises maximum benefit on the machine learning side by empowering parallel performance and efficiency. The system design is built on the most efficient message-passing communication, where the framework distributes the computational load equally among the parallel agents via feature masking. The module is validated on two workstations, one of which is hyper-threading capable. An overall performance gain of 19.77% is achieved with hyper-threading. Various scenarios and experiments yield different speedups and efficiencies up to 96.74%, validating the flexible design of the proposed parallel framework. The source code of the module is available at https://github.com/haeren/parallel-feature-selector and https://pypi.org/project/parallel-feature-selector/.

References

  • Okyay S, Adar N (2018) Parallel 3D brain modeling & feature extraction: ADNI dataset case study. 14th International Conference on Advanced Trends in Radioelectronics, Telecommunications and Computer Engineering (TCSET), Lviv-Slavske, Ukraine, Feb. 20-24. https://doi.org/10.1109/TCSET.2018.8336172
  • Jovi A, Brki K, Bogunovi N (2015) A review of feature selection methods with applications. 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), Opatija, Croatia, May 25-29. https://doi.org/10.1109/MIPRO.2015.7160458
  • Nersisyan S, Novosad V, Galatenko A, Sokolov A, Bokov G, Konovalov A et al (2022) ExhauFS: exhaustive search-based feature selection for classification and survival regression. PeerJ 10:e13200. https://doi.org/10.7717/peerj.13200
  • Okyay S, Adar N (2021) Filter Feature Selection Analysis to Determine the Characteristics of Dementia. Journal of Engineering and Architecture Faculty of Eskisehir Osmangazi University 29(1):20–7. https://doi.org/10.31796/ogummf.768872
  • Bolón-Canedo V, Sánchez-Marono N, Cervino-Rabunal J (2014) Toward parallel feature selection from vertically partitioned data. ESANN 2014 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, Bruges, Belgium, Apr. 23-25.
  • Roffo G (2016) Feature selection library (MATLAB toolbox). arXiv preprint arXiv:160701327.
  • Yu K, Ding W, Wu X (2016) LOFS: A library of online streaming feature selection. Knowledge-Based Systems 113:1–3. https://doi.org/10.1016/j.knosys.2016.08.026
  • Horn F, Pack R, Rieger M (2019) The autofeat python library for automated feature engineering and selection. Machine Learning and Knowledge Discovery in Databases: International Workshops of ECML PKDD 2019, Würzburg, Germany, Sep. 16-20.
  • Masoudi-Sobhanzadeh Y, Motieghader H, Masoudi-Nejad A (2019) FeatureSelect: a software for feature selection based on machine learning approaches. BMC Bioinformatics 20:1–17. https://doi.org/10.1186/s12859-019-2754-0
  • Pilnenskiy N, Smetannikov I (2020) Feature selection algorithms as one of the python data analytical tools. Future Internet 12(3):54. https://doi.org/10.3390/fi12030054
  • Zhao Z, Zhang R, Cox J, Duling D, Sarle W (2013) Massively parallel feature selection: an approach based on variance preservation. Mach Learning 92:195–220. https://doi.org/10.1007/s10994-013-5373-4
  • Stojanovski TD (2014) Performance of exhaustive search with parallel agents. Turkish Journal of Electrical Engineering and Computer Sciences 22(5):1382–94. https://doi.org/10.3906/elk-1210-105
  • Sun Z, Li Z (2014) Data intensive parallel feature selection method study. International Joint Conference on Neural Networks (IJCNN), Beijing, China, Jul. 6-11. https://doi.org/10.1109/IJCNN.2014.6889409
  • Zhou Y, Porwal U, Zhang C, Ngo HQ, Nguyen X, Ré C et al (2014) Parallel feature selection inspired by group testing. Advances in Neural Information Processing Systems 27.
  • El-Alfy ESM, Alshammari MA (2016) Towards scalable rough set based attribute subset selection for intrusion detection using parallel genetic algorithm in MapReduce. Simulation Modelling Practice and Theory 64:18–29. https://doi.org/10.1016/j.simpat.2016.01.010
  • Gieseke F, Polsterer KL, Mahabal A, Igel C, Heskes T (2017) Massively-parallel best subset selection for ordinary least-squares regression. IEEE Symposium Series on Computational Intelligence (SSCI), Honolulu, HI, USA, Nov. 27 – Dec. 1. https://doi.org/10.1109/SSCI.2017.8285225
  • Li Z, Lu W, Sun Z, Xing W (2017) A parallel feature selection method study for text classification. Neural Computing and Applications 28:513–24. https://doi.org/10.1007/s00521-016-2351-3
  • González-Domínguez J, Bolón-Canedo V, Freire B, Touriño J (2019) Parallel feature selection for distributed-memory clusters. Information Sciences 496:399–409. https://doi.org/10.1016/j.ins.2019.01.050
  • Nguyen T, Phan N, Nguyen N, Nguyen BT, Halvorsen P, Riegler MA (2022) Parallel feature selection based on the trace ratio criterion. International Joint Conference on Neural Networks (IJCNN), Padua, Italy, Jul. 18-23. https://doi.org/10.1109/IJCNN55064.2022.9892181
  • Vivek Y, Ravi V, Krishna PR (2023) Scalable feature subset selection for big data using parallel hybrid evolutionary algorithm based wrapper under apache spark environment. Cluster Computing 26(3):1949–83. https://doi.org/10.1007/s10586-022-03725-w
  • Dalcin LD, Paz RR, Kler PA, Cosimo A (2011) Parallel distributed computing using Python. Advances in Water Resources 34(9):1124–39. https://doi.org/10.1016/j.advwatres.2011.04.013
  • Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O et al (2011) Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12(85):2825–30.
  • McKinney W (2010) Data structures for statistical computing in Python. SciPy 445(1):51–6. https://doi.org/10.25080/Majora-92bf1922-00a
  • Marr DT, Binns F, Hill DL, Hinton G, Koufaty DA, Miller JA et al (2002) Hyper-Threading Technology Architecture and Microarchitecture. Intel Technology Journal 6(1).
  • Tau Leng RA, Hsieh J, Mashayekhi V, Rooholamini R (2002) An empirical study of hyper-threading in high performance computing clusters. Linux HPC Revolution 45.
  • Eager DL, Zahorjan J, Lazowska ED (1989) Speedup versus efficiency in parallel systems. IEEE Transactions on Computers 38(3):408–23. https://doi.org/10.1109/12.21127
There are 26 citations in total.

Details

Primary Language English
Subjects High Performance Computing, Machine Learning Algorithms, Data Mining and Knowledge Discovery, Computer Software
Journal Section Research Articles
Authors

Hakan Alp Eren 0000-0001-6105-158X

Savaş Okyay 0000-0003-3955-6324

Nihat Adar 0000-0002-0555-0701

Publication Date July 31, 2025
Submission Date February 14, 2025
Acceptance Date April 27, 2025
Published in Issue Year 2025 Volume: 5 Issue: 2

Cite

APA Eren, H. A., Okyay, S., & Adar, N. (2025). PWFS: A scalable parallel Python module for wrapper feature selection. Journal of Innovative Engineering and Natural Science, 5(2), 704-719. https://doi.org/10.61112/jiens.1639780


by.png
Journal of Innovative Engineering and Natural Science by İdris Karagöz is licensed under CC BY 4.0