PWFS: A scalable parallel Python module for wrapper feature selection

Hakan Alp Eren; Savaş Okyay; Nihat Adar

doi:10.61112/jiens.1639780

Research Article

PWFS: A scalable parallel Python module for wrapper feature selection

Year 2025, Volume: 5 Issue: 2, 704 - 719, 31.07.2025

Hakan Alp Eren , Savaş Okyay , Nihat Adar

https://doi.org/10.61112/jiens.1639780

Abstract

In the field of machine learning, the feature selection process is a crucial step, and it can significantly impact the performance of predictive models. Despite the existence of various time-efficient algorithms, the only method that guarantees problem optimization is exhaustive search, but it requires an enormous computational load. Although the exhaustive search ensures the best feature selection, a lifetime would not be enough after certain large feature counts. This study proposes a generic, scalable open-source parallel Python module to find the best wrapper feature subset in a fully optimized execution time, especially for reasonable feature counts. This parallel wrapper feature selection module, PWFS, is independent of machine learning algorithms and can function with user-defined methods. The framework promises maximum benefit on the machine learning side by empowering parallel performance and efficiency. The system design is built on the most efficient message-passing communication, where the framework distributes the computational load equally among the parallel agents via feature masking. The module is validated on two workstations, one of which is hyper-threading capable. An overall performance gain of 19.77% is achieved with hyper-threading. Various scenarios and experiments yield different speedups and efficiencies up to 96.74%, validating the flexible design of the proposed parallel framework. The source code of the module is available at https://github.com/haeren/parallel-feature-selector and https://pypi.org/project/parallel-feature-selector/.

Keywords

Parallel programming , Open-source software , Feature selection , Exhaustive search , Message passing , Python , Machine learning

References

Okyay S, Adar N (2018) Parallel 3D brain modeling & feature extraction: ADNI dataset case study. 14th International Conference on Advanced Trends in Radioelectronics, Telecommunications and Computer Engineering (TCSET), Lviv-Slavske, Ukraine, Feb. 20-24. https://doi.org/10.1109/TCSET.2018.8336172
Jovi A, Brki K, Bogunovi N (2015) A review of feature selection methods with applications. 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), Opatija, Croatia, May 25-29. https://doi.org/10.1109/MIPRO.2015.7160458
Nersisyan S, Novosad V, Galatenko A, Sokolov A, Bokov G, Konovalov A et al (2022) ExhauFS: exhaustive search-based feature selection for classification and survival regression. PeerJ 10:e13200. https://doi.org/10.7717/peerj.13200
Okyay S, Adar N (2021) Filter Feature Selection Analysis to Determine the Characteristics of Dementia. Journal of Engineering and Architecture Faculty of Eskisehir Osmangazi University 29(1):20–7. https://doi.org/10.31796/ogummf.768872
Bolón-Canedo V, Sánchez-Marono N, Cervino-Rabunal J (2014) Toward parallel feature selection from vertically partitioned data. ESANN 2014 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, Bruges, Belgium, Apr. 23-25.
Roffo G (2016) Feature selection library (MATLAB toolbox). arXiv preprint arXiv:160701327.
Yu K, Ding W, Wu X (2016) LOFS: A library of online streaming feature selection. Knowledge-Based Systems 113:1–3. https://doi.org/10.1016/j.knosys.2016.08.026
Horn F, Pack R, Rieger M (2019) The autofeat python library for automated feature engineering and selection. Machine Learning and Knowledge Discovery in Databases: International Workshops of ECML PKDD 2019, Würzburg, Germany, Sep. 16-20.
Masoudi-Sobhanzadeh Y, Motieghader H, Masoudi-Nejad A (2019) FeatureSelect: a software for feature selection based on machine learning approaches. BMC Bioinformatics 20:1–17. https://doi.org/10.1186/s12859-019-2754-0
Pilnenskiy N, Smetannikov I (2020) Feature selection algorithms as one of the python data analytical tools. Future Internet 12(3):54. https://doi.org/10.3390/fi12030054
Zhao Z, Zhang R, Cox J, Duling D, Sarle W (2013) Massively parallel feature selection: an approach based on variance preservation. Mach Learning 92:195–220. https://doi.org/10.1007/s10994-013-5373-4
Stojanovski TD (2014) Performance of exhaustive search with parallel agents. Turkish Journal of Electrical Engineering and Computer Sciences 22(5):1382–94. https://doi.org/10.3906/elk-1210-105
Sun Z, Li Z (2014) Data intensive parallel feature selection method study. International Joint Conference on Neural Networks (IJCNN), Beijing, China, Jul. 6-11. https://doi.org/10.1109/IJCNN.2014.6889409
Zhou Y, Porwal U, Zhang C, Ngo HQ, Nguyen X, Ré C et al (2014) Parallel feature selection inspired by group testing. Advances in Neural Information Processing Systems 27.
El-Alfy ESM, Alshammari MA (2016) Towards scalable rough set based attribute subset selection for intrusion detection using parallel genetic algorithm in MapReduce. Simulation Modelling Practice and Theory 64:18–29. https://doi.org/10.1016/j.simpat.2016.01.010
Gieseke F, Polsterer KL, Mahabal A, Igel C, Heskes T (2017) Massively-parallel best subset selection for ordinary least-squares regression. IEEE Symposium Series on Computational Intelligence (SSCI), Honolulu, HI, USA, Nov. 27 – Dec. 1. https://doi.org/10.1109/SSCI.2017.8285225
Li Z, Lu W, Sun Z, Xing W (2017) A parallel feature selection method study for text classification. Neural Computing and Applications 28:513–24. https://doi.org/10.1007/s00521-016-2351-3
González-Domínguez J, Bolón-Canedo V, Freire B, Touriño J (2019) Parallel feature selection for distributed-memory clusters. Information Sciences 496:399–409. https://doi.org/10.1016/j.ins.2019.01.050
Nguyen T, Phan N, Nguyen N, Nguyen BT, Halvorsen P, Riegler MA (2022) Parallel feature selection based on the trace ratio criterion. International Joint Conference on Neural Networks (IJCNN), Padua, Italy, Jul. 18-23. https://doi.org/10.1109/IJCNN55064.2022.9892181
Vivek Y, Ravi V, Krishna PR (2023) Scalable feature subset selection for big data using parallel hybrid evolutionary algorithm based wrapper under apache spark environment. Cluster Computing 26(3):1949–83. https://doi.org/10.1007/s10586-022-03725-w
Dalcin LD, Paz RR, Kler PA, Cosimo A (2011) Parallel distributed computing using Python. Advances in Water Resources 34(9):1124–39. https://doi.org/10.1016/j.advwatres.2011.04.013
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O et al (2011) Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12(85):2825–30.
McKinney W (2010) Data structures for statistical computing in Python. SciPy 445(1):51–6. https://doi.org/10.25080/Majora-92bf1922-00a
Marr DT, Binns F, Hill DL, Hinton G, Koufaty DA, Miller JA et al (2002) Hyper-Threading Technology Architecture and Microarchitecture. Intel Technology Journal 6(1).
Tau Leng RA, Hsieh J, Mashayekhi V, Rooholamini R (2002) An empirical study of hyper-threading in high performance computing clusters. Linux HPC Revolution 45.
Eager DL, Zahorjan J, Lazowska ED (1989) Speedup versus efficiency in parallel systems. IEEE Transactions on Computers 38(3):408–23. https://doi.org/10.1109/12.21127

PWFS: A scalable parallel Python module for wrapper feature selection

Year 2025, Volume: 5 Issue: 2, 704 - 719, 31.07.2025

Hakan Alp Eren , Savaş Okyay , Nihat Adar

https://doi.org/10.61112/jiens.1639780

Abstract

Keywords

Parallel programming , Open-source software , Feature selection , Exhaustive search , Message passing , Python , Machine learning

References

Okyay S, Adar N (2018) Parallel 3D brain modeling & feature extraction: ADNI dataset case study. 14th International Conference on Advanced Trends in Radioelectronics, Telecommunications and Computer Engineering (TCSET), Lviv-Slavske, Ukraine, Feb. 20-24. https://doi.org/10.1109/TCSET.2018.8336172
Jovi A, Brki K, Bogunovi N (2015) A review of feature selection methods with applications. 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), Opatija, Croatia, May 25-29. https://doi.org/10.1109/MIPRO.2015.7160458
Nersisyan S, Novosad V, Galatenko A, Sokolov A, Bokov G, Konovalov A et al (2022) ExhauFS: exhaustive search-based feature selection for classification and survival regression. PeerJ 10:e13200. https://doi.org/10.7717/peerj.13200
Okyay S, Adar N (2021) Filter Feature Selection Analysis to Determine the Characteristics of Dementia. Journal of Engineering and Architecture Faculty of Eskisehir Osmangazi University 29(1):20–7. https://doi.org/10.31796/ogummf.768872
Bolón-Canedo V, Sánchez-Marono N, Cervino-Rabunal J (2014) Toward parallel feature selection from vertically partitioned data. ESANN 2014 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, Bruges, Belgium, Apr. 23-25.
Roffo G (2016) Feature selection library (MATLAB toolbox). arXiv preprint arXiv:160701327.
Yu K, Ding W, Wu X (2016) LOFS: A library of online streaming feature selection. Knowledge-Based Systems 113:1–3. https://doi.org/10.1016/j.knosys.2016.08.026
Horn F, Pack R, Rieger M (2019) The autofeat python library for automated feature engineering and selection. Machine Learning and Knowledge Discovery in Databases: International Workshops of ECML PKDD 2019, Würzburg, Germany, Sep. 16-20.
Masoudi-Sobhanzadeh Y, Motieghader H, Masoudi-Nejad A (2019) FeatureSelect: a software for feature selection based on machine learning approaches. BMC Bioinformatics 20:1–17. https://doi.org/10.1186/s12859-019-2754-0
Pilnenskiy N, Smetannikov I (2020) Feature selection algorithms as one of the python data analytical tools. Future Internet 12(3):54. https://doi.org/10.3390/fi12030054
Zhao Z, Zhang R, Cox J, Duling D, Sarle W (2013) Massively parallel feature selection: an approach based on variance preservation. Mach Learning 92:195–220. https://doi.org/10.1007/s10994-013-5373-4
Stojanovski TD (2014) Performance of exhaustive search with parallel agents. Turkish Journal of Electrical Engineering and Computer Sciences 22(5):1382–94. https://doi.org/10.3906/elk-1210-105
Sun Z, Li Z (2014) Data intensive parallel feature selection method study. International Joint Conference on Neural Networks (IJCNN), Beijing, China, Jul. 6-11. https://doi.org/10.1109/IJCNN.2014.6889409
Zhou Y, Porwal U, Zhang C, Ngo HQ, Nguyen X, Ré C et al (2014) Parallel feature selection inspired by group testing. Advances in Neural Information Processing Systems 27.
El-Alfy ESM, Alshammari MA (2016) Towards scalable rough set based attribute subset selection for intrusion detection using parallel genetic algorithm in MapReduce. Simulation Modelling Practice and Theory 64:18–29. https://doi.org/10.1016/j.simpat.2016.01.010
Gieseke F, Polsterer KL, Mahabal A, Igel C, Heskes T (2017) Massively-parallel best subset selection for ordinary least-squares regression. IEEE Symposium Series on Computational Intelligence (SSCI), Honolulu, HI, USA, Nov. 27 – Dec. 1. https://doi.org/10.1109/SSCI.2017.8285225
Li Z, Lu W, Sun Z, Xing W (2017) A parallel feature selection method study for text classification. Neural Computing and Applications 28:513–24. https://doi.org/10.1007/s00521-016-2351-3
González-Domínguez J, Bolón-Canedo V, Freire B, Touriño J (2019) Parallel feature selection for distributed-memory clusters. Information Sciences 496:399–409. https://doi.org/10.1016/j.ins.2019.01.050
Nguyen T, Phan N, Nguyen N, Nguyen BT, Halvorsen P, Riegler MA (2022) Parallel feature selection based on the trace ratio criterion. International Joint Conference on Neural Networks (IJCNN), Padua, Italy, Jul. 18-23. https://doi.org/10.1109/IJCNN55064.2022.9892181
Vivek Y, Ravi V, Krishna PR (2023) Scalable feature subset selection for big data using parallel hybrid evolutionary algorithm based wrapper under apache spark environment. Cluster Computing 26(3):1949–83. https://doi.org/10.1007/s10586-022-03725-w
Dalcin LD, Paz RR, Kler PA, Cosimo A (2011) Parallel distributed computing using Python. Advances in Water Resources 34(9):1124–39. https://doi.org/10.1016/j.advwatres.2011.04.013
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O et al (2011) Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12(85):2825–30.
McKinney W (2010) Data structures for statistical computing in Python. SciPy 445(1):51–6. https://doi.org/10.25080/Majora-92bf1922-00a
Marr DT, Binns F, Hill DL, Hinton G, Koufaty DA, Miller JA et al (2002) Hyper-Threading Technology Architecture and Microarchitecture. Intel Technology Journal 6(1).
Tau Leng RA, Hsieh J, Mashayekhi V, Rooholamini R (2002) An empirical study of hyper-threading in high performance computing clusters. Linux HPC Revolution 45.
Eager DL, Zahorjan J, Lazowska ED (1989) Speedup versus efficiency in parallel systems. IEEE Transactions on Computers 38(3):408–23. https://doi.org/10.1109/12.21127

There are 26 citations in total.

Details

Primary Language	English
Subjects	High Performance Computing, Machine Learning Algorithms, Data Mining and Knowledge Discovery, Computer Software
Journal Section	Research Articles
Authors	Hakan Alp Eren 0000-0001-6105-158X Savaş Okyay 0000-0003-3955-6324 Nihat Adar 0000-0002-0555-0701
Publication Date	July 31, 2025
Submission Date	February 14, 2025
Acceptance Date	April 27, 2025
Published in Issue	Year 2025 Volume: 5 Issue: 2

Cite

APA	Eren, H. A., Okyay, S., & Adar, N. (2025). PWFS: A scalable parallel Python module for wrapper feature selection. Journal of Innovative Engineering and Natural Science, 5(2), 704-719. https://doi.org/10.61112/jiens.1639780

Download Cover Image

Article Files

Full Text

Open Journal Systems 28737

Journal of Innovative Engineering and Natural Science by İdris Karagöz is licensed under CC BY 4.0