A k-mer based metaheuristic approach for detecting COVID-19 variants

Hilal Arslan

doi:10.24012/dumf.1195600

Research Article

A k-mer based metaheuristic approach for detecting COVID-19 variants

Year 2023, Volume: 14 Issue: 1, 17 - 26, 23.03.2023

Hilal Arslan

https://doi.org/10.24012/dumf.1195600

Cited By: 2

Abstract

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) belongs to coronaviridae family and a change in the genetic sequence of SARS-CoV-2 is named as a mutation that causes to variants of SARS-CoV-2. In this paper, we propose a novel and efficient method to predict SARS-CoV-2 variants of concern from whole human genome sequences. In this method, we describe 16 dinucleotide and 64 trinucleotide features to differentiate SARS-CoV-2 variants of concern. The efficacy of the proposed features is proved by using four classifiers, k-nearest neighbor, support vector machines, multilayer perceptron, and random forest. The proposed method is evaluated on the dataset including 223,326 complete human genome sequences including recently designated variants of concern, Alpha, Beta, Gamma, Delta, and Omicron variants. Experimental results present that overall accuracy for detecting SARS-CoV-2 variants of concern remarkably increases when trinucleotide features rather than dinucleotide features are used. Furthermore, we use the whale optimization algorithm, which is a state-of-the-art method for reducing the number of features and choosing the most relevant features. We select 44 trinucleotide features out of 64 to differentiate SARS-CoV-2 variants with acceptable accuracy as a result of the whale optimization method. Experimental results indicate that the SVM classifier with selected features achieves about 99% accuracy, sensitivity, specificity, precision on average. The proposed method presents an admirable performance for detecting SARS-CoV-2 variants.

Keywords

COVID-19 , SARS-CoV-2 , Whale Optimization Algorithm , Classifiers , Feature Selection , Machine Learning

References

[1] Volz, E., Mishra, S., Chand, M., Barrett, J. C., & al., R. J. et. (2021). Assessing transmissibility of SARS-CoV-2 lineage B.1.1.7 in England. Nature, 593(7858), 266–269. doi:10.1038/s41586-021-03470-x
[2] Lauring, A. S., & Malani, P. N. (09 2021). Variants of SARS-CoV-2. JAMA, 326(9), 880–880. doi:10.1001/jama.2021.14181
[3] Tegally, H., Wilkinson, E., Giovanetti, M., & al., A. I. et. (2021). Detection of a SARS-CoV-2 variant of concern in South Africa. Nature, 592(7854), 438–443. doi:10.1038/s41586-021-03402-9
[4] Sabino, E. C., Buss, L. F., Carvalho, M. P. S., & al., E. (2021). Resurgence of COVID-19 in Manaus, Brazil, despite high seroprevalence. The Lancet, 397(10273), 452–455. doi:10.1016/s0140-6736(21)00183-5
[5] Mlcochova, P., Kemp, S. A., Dhar, M. S., & al., G. P. et. (2021). SARS-CoV-2 B.1.617.2 Delta variant replication and immune evasion. Nature, 599(7883), 114–119. doi:10.1038/s41586-021-03944-y
[6] Sahoo, J. P., & Samal, K. C. (2021). World on alert: WHO designated south African new COVID strain (Omicron/B.1.1.529) as a variant of concern. Biotica Research Today, 3(11), 1086–1088.
[7] Jiang, X., Coffee, M., Bari, A., Wang, J., Jiang, X., Huang, J., … Huang, Y. (2020). Towards an Artificial Intelligence Framework for Data-Driven Prediction of Coronavirus Clinical Severity. Computers, Materials $\&$ Continua, 62(3), 537–551. doi:10.32604/cmc.2020.010691
[8] Zoabi, Y., Deri-Rozov, S., & Shomron, N. (2021). Machine learning-based prediction of COVID-19 diagnosis based on symptoms. Npj Digital Medicine, 4(1), 3. doi:10.1038/s41746-020-00372-6
[9] Muhammad, L. J., Algehyne, E. A., Usman, S. S., Ahmad, A., Chakraborty, C., & Mohammed, I. A. (2021). Supervised Machine Learning Models for Prediction of COVID-19 Infection using Epidemiology Dataset. SN Computer Science, 2(1), 11. doi:10.1007/s42979-020-00394-7
[10] Shi, F., Wang, J., Shi, J., Wu, Z., Wang, Q., Tang, Z., … Shen, D. (2021). Review of Artificial Intelligence Techniques in Imaging Data Acquisition, Segmentation, and Diagnosis for COVID-19. IEEE Reviews in Biomedical Engineering, 14, 4–15. doi:10.1109/RBME.2020.2987975
[11] Mohamadou, Y., Halidou, A., & Kapen, P. T. (2020). A review of mathematical modeling, artificial intelligence and datasets used in the study, prediction and management of COVID-19. Applied Intelligence, 50(11), 3913–3925. doi:10.1007/s10489-020-01770-9
[12] Arslan, H., & Arslan, H. (2021). A new COVID-19 detection method from human genome sequences using CpG island features and KNN classifier. Engineering Science and Technology, an International Journal. doi:10.1016/j.jestch.2020.12.026
[13] Arslan, H. (2021a). COVID-19 prediction based on genome similarity of human SARS-CoV-2 and bat SARS-CoV-like coronavirus. Computers $\&$ Industrial Engineering, 161, 107666. doi:10.1016/j.cie.2021.107666
[14] Arslan, H., & Aygün, B. (2021). Performance Analysis of Machine Learning Algorithms in Detection of COVID-19 from Common Symptoms. 2021 29th Signal Processing and Communications Applications Conference (SIU), 1–4. doi:10.1109/SIU53274.2021.9477809
[15] Arslan, H. (2021b). Machine Learning Methods for COVID-19 Prediction Using Human Genomic Data. Proceedings, 74(1). doi:10.3390/proceedings2021074020
[16] Ali, S., Tamkanat-E-Ali, Khan, M. A., Khan, I., & Patterson, M. (2021). Effective and scalable clustering of SARS-CoV-2 sequences. arXiv [q-bio.PE]. Ανακτήθηκε από http://arxiv.org/abs/2108.08143
[17] Jamil, S., & Rahman, M. (2021). A Dual-Stage Vocabulary of Features (VoF)-Based Technique for COVID-19 Variants’ Classification. Applied Sciences, 11(24). doi:10.3390/app112411902
[18] Ogiela, M. R., & Ogiela, U. (2021). Linguistic methods in healthcare application and COVID-19 variants classification. Neural Computing and Applications. doi:10.1007/s00521-021-06286-y
[19] Mann, C., Griffin, J. H., & Downard, K. M. (2021). Detection and evolution of SARS-CoV-2 coronavirus variants of concern with mass spectrometry. Analytical and Bioanalytical Chemistry, 413(29), 7241–7249. doi:10.1007/s00216-021-03649-1
[20] Mafarja, M., & Mirjalili, S. (2018). Whale optimization approaches for wrapper feature selection. Applied Soft Computing, 62, 441–453. doi:10.1016/j.asoc.2017.11.006
[21] Brown, I., & Mues, C. (2012). An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Systems with Applications, 39(3), 3446–3453.
[22] Deng, Z., Zhu, X., Cheng, D., Zong, M., & Zhang, S. (2016). Efficient KNN Classification Algorithm for Big Data. Neurocomput., 195(C), 143–148. doi:10.1016/j.neucom.2015.08.112
[23] Abu Alfeilat, H., Hassanat, A., Lasassmeh, O., Tarawneh, A., Alhasanat, M., Eyal-Salman, H., & Prasath, S. (08 2019). Effects of Distance Measure Choice on K-Nearest Neighbor Classifier Performance: A Review. Big Data, 7. doi:10.1089/big.2018.0175
[24] Bishop, C. M. (2006). Pattern recognition and Machine Learning. Springer.
[25] Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366. doi:10.1016/0893-6080(89)90020-8
[26] Burges, C. J. C. (1998). A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2(2), 121–167. doi:10.1023/A:1009715923555
[27] Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. doi:10.1007/978-1-4757-2440-0
[28] Hsu, C.-W., & Lin, C.-J. (2002). A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks, 13(2), 415–425. doi:10.1109/72.991427
[29] Min, J. H., & Lee, Y.-C. (2005). Bankruptcy prediction using support vector machine with optimal choice of kernel function parameters. Expert Systems with Applications, 28(4), 603–614. doi:10.1016/j.eswa.2004.12.008
[30] Keerthi, S. S., & Lin, C.-J. (2003). Asymptotic Behaviors of Support Vector Machines with Gaussian Kernel. Neural Computation, 15(7), 1667–1689. doi:10.1162/089976603321891855
[31] Breiman, L. (2001a). Random Forests. Machine Learning, 45(1), 5–32. doi:10.1023/A:1010933404324
[32] Breiman, L. (2001b). Machine Learning, 45(1), 5–32. doi:10.1023/a:1010933404324
[33] Shu, Y., & McCauley, J. (2017). GISAID: Global initiative on sharing all influenza data - from vision to reality. Eurosurveillance, 22(13). doi:10.2807/1560-7917.ES.2017.22.13.30494
[34] Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing $\&$ Management, 45(4), 427–437. doi:10.1016/j.ipm.2009.03.002

COVID-19 varyantlarını tespit etmek için k-mer tabanlı bir metasezgisel yaklaşım

Year 2023, Volume: 14 Issue: 1, 17 - 26, 23.03.2023

Hilal Arslan

https://doi.org/10.24012/dumf.1195600

Cited By: 2

Abstract

Emergence of SARS-CoV-2 variants threatens the public health and remarkably prolong the COVID-19 pandemic. Rapid and accurate detection of SARS-CoV-2 variants is crucial to track mutations, monitor the changes, measure the efficiency of the current vaccines, assess the evolution of SARS-CoV-2 as well as prevent its spread. In this paper, we propose a novel and efficient method to predict SARS-CoV-2 variants of concern from whole human genome sequences. In this method, we describe 16 dinucleotide and 64 trinucleotide features to differentiate SARS-CoV-2 variants of concern. The efficacy of the proposed features is proved by using four classifiers, k-nearest neighbor, support vector machines, multilayer perceptron, and random forest. The proposed method is evaluated on the dataset including 223,326 complete human genome sequences including recently designated variants of concern, Alpha, Beta, Gamma, Delta, and Omicron variants. Experimental results present that overall accuracy for detecting SARS-CoV-2 variants of concern remarkably increases when trinucleotide features rather than dinucleotide features are used. Furthermore, we use the whale optimization algorithm, which is the state-of-the-art method for reducing the number of features and choosing the most relevant features. We select
44 trinucleotide features out of 64 to differentiate SARS-CoV-2 variants with acceptable accuracy as a result of the whale optimization method. Experimental results indicate that the SVM classifier with selected features achieves about 99% accuracy, sensitivity, specificity, precision on average. The proposed method presents an admirable performance for detecting SARS-CoV-2 variants.

Keywords

COVID-19 , SARS-CoV-2 , Classifiers , Feature Selection , Machine Learning

References

[1] Volz, E., Mishra, S., Chand, M., Barrett, J. C., & al., R. J. et. (2021). Assessing transmissibility of SARS-CoV-2 lineage B.1.1.7 in England. Nature, 593(7858), 266–269. doi:10.1038/s41586-021-03470-x
[2] Lauring, A. S., & Malani, P. N. (09 2021). Variants of SARS-CoV-2. JAMA, 326(9), 880–880. doi:10.1001/jama.2021.14181
[3] Tegally, H., Wilkinson, E., Giovanetti, M., & al., A. I. et. (2021). Detection of a SARS-CoV-2 variant of concern in South Africa. Nature, 592(7854), 438–443. doi:10.1038/s41586-021-03402-9
[4] Sabino, E. C., Buss, L. F., Carvalho, M. P. S., & al., E. (2021). Resurgence of COVID-19 in Manaus, Brazil, despite high seroprevalence. The Lancet, 397(10273), 452–455. doi:10.1016/s0140-6736(21)00183-5
[5] Mlcochova, P., Kemp, S. A., Dhar, M. S., & al., G. P. et. (2021). SARS-CoV-2 B.1.617.2 Delta variant replication and immune evasion. Nature, 599(7883), 114–119. doi:10.1038/s41586-021-03944-y
[6] Sahoo, J. P., & Samal, K. C. (2021). World on alert: WHO designated south African new COVID strain (Omicron/B.1.1.529) as a variant of concern. Biotica Research Today, 3(11), 1086–1088.
[7] Jiang, X., Coffee, M., Bari, A., Wang, J., Jiang, X., Huang, J., … Huang, Y. (2020). Towards an Artificial Intelligence Framework for Data-Driven Prediction of Coronavirus Clinical Severity. Computers, Materials $\&$ Continua, 62(3), 537–551. doi:10.32604/cmc.2020.010691
[8] Zoabi, Y., Deri-Rozov, S., & Shomron, N. (2021). Machine learning-based prediction of COVID-19 diagnosis based on symptoms. Npj Digital Medicine, 4(1), 3. doi:10.1038/s41746-020-00372-6
[9] Muhammad, L. J., Algehyne, E. A., Usman, S. S., Ahmad, A., Chakraborty, C., & Mohammed, I. A. (2021). Supervised Machine Learning Models for Prediction of COVID-19 Infection using Epidemiology Dataset. SN Computer Science, 2(1), 11. doi:10.1007/s42979-020-00394-7
[10] Shi, F., Wang, J., Shi, J., Wu, Z., Wang, Q., Tang, Z., … Shen, D. (2021). Review of Artificial Intelligence Techniques in Imaging Data Acquisition, Segmentation, and Diagnosis for COVID-19. IEEE Reviews in Biomedical Engineering, 14, 4–15. doi:10.1109/RBME.2020.2987975
[11] Mohamadou, Y., Halidou, A., & Kapen, P. T. (2020). A review of mathematical modeling, artificial intelligence and datasets used in the study, prediction and management of COVID-19. Applied Intelligence, 50(11), 3913–3925. doi:10.1007/s10489-020-01770-9
[12] Arslan, H., & Arslan, H. (2021). A new COVID-19 detection method from human genome sequences using CpG island features and KNN classifier. Engineering Science and Technology, an International Journal. doi:10.1016/j.jestch.2020.12.026
[13] Arslan, H. (2021a). COVID-19 prediction based on genome similarity of human SARS-CoV-2 and bat SARS-CoV-like coronavirus. Computers $\&$ Industrial Engineering, 161, 107666. doi:10.1016/j.cie.2021.107666
[14] Arslan, H., & Aygün, B. (2021). Performance Analysis of Machine Learning Algorithms in Detection of COVID-19 from Common Symptoms. 2021 29th Signal Processing and Communications Applications Conference (SIU), 1–4. doi:10.1109/SIU53274.2021.9477809
[15] Arslan, H. (2021b). Machine Learning Methods for COVID-19 Prediction Using Human Genomic Data. Proceedings, 74(1). doi:10.3390/proceedings2021074020
[16] Ali, S., Tamkanat-E-Ali, Khan, M. A., Khan, I., & Patterson, M. (2021). Effective and scalable clustering of SARS-CoV-2 sequences. arXiv [q-bio.PE]. Ανακτήθηκε από http://arxiv.org/abs/2108.08143
[17] Jamil, S., & Rahman, M. (2021). A Dual-Stage Vocabulary of Features (VoF)-Based Technique for COVID-19 Variants’ Classification. Applied Sciences, 11(24). doi:10.3390/app112411902
[18] Ogiela, M. R., & Ogiela, U. (2021). Linguistic methods in healthcare application and COVID-19 variants classification. Neural Computing and Applications. doi:10.1007/s00521-021-06286-y
[19] Mann, C., Griffin, J. H., & Downard, K. M. (2021). Detection and evolution of SARS-CoV-2 coronavirus variants of concern with mass spectrometry. Analytical and Bioanalytical Chemistry, 413(29), 7241–7249. doi:10.1007/s00216-021-03649-1
[20] Mafarja, M., & Mirjalili, S. (2018). Whale optimization approaches for wrapper feature selection. Applied Soft Computing, 62, 441–453. doi:10.1016/j.asoc.2017.11.006
[21] Brown, I., & Mues, C. (2012). An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Systems with Applications, 39(3), 3446–3453.
[22] Deng, Z., Zhu, X., Cheng, D., Zong, M., & Zhang, S. (2016). Efficient KNN Classification Algorithm for Big Data. Neurocomput., 195(C), 143–148. doi:10.1016/j.neucom.2015.08.112
[23] Abu Alfeilat, H., Hassanat, A., Lasassmeh, O., Tarawneh, A., Alhasanat, M., Eyal-Salman, H., & Prasath, S. (08 2019). Effects of Distance Measure Choice on K-Nearest Neighbor Classifier Performance: A Review. Big Data, 7. doi:10.1089/big.2018.0175
[24] Bishop, C. M. (2006). Pattern recognition and Machine Learning. Springer.
[25] Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366. doi:10.1016/0893-6080(89)90020-8
[26] Burges, C. J. C. (1998). A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2(2), 121–167. doi:10.1023/A:1009715923555
[27] Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. doi:10.1007/978-1-4757-2440-0
[28] Hsu, C.-W., & Lin, C.-J. (2002). A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks, 13(2), 415–425. doi:10.1109/72.991427
[29] Min, J. H., & Lee, Y.-C. (2005). Bankruptcy prediction using support vector machine with optimal choice of kernel function parameters. Expert Systems with Applications, 28(4), 603–614. doi:10.1016/j.eswa.2004.12.008
[30] Keerthi, S. S., & Lin, C.-J. (2003). Asymptotic Behaviors of Support Vector Machines with Gaussian Kernel. Neural Computation, 15(7), 1667–1689. doi:10.1162/089976603321891855
[31] Breiman, L. (2001a). Random Forests. Machine Learning, 45(1), 5–32. doi:10.1023/A:1010933404324
[32] Breiman, L. (2001b). Machine Learning, 45(1), 5–32. doi:10.1023/a:1010933404324
[33] Shu, Y., & McCauley, J. (2017). GISAID: Global initiative on sharing all influenza data - from vision to reality. Eurosurveillance, 22(13). doi:10.2807/1560-7917.ES.2017.22.13.30494
[34] Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing $\&$ Management, 45(4), 427–437. doi:10.1016/j.ipm.2009.03.002

There are 34 citations in total.

Details

Primary Language	English
Journal Section	Research Article
Authors	Hilal Arslan 0000-0002-6449-6952
Submission Date	October 27, 2022
Publication Date	March 23, 2023
Published in Issue	Year 2023 Volume: 14 Issue: 1

Cite

IEEE	[1]H. Arslan, “A k-mer based metaheuristic approach for detecting COVID-19 variants”, DUJE, vol. 14, no. 1, pp. 17–26, Mar. 2023, doi: 10.24012/dumf.1195600.

Cited By

A Parallel Algorithm for Designing Primer and Probe for Accurate Detection of Severe Acute Respiratory Syndrome Coronavirus

Black Sea Journal of Engineering and Science

https://doi.org/10.34248/bsengineering.1324890

Advancing viral genome classification: assesing the efficiency and accuracy of the alignment-free k-mer method in emerging pandemics

Caderno Pedagógico

https://doi.org/10.54033/cadpedv22n8-177

Article Files

Full Text