Research Article
BibTex RIS Cite

A k-mer based metaheuristic approach for detecting COVID-19 variants

Year 2023, , 17 - 26, 23.03.2023
https://doi.org/10.24012/dumf.1195600

Abstract

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) belongs to coronaviridae family and a change in the genetic sequence of SARS-CoV-2 is named as a mutation that causes to variants of SARS-CoV-2. In this paper, we propose a novel and efficient method to predict SARS-CoV-2 variants of concern from whole human genome sequences. In this method, we describe 16 dinucleotide and 64 trinucleotide features to differentiate SARS-CoV-2 variants of concern. The efficacy of the proposed features is proved by using four classifiers, k-nearest neighbor, support vector machines, multilayer perceptron, and random forest. The proposed method is evaluated on the dataset including 223,326 complete human genome sequences including recently designated variants of concern, Alpha, Beta, Gamma, Delta, and Omicron variants. Experimental results present that overall accuracy for detecting SARS-CoV-2 variants of concern remarkably increases when trinucleotide features rather than dinucleotide features are used. Furthermore, we use the whale optimization algorithm, which is a state-of-the-art method for reducing the number of features and choosing the most relevant features. We select 44 trinucleotide features out of 64 to differentiate SARS-CoV-2 variants with acceptable accuracy as a result of the whale optimization method. Experimental results indicate that the SVM classifier with selected features achieves about 99% accuracy, sensitivity, specificity, precision on average. The proposed method presents an admirable performance for detecting SARS-CoV-2 variants.

References

  • [1] Volz, E., Mishra, S., Chand, M., Barrett, J. C., & al., R. J. et. (2021). Assessing transmissibility of SARS-CoV-2 lineage B.1.1.7 in England. Nature, 593(7858), 266–269. doi:10.1038/s41586-021-03470-x
  • [2] Lauring, A. S., & Malani, P. N. (09 2021). Variants of SARS-CoV-2. JAMA, 326(9), 880–880. doi:10.1001/jama.2021.14181
  • [3] Tegally, H., Wilkinson, E., Giovanetti, M., & al., A. I. et. (2021). Detection of a SARS-CoV-2 variant of concern in South Africa. Nature, 592(7854), 438–443. doi:10.1038/s41586-021-03402-9
  • [4] Sabino, E. C., Buss, L. F., Carvalho, M. P. S., & al., E. (2021). Resurgence of COVID-19 in Manaus, Brazil, despite high seroprevalence. The Lancet, 397(10273), 452–455. doi:10.1016/s0140-6736(21)00183-5
  • [5] Mlcochova, P., Kemp, S. A., Dhar, M. S., & al., G. P. et. (2021). SARS-CoV-2 B.1.617.2 Delta variant replication and immune evasion. Nature, 599(7883), 114–119. doi:10.1038/s41586-021-03944-y
  • [6] Sahoo, J. P., & Samal, K. C. (2021). World on alert: WHO designated south African new COVID strain (Omicron/B.1.1.529) as a variant of concern. Biotica Research Today, 3(11), 1086–1088.
  • [7] Jiang, X., Coffee, M., Bari, A., Wang, J., Jiang, X., Huang, J., … Huang, Y. (2020). Towards an Artificial Intelligence Framework for Data-Driven Prediction of Coronavirus Clinical Severity. Computers, Materials $\&$ Continua, 62(3), 537–551. doi:10.32604/cmc.2020.010691
  • [8] Zoabi, Y., Deri-Rozov, S., & Shomron, N. (2021). Machine learning-based prediction of COVID-19 diagnosis based on symptoms. Npj Digital Medicine, 4(1), 3. doi:10.1038/s41746-020-00372-6
  • [9] Muhammad, L. J., Algehyne, E. A., Usman, S. S., Ahmad, A., Chakraborty, C., & Mohammed, I. A. (2021). Supervised Machine Learning Models for Prediction of COVID-19 Infection using Epidemiology Dataset. SN Computer Science, 2(1), 11. doi:10.1007/s42979-020-00394-7
  • [10] Shi, F., Wang, J., Shi, J., Wu, Z., Wang, Q., Tang, Z., … Shen, D. (2021). Review of Artificial Intelligence Techniques in Imaging Data Acquisition, Segmentation, and Diagnosis for COVID-19. IEEE Reviews in Biomedical Engineering, 14, 4–15. doi:10.1109/RBME.2020.2987975
  • [11] Mohamadou, Y., Halidou, A., & Kapen, P. T. (2020). A review of mathematical modeling, artificial intelligence and datasets used in the study, prediction and management of COVID-19. Applied Intelligence, 50(11), 3913–3925. doi:10.1007/s10489-020-01770-9
  • [12] Arslan, H., & Arslan, H. (2021). A new COVID-19 detection method from human genome sequences using CpG island features and KNN classifier. Engineering Science and Technology, an International Journal. doi:10.1016/j.jestch.2020.12.026
  • [13] Arslan, H. (2021a). COVID-19 prediction based on genome similarity of human SARS-CoV-2 and bat SARS-CoV-like coronavirus. Computers $\&$ Industrial Engineering, 161, 107666. doi:10.1016/j.cie.2021.107666
  • [14] Arslan, H., & Aygün, B. (2021). Performance Analysis of Machine Learning Algorithms in Detection of COVID-19 from Common Symptoms. 2021 29th Signal Processing and Communications Applications Conference (SIU), 1–4. doi:10.1109/SIU53274.2021.9477809
  • [15] Arslan, H. (2021b). Machine Learning Methods for COVID-19 Prediction Using Human Genomic Data. Proceedings, 74(1). doi:10.3390/proceedings2021074020
  • [16] Ali, S., Tamkanat-E-Ali, Khan, M. A., Khan, I., & Patterson, M. (2021). Effective and scalable clustering of SARS-CoV-2 sequences. arXiv [q-bio.PE]. Ανακτήθηκε από http://arxiv.org/abs/2108.08143
  • [17] Jamil, S., & Rahman, M. (2021). A Dual-Stage Vocabulary of Features (VoF)-Based Technique for COVID-19 Variants’ Classification. Applied Sciences, 11(24). doi:10.3390/app112411902
  • [18] Ogiela, M. R., & Ogiela, U. (2021). Linguistic methods in healthcare application and COVID-19 variants classification. Neural Computing and Applications. doi:10.1007/s00521-021-06286-y
  • [19] Mann, C., Griffin, J. H., & Downard, K. M. (2021). Detection and evolution of SARS-CoV-2 coronavirus variants of concern with mass spectrometry. Analytical and Bioanalytical Chemistry, 413(29), 7241–7249. doi:10.1007/s00216-021-03649-1
  • [20] Mafarja, M., & Mirjalili, S. (2018). Whale optimization approaches for wrapper feature selection. Applied Soft Computing, 62, 441–453. doi:10.1016/j.asoc.2017.11.006
  • [21] Brown, I., & Mues, C. (2012). An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Systems with Applications, 39(3), 3446–3453.
  • [22] Deng, Z., Zhu, X., Cheng, D., Zong, M., & Zhang, S. (2016). Efficient KNN Classification Algorithm for Big Data. Neurocomput., 195(C), 143–148. doi:10.1016/j.neucom.2015.08.112
  • [23] Abu Alfeilat, H., Hassanat, A., Lasassmeh, O., Tarawneh, A., Alhasanat, M., Eyal-Salman, H., & Prasath, S. (08 2019). Effects of Distance Measure Choice on K-Nearest Neighbor Classifier Performance: A Review. Big Data, 7. doi:10.1089/big.2018.0175
  • [24] Bishop, C. M. (2006). Pattern recognition and Machine Learning. Springer.
  • [25] Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366. doi:10.1016/0893-6080(89)90020-8
  • [26] Burges, C. J. C. (1998). A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2(2), 121–167. doi:10.1023/A:1009715923555
  • [27] Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. doi:10.1007/978-1-4757-2440-0
  • [28] Hsu, C.-W., & Lin, C.-J. (2002). A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks, 13(2), 415–425. doi:10.1109/72.991427
  • [29] Min, J. H., & Lee, Y.-C. (2005). Bankruptcy prediction using support vector machine with optimal choice of kernel function parameters. Expert Systems with Applications, 28(4), 603–614. doi:10.1016/j.eswa.2004.12.008
  • [30] Keerthi, S. S., & Lin, C.-J. (2003). Asymptotic Behaviors of Support Vector Machines with Gaussian Kernel. Neural Computation, 15(7), 1667–1689. doi:10.1162/089976603321891855
  • [31] Breiman, L. (2001a). Random Forests. Machine Learning, 45(1), 5–32. doi:10.1023/A:1010933404324
  • [32] Breiman, L. (2001b). Machine Learning, 45(1), 5–32. doi:10.1023/a:1010933404324
  • [33] Shu, Y., & McCauley, J. (2017). GISAID: Global initiative on sharing all influenza data - from vision to reality. Eurosurveillance, 22(13). doi:10.2807/1560-7917.ES.2017.22.13.30494
  • [34] Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing $\&$ Management, 45(4), 427–437. doi:10.1016/j.ipm.2009.03.002

COVID-19 varyantlarını tespit etmek için k-mer tabanlı bir metasezgisel yaklaşım

Year 2023, , 17 - 26, 23.03.2023
https://doi.org/10.24012/dumf.1195600

Abstract

Emergence of SARS-CoV-2 variants threatens the public health and remarkably prolong the COVID-19 pandemic. Rapid and accurate detection of SARS-CoV-2 variants is crucial to track mutations, monitor the changes, measure the efficiency of the current vaccines, assess the evolution of SARS-CoV-2 as well as prevent its spread. In this paper, we propose a novel and efficient method to predict SARS-CoV-2 variants of concern from whole human genome sequences. In this method, we describe 16 dinucleotide and 64 trinucleotide features to differentiate SARS-CoV-2 variants of concern. The efficacy of the proposed features is proved by using four classifiers, k-nearest neighbor, support vector machines, multilayer perceptron, and random forest. The proposed method is evaluated on the dataset including 223,326 complete human genome sequences including recently designated variants of concern, Alpha, Beta, Gamma, Delta, and Omicron variants. Experimental results present that overall accuracy for detecting SARS-CoV-2 variants of concern remarkably increases when trinucleotide features rather than dinucleotide features are used. Furthermore, we use the whale optimization algorithm, which is the state-of-the-art method for reducing the number of features and choosing the most relevant features. We select
44 trinucleotide features out of 64 to differentiate SARS-CoV-2 variants with acceptable accuracy as a result of the whale optimization method. Experimental results indicate that the SVM classifier with selected features achieves about 99% accuracy, sensitivity, specificity, precision on average. The proposed method presents an admirable performance for detecting SARS-CoV-2 variants.

References

  • [1] Volz, E., Mishra, S., Chand, M., Barrett, J. C., & al., R. J. et. (2021). Assessing transmissibility of SARS-CoV-2 lineage B.1.1.7 in England. Nature, 593(7858), 266–269. doi:10.1038/s41586-021-03470-x
  • [2] Lauring, A. S., & Malani, P. N. (09 2021). Variants of SARS-CoV-2. JAMA, 326(9), 880–880. doi:10.1001/jama.2021.14181
  • [3] Tegally, H., Wilkinson, E., Giovanetti, M., & al., A. I. et. (2021). Detection of a SARS-CoV-2 variant of concern in South Africa. Nature, 592(7854), 438–443. doi:10.1038/s41586-021-03402-9
  • [4] Sabino, E. C., Buss, L. F., Carvalho, M. P. S., & al., E. (2021). Resurgence of COVID-19 in Manaus, Brazil, despite high seroprevalence. The Lancet, 397(10273), 452–455. doi:10.1016/s0140-6736(21)00183-5
  • [5] Mlcochova, P., Kemp, S. A., Dhar, M. S., & al., G. P. et. (2021). SARS-CoV-2 B.1.617.2 Delta variant replication and immune evasion. Nature, 599(7883), 114–119. doi:10.1038/s41586-021-03944-y
  • [6] Sahoo, J. P., & Samal, K. C. (2021). World on alert: WHO designated south African new COVID strain (Omicron/B.1.1.529) as a variant of concern. Biotica Research Today, 3(11), 1086–1088.
  • [7] Jiang, X., Coffee, M., Bari, A., Wang, J., Jiang, X., Huang, J., … Huang, Y. (2020). Towards an Artificial Intelligence Framework for Data-Driven Prediction of Coronavirus Clinical Severity. Computers, Materials $\&$ Continua, 62(3), 537–551. doi:10.32604/cmc.2020.010691
  • [8] Zoabi, Y., Deri-Rozov, S., & Shomron, N. (2021). Machine learning-based prediction of COVID-19 diagnosis based on symptoms. Npj Digital Medicine, 4(1), 3. doi:10.1038/s41746-020-00372-6
  • [9] Muhammad, L. J., Algehyne, E. A., Usman, S. S., Ahmad, A., Chakraborty, C., & Mohammed, I. A. (2021). Supervised Machine Learning Models for Prediction of COVID-19 Infection using Epidemiology Dataset. SN Computer Science, 2(1), 11. doi:10.1007/s42979-020-00394-7
  • [10] Shi, F., Wang, J., Shi, J., Wu, Z., Wang, Q., Tang, Z., … Shen, D. (2021). Review of Artificial Intelligence Techniques in Imaging Data Acquisition, Segmentation, and Diagnosis for COVID-19. IEEE Reviews in Biomedical Engineering, 14, 4–15. doi:10.1109/RBME.2020.2987975
  • [11] Mohamadou, Y., Halidou, A., & Kapen, P. T. (2020). A review of mathematical modeling, artificial intelligence and datasets used in the study, prediction and management of COVID-19. Applied Intelligence, 50(11), 3913–3925. doi:10.1007/s10489-020-01770-9
  • [12] Arslan, H., & Arslan, H. (2021). A new COVID-19 detection method from human genome sequences using CpG island features and KNN classifier. Engineering Science and Technology, an International Journal. doi:10.1016/j.jestch.2020.12.026
  • [13] Arslan, H. (2021a). COVID-19 prediction based on genome similarity of human SARS-CoV-2 and bat SARS-CoV-like coronavirus. Computers $\&$ Industrial Engineering, 161, 107666. doi:10.1016/j.cie.2021.107666
  • [14] Arslan, H., & Aygün, B. (2021). Performance Analysis of Machine Learning Algorithms in Detection of COVID-19 from Common Symptoms. 2021 29th Signal Processing and Communications Applications Conference (SIU), 1–4. doi:10.1109/SIU53274.2021.9477809
  • [15] Arslan, H. (2021b). Machine Learning Methods for COVID-19 Prediction Using Human Genomic Data. Proceedings, 74(1). doi:10.3390/proceedings2021074020
  • [16] Ali, S., Tamkanat-E-Ali, Khan, M. A., Khan, I., & Patterson, M. (2021). Effective and scalable clustering of SARS-CoV-2 sequences. arXiv [q-bio.PE]. Ανακτήθηκε από http://arxiv.org/abs/2108.08143
  • [17] Jamil, S., & Rahman, M. (2021). A Dual-Stage Vocabulary of Features (VoF)-Based Technique for COVID-19 Variants’ Classification. Applied Sciences, 11(24). doi:10.3390/app112411902
  • [18] Ogiela, M. R., & Ogiela, U. (2021). Linguistic methods in healthcare application and COVID-19 variants classification. Neural Computing and Applications. doi:10.1007/s00521-021-06286-y
  • [19] Mann, C., Griffin, J. H., & Downard, K. M. (2021). Detection and evolution of SARS-CoV-2 coronavirus variants of concern with mass spectrometry. Analytical and Bioanalytical Chemistry, 413(29), 7241–7249. doi:10.1007/s00216-021-03649-1
  • [20] Mafarja, M., & Mirjalili, S. (2018). Whale optimization approaches for wrapper feature selection. Applied Soft Computing, 62, 441–453. doi:10.1016/j.asoc.2017.11.006
  • [21] Brown, I., & Mues, C. (2012). An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Systems with Applications, 39(3), 3446–3453.
  • [22] Deng, Z., Zhu, X., Cheng, D., Zong, M., & Zhang, S. (2016). Efficient KNN Classification Algorithm for Big Data. Neurocomput., 195(C), 143–148. doi:10.1016/j.neucom.2015.08.112
  • [23] Abu Alfeilat, H., Hassanat, A., Lasassmeh, O., Tarawneh, A., Alhasanat, M., Eyal-Salman, H., & Prasath, S. (08 2019). Effects of Distance Measure Choice on K-Nearest Neighbor Classifier Performance: A Review. Big Data, 7. doi:10.1089/big.2018.0175
  • [24] Bishop, C. M. (2006). Pattern recognition and Machine Learning. Springer.
  • [25] Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366. doi:10.1016/0893-6080(89)90020-8
  • [26] Burges, C. J. C. (1998). A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2(2), 121–167. doi:10.1023/A:1009715923555
  • [27] Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. doi:10.1007/978-1-4757-2440-0
  • [28] Hsu, C.-W., & Lin, C.-J. (2002). A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks, 13(2), 415–425. doi:10.1109/72.991427
  • [29] Min, J. H., & Lee, Y.-C. (2005). Bankruptcy prediction using support vector machine with optimal choice of kernel function parameters. Expert Systems with Applications, 28(4), 603–614. doi:10.1016/j.eswa.2004.12.008
  • [30] Keerthi, S. S., & Lin, C.-J. (2003). Asymptotic Behaviors of Support Vector Machines with Gaussian Kernel. Neural Computation, 15(7), 1667–1689. doi:10.1162/089976603321891855
  • [31] Breiman, L. (2001a). Random Forests. Machine Learning, 45(1), 5–32. doi:10.1023/A:1010933404324
  • [32] Breiman, L. (2001b). Machine Learning, 45(1), 5–32. doi:10.1023/a:1010933404324
  • [33] Shu, Y., & McCauley, J. (2017). GISAID: Global initiative on sharing all influenza data - from vision to reality. Eurosurveillance, 22(13). doi:10.2807/1560-7917.ES.2017.22.13.30494
  • [34] Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing $\&$ Management, 45(4), 427–437. doi:10.1016/j.ipm.2009.03.002
There are 34 citations in total.

Details

Primary Language English
Journal Section Articles
Authors

Hilal Arslan 0000-0002-6449-6952

Publication Date March 23, 2023
Submission Date October 27, 2022
Published in Issue Year 2023

Cite

IEEE H. Arslan, “A k-mer based metaheuristic approach for detecting COVID-19 variants”, DÜMF MD, vol. 14, no. 1, pp. 17–26, 2023, doi: 10.24012/dumf.1195600.
DUJE tarafından yayınlanan tüm makaleler, Creative Commons Atıf 4.0 Uluslararası Lisansı ile lisanslanmıştır. Bu, orijinal eser ve kaynağın uygun şekilde belirtilmesi koşuluyla, herkesin eseri kopyalamasına, yeniden dağıtmasına, yeniden düzenlemesine, iletmesine ve uyarlamasına izin verir. 24456