Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) belongs to coronaviridae family and a change in the genetic sequence of SARS-CoV-2 is named as a mutation that causes to variants of SARS-CoV-2. In this paper, we propose a novel and efficient method to predict SARS-CoV-2 variants of concern from whole human genome sequences. In this method, we describe 16 dinucleotide and 64 trinucleotide features to differentiate SARS-CoV-2 variants of concern. The efficacy of the proposed features is proved by using four classifiers, k-nearest neighbor, support vector machines, multilayer perceptron, and random forest. The proposed method is evaluated on the dataset including 223,326 complete human genome sequences including recently designated variants of concern, Alpha, Beta, Gamma, Delta, and Omicron variants. Experimental results present that overall accuracy for detecting SARS-CoV-2 variants of concern remarkably increases when trinucleotide features rather than dinucleotide features are used. Furthermore, we use the whale optimization algorithm, which is a state-of-the-art method for reducing the number of features and choosing the most relevant features. We select 44 trinucleotide features out of 64 to differentiate SARS-CoV-2 variants with acceptable accuracy as a result of the whale optimization method. Experimental results indicate that the SVM classifier with selected features achieves about 99% accuracy, sensitivity, specificity, precision on average. The proposed method presents an admirable performance for detecting SARS-CoV-2 variants.
COVID-19 SARS-CoV-2 Whale Optimization Algorithm Classifiers Feature Selection Machine Learning
Emergence of SARS-CoV-2 variants threatens the public health and remarkably prolong the COVID-19 pandemic. Rapid and accurate detection of SARS-CoV-2 variants is crucial to track mutations, monitor the changes, measure the efficiency of the current vaccines, assess the evolution of SARS-CoV-2 as well as prevent its spread. In this paper, we propose a novel and efficient method to predict SARS-CoV-2 variants of concern from whole human genome sequences. In this method, we describe 16 dinucleotide and 64 trinucleotide features to differentiate SARS-CoV-2 variants of concern. The efficacy of the proposed features is proved by using four classifiers, k-nearest neighbor, support vector machines, multilayer perceptron, and random forest. The proposed method is evaluated on the dataset including 223,326 complete human genome sequences including recently designated variants of concern, Alpha, Beta, Gamma, Delta, and Omicron variants. Experimental results present that overall accuracy for detecting SARS-CoV-2 variants of concern remarkably increases when trinucleotide features rather than dinucleotide features are used. Furthermore, we use the whale optimization algorithm, which is the state-of-the-art method for reducing the number of features and choosing the most relevant features. We select
44 trinucleotide features out of 64 to differentiate SARS-CoV-2 variants with acceptable accuracy as a result of the whale optimization method. Experimental results indicate that the SVM classifier with selected features achieves about 99% accuracy, sensitivity, specificity, precision on average. The proposed method presents an admirable performance for detecting SARS-CoV-2 variants.
Primary Language | English |
---|---|
Journal Section | Articles |
Authors | |
Publication Date | March 23, 2023 |
Submission Date | October 27, 2022 |
Published in Issue | Year 2023 |