In this study, a remote homologous protein detection problem, which is a problem related to the field of bioinformatics and has made a great contribution in the field of medicine, is discussed. Protein sequences taken from the SCOP database, which is an important and widely used database for proteins, were tested for remote homologue protein detection in this study. Feature vectors were obtained from the protein sequences using the bag-of-words model. These obtained feature vectors were classified using the k-nearest Neighbor classifier algorithm. In this classification, the different distances used were Bray Curtis, Euclidean, Minkowski, Dice, Jaccard, Chebyshev, Cosine, SokalSneath, correlation, matching coefficient, RogersTanimoto, SokalMichener, Canbera, Hamming, Kulczynski, and RussellRao on the k-nearest Neighbor classifier for remote homologue protein detection. Two different new methods is proposed for preventing the imbalanced data problem. The first of these is special k-fold value and the other is novel k-split method. It is observed that the k-nearest Neighbor algorithm with the Bray Curtis distance and cross validation with special k-fold value and novel k-split method show the most successful performance, with 98.9% and 83.8% accuracy and 77% and 92% ROC score, respectively.
Remote Homologue Protein k-nearest Neighbor Bag-of-words model Distances k-fold Stratified Cross Validation
Project No: FDK-2019-11621.
In this study, a remote homologous protein detection problem, which is a problem belonging to the field of bioinformatics, which has a great contribution in the field of medicine, is discussed. Protein sequences taken from the SCOP database, which is an important and widely used database for proteins, were tested for remote homolog protein detection in this study. Feature vectors were obtained from the protein sequences using the bag of word model. These obtained feature vectors were classified using the kNN classifier algorithm. In this classification, the different distances were used as Bray Curtis, Chebyshev, Cosine, Dice, Euclidean, Hamming, Jaccard, Kulczynski, Matching coefficient, Minkowski, RogersTanimoto, RussellRao and SokalMichener on kNN classifier for remote homolog protein detection. There is proposed special k fold value formula for prevent imbalanced data problem. It has observed that the kNN algorithm with the Bray Curtis distance with cross validation with special k fold value shows the most successful performance with 99% accuracy.
Remote Homolog Protein k-nearest Neighbor (kNN) Bag of words model Distances k-fold Cross Validation
Çukurova University Scientific Research Projects (BAP)
Project No: FDK-2019-11621.
Primary Language | English |
---|---|
Subjects | Engineering |
Journal Section | Articles |
Authors | |
Project Number | Project No: FDK-2019-11621. |
Publication Date | March 30, 2022 |
Published in Issue | Year 2022 Volume: 23 Issue: 1 |