Research Article

COMPARATIVE OF SUCCESS OF KNN WITH NEW PROPOSED K-SPLIT METHOD AND STRATIFIED CROSS VALIDATION ON REMOTE HOMOLOGUE PROTEIN DETECTION

Volume: 23 Number: 1 March 30, 2022
Fahriye Gemci *, Turgay İbrikçi , Ulus Çevik
TR EN

COMPARATIVE OF SUCCESS OF KNN WITH NEW PROPOSED K-SPLIT METHOD AND STRATIFIED CROSS VALIDATION ON REMOTE HOMOLOGUE PROTEIN DETECTION

Abstract

In this study, a remote homologous protein detection problem, which is a problem belonging to the field of bioinformatics, which has a great contribution in the field of medicine, is discussed. Protein sequences taken from the SCOP database, which is an important and widely used database for proteins, were tested for remote homolog protein detection in this study. Feature vectors were obtained from the protein sequences using the bag of word model. These obtained feature vectors were classified using the kNN classifier algorithm. In this classification, the different distances were used as Bray Curtis, Chebyshev, Cosine, Dice, Euclidean, Hamming, Jaccard, Kulczynski, Matching coefficient, Minkowski, RogersTanimoto, RussellRao and SokalMichener on kNN classifier for remote homolog protein detection. There is proposed special k fold value formula for prevent imbalanced data problem. It has observed that the kNN algorithm with the Bray Curtis distance with cross validation with special k fold value shows the most successful performance with 99% accuracy.

Keywords

Remote Homolog Protein, k-nearest Neighbor (kNN), Bag of words model, Distances, k-fold Cross Validation

Supporting Institution

Çukurova University Scientific Research Projects (BAP)

Project Number

Project No: FDK-2019-11621.

References

  1. [1] Li J, Wong L, Yang Q. Guest editors' introduction: Data Mining in Bioinformatics. IEEE Intell. Systems, 2005; 20(6):16-18.
  2. [2] Yoo I, Alafaireet P, Marinov M, Pena-Hernandez K, Gopidi R, Chang J.-F, Hua L. Data mining in healthcare and biomedicine: a survey of the literature. Journal of medical systems, 2012; 36(4):2431-2448.
  3. [3] Chen J, Guo M, Wang X, Liu B. A comprehensive review and comparison of different computational methods for protein remote homology detection. Briefings in Bioinformatics, 2018; 19(2): 231-244.
  4. [4] Liao L, Noble WS. Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships. Journal of computational biology, 2003; 10(6), 857-868.
  5. [5] Lovato P, Cristani M, Bicego M. Soft Ngram representation and modeling for protein remote homology detection. IEEE/ACM transactions on computational biology and bioinformatics, 2016; 14(6), 1482-1488.
  6. [6] Dong QW. Lin L, Wang XL, Li MH. A pattern-based SVM for protein remote homology detection. In 2005 International Conference on Machine Learning and Cybernetics, 2005; Vol.6, 3363-3368. IEEE.
  7. [7] Beaume N, Ramstein G, Jacques Y. An expert-based approach for the identification of remote homologs. WCSB. 2008; (pp. 17-20).
  8. [8] Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. Journal of molecular biology, 1995; 247(4), 536-540.
  9. [9] Harris A, Jones SH. Words. In Writing for Performance. 2016; (pp. 19-35). Rotterdam, Netherlands: Sense.
  10. [10] Ofer D, Brandes N, Linial M. The language of proteins: NLP, machine learning & protein sequences. Computational and Structural Biotechnology Journal, 2021.
APA
Gemci, F., İbrikçi, T., & Çevik, U. (2022). COMPARATIVE OF SUCCESS OF KNN WITH NEW PROPOSED K-SPLIT METHOD AND STRATIFIED CROSS VALIDATION ON REMOTE HOMOLOGUE PROTEIN DETECTION. Eskişehir Technical University Journal of Science and Technology A - Applied Sciences and Engineering, 23(1), 87-108. https://doi.org/10.18038/estubtda.970169
AMA
1.Gemci F, İbrikçi T, Çevik U. COMPARATIVE OF SUCCESS OF KNN WITH NEW PROPOSED K-SPLIT METHOD AND STRATIFIED CROSS VALIDATION ON REMOTE HOMOLOGUE PROTEIN DETECTION. Estuscience - Se. 2022;23(1):87-108. doi:10.18038/estubtda.970169
Chicago
Gemci, Fahriye, Turgay İbrikçi, and Ulus Çevik. 2022. “COMPARATIVE OF SUCCESS OF KNN WITH NEW PROPOSED K-SPLIT METHOD AND STRATIFIED CROSS VALIDATION ON REMOTE HOMOLOGUE PROTEIN DETECTION”. Eskişehir Technical University Journal of Science and Technology A - Applied Sciences and Engineering 23 (1): 87-108. https://doi.org/10.18038/estubtda.970169.
EndNote
Gemci F, İbrikçi T, Çevik U (March 1, 2022) COMPARATIVE OF SUCCESS OF KNN WITH NEW PROPOSED K-SPLIT METHOD AND STRATIFIED CROSS VALIDATION ON REMOTE HOMOLOGUE PROTEIN DETECTION. Eskişehir Technical University Journal of Science and Technology A - Applied Sciences and Engineering 23 1 87–108.
IEEE
[1]F. Gemci, T. İbrikçi, and U. Çevik, “COMPARATIVE OF SUCCESS OF KNN WITH NEW PROPOSED K-SPLIT METHOD AND STRATIFIED CROSS VALIDATION ON REMOTE HOMOLOGUE PROTEIN DETECTION”, Estuscience - Se, vol. 23, no. 1, pp. 87–108, Mar. 2022, doi: 10.18038/estubtda.970169.
ISNAD
Gemci, Fahriye - İbrikçi, Turgay - Çevik, Ulus. “COMPARATIVE OF SUCCESS OF KNN WITH NEW PROPOSED K-SPLIT METHOD AND STRATIFIED CROSS VALIDATION ON REMOTE HOMOLOGUE PROTEIN DETECTION”. Eskişehir Technical University Journal of Science and Technology A - Applied Sciences and Engineering 23/1 (March 1, 2022): 87-108. https://doi.org/10.18038/estubtda.970169.
JAMA
1.Gemci F, İbrikçi T, Çevik U. COMPARATIVE OF SUCCESS OF KNN WITH NEW PROPOSED K-SPLIT METHOD AND STRATIFIED CROSS VALIDATION ON REMOTE HOMOLOGUE PROTEIN DETECTION. Estuscience - Se. 2022;23:87–108.
MLA
Gemci, Fahriye, et al. “COMPARATIVE OF SUCCESS OF KNN WITH NEW PROPOSED K-SPLIT METHOD AND STRATIFIED CROSS VALIDATION ON REMOTE HOMOLOGUE PROTEIN DETECTION”. Eskişehir Technical University Journal of Science and Technology A - Applied Sciences and Engineering, vol. 23, no. 1, Mar. 2022, pp. 87-108, doi:10.18038/estubtda.970169.
Vancouver
1.Fahriye Gemci, Turgay İbrikçi, Ulus Çevik. COMPARATIVE OF SUCCESS OF KNN WITH NEW PROPOSED K-SPLIT METHOD AND STRATIFIED CROSS VALIDATION ON REMOTE HOMOLOGUE PROTEIN DETECTION. Estuscience - Se. 2022 Mar. 1;23(1):87-108. doi:10.18038/estubtda.970169