Classification of Exon and Intron Regions on DNA Sequences with Hybrid Use of SBERT and ANFIS Approaches
Year 2024,
, 1043 - 1053, 25.07.2024
Fatma Akalın
,
Nejat Yumuşak
Abstract
DNA is the part of the genome that contains enormous amounts of information related to life. Amino acids are formed by coding three nucleotides in this genome part, and the encoded amino acids are called codes in DNA. The frequency of the triple nucleotide in the DNA sequence allows for the evaluation of protein-coding (exon) and non-protein-coding (intron) regions. Distinguishing these regions enables the analysis of vital functions related to life. This study provides the classification of exon and intron regions for BCR-ABL and MEFV genes obtained from NCBI and Ensemble datasets, respectively. Then, existing DNA sequences are clustered using pretrained models in the scope of the SBERT approach. In the clustering process, K-Means and Agglomerative Clustering approaches are used consecutively. The frequency of repetition of codes is calculated with a representative sample selected from each cluster. The matrix is created using the frequencies of 64 different codons that constitute genetic code. This matrix is given as input to the ANFIS structure. The %88.88 accuracy rate is obtained with the ANFIS approach to classify exon and intron DNA sequences. As a result of this study, a successful result was produced independently of DNA length.
References
- [1] Raza K., ‘Fuzzy logic based approaches for gene regulatory network inference’, Artificial Intelligence in Medicine, 97: 189–203, (2019).
- [2] Zheng P., Wang S., Wang X., and Zeng X., ‘Editorial: Artificial Intelligence in Bioinformatics and Drug Repurposing: Methods and Applications’, Frontiers in Genetics, 13: 1–4, (2022).
- [3] Singh N., Nath R., and Singh D.B., ‘Splice-site identification for exon prediction using bidirectional LSTM-RNN approach’, Biochemistry and Biophysics Reports, 30, (2022).
- [4] Kar S. and Ganguly M., ‘Study of effectiveness of FIR and IIR filters in Exon identification: A comparative approach’, Materials Today: Proceedings, 58: 437–444, (2022).
- [5] Barman S., Saha S., Mandal A., and Roy M., ‘Prediction of protein coding regions of a DNA sequence through spectral analysis’, 2012 International Conference on Informatics, Electronics and Vision, ICIEV 2012, 12–16, (2012).
- [6] Das L., Das J. K., and Nanda S., ‘Detection of exon location in eukaryotic DNA using a fuzzy adaptive Gabor wavelet transform’, Genomics, 112: 4406–4416, (2020).
- [7] Das L., Nanda S., and Das J. K., ‘An integrated approach for identification of exon locations using recursive Gauss Newton tuned adaptive Kaiser window’, Genomics, 111: 284–296, (2019).
- [8] Gupta R., Mittal A., Singh K., Bajpai P., and Prakash S., 'A Time Series Approach for Identification of Exons and Introns', 10th International Conference on Information Technology (ICIT 2007), 91–93, (2007).
- [9] Das B. and Türkoglu I., ‘Sayisal haritalama teknikleri ve Fourier dönüsümü kullanılarak DNA dizilimlerinin sınıflandirilmasi’, Journal of the Faculty of Engineering and Architecture of Gazi University, 31(4): 921–932, (2016).
- [10] Hota M. K. and Srivastava V. K., ‘Performance analysis of different DNA to numerical mapping techniques for identification of protein coding regions using tapered window based short-time discrete Fourier transform’, ICPCES 2010 - International Conference on Power, Control and Embedded Systems, (2010).
- [11] Dessouky A. M., Taha T. E., Dessouky M. M., Eltholth A. A., Hassan E., and Abd El-Samie F. E., ‘Non-parametric spectral estimation techniques for DNA sequence analysis and exon region prediction’, Computers and Electrical Engineering, 73: 334–348, (2019).
- [12] Roy M. and Barman S., ‘Spectral analysis of coding and non-coding regions of a DNA sequence by Parametric method’, Proceedings of the 2010 Annual IEEE India Conference: Green Energy, Computing and Communication, INDICON 2010, 7–10, (2010).
- [13] Singh A. K. and Srivastava V. K., ‘The three base periodicity of protein coding sequences and its application in exon prediction’, 2020 7th International Conference on Signal Processing and Integrated Networks, SPIN 2020, 64: 1089–1094, (2020).
- [14] Akalın F. and Yumuşak N., ‘DNA genom dizilimi üzerinde dijital sinyal işleme teknikleri kullanılarak elde edilen ekson ve intron bölgelerinin EfficientNetB7 mimarisi ile sınıflandırılması’, Journal of the Faculty of Engineering and Architecture of Gazi University, 37(3): 1355–1371, (2022).
- [15] Gunasekaran H., Ramalakshmi K., Rex Macedo Arokiaraj A., Kanmani S. D., Venkatesan C., and Dhas C. S. G., ‘Analysis of DNA Sequence Classification Using CNN and Hybrid Models’, Computational and Mathematical Methods in Medicine, (2021).
- [16] Abass Y.A., Adeshina S.A., Agwu N.N., Boukar M.M., Department of Computer Science, ‘Analysis of Prostate Cancer DNA Sequences Using Bi-Directional Long Short Term Memory Model’, 2021 16th International Conference on Electronics Computer and Computation (ICECCO), 21–26, 2021.
- [17] Canatalay P. J. and Ucan O. N., ‘A Bidirectional LSTM-RNN and GRU Method to Exon Prediction Using Splice-Site Mapping’, Applied Sciences, 12(9), (2022).
- [18] Nasr F.B., Oueslati A. E., ‘CNN for human exons and introns classification’, 2021 18th International Multi-Conference on Systems, Signals & Devices, 249–254, (2021).
- [19] Chakraborty S. and Gupta V., DWT based cancer identification using EIIP, Proceedings - 2016 2nd International Conference on Computational Intelligence and Communication Technology, CICT 2016, 718–723, (2016).
- [20] Marhon S. A. and Kremer S. C., ‘Protein coding region prediction based on the adaptive representation method’, Canadian Conference on Electrical and Computer Engineering, 000415–000418, (2011).
- [21] Li J. et al., ‘Integrated entropy-based approach for analyzing exons and introns in DNA sequences’, BMC Bioinformatics, 20, (2019).
- [22] https://www.ncbi.nlm.nih.gov/,‘NCBI’.
- [23]https://www.ensembl.org/Homo_sapiens/Gene/Sequence?db=core;g=ENSG00000103313;r=16:3242027-3256633, ‘Ensemble’.
- [24] Wang T., Shi H., Liu W., and Yan X., ‘A joint FrameNet and element focusing Sentence-BERT method of sentence similarity computation’, Expert Systems with Applications, 200, (2022).
- [25] Devlin J., Chang M. W., Lee K., and Toutanova K., ‘BERT: Pre-training of deep bidirectional transformers for language understanding’, NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, 4171–4186, (2019).
- [26] Santander-Cruz Y, et al., ‘Semantic Feature Extraction Using SBERT for Dementia Detection’ brain sciences, (2022).
- [27] Reimers N. and Gurevych I., ‘Sentence-BERT: Sentence embeddings using siamese BERT-networks’, arXiv, 3982–3992, (2019).
- [28] Mahdevari S. and Khodabakhshi M. B., ‘A hybrid PSO-ANFIS model for predicting unstable zones in underground roadways’, Tunnelling and Underground Space Technology incorporating Trenchless Technology Research, 117, (2021).
- [29] Karaboga D. and Kaya E., ‘Estimation of number of foreign visitors with ANFIS by using ABC algorithm’, Soft Computing, 24:7579–7591, (2020).
- [30]https://www.sbert.net/examples/applications/clustering/README.html, ‘SBERT-Clustering’
- [31] https://www.sbert.net/docs/pretrained_models.html, ‘SBERT-Pretrained Models’
- [32] Bihter DAŞ, ‘DNA dizilimlerinden hastalik tanilanmasi için işaret işleme temelli yeni yaklaşımların geliştirilmesi’, Fırat Üniversitesi Fen Bilimleri Enstitüsü Yazılım Mühendisliği Anabilim Dalı, Doktora Tezi, (2018).
- [33] Sak H., Senior A, and Beaufays F., ‘Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition’, arXiv, (2014), [Online]. Available: http://arxiv.org/abs/1402.1128.
- [34] Precup R. E., Bojan-Dragos C. A., Hedrea E. L., Roman R. C., and Petriu E. M., ‘Evolving Fuzzy Models of Shape Memory Alloy Wire Actuators’, Romanian Journal of Information Science and Technology, 24(4): 353–365, (2021).
- [35] Mishra P. and Bhoi N., ‘Cancer gene recognition from microarray data with manta ray based enhanced ANFIS technique’, Biocybernetics and Biomedical Engineering, 41(3): 916–932, (2021).
- [36] Akalın F., and Yumuşak N., ‘Lösemi hastalığının temel türlerinden ALL ve KML malignitelerinin graf sinir ağları ve bulanık mantık algoritması ile sınıflandırılması’, Journal of the Faculty of Engineering and Architecture of Gazi University, 38(2): 707–719, 2023.
- [37] Zhu M. and Lai Y., ‘Improvements Achieved by Multiple Imputation for Single-Cell RNA-Seq Data in Clustering Analysis and Differential Expression Analysis’, Journal of Computational Biology, 29(7): 634–649, (2022).
- [38] Radpour V. and Soleimanian Gharehchopogh F., ‘A Novel Hybrid Binary Farmland Fertility Algorithm with Naïve Bayes for Diagnosis of Heart Disease’, Sakarya University Journal of Computer and Information Sciences, 5(1), 2022.
- [39] Ibrahim M. H., ‘WBBA-KM: A Hybrid Weight-Based Bat Algorithm with K-Means Algorithm For Cluster Analysis’, Journal of Polytechnic, 25(1): 65–73, 2022.
- [40] M. E. BAYRAKDAR and A. ÇALHAN, ‘Optimization of Ant Colony for Next Generation Wireless Cognitive Networks’, Journal of Polytechnic, 24(3): 779–784, 2021.
- [41] Garip Z., Çimen M. E., and Boz A. F., ‘Fotovoltaik Modellerin Parametre Çıkarımı İçin Geliştirilmiş Bir Kaotik Tabanlı Balina Optimizasyon Algoritması’, Journal of Polytechnic, 25(3): 1041–1054, 2022.
- [42] Alghobiri M., Mohiuddin K., Khaleel M. A., Islam M., Shahwar S., and Nasr O., ‘A Novel Approach of Clustering Documents: Minimizing Computational Complexities in Accessing Database Systems’, International Arab Journal of Information Technology, 19(4), 617–628, (2022).
- [43] Konar M., ‘Redesign of morphing UAV’s winglet using DS algorithm based ANFIS model’, Aircraft Engineering and Aerospace Technology, 91(9): 1214–1222, (2019).
ANFIS ve SBERT Yaklaşımlarının Hibrit Kullanımı ile DNA Dizilimleri Üzerinde Ekson ve İntron Bölgelerinin Sınıflandırılması
Year 2024,
, 1043 - 1053, 25.07.2024
Fatma Akalın
,
Nejat Yumuşak
Abstract
DNA, canlılığa ilişkin devasa bilgi barındıran genom parçasıdır. Bu genom parçasındaki üç nükleotidin kodlanması ile aminoasitler oluşur ve kodlanan aminoasitler DNA’da kod olarak isimlendirilir. DNA dizilimindeki üçlü nükleotidin frekansı, protein kodlayan(ekson) ve protein kodlamayan(intron) bölgelere ilişkin analiz imkanı sağlar. Bu bölgelerin ayırt edilmesi yaşama ilişkin hayati fonksiyonların değerlendirilmesini mümkün kılar. Bu çalışma sırasıyla NCBI ve Ensemble veri setlerinden elde edilen BCR-ABL ve MEFV genleri için ekson ve intron bölgelerinin sınıflandırılmasını sağlamıştır. Ardından SBERT yaklaşımı kapsamında önceden eğitilmiş modeller ile mevcut DNA dizilimleri kümelenmiştir. Kümeleme sürecinde K-Means ve Agglomerative Kümeleme yaklaşımları art arda kullanılmıştır. Her bir kümeden seçilen temsili bir örnek ile kodonların tekrarlanma sıklığı hesaplanmıştır. Genetik kodun oluşmasını sağlayan 64 farklı kodonların frekansı kullanılarak matris oluşturulmuştur. Bu matris ANFIS yapısına girdi olarak verilmiştir. ANFIS yaklaşımı ile ekson ve intron bölgelerinin sınıflandırılmasında %88.88 doğruluk oranı elde edilmiştir. Bu çalışmanın sonucunda DNA uzunluğundan bağımsız başarılı bir sonuç üretilmiştir.
References
- [1] Raza K., ‘Fuzzy logic based approaches for gene regulatory network inference’, Artificial Intelligence in Medicine, 97: 189–203, (2019).
- [2] Zheng P., Wang S., Wang X., and Zeng X., ‘Editorial: Artificial Intelligence in Bioinformatics and Drug Repurposing: Methods and Applications’, Frontiers in Genetics, 13: 1–4, (2022).
- [3] Singh N., Nath R., and Singh D.B., ‘Splice-site identification for exon prediction using bidirectional LSTM-RNN approach’, Biochemistry and Biophysics Reports, 30, (2022).
- [4] Kar S. and Ganguly M., ‘Study of effectiveness of FIR and IIR filters in Exon identification: A comparative approach’, Materials Today: Proceedings, 58: 437–444, (2022).
- [5] Barman S., Saha S., Mandal A., and Roy M., ‘Prediction of protein coding regions of a DNA sequence through spectral analysis’, 2012 International Conference on Informatics, Electronics and Vision, ICIEV 2012, 12–16, (2012).
- [6] Das L., Das J. K., and Nanda S., ‘Detection of exon location in eukaryotic DNA using a fuzzy adaptive Gabor wavelet transform’, Genomics, 112: 4406–4416, (2020).
- [7] Das L., Nanda S., and Das J. K., ‘An integrated approach for identification of exon locations using recursive Gauss Newton tuned adaptive Kaiser window’, Genomics, 111: 284–296, (2019).
- [8] Gupta R., Mittal A., Singh K., Bajpai P., and Prakash S., 'A Time Series Approach for Identification of Exons and Introns', 10th International Conference on Information Technology (ICIT 2007), 91–93, (2007).
- [9] Das B. and Türkoglu I., ‘Sayisal haritalama teknikleri ve Fourier dönüsümü kullanılarak DNA dizilimlerinin sınıflandirilmasi’, Journal of the Faculty of Engineering and Architecture of Gazi University, 31(4): 921–932, (2016).
- [10] Hota M. K. and Srivastava V. K., ‘Performance analysis of different DNA to numerical mapping techniques for identification of protein coding regions using tapered window based short-time discrete Fourier transform’, ICPCES 2010 - International Conference on Power, Control and Embedded Systems, (2010).
- [11] Dessouky A. M., Taha T. E., Dessouky M. M., Eltholth A. A., Hassan E., and Abd El-Samie F. E., ‘Non-parametric spectral estimation techniques for DNA sequence analysis and exon region prediction’, Computers and Electrical Engineering, 73: 334–348, (2019).
- [12] Roy M. and Barman S., ‘Spectral analysis of coding and non-coding regions of a DNA sequence by Parametric method’, Proceedings of the 2010 Annual IEEE India Conference: Green Energy, Computing and Communication, INDICON 2010, 7–10, (2010).
- [13] Singh A. K. and Srivastava V. K., ‘The three base periodicity of protein coding sequences and its application in exon prediction’, 2020 7th International Conference on Signal Processing and Integrated Networks, SPIN 2020, 64: 1089–1094, (2020).
- [14] Akalın F. and Yumuşak N., ‘DNA genom dizilimi üzerinde dijital sinyal işleme teknikleri kullanılarak elde edilen ekson ve intron bölgelerinin EfficientNetB7 mimarisi ile sınıflandırılması’, Journal of the Faculty of Engineering and Architecture of Gazi University, 37(3): 1355–1371, (2022).
- [15] Gunasekaran H., Ramalakshmi K., Rex Macedo Arokiaraj A., Kanmani S. D., Venkatesan C., and Dhas C. S. G., ‘Analysis of DNA Sequence Classification Using CNN and Hybrid Models’, Computational and Mathematical Methods in Medicine, (2021).
- [16] Abass Y.A., Adeshina S.A., Agwu N.N., Boukar M.M., Department of Computer Science, ‘Analysis of Prostate Cancer DNA Sequences Using Bi-Directional Long Short Term Memory Model’, 2021 16th International Conference on Electronics Computer and Computation (ICECCO), 21–26, 2021.
- [17] Canatalay P. J. and Ucan O. N., ‘A Bidirectional LSTM-RNN and GRU Method to Exon Prediction Using Splice-Site Mapping’, Applied Sciences, 12(9), (2022).
- [18] Nasr F.B., Oueslati A. E., ‘CNN for human exons and introns classification’, 2021 18th International Multi-Conference on Systems, Signals & Devices, 249–254, (2021).
- [19] Chakraborty S. and Gupta V., DWT based cancer identification using EIIP, Proceedings - 2016 2nd International Conference on Computational Intelligence and Communication Technology, CICT 2016, 718–723, (2016).
- [20] Marhon S. A. and Kremer S. C., ‘Protein coding region prediction based on the adaptive representation method’, Canadian Conference on Electrical and Computer Engineering, 000415–000418, (2011).
- [21] Li J. et al., ‘Integrated entropy-based approach for analyzing exons and introns in DNA sequences’, BMC Bioinformatics, 20, (2019).
- [22] https://www.ncbi.nlm.nih.gov/,‘NCBI’.
- [23]https://www.ensembl.org/Homo_sapiens/Gene/Sequence?db=core;g=ENSG00000103313;r=16:3242027-3256633, ‘Ensemble’.
- [24] Wang T., Shi H., Liu W., and Yan X., ‘A joint FrameNet and element focusing Sentence-BERT method of sentence similarity computation’, Expert Systems with Applications, 200, (2022).
- [25] Devlin J., Chang M. W., Lee K., and Toutanova K., ‘BERT: Pre-training of deep bidirectional transformers for language understanding’, NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, 4171–4186, (2019).
- [26] Santander-Cruz Y, et al., ‘Semantic Feature Extraction Using SBERT for Dementia Detection’ brain sciences, (2022).
- [27] Reimers N. and Gurevych I., ‘Sentence-BERT: Sentence embeddings using siamese BERT-networks’, arXiv, 3982–3992, (2019).
- [28] Mahdevari S. and Khodabakhshi M. B., ‘A hybrid PSO-ANFIS model for predicting unstable zones in underground roadways’, Tunnelling and Underground Space Technology incorporating Trenchless Technology Research, 117, (2021).
- [29] Karaboga D. and Kaya E., ‘Estimation of number of foreign visitors with ANFIS by using ABC algorithm’, Soft Computing, 24:7579–7591, (2020).
- [30]https://www.sbert.net/examples/applications/clustering/README.html, ‘SBERT-Clustering’
- [31] https://www.sbert.net/docs/pretrained_models.html, ‘SBERT-Pretrained Models’
- [32] Bihter DAŞ, ‘DNA dizilimlerinden hastalik tanilanmasi için işaret işleme temelli yeni yaklaşımların geliştirilmesi’, Fırat Üniversitesi Fen Bilimleri Enstitüsü Yazılım Mühendisliği Anabilim Dalı, Doktora Tezi, (2018).
- [33] Sak H., Senior A, and Beaufays F., ‘Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition’, arXiv, (2014), [Online]. Available: http://arxiv.org/abs/1402.1128.
- [34] Precup R. E., Bojan-Dragos C. A., Hedrea E. L., Roman R. C., and Petriu E. M., ‘Evolving Fuzzy Models of Shape Memory Alloy Wire Actuators’, Romanian Journal of Information Science and Technology, 24(4): 353–365, (2021).
- [35] Mishra P. and Bhoi N., ‘Cancer gene recognition from microarray data with manta ray based enhanced ANFIS technique’, Biocybernetics and Biomedical Engineering, 41(3): 916–932, (2021).
- [36] Akalın F., and Yumuşak N., ‘Lösemi hastalığının temel türlerinden ALL ve KML malignitelerinin graf sinir ağları ve bulanık mantık algoritması ile sınıflandırılması’, Journal of the Faculty of Engineering and Architecture of Gazi University, 38(2): 707–719, 2023.
- [37] Zhu M. and Lai Y., ‘Improvements Achieved by Multiple Imputation for Single-Cell RNA-Seq Data in Clustering Analysis and Differential Expression Analysis’, Journal of Computational Biology, 29(7): 634–649, (2022).
- [38] Radpour V. and Soleimanian Gharehchopogh F., ‘A Novel Hybrid Binary Farmland Fertility Algorithm with Naïve Bayes for Diagnosis of Heart Disease’, Sakarya University Journal of Computer and Information Sciences, 5(1), 2022.
- [39] Ibrahim M. H., ‘WBBA-KM: A Hybrid Weight-Based Bat Algorithm with K-Means Algorithm For Cluster Analysis’, Journal of Polytechnic, 25(1): 65–73, 2022.
- [40] M. E. BAYRAKDAR and A. ÇALHAN, ‘Optimization of Ant Colony for Next Generation Wireless Cognitive Networks’, Journal of Polytechnic, 24(3): 779–784, 2021.
- [41] Garip Z., Çimen M. E., and Boz A. F., ‘Fotovoltaik Modellerin Parametre Çıkarımı İçin Geliştirilmiş Bir Kaotik Tabanlı Balina Optimizasyon Algoritması’, Journal of Polytechnic, 25(3): 1041–1054, 2022.
- [42] Alghobiri M., Mohiuddin K., Khaleel M. A., Islam M., Shahwar S., and Nasr O., ‘A Novel Approach of Clustering Documents: Minimizing Computational Complexities in Accessing Database Systems’, International Arab Journal of Information Technology, 19(4), 617–628, (2022).
- [43] Konar M., ‘Redesign of morphing UAV’s winglet using DS algorithm based ANFIS model’, Aircraft Engineering and Aerospace Technology, 91(9): 1214–1222, (2019).