| | | |

## Assessment of Mutation Susceptibility in DNA Sequences with Word Vectors

#### Alper YILMAZ [1]

With the advent of natural language processing (NLP) techniques empowered with deep learning approaches, more detailed relationships between words have been unraveled. Word2Vec is quite robust in discovering contextual and semantic relationships. Genome being a long text, is subject to similar studies to unravel yet to be discovered relationships between DNA k-mers. Dna2vec applies Word2Vec approach to whole genome so that DNA k-mers are represented as vectors. The cosine similarity queries on DNA vectors reveal unusual relationships between DNA k-mers. In this study, we examined DNA sequence based prediction of mutation susceptibility. Initially,we generated word vectors for human and mouse genome via dna2vec. On the other hand, we retrieved coordinates of common and all mutations from dbSNP. For each coordinate, we extracted 8 nucleotide k-mers intersecting mutations and results are aggregated. such a way that number of mutations for each 8-mer has been tabulated. These results are incorporated with dna2vec cosine similarity data. Our results showed that for a given k-mer, k-mers with highest cosine similarity coincide with highest mutation count k-mer. In other words, the neighbor with the highest cosine similarity for a k-mer was also seen to be the neighbor overlapping the mutation count. As a result of our studies, human and mouse, dna2vec vs. mutation overlap is 80% and 70%, respectively. In conclusion, dna2vec and other word embedding approaches can be used to reveal mutation or variation characteristics of genomes without sequencing or experimental data, solely using the genome sequence itself. This might pave the way for understanding the underlying mechanism or dynamics of mutations in genomes.
mutation, word2vec, k-mer, dna2vec, cosine similarity
• Abdul-Mageed, M., & Ungar, L. (2017, July). Emonet: Fine-grained emotion detection with gated recurrent neural networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 718-728).
• Akın, A. A., & Akın, M. D. (2007). Zemberek, an open source nlp framework for turkic languages. Structure, 10, 1-5.
• Arora, S., Li, Y., Liang, Y., Ma, T., & Risteski, A. (2018). Linear algebraic structure of word senses, with applications to polysemy. Transactions of the Association for Computational Linguistics, 6, 483-495.
• Bolukbasi, T., Chang, K. W., Zou, J. Y., Saligrama, V., & Kalai, A. T. (2016). Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in neural information processing systems (pp. 4349-4357).
• Chen, M. (2017). Efficient vector representation for documents through corruption. arXiv preprint arXiv:1707.02377.
• Chen, X., & Lawrence Zitnick, C. (2015). Mind's eye: A recurrent visual representation for image caption generation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2422-2431).
• De Boom, C., Van Canneyt, S., Bohez, S., Demeester, T., & Dhoedt, B. (2015, November). Learning semantic similarity for very short texts. In 2015 ieee international conference on data mining workshop (icdmw) (pp. 1229-1234). IEEE.
• Dos Santos, C., & Gatti, M. (2014, August). Deep convolutional neural networks for sentiment analysis of short texts. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers (pp. 69-78).
• Eisner, B., Rocktäschel, T., Augenstein, I., Bošnjak, M., & Riedel, S. (2016). emoji2vec: Learning emoji representations from their description. arXiv preprint arXiv:1609.08359.
• Faruqui, M., Tsvetkov, Y., Rastogi, P., & Dyer, C. (2016). Problems with evaluation of word embeddings using word similarity tasks. arXiv preprint arXiv:1605.02276.
• Gladkova, A., Drozd, A., & Matsuoka, S. (2016, June). Analogy-based detection of morphological and semantic relations with word embeddings: what works and what doesn’t. In Proceedings of the NAACL Student Research Workshop (pp. 8-15).
• Jauhar, S. K., Dyer, C., & Hovy, E. (2015). Ontologically grounded multi-sense representation learning for semantic vector space models. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 683-693).
• Kiros, R., Salakhutdinov, R., & Zemel, R. S. (2014). Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539.
• Kusner, M., Sun, Y., Kolkin, N., & Weinberger, K. (2015, June). From word embeddings to document distances. In International conference on machine learning (pp. 957-966).
• Le, Q., & Mikolov, T. (2014, January). Distributed representations of sentences and documents. In International conference on machine learning (pp. 1188-1196).
• Levy, O., & Goldberg, Y. (2014, June). Linguistic regularities in sparse and explicit word representations. In Proceedings of the eighteenth conference on computational natural language learning (pp. 171-180).
• Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
• Mikolov, T., Yih, W. T., & Zweig, G. (2013, June). Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 746-751).
• Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3), 443–453. doi:10.1016/0022-2836(70)90057-4
• Ng, P. (2017). dna2vec: Consistent vector representations of variable-length k-mers. arXiv preprint arXiv:1701.06279. 45
• Pedersoli, M., Lucas, T., Schmid, C., & Verbeek, J. (2017). Areas of attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1242-1250).
• Perozzi, B., Al-Rfou, R., & Skiena, S. (2014, August). Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 701-710). ACM.
• Preoţiuc-Pietro, D., Lampos, V., & Aletras, N. (2015, July). An analysis of the user occupational class through Twitter content. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 1754-1764).
• Preoţiuc-Pietro, D., Liu, Y., Hopkins, D., & Ungar, L. (2017, July). Beyond binary labels: political ideology prediction of twitter users. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 729-740).
• Schwartz, R., Reichart, R., & Rappoport, A. (2015, July). Symmetric pattern based word embeddings for improved word similarity prediction. In Proceedings of the nineteenth conference on computational natural language learning (pp. 258-267).
• Sienčnik, S. K. (2015, May). Adapting word2vec to named entity recognition. In Proceedings of the 20th nordic conference of computational linguistics, nodalida 2015, may 11-13, 2015, vilnius, lithuania (No. 109, pp. 239-243). Linköping University Electronic Press.
• Uricchio, T., Ballan, L., Seidenari, L., & Del Bimbo, A. (2017). Automatic image annotation via label transfer in the semantic space. Pattern Recognition, 71, 144-157.
• Wang, Y., Liu, S., Afzal, N., Rastegar-Mojarad, M., Wang, L., Shen, F., ... & Liu, H. (2018). A comparison of word embeddings for the biomedical natural language processing. Journal of biomedical informatics, 87, 12-20.
• Yang, X., Macdonald, C., & Ounis, I. (2018). Using word embeddings in twitter election classification. Information Retrieval Journal, 21(2-3), 183-207.
• Yao, Y., Li, X., Liu, X., Liu, P., Liang, Z., Zhang, J., & Mai, K. (2017). Sensing spatial distribution of urban land use by integrating points-of-interest and Google Word2Vec model. International Journal of Geographical Information Science, 31(4), 825-848.
• Zhao, Z., Yang, Z., Luo, L., Lin, H., & Wang, J. (2016). Drug drug interaction extraction from biomedical literature using syntax convolutional neural network. Bioinformatics, 32(22), 3444-3453.
Primary Language en Computer Science, Artifical Intelligence January 2020 Research Articles Orcid: 0000-0002-8827-4887Author: Alper YILMAZ (Primary Author)Institution: Yildiz Technical UniversityCountry: Turkey Application Date : January 14, 2020 Acceptance Date : February 7, 2020 Publication Date : March 20, 2020
 Bibtex @research article { jista674910, journal = {Journal of Intelligent Systems: Theory and Applications}, issn = {}, eissn = {2651-3927}, address = {Sakarya Üniversitesi, Mühendislik Fakültesi, Endüstri Mühendisliği Bölümü, M5 Binası, Serdivan Sakarya.}, publisher = {Harun TAŞKIN}, year = {2020}, volume = {3}, pages = {1 - 6}, doi = {10.38016/jista.674910}, title = {Assessment of Mutation Susceptibility in DNA Sequences with Word Vectors}, key = {cite}, author = {Yılmaz, Alper} } APA Yılmaz, A . (2020). Assessment of Mutation Susceptibility in DNA Sequences with Word Vectors . Journal of Intelligent Systems: Theory and Applications , 3 (1) , 1-6 . DOI: 10.38016/jista.674910 MLA Yılmaz, A . "Assessment of Mutation Susceptibility in DNA Sequences with Word Vectors" . Journal of Intelligent Systems: Theory and Applications 3 (2020 ): 1-6 Chicago Yılmaz, A . "Assessment of Mutation Susceptibility in DNA Sequences with Word Vectors". Journal of Intelligent Systems: Theory and Applications 3 (2020 ): 1-6 RIS TY - JOUR T1 - Assessment of Mutation Susceptibility in DNA Sequences with Word Vectors AU - Alper Yılmaz Y1 - 2020 PY - 2020 N1 - doi: 10.38016/jista.674910 DO - 10.38016/jista.674910 T2 - Journal of Intelligent Systems: Theory and Applications JF - Journal JO - JOR SP - 1 EP - 6 VL - 3 IS - 1 SN - -2651-3927 M3 - doi: 10.38016/jista.674910 UR - https://doi.org/10.38016/jista.674910 Y2 - 2020 ER - EndNote %0 Zeki Sistemler Teori ve Uygulamaları Dergisi Assessment of Mutation Susceptibility in DNA Sequences with Word Vectors %A Alper Yılmaz %T Assessment of Mutation Susceptibility in DNA Sequences with Word Vectors %D 2020 %J Journal of Intelligent Systems: Theory and Applications %P -2651-3927 %V 3 %N 1 %R doi: 10.38016/jista.674910 %U 10.38016/jista.674910 ISNAD Yılmaz, Alper . "Assessment of Mutation Susceptibility in DNA Sequences with Word Vectors". Journal of Intelligent Systems: Theory and Applications 3 / 1 (March 2020): 1-6 . https://doi.org/10.38016/jista.674910 AMA Yılmaz A . Assessment of Mutation Susceptibility in DNA Sequences with Word Vectors. jista. 2020; 3(1): 1-6. Vancouver Yılmaz A . Assessment of Mutation Susceptibility in DNA Sequences with Word Vectors. Journal of Intelligent Systems: Theory and Applications. 2020; 3(1): 1-6.

Authors of the Article