Assessment of Mutation Susceptibility in DNA Sequences with Word Vectors

Alper Yılmaz

doi:10.38016/jista.674910

Research Article

Year 2020, Volume: 3 Issue: 1, 1 - 6, 20.03.2020

Alper Yılmaz

https://doi.org/10.38016/jista.674910

Cited By: 1

Abstract

References

Abdul-Mageed, M., & Ungar, L. (2017, July). Emonet: Fine-grained emotion detection with gated recurrent neural networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 718-728).
Akın, A. A., & Akın, M. D. (2007). Zemberek, an open source nlp framework for turkic languages. Structure, 10, 1-5.
Arora, S., Li, Y., Liang, Y., Ma, T., & Risteski, A. (2018). Linear algebraic structure of word senses, with applications to polysemy. Transactions of the Association for Computational Linguistics, 6, 483-495.
Bolukbasi, T., Chang, K. W., Zou, J. Y., Saligrama, V., & Kalai, A. T. (2016). Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in neural information processing systems (pp. 4349-4357).
Chen, M. (2017). Efficient vector representation for documents through corruption. arXiv preprint arXiv:1707.02377.
Chen, X., & Lawrence Zitnick, C. (2015). Mind's eye: A recurrent visual representation for image caption generation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2422-2431).
De Boom, C., Van Canneyt, S., Bohez, S., Demeester, T., & Dhoedt, B. (2015, November). Learning semantic similarity for very short texts. In 2015 ieee international conference on data mining workshop (icdmw) (pp. 1229-1234). IEEE.
Dos Santos, C., & Gatti, M. (2014, August). Deep convolutional neural networks for sentiment analysis of short texts. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers (pp. 69-78).
Eisner, B., Rocktäschel, T., Augenstein, I., Bošnjak, M., & Riedel, S. (2016). emoji2vec: Learning emoji representations from their description. arXiv preprint arXiv:1609.08359.
Faruqui, M., Tsvetkov, Y., Rastogi, P., & Dyer, C. (2016). Problems with evaluation of word embeddings using word similarity tasks. arXiv preprint arXiv:1605.02276.
Gladkova, A., Drozd, A., & Matsuoka, S. (2016, June). Analogy-based detection of morphological and semantic relations with word embeddings: what works and what doesn’t. In Proceedings of the NAACL Student Research Workshop (pp. 8-15).
Jauhar, S. K., Dyer, C., & Hovy, E. (2015). Ontologically grounded multi-sense representation learning for semantic vector space models. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 683-693).
Kiros, R., Salakhutdinov, R., & Zemel, R. S. (2014). Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539.
Kusner, M., Sun, Y., Kolkin, N., & Weinberger, K. (2015, June). From word embeddings to document distances. In International conference on machine learning (pp. 957-966).
Le, Q., & Mikolov, T. (2014, January). Distributed representations of sentences and documents. In International conference on machine learning (pp. 1188-1196).
Levy, O., & Goldberg, Y. (2014, June). Linguistic regularities in sparse and explicit word representations. In Proceedings of the eighteenth conference on computational natural language learning (pp. 171-180).
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Mikolov, T., Yih, W. T., & Zweig, G. (2013, June). Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 746-751).
Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3), 443–453. doi:10.1016/0022-2836(70)90057-4
Ng, P. (2017). dna2vec: Consistent vector representations of variable-length k-mers. arXiv preprint arXiv:1701.06279. 45
Pedersoli, M., Lucas, T., Schmid, C., & Verbeek, J. (2017). Areas of attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1242-1250).
Perozzi, B., Al-Rfou, R., & Skiena, S. (2014, August). Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 701-710). ACM.
Preoţiuc-Pietro, D., Lampos, V., & Aletras, N. (2015, July). An analysis of the user occupational class through Twitter content. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 1754-1764).
Preoţiuc-Pietro, D., Liu, Y., Hopkins, D., & Ungar, L. (2017, July). Beyond binary labels: political ideology prediction of twitter users. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 729-740).
Schwartz, R., Reichart, R., & Rappoport, A. (2015, July). Symmetric pattern based word embeddings for improved word similarity prediction. In Proceedings of the nineteenth conference on computational natural language learning (pp. 258-267).
Sienčnik, S. K. (2015, May). Adapting word2vec to named entity recognition. In Proceedings of the 20th nordic conference of computational linguistics, nodalida 2015, may 11-13, 2015, vilnius, lithuania (No. 109, pp. 239-243). Linköping University Electronic Press.
Uricchio, T., Ballan, L., Seidenari, L., & Del Bimbo, A. (2017). Automatic image annotation via label transfer in the semantic space. Pattern Recognition, 71, 144-157.
Wang, Y., Liu, S., Afzal, N., Rastegar-Mojarad, M., Wang, L., Shen, F., ... & Liu, H. (2018). A comparison of word embeddings for the biomedical natural language processing. Journal of biomedical informatics, 87, 12-20.
Yang, X., Macdonald, C., & Ounis, I. (2018). Using word embeddings in twitter election classification. Information Retrieval Journal, 21(2-3), 183-207.
Yao, Y., Li, X., Liu, X., Liu, P., Liang, Z., Zhang, J., & Mai, K. (2017). Sensing spatial distribution of urban land use by integrating points-of-interest and Google Word2Vec model. International Journal of Geographical Information Science, 31(4), 825-848.
Zhao, Z., Yang, Z., Luo, L., Lin, H., & Wang, J. (2016). Drug drug interaction extraction from biomedical literature using syntax convolutional neural network. Bioinformatics, 32(22), 3444-3453.

Assessment of Mutation Susceptibility in DNA Sequences with Word Vectors

Year 2020, Volume: 3 Issue: 1, 1 - 6, 20.03.2020

Alper Yılmaz

https://doi.org/10.38016/jista.674910

Cited By: 1

Abstract

With the advent of natural language processing (NLP) techniques empowered with deep learning approaches, more detailed
relationships between words have been unraveled. Word2Vec is quite robust in discovering contextual and semantic relationships.
Genome being a long text, is subject to similar studies to unravel yet to be discovered relationships between DNA k-mers. Dna2vec
applies Word2Vec approach to whole genome so that DNA k-mers are represented as vectors. The cosine similarity queries on DNA
vectors reveal unusual relationships between DNA k-mers.
In this study, we examined DNA sequence based prediction of mutation susceptibility. Initially,we generated word vectors for human
and mouse genome via dna2vec. On the other hand, we retrieved coordinates of common and all mutations from dbSNP. For each
coordinate, we extracted 8 nucleotide k-mers intersecting mutations and results are aggregated. such a way that number of mutations
for each 8-mer has been tabulated. These results are incorporated with dna2vec cosine similarity data. Our results showed that for a
given k-mer, k-mers with highest cosine similarity coincide with highest mutation count k-mer. In other words, the neighbor with the
highest cosine similarity for a k-mer was also seen to be the neighbor overlapping the mutation count. As a result of our studies, human
and mouse, dna2vec vs. mutation overlap is 80% and 70%, respectively. In conclusion, dna2vec and other word embedding approaches
can be used to reveal mutation or variation characteristics of genomes without sequencing or experimental data, solely using the genome
sequence itself. This might pave the way for understanding the underlying mechanism or dynamics of mutations in genomes.

Keywords

mutation, word2vec, k-mer, dna2vec, cosine similarity

References

Abdul-Mageed, M., & Ungar, L. (2017, July). Emonet: Fine-grained emotion detection with gated recurrent neural networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 718-728).
Akın, A. A., & Akın, M. D. (2007). Zemberek, an open source nlp framework for turkic languages. Structure, 10, 1-5.
Arora, S., Li, Y., Liang, Y., Ma, T., & Risteski, A. (2018). Linear algebraic structure of word senses, with applications to polysemy. Transactions of the Association for Computational Linguistics, 6, 483-495.
Bolukbasi, T., Chang, K. W., Zou, J. Y., Saligrama, V., & Kalai, A. T. (2016). Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in neural information processing systems (pp. 4349-4357).
Chen, M. (2017). Efficient vector representation for documents through corruption. arXiv preprint arXiv:1707.02377.
Chen, X., & Lawrence Zitnick, C. (2015). Mind's eye: A recurrent visual representation for image caption generation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2422-2431).
De Boom, C., Van Canneyt, S., Bohez, S., Demeester, T., & Dhoedt, B. (2015, November). Learning semantic similarity for very short texts. In 2015 ieee international conference on data mining workshop (icdmw) (pp. 1229-1234). IEEE.
Dos Santos, C., & Gatti, M. (2014, August). Deep convolutional neural networks for sentiment analysis of short texts. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers (pp. 69-78).
Eisner, B., Rocktäschel, T., Augenstein, I., Bošnjak, M., & Riedel, S. (2016). emoji2vec: Learning emoji representations from their description. arXiv preprint arXiv:1609.08359.
Faruqui, M., Tsvetkov, Y., Rastogi, P., & Dyer, C. (2016). Problems with evaluation of word embeddings using word similarity tasks. arXiv preprint arXiv:1605.02276.
Gladkova, A., Drozd, A., & Matsuoka, S. (2016, June). Analogy-based detection of morphological and semantic relations with word embeddings: what works and what doesn’t. In Proceedings of the NAACL Student Research Workshop (pp. 8-15).
Jauhar, S. K., Dyer, C., & Hovy, E. (2015). Ontologically grounded multi-sense representation learning for semantic vector space models. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 683-693).
Kiros, R., Salakhutdinov, R., & Zemel, R. S. (2014). Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539.
Kusner, M., Sun, Y., Kolkin, N., & Weinberger, K. (2015, June). From word embeddings to document distances. In International conference on machine learning (pp. 957-966).
Le, Q., & Mikolov, T. (2014, January). Distributed representations of sentences and documents. In International conference on machine learning (pp. 1188-1196).
Levy, O., & Goldberg, Y. (2014, June). Linguistic regularities in sparse and explicit word representations. In Proceedings of the eighteenth conference on computational natural language learning (pp. 171-180).
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Mikolov, T., Yih, W. T., & Zweig, G. (2013, June). Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 746-751).
Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3), 443–453. doi:10.1016/0022-2836(70)90057-4
Ng, P. (2017). dna2vec: Consistent vector representations of variable-length k-mers. arXiv preprint arXiv:1701.06279. 45
Pedersoli, M., Lucas, T., Schmid, C., & Verbeek, J. (2017). Areas of attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1242-1250).
Perozzi, B., Al-Rfou, R., & Skiena, S. (2014, August). Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 701-710). ACM.
Preoţiuc-Pietro, D., Lampos, V., & Aletras, N. (2015, July). An analysis of the user occupational class through Twitter content. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 1754-1764).
Preoţiuc-Pietro, D., Liu, Y., Hopkins, D., & Ungar, L. (2017, July). Beyond binary labels: political ideology prediction of twitter users. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 729-740).
Schwartz, R., Reichart, R., & Rappoport, A. (2015, July). Symmetric pattern based word embeddings for improved word similarity prediction. In Proceedings of the nineteenth conference on computational natural language learning (pp. 258-267).
Sienčnik, S. K. (2015, May). Adapting word2vec to named entity recognition. In Proceedings of the 20th nordic conference of computational linguistics, nodalida 2015, may 11-13, 2015, vilnius, lithuania (No. 109, pp. 239-243). Linköping University Electronic Press.
Uricchio, T., Ballan, L., Seidenari, L., & Del Bimbo, A. (2017). Automatic image annotation via label transfer in the semantic space. Pattern Recognition, 71, 144-157.
Wang, Y., Liu, S., Afzal, N., Rastegar-Mojarad, M., Wang, L., Shen, F., ... & Liu, H. (2018). A comparison of word embeddings for the biomedical natural language processing. Journal of biomedical informatics, 87, 12-20.
Yang, X., Macdonald, C., & Ounis, I. (2018). Using word embeddings in twitter election classification. Information Retrieval Journal, 21(2-3), 183-207.
Yao, Y., Li, X., Liu, X., Liu, P., Liang, Z., Zhang, J., & Mai, K. (2017). Sensing spatial distribution of urban land use by integrating points-of-interest and Google Word2Vec model. International Journal of Geographical Information Science, 31(4), 825-848.
Zhao, Z., Yang, Z., Luo, L., Lin, H., & Wang, J. (2016). Drug drug interaction extraction from biomedical literature using syntax convolutional neural network. Bioinformatics, 32(22), 3444-3453.

There are 31 citations in total.

Details

Primary Language	English
Subjects	Artificial Intelligence
Journal Section	Research Articles
Authors	Alper Yılmaz 0000-0002-8827-4887
Publication Date	March 20, 2020
Submission Date	January 14, 2020
Published in Issue	Year 2020 Volume: 3 Issue: 1

Cite

APA	Yılmaz, A. (2020). Assessment of Mutation Susceptibility in DNA Sequences with Word Vectors. Journal of Intelligent Systems: Theory and Applications, 3(1), 1-6. https://doi.org/10.38016/jista.674910
AMA	Yılmaz A. Assessment of Mutation Susceptibility in DNA Sequences with Word Vectors. JISTA. March 2020;3(1):1-6. doi:10.38016/jista.674910
Chicago	Yılmaz, Alper. “Assessment of Mutation Susceptibility in DNA Sequences With Word Vectors”. Journal of Intelligent Systems: Theory and Applications 3, no. 1 (March 2020): 1-6. https://doi.org/10.38016/jista.674910.
EndNote	Yılmaz A (March 1, 2020) Assessment of Mutation Susceptibility in DNA Sequences with Word Vectors. Journal of Intelligent Systems: Theory and Applications 3 1 1–6.
IEEE	A. Yılmaz, “Assessment of Mutation Susceptibility in DNA Sequences with Word Vectors”, JISTA, vol. 3, no. 1, pp. 1–6, 2020, doi: 10.38016/jista.674910.
ISNAD	Yılmaz, Alper. “Assessment of Mutation Susceptibility in DNA Sequences With Word Vectors”. Journal of Intelligent Systems: Theory and Applications 3/1 (March 2020), 1-6. https://doi.org/10.38016/jista.674910.
JAMA	Yılmaz A. Assessment of Mutation Susceptibility in DNA Sequences with Word Vectors. JISTA. 2020;3:1–6.
MLA	Yılmaz, Alper. “Assessment of Mutation Susceptibility in DNA Sequences With Word Vectors”. Journal of Intelligent Systems: Theory and Applications, vol. 3, no. 1, 2020, pp. 1-6, doi:10.38016/jista.674910.
Vancouver	Yılmaz A. Assessment of Mutation Susceptibility in DNA Sequences with Word Vectors. JISTA. 2020;3(1):1-6.

Cited By

Genomic Surveillance of COVID-19 Variants With Language Models and Machine Learning

Frontiers in Genetics

https://doi.org/10.3389/fgene.2022.858252

Download Cover Image

Article Files

Full Text

Journal of Intelligent Systems: Theory and Applications