Assessment of Mutation Susceptibility in DNA Sequences with Word Vectors

Alper Yılmaz

doi:10.38016/jista.674910

Araştırma Makalesi

Yıl 2020, Cilt: 3 Sayı: 1, 1 - 6, 20.03.2020

Alper Yılmaz

https://doi.org/10.38016/jista.674910

Cited By: 2

Öz

Kaynakça

Abdul-Mageed, M., & Ungar, L. (2017, July). Emonet: Fine-grained emotion detection with gated recurrent neural networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 718-728).
Akın, A. A., & Akın, M. D. (2007). Zemberek, an open source nlp framework for turkic languages. Structure, 10, 1-5.
Arora, S., Li, Y., Liang, Y., Ma, T., & Risteski, A. (2018). Linear algebraic structure of word senses, with applications to polysemy. Transactions of the Association for Computational Linguistics, 6, 483-495.
Bolukbasi, T., Chang, K. W., Zou, J. Y., Saligrama, V., & Kalai, A. T. (2016). Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in neural information processing systems (pp. 4349-4357).
Chen, M. (2017). Efficient vector representation for documents through corruption. arXiv preprint arXiv:1707.02377.
Chen, X., & Lawrence Zitnick, C. (2015). Mind's eye: A recurrent visual representation for image caption generation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2422-2431).
De Boom, C., Van Canneyt, S., Bohez, S., Demeester, T., & Dhoedt, B. (2015, November). Learning semantic similarity for very short texts. In 2015 ieee international conference on data mining workshop (icdmw) (pp. 1229-1234). IEEE.
Dos Santos, C., & Gatti, M. (2014, August). Deep convolutional neural networks for sentiment analysis of short texts. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers (pp. 69-78).
Eisner, B., Rocktäschel, T., Augenstein, I., Bošnjak, M., & Riedel, S. (2016). emoji2vec: Learning emoji representations from their description. arXiv preprint arXiv:1609.08359.
Faruqui, M., Tsvetkov, Y., Rastogi, P., & Dyer, C. (2016). Problems with evaluation of word embeddings using word similarity tasks. arXiv preprint arXiv:1605.02276.
Gladkova, A., Drozd, A., & Matsuoka, S. (2016, June). Analogy-based detection of morphological and semantic relations with word embeddings: what works and what doesn’t. In Proceedings of the NAACL Student Research Workshop (pp. 8-15).
Jauhar, S. K., Dyer, C., & Hovy, E. (2015). Ontologically grounded multi-sense representation learning for semantic vector space models. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 683-693).
Kiros, R., Salakhutdinov, R., & Zemel, R. S. (2014). Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539.
Kusner, M., Sun, Y., Kolkin, N., & Weinberger, K. (2015, June). From word embeddings to document distances. In International conference on machine learning (pp. 957-966).
Le, Q., & Mikolov, T. (2014, January). Distributed representations of sentences and documents. In International conference on machine learning (pp. 1188-1196).
Levy, O., & Goldberg, Y. (2014, June). Linguistic regularities in sparse and explicit word representations. In Proceedings of the eighteenth conference on computational natural language learning (pp. 171-180).
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Mikolov, T., Yih, W. T., & Zweig, G. (2013, June). Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 746-751).
Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3), 443–453. doi:10.1016/0022-2836(70)90057-4
Ng, P. (2017). dna2vec: Consistent vector representations of variable-length k-mers. arXiv preprint arXiv:1701.06279. 45
Pedersoli, M., Lucas, T., Schmid, C., & Verbeek, J. (2017). Areas of attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1242-1250).
Perozzi, B., Al-Rfou, R., & Skiena, S. (2014, August). Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 701-710). ACM.
Preoţiuc-Pietro, D., Lampos, V., & Aletras, N. (2015, July). An analysis of the user occupational class through Twitter content. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 1754-1764).
Preoţiuc-Pietro, D., Liu, Y., Hopkins, D., & Ungar, L. (2017, July). Beyond binary labels: political ideology prediction of twitter users. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 729-740).
Schwartz, R., Reichart, R., & Rappoport, A. (2015, July). Symmetric pattern based word embeddings for improved word similarity prediction. In Proceedings of the nineteenth conference on computational natural language learning (pp. 258-267).
Sienčnik, S. K. (2015, May). Adapting word2vec to named entity recognition. In Proceedings of the 20th nordic conference of computational linguistics, nodalida 2015, may 11-13, 2015, vilnius, lithuania (No. 109, pp. 239-243). Linköping University Electronic Press.
Uricchio, T., Ballan, L., Seidenari, L., & Del Bimbo, A. (2017). Automatic image annotation via label transfer in the semantic space. Pattern Recognition, 71, 144-157.
Wang, Y., Liu, S., Afzal, N., Rastegar-Mojarad, M., Wang, L., Shen, F., ... & Liu, H. (2018). A comparison of word embeddings for the biomedical natural language processing. Journal of biomedical informatics, 87, 12-20.
Yang, X., Macdonald, C., & Ounis, I. (2018). Using word embeddings in twitter election classification. Information Retrieval Journal, 21(2-3), 183-207.
Yao, Y., Li, X., Liu, X., Liu, P., Liang, Z., Zhang, J., & Mai, K. (2017). Sensing spatial distribution of urban land use by integrating points-of-interest and Google Word2Vec model. International Journal of Geographical Information Science, 31(4), 825-848.
Zhao, Z., Yang, Z., Luo, L., Lin, H., & Wang, J. (2016). Drug drug interaction extraction from biomedical literature using syntax convolutional neural network. Bioinformatics, 32(22), 3444-3453.

Assessment of Mutation Susceptibility in DNA Sequences with Word Vectors

Yıl 2020, Cilt: 3 Sayı: 1, 1 - 6, 20.03.2020

Alper Yılmaz

https://doi.org/10.38016/jista.674910

Cited By: 2

Öz

With the advent of natural language processing (NLP) techniques empowered with deep learning approaches, more detailed
relationships between words have been unraveled. Word2Vec is quite robust in discovering contextual and semantic relationships.
Genome being a long text, is subject to similar studies to unravel yet to be discovered relationships between DNA k-mers. Dna2vec
applies Word2Vec approach to whole genome so that DNA k-mers are represented as vectors. The cosine similarity queries on DNA
vectors reveal unusual relationships between DNA k-mers.
In this study, we examined DNA sequence based prediction of mutation susceptibility. Initially,we generated word vectors for human
and mouse genome via dna2vec. On the other hand, we retrieved coordinates of common and all mutations from dbSNP. For each
coordinate, we extracted 8 nucleotide k-mers intersecting mutations and results are aggregated. such a way that number of mutations
for each 8-mer has been tabulated. These results are incorporated with dna2vec cosine similarity data. Our results showed that for a
given k-mer, k-mers with highest cosine similarity coincide with highest mutation count k-mer. In other words, the neighbor with the
highest cosine similarity for a k-mer was also seen to be the neighbor overlapping the mutation count. As a result of our studies, human
and mouse, dna2vec vs. mutation overlap is 80% and 70%, respectively. In conclusion, dna2vec and other word embedding approaches
can be used to reveal mutation or variation characteristics of genomes without sequencing or experimental data, solely using the genome
sequence itself. This might pave the way for understanding the underlying mechanism or dynamics of mutations in genomes.

Anahtar Kelimeler

mutation , word2vec , k-mer , dna2vec , cosine similarity

Kaynakça

Abdul-Mageed, M., & Ungar, L. (2017, July). Emonet: Fine-grained emotion detection with gated recurrent neural networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 718-728).
Akın, A. A., & Akın, M. D. (2007). Zemberek, an open source nlp framework for turkic languages. Structure, 10, 1-5.
Arora, S., Li, Y., Liang, Y., Ma, T., & Risteski, A. (2018). Linear algebraic structure of word senses, with applications to polysemy. Transactions of the Association for Computational Linguistics, 6, 483-495.
Bolukbasi, T., Chang, K. W., Zou, J. Y., Saligrama, V., & Kalai, A. T. (2016). Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in neural information processing systems (pp. 4349-4357).
Chen, M. (2017). Efficient vector representation for documents through corruption. arXiv preprint arXiv:1707.02377.
Chen, X., & Lawrence Zitnick, C. (2015). Mind's eye: A recurrent visual representation for image caption generation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2422-2431).
De Boom, C., Van Canneyt, S., Bohez, S., Demeester, T., & Dhoedt, B. (2015, November). Learning semantic similarity for very short texts. In 2015 ieee international conference on data mining workshop (icdmw) (pp. 1229-1234). IEEE.
Dos Santos, C., & Gatti, M. (2014, August). Deep convolutional neural networks for sentiment analysis of short texts. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers (pp. 69-78).
Eisner, B., Rocktäschel, T., Augenstein, I., Bošnjak, M., & Riedel, S. (2016). emoji2vec: Learning emoji representations from their description. arXiv preprint arXiv:1609.08359.
Faruqui, M., Tsvetkov, Y., Rastogi, P., & Dyer, C. (2016). Problems with evaluation of word embeddings using word similarity tasks. arXiv preprint arXiv:1605.02276.
Gladkova, A., Drozd, A., & Matsuoka, S. (2016, June). Analogy-based detection of morphological and semantic relations with word embeddings: what works and what doesn’t. In Proceedings of the NAACL Student Research Workshop (pp. 8-15).
Jauhar, S. K., Dyer, C., & Hovy, E. (2015). Ontologically grounded multi-sense representation learning for semantic vector space models. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 683-693).
Kiros, R., Salakhutdinov, R., & Zemel, R. S. (2014). Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539.
Kusner, M., Sun, Y., Kolkin, N., & Weinberger, K. (2015, June). From word embeddings to document distances. In International conference on machine learning (pp. 957-966).
Le, Q., & Mikolov, T. (2014, January). Distributed representations of sentences and documents. In International conference on machine learning (pp. 1188-1196).
Levy, O., & Goldberg, Y. (2014, June). Linguistic regularities in sparse and explicit word representations. In Proceedings of the eighteenth conference on computational natural language learning (pp. 171-180).
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Mikolov, T., Yih, W. T., & Zweig, G. (2013, June). Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 746-751).
Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3), 443–453. doi:10.1016/0022-2836(70)90057-4
Ng, P. (2017). dna2vec: Consistent vector representations of variable-length k-mers. arXiv preprint arXiv:1701.06279. 45
Pedersoli, M., Lucas, T., Schmid, C., & Verbeek, J. (2017). Areas of attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1242-1250).
Perozzi, B., Al-Rfou, R., & Skiena, S. (2014, August). Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 701-710). ACM.
Preoţiuc-Pietro, D., Lampos, V., & Aletras, N. (2015, July). An analysis of the user occupational class through Twitter content. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 1754-1764).
Preoţiuc-Pietro, D., Liu, Y., Hopkins, D., & Ungar, L. (2017, July). Beyond binary labels: political ideology prediction of twitter users. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 729-740).
Schwartz, R., Reichart, R., & Rappoport, A. (2015, July). Symmetric pattern based word embeddings for improved word similarity prediction. In Proceedings of the nineteenth conference on computational natural language learning (pp. 258-267).
Sienčnik, S. K. (2015, May). Adapting word2vec to named entity recognition. In Proceedings of the 20th nordic conference of computational linguistics, nodalida 2015, may 11-13, 2015, vilnius, lithuania (No. 109, pp. 239-243). Linköping University Electronic Press.
Uricchio, T., Ballan, L., Seidenari, L., & Del Bimbo, A. (2017). Automatic image annotation via label transfer in the semantic space. Pattern Recognition, 71, 144-157.
Wang, Y., Liu, S., Afzal, N., Rastegar-Mojarad, M., Wang, L., Shen, F., ... & Liu, H. (2018). A comparison of word embeddings for the biomedical natural language processing. Journal of biomedical informatics, 87, 12-20.
Yang, X., Macdonald, C., & Ounis, I. (2018). Using word embeddings in twitter election classification. Information Retrieval Journal, 21(2-3), 183-207.
Yao, Y., Li, X., Liu, X., Liu, P., Liang, Z., Zhang, J., & Mai, K. (2017). Sensing spatial distribution of urban land use by integrating points-of-interest and Google Word2Vec model. International Journal of Geographical Information Science, 31(4), 825-848.
Zhao, Z., Yang, Z., Luo, L., Lin, H., & Wang, J. (2016). Drug drug interaction extraction from biomedical literature using syntax convolutional neural network. Bioinformatics, 32(22), 3444-3453.

Toplam 31 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	İngilizce
Konular	Yapay Zeka
Bölüm	Araştırma Makalesi
Yazarlar	Alper Yılmaz 0000-0002-8827-4887
Yayımlanma Tarihi	20 Mart 2020
Gönderilme Tarihi	14 Ocak 2020
Yayımlandığı Sayı	Yıl 2020 Cilt: 3 Sayı: 1

Kaynak Göster

APA	Yılmaz, A. (2020). Assessment of Mutation Susceptibility in DNA Sequences with Word Vectors. Journal of Intelligent Systems: Theory and Applications, 3(1), 1-6. https://doi.org/10.38016/jista.674910
AMA	Yılmaz A. Assessment of Mutation Susceptibility in DNA Sequences with Word Vectors. jista. Mart 2020;3(1):1-6. doi:10.38016/jista.674910
Chicago	Yılmaz, Alper. “Assessment of Mutation Susceptibility in DNA Sequences with Word Vectors”. Journal of Intelligent Systems: Theory and Applications 3, sy. 1 (Mart 2020): 1-6. https://doi.org/10.38016/jista.674910.
EndNote	Yılmaz A (01 Mart 2020) Assessment of Mutation Susceptibility in DNA Sequences with Word Vectors. Journal of Intelligent Systems: Theory and Applications 3 1 1–6.
IEEE	A. Yılmaz, “Assessment of Mutation Susceptibility in DNA Sequences with Word Vectors”, jista, c. 3, sy. 1, ss. 1–6, 2020, doi: 10.38016/jista.674910.
ISNAD	Yılmaz, Alper. “Assessment of Mutation Susceptibility in DNA Sequences with Word Vectors”. Journal of Intelligent Systems: Theory and Applications 3/1 (Mart2020), 1-6. https://doi.org/10.38016/jista.674910.
JAMA	Yılmaz A. Assessment of Mutation Susceptibility in DNA Sequences with Word Vectors. jista. 2020;3:1–6.
MLA	Yılmaz, Alper. “Assessment of Mutation Susceptibility in DNA Sequences with Word Vectors”. Journal of Intelligent Systems: Theory and Applications, c. 3, sy. 1, 2020, ss. 1-6, doi:10.38016/jista.674910.
Vancouver	Yılmaz A. Assessment of Mutation Susceptibility in DNA Sequences with Word Vectors. jista. 2020;3(1):1-6.

Cited By

Genomic Surveillance of COVID-19 Variants With Language Models and Machine Learning

Frontiers in Genetics

https://doi.org/10.3389/fgene.2022.858252

DNA sequence analysis landscape: a comprehensive review of DNA sequence analysis task types, databases, datasets, word embedding methods, and language models

Frontiers in Medicine

https://doi.org/10.3389/fmed.2025.1503229

Kapak Resmi İndir

Makale Dosyaları

Tam Metin

Zeki Sistemler Teori ve Uygulamaları Dergisi