Spelling Correction with the Dictionary Method for the Turkish Language Using Word Embeddings

Murat Aydoğan; Ali Karci

doi:10.31590/ejosat.araconf8

Araştırma Makalesi

Kelime Gömmelerini Kullanarak Türkçe Dili İçin Sözlük Metodu ile Yazım Düzeltme

Yıl 2020, Ejosat Özel Sayı 2020 (ARACONF), 57 - 63, 01.04.2020

Murat Aydoğan , Ali Karci

https://doi.org/10.31590/ejosat.araconf8

Cited By: 1

Öz

Günümüzde oldukça büyük miktarda veri üretilmektedir. Üretilen bu büyük verinin çok önemli bir kısmı ise text verilerinden oluşmaktadır. Bu durum, text processing çalışmalarının daha da önem kazanmasını sağlamıştır. Ancak yapılan çalışmalar incelendiğinde başta İngilizce olmak üzere birçok dünya dili odaklı çalışmalar yapılırken Türkçe diline özgü çalışmaların yeterli sayıda olmadığı görülmüştür. Bu nedenle bu çalışmada hedef dil olarak Türkçe seçilmiştir. Etiketsiz verilerden oluşan ve yazım yanlışı bulunmayan yaklaşık 10.5 milyar kelimeden oluşan etiketsiz ve büyük Türkçe bir derlem üretilmiştir. Word2Vec metodu kullanılarak bu derlem üzerinde kelime vektörleri eğitilmiştir. Bu derlemi temel alarak “Sözlük Metodu” adı verilen yeni bir yöntem önerilmiştir, üretilen derlem içindeki kelimeler ile hemen hemen tüm Türkçe kelimeleri kapsayan bir sözlük oluşturulmuştur. Daha sonra çok sınıflı Türkçe bir dataset üzerinde metin sınıflandırma işlemi uygulanmıştır. Bu veriseti içerisindeki token kelimelerin vektörel değerleri sözlükten transfer öğrenme ile aktarılmıştır. Ancak sözlükte bulunmayan kelimelerin hatalı kelimeler olduğu düşünülerek bir derin sinir ağı mimarisi olan LSTM (Uzun Kısa Süreli Bellek) yöntemi ile bu kelimelerin yerine doğru veya yakın anlamlı kelimeler tahmin edilmeye çalışılmıştır. Bu işlemin ardından metin sınıflandırma uygulamasının doğruluk oranında %8.68 oranında gelişme olduğu görülmüştür. Üretilen Türkçe veriseti, derlem ve sözlük Türkçe metin işleme çalışmalarına katkı sağlamak amacıyla araştırmacılarla paylaşılacaktır.

Anahtar Kelimeler

Kelime Gömme , Türkçe Metin İşleme , LSTM , Sözlük Metodu

Kaynakça

Wu, X., Zhu, X., Wu, G. Q., & Ding, W. (2013). Data mining with big data. IEEE transactions on knowledge and data engineering, 26(1), 97-107. D. Kılınç, A. Özçift, F. Bozyigit, P. Yıldırım, F. Yücalar, E. Borandag, TTC-3600: A new benchmark dataset for Turkish text categorization. Journal of Information Science, 43(2), 174-185, (2015).
Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278- 2324 (1998).
T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space. Proceedings of Workshop at ICLR. Scottsdale, (2013).
Q. Le, T. Mikolov, Distributed representations of sentences and documents. 31st International Conference on Machine Learning, China, 2014 (2002)
J. Pennington, R. Socher, C. Manning, Glove: Global vectors for word representation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Qatar, 2014.
F. Rosenblatt, The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, (1958) 65(6), 386.
Y. Qi, S. G. Das, R. Collobert, J. Weston, Deep learning for characterbased information extrac
R. Socher, Y. Bengio, C. D. Manning, Deep learning for NLP (without magic), Tutorial Abstracts of ACL, (2012) 5.
M. Tan, C. D. Santos, B. Xiang, B. Zhou, LSTM-based deep learning models for non-factoid answer selection, (2015) arXiv preprint arXiv:1511.04108.
P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword information, (2016) arXiv preprint arXiv:1607.04606.
M. Sundermeyer, R. Schlüter, H. Ney, LSTM neural networks for language modeling. In Thirteenth annual conference of the international speech communication association, (2012).
Y. Wang, M. Huang, L. Zhao, Attention-based LSTM for aspect-level sentiment classification. In Proceedings of the 2016 conference on empirical methods in natural language processing, (2016) (pp. 606-615).
K. S. Tai, R. Socher, C. D. Manning, Improved semantic representations from tree-structured long short-term memory networks, (2015) arXiv preprint arXiv:1503.00075.
C. Zhou, C. Sun, Z. Liu, F. Lau, A C-LSTM neural network for text classification, (2015) arXiv Preprint. arXiv1511.08630.
T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space. Proceedings of Workshop at ICLR. Scottsdale, (2013).
Q. Le, T. Mikolov, Distributed representations of sentences and documents. 31st International Conference on Machine Learning, China, 2014.
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting, JMach Learn Res. 2014, (2014) 15(1):1929-1958.
Aydogan, Murat & Karci, Ali. (2019). Improving the accuracy using pre-trained word embeddings on deep neural networks for Turkish text classification. Physica A: Statistical Mechanics and its Applications. 541. 10.1016/j.physa.2019.123288.

Spelling Correction with the Dictionary Method for the Turkish Language Using Word Embeddings

Yıl 2020, Ejosat Özel Sayı 2020 (ARACONF), 57 - 63, 01.04.2020

Murat Aydoğan , Ali Karci

https://doi.org/10.31590/ejosat.araconf8

Cited By: 1

Öz

Today, a massive amount of data is being produced, which is referred to as “big data.” A significant part of big data is composed of text data, which has made text processing all the more important. However, when text processing studies are examined, it can be seen that while there are many world language-oriented studies, especially the English language, there has been an insufficient level of studies published specific to the Turkish language. Therefore, Turkish was chosen as the target language for the study. A Turkish corpus of approximately 10.5 billion words was created, consisting of unlabeled data containing no spelling errors. Word vectors were trained using the Word2Vec method on this corpus. Based on this corpus, a new method was proposed called the “dictionary method,” with a dictionary created covering almost all known Turkish words. Then, text classification was applied to a multi-class Turkish dataset. This dataset contains 10 classes and approximately 1.5 million samples. Vector values of the token words in this dataset were transferred from the dictionary by transfer learning. However, words not found in the created dictionary were considered as incorrect; then, using LSTM (Long Short-Term Memory), which is a deep neural network (DNN) architecture, the proposed method attempts to predict correct or similar words as replacement words. Following this process, it was seen that the accuracy rate improved by 8.68%. Turkish dataset that is created, corpus and dictionary will be shared with researchers in order to contribute to Turkish text processing studies.

Anahtar Kelimeler

Word Embedding , Turkish Text Processing , LSTM , Dictionary Method

Kaynakça

Wu, X., Zhu, X., Wu, G. Q., & Ding, W. (2013). Data mining with big data. IEEE transactions on knowledge and data engineering, 26(1), 97-107. D. Kılınç, A. Özçift, F. Bozyigit, P. Yıldırım, F. Yücalar, E. Borandag, TTC-3600: A new benchmark dataset for Turkish text categorization. Journal of Information Science, 43(2), 174-185, (2015).
Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278- 2324 (1998).
T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space. Proceedings of Workshop at ICLR. Scottsdale, (2013).
Q. Le, T. Mikolov, Distributed representations of sentences and documents. 31st International Conference on Machine Learning, China, 2014 (2002)
J. Pennington, R. Socher, C. Manning, Glove: Global vectors for word representation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Qatar, 2014.
F. Rosenblatt, The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, (1958) 65(6), 386.
Y. Qi, S. G. Das, R. Collobert, J. Weston, Deep learning for characterbased information extrac
R. Socher, Y. Bengio, C. D. Manning, Deep learning for NLP (without magic), Tutorial Abstracts of ACL, (2012) 5.
M. Tan, C. D. Santos, B. Xiang, B. Zhou, LSTM-based deep learning models for non-factoid answer selection, (2015) arXiv preprint arXiv:1511.04108.
P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword information, (2016) arXiv preprint arXiv:1607.04606.
M. Sundermeyer, R. Schlüter, H. Ney, LSTM neural networks for language modeling. In Thirteenth annual conference of the international speech communication association, (2012).
Y. Wang, M. Huang, L. Zhao, Attention-based LSTM for aspect-level sentiment classification. In Proceedings of the 2016 conference on empirical methods in natural language processing, (2016) (pp. 606-615).
K. S. Tai, R. Socher, C. D. Manning, Improved semantic representations from tree-structured long short-term memory networks, (2015) arXiv preprint arXiv:1503.00075.
C. Zhou, C. Sun, Z. Liu, F. Lau, A C-LSTM neural network for text classification, (2015) arXiv Preprint. arXiv1511.08630.
T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space. Proceedings of Workshop at ICLR. Scottsdale, (2013).
Q. Le, T. Mikolov, Distributed representations of sentences and documents. 31st International Conference on Machine Learning, China, 2014.
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting, JMach Learn Res. 2014, (2014) 15(1):1929-1958.
Aydogan, Murat & Karci, Ali. (2019). Improving the accuracy using pre-trained word embeddings on deep neural networks for Turkish text classification. Physica A: Statistical Mechanics and its Applications. 541. 10.1016/j.physa.2019.123288.

Toplam 18 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	İngilizce
Konular	Mühendislik
Bölüm	Araştırma Makalesi
Yazarlar	Murat Aydoğan 0000-0002-6876-6454 Ali Karci 0000-0002-8489-8617
Yayımlanma Tarihi	1 Nisan 2020
Yayımlandığı Sayı	Yıl 2020 Ejosat Özel Sayı 2020 (ARACONF)

Kaynak Göster

APA	Aydoğan, M., & Karci, A. (2020). Spelling Correction with the Dictionary Method for the Turkish Language Using Word Embeddings. Avrupa Bilim ve Teknoloji Dergisi57-63. https://doi.org/10.31590/ejosat.araconf8

Avrupa Bilim ve Teknoloji Dergisi

Kelime Gömmelerini Kullanarak Türkçe Dili İçin Sözlük Metodu ile Yazım Düzeltme

Öz

Anahtar Kelimeler

Kaynakça

Spelling Correction with the Dictionary Method for the Turkish Language Using Word Embeddings

Öz

Anahtar Kelimeler

Kaynakça

Ayrıntılar

Kaynak Göster

Cited By

A Short-Patterning of the Texts Attributed to Al Ghazali: A “Twitter Look” at the Problem

Mathematics

Zeev Volkovich

https://doi.org/10.3390/math8111937