TR
EN
Spelling Correction with the Dictionary Method for the Turkish Language Using Word Embeddings
Abstract
Today, a massive amount of data is being produced, which is referred to as “big data.” A significant part of big data is composed of text data, which has made text processing all the more important. However, when text processing studies are examined, it can be seen that while there are many world language-oriented studies, especially the English language, there has been an insufficient level of studies published specific to the Turkish language. Therefore, Turkish was chosen as the target language for the study. A Turkish corpus of approximately 10.5 billion words was created, consisting of unlabeled data containing no spelling errors. Word vectors were trained using the Word2Vec method on this corpus. Based on this corpus, a new method was proposed called the “dictionary method,” with a dictionary created covering almost all known Turkish words. Then, text classification was applied to a multi-class Turkish dataset. This dataset contains 10 classes and approximately 1.5 million samples. Vector values of the token words in this dataset were transferred from the dictionary by transfer learning. However, words not found in the created dictionary were considered as incorrect; then, using LSTM (Long Short-Term Memory), which is a deep neural network (DNN) architecture, the proposed method attempts to predict correct or similar words as replacement words. Following this process, it was seen that the accuracy rate improved by 8.68%. Turkish dataset that is created, corpus and dictionary will be shared with researchers in order to contribute to Turkish text processing studies.
Keywords
Kaynakça
- Wu, X., Zhu, X., Wu, G. Q., & Ding, W. (2013). Data mining with big data. IEEE transactions on knowledge and data engineering, 26(1), 97-107. D. Kılınç, A. Özçift, F. Bozyigit, P. Yıldırım, F. Yücalar, E. Borandag, TTC-3600: A new benchmark dataset for Turkish text categorization. Journal of Information Science, 43(2), 174-185, (2015).
- Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278- 2324 (1998).
- T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space. Proceedings of Workshop at ICLR. Scottsdale, (2013).
- Q. Le, T. Mikolov, Distributed representations of sentences and documents. 31st International Conference on Machine Learning, China, 2014 (2002)
- J. Pennington, R. Socher, C. Manning, Glove: Global vectors for word representation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Qatar, 2014.
- F. Rosenblatt, The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, (1958) 65(6), 386.
- Y. Qi, S. G. Das, R. Collobert, J. Weston, Deep learning for characterbased information extrac
- R. Socher, Y. Bengio, C. D. Manning, Deep learning for NLP (without magic), Tutorial Abstracts of ACL, (2012) 5.
Ayrıntılar
Birincil Dil
İngilizce
Konular
Mühendislik
Bölüm
Araştırma Makalesi
Yayımlanma Tarihi
1 Nisan 2020
Gönderilme Tarihi
15 Mart 2020
Kabul Tarihi
27 Mart 2020
Yayımlandığı Sayı
Yıl 2020
APA
Aydoğan, M., & Karci, A. (2020). Spelling Correction with the Dictionary Method for the Turkish Language Using Word Embeddings. Avrupa Bilim ve Teknoloji Dergisi, 57-63. https://doi.org/10.31590/ejosat.araconf8