Research Article
BibTex RIS Cite

Using NLP Methods For The Discovery Of Semantic Similarities Between Words In Old Turkish Language

Year 2019, Volume: 9 Issue: 3, 536 - 546, 15.07.2019
https://doi.org/10.17714/gumusfenbil.514154

Abstract

Leveraging machine learning techniques in NLP
domain has been a very hot research field due to the advancements in artificial
intelligence area. Despite the popularity of this field, there is no known
study on application of ML techniques on old Turkish language.  This study aims to fill in this gap where
32000 pages of text has been downloaded from the websites of Ministry of Culture
and a two-layer neural network model has been built on top of them to discover
the semantic similarities between Turkish words in old Turkish language. The
algorithm has been run with different parameters such as window size, dimension
size, sampling size etc. and the produced vector spaces are uploaded into public
servers for the purposes of enabling a RESTful API based query interface. Also
a web UI has been created to provide a querying mechanism for regular users. The
services that are developed can be used for two different purposes. One of them
is to integrate these services into existing old Turkish language dictionary
websites that are made available by third party providers as well as other
institutions such as Ministry of Culture and Turkish Language Institution.
Secondly, the developed services are intended to be used for mitigating the
translation errors made during the translation of old Turkish texts into modern
Turkish language in the areas of history and Turkish literature. Also enabling these
services for public use will encourage other researchers to pursue this
academic work and compare their results with the experimental results presented
in this paper to make further improvements in this field.

References

  • Adıgüzel, H., Şahin, P., Kalpaklı, M., 2012. Line segmentation of Ottoman documents. 20th Signal Processing and Communications Applications Conference, Fethiye Mugla, Turkey.
  • Arifoğlu, D., Duygulu, P., 2011. Word retrieval in ottoman documents. IEEE 19th Signal Processing and Communications Applications Conference, Antalya, Turkey.
  • Ataer, E., Duygulu P., 2007. Matching ottoman words: an image retrieval approach to historical document indexing. Proceedings of the 6th ACM international conference on Image and video retrieval, Amsterdam, Netherlands.
  • Basu, M., Roy, A., Ghosh, K., Bandyopadhyay, S., Ghosh, S., 2017. A Novel Word Embedding Based Stemming Approach for Microblog Retrieval During Disasters. 39th European Conference on Information Retrieval, Scotland, UK.
  • Can, B., 2017. Unsupervised learning of allomorphs in Turkish. Turkish Journal of Electrical Engineering & Computer Sciences 25(4), 3253-3260.
  • Chris, B., Faralli, S., Panchenko, A., Ponzetto, S., 2018. A framework for enriching lexical semantic resources with distributional semantics. Natural Language Engineering, Cambridge University Press, 24(1), 265-312.
  • Church, K. W., 2017. Word2Vec. Natural Language Engineering: Cambridge University Press, 155 p.
  • Deniz, K., Özçift, A., Bozyigit, F., Yıldırım, P., Yücalar F., Borandag E., 2017. TTC-3600: A new benchmark dataset for Turkish text categorization. Journal of Information Science, 43(2), 174-185.
  • Ganggao, Z., Iglesias, C. A., 2017. Computing Semantic Similarity of Concepts in Knowledge Graphs. IEEE Transactions on Knowledge and Data Engineering, 29(1), 72-85.
  • İlgen, B., Adalı, E., Tantuğ, A., 2016. Exploring feature sets for Turkish word sense disambiguation. Turkish Journal of Electrical Engineering & Computer Sciences, 24(1), 4391-4405.
  • Kalender, M., Korkmaz, E. E., 2018. THINKER - Entity Linking System for Turkish Language. IEEE Transactions on Knowledge and Data Engineering, 30(2), 367-380.
  • Kaya, Y., Ertugrul, O., 2016. A novel feature extraction approach for text-based language identification: Binary patterns. Journal of the Faculty of Engineering and Architecture of Gazi University, 31(4)
  • Kılıç, N., Gorgel, P., Ucan N., Kala, A., 2008. Multifont Ottoman character recognition using support vector machine. In Communications, Control and Signal Processing. St Julians, Malta.
  • Lushan H., Finin, T., McNamee, P., Joshi, A., Yesha, Y., 2013. Improving Word Similarity by Augmenting PMI with Estimates of Word Polysemy. IEEE Transactions on Knowledge and Data Engineering, 25(6), 1307-1322.
  • Marrero, M, Urbano, J., 2018. A Semi-automatic and low-cost method to learn patterns for named entity recognition. Natural Language Engineering, 24(1), 39-75.
  • Metin, B., Amasyalı, M., 2017. Dependency parsing with stacked conditional random fields for Turkish. Journal of the Faculty of Engineering and Architecture of Gazi University 32(2).
  • Mikolov, T., Chen, K., Corrado, G., Dean, J., 2013. Efficient Estimation of Word Representations in Vector Space: arXiv preprint, 1301.3781.
  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J., 2013. Distributed Representations of Words and Phrases and Their Compositionality. Proceedings of the 26th International Conference on Neural Information Processing Systems, Lake Tahoe, Nevada.
  • Ozturk, A., Gunes, S., Ozbay, Y., 2000. Multifont Ottoman character recognition. 7th IEEE International Conference on Electronics, Circuits and Systems, Montreal, Quebec.
  • Peipei, L., Wang, H., Zhu, K. Q., Wang, Z., Hu, X., Wu, X., 2015. A Large Probabilistic Semantic Network Based Approach to Compute Term Similarity. IEEE Transactions on Knowledge and Data Engineering, 27(1), 2604-2617.
  • Soon, W. M., Ng, H. T., Lim, D., 2001. A machine learning approach to coreference resolution of noun phrases. Computational linguistics, 27(4), 521-544.
  • URL-1, 2016. http://ekitap.kulturturizm.gov.tr/TR,78353/divanlar-ve-mesneviler.html
  • URL-2, 2018. http://snowball.tartarus.org/
  • URL-3, 2019. https://code.google.com/archive/p/word2vec/source/default/source
  • URL-4, 2019. http://snowball.tartarus.org/algorithms/turkish/stemmer.html
  • URL-5, 2019. https://code.google.com/archive/p/word2vec/

Eski Dilde Kullanılan Sözcükler Arasındaki Anlamsal Yakınlıkların Doğal Dil İşleme Yöntemleriyle Tespiti

Year 2019, Volume: 9 Issue: 3, 536 - 546, 15.07.2019
https://doi.org/10.17714/gumusfenbil.514154

Abstract

Makina
öğrenme tekniklerinin doğal dil işleme alanında kullanımı son yıllarda oldukça
popüler bir çalışma konusu haline gelmiştir. Doğal dil işleme yöntemlerinin
yabancı dillerdeki birçok uygulamasına rastlanılmasına rağmen Türkçe ve
özellikle eski dil metinlerdeki uygulamaları oldukça yetersizdir. Bu alandaki
eksikliğin giderilmesine yönelik olarak yapılan bu çalışmada, Kültür Bakanlığı
kaynaklarından elde edilen 32000 sayfa doküman, temizleme işleminden geçirildikten
sonra, bu metinlerden elde edilen kelimeler üzerinde iki katmanlı bir sinir ağı
modeli çalıştırılmıştır. Pencere boyutu, uzay boyutu, örnekleme miktarı gibi birçok
farklı parametre ile geliştirilen modellere ait vektör uzayları bir sunucuya
kopyalanarak bir sorgulama sistemi ve RESTful API servisleri oluşturulmuştur. Ayrıca
bu sorgulama sisteminin doğrudan kullanabilmesi için bir kullanıcı portalı
oluşturularak RESTful API ile beraber internet kullanımına açılmıştır. Yapılan bu
çalışmanın iki farklı amaçla kullanılması hedeflenmektedir. Birinci hedef bu
sistemin Türk Dil Kurumu ve Kültür Bakanlığı gibi kurumların ve diğer eski dil
sözlük hizmeti sağlayan şirketlerin internet sitelerine entegre edilmesi ve
aratılan sözcüklere yakın terimlerin kullanıcılara getirilmesidir. İkinci hedef
ise tarih ve edebiyat gibi eski dilin kullanıldığı bilimsel çalışmalarda metinlerin
günümüz Türkçe’sine çevrilmesi esnasında ortaya çıkan hataların azaltılmasıdır. 

References

  • Adıgüzel, H., Şahin, P., Kalpaklı, M., 2012. Line segmentation of Ottoman documents. 20th Signal Processing and Communications Applications Conference, Fethiye Mugla, Turkey.
  • Arifoğlu, D., Duygulu, P., 2011. Word retrieval in ottoman documents. IEEE 19th Signal Processing and Communications Applications Conference, Antalya, Turkey.
  • Ataer, E., Duygulu P., 2007. Matching ottoman words: an image retrieval approach to historical document indexing. Proceedings of the 6th ACM international conference on Image and video retrieval, Amsterdam, Netherlands.
  • Basu, M., Roy, A., Ghosh, K., Bandyopadhyay, S., Ghosh, S., 2017. A Novel Word Embedding Based Stemming Approach for Microblog Retrieval During Disasters. 39th European Conference on Information Retrieval, Scotland, UK.
  • Can, B., 2017. Unsupervised learning of allomorphs in Turkish. Turkish Journal of Electrical Engineering & Computer Sciences 25(4), 3253-3260.
  • Chris, B., Faralli, S., Panchenko, A., Ponzetto, S., 2018. A framework for enriching lexical semantic resources with distributional semantics. Natural Language Engineering, Cambridge University Press, 24(1), 265-312.
  • Church, K. W., 2017. Word2Vec. Natural Language Engineering: Cambridge University Press, 155 p.
  • Deniz, K., Özçift, A., Bozyigit, F., Yıldırım, P., Yücalar F., Borandag E., 2017. TTC-3600: A new benchmark dataset for Turkish text categorization. Journal of Information Science, 43(2), 174-185.
  • Ganggao, Z., Iglesias, C. A., 2017. Computing Semantic Similarity of Concepts in Knowledge Graphs. IEEE Transactions on Knowledge and Data Engineering, 29(1), 72-85.
  • İlgen, B., Adalı, E., Tantuğ, A., 2016. Exploring feature sets for Turkish word sense disambiguation. Turkish Journal of Electrical Engineering & Computer Sciences, 24(1), 4391-4405.
  • Kalender, M., Korkmaz, E. E., 2018. THINKER - Entity Linking System for Turkish Language. IEEE Transactions on Knowledge and Data Engineering, 30(2), 367-380.
  • Kaya, Y., Ertugrul, O., 2016. A novel feature extraction approach for text-based language identification: Binary patterns. Journal of the Faculty of Engineering and Architecture of Gazi University, 31(4)
  • Kılıç, N., Gorgel, P., Ucan N., Kala, A., 2008. Multifont Ottoman character recognition using support vector machine. In Communications, Control and Signal Processing. St Julians, Malta.
  • Lushan H., Finin, T., McNamee, P., Joshi, A., Yesha, Y., 2013. Improving Word Similarity by Augmenting PMI with Estimates of Word Polysemy. IEEE Transactions on Knowledge and Data Engineering, 25(6), 1307-1322.
  • Marrero, M, Urbano, J., 2018. A Semi-automatic and low-cost method to learn patterns for named entity recognition. Natural Language Engineering, 24(1), 39-75.
  • Metin, B., Amasyalı, M., 2017. Dependency parsing with stacked conditional random fields for Turkish. Journal of the Faculty of Engineering and Architecture of Gazi University 32(2).
  • Mikolov, T., Chen, K., Corrado, G., Dean, J., 2013. Efficient Estimation of Word Representations in Vector Space: arXiv preprint, 1301.3781.
  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J., 2013. Distributed Representations of Words and Phrases and Their Compositionality. Proceedings of the 26th International Conference on Neural Information Processing Systems, Lake Tahoe, Nevada.
  • Ozturk, A., Gunes, S., Ozbay, Y., 2000. Multifont Ottoman character recognition. 7th IEEE International Conference on Electronics, Circuits and Systems, Montreal, Quebec.
  • Peipei, L., Wang, H., Zhu, K. Q., Wang, Z., Hu, X., Wu, X., 2015. A Large Probabilistic Semantic Network Based Approach to Compute Term Similarity. IEEE Transactions on Knowledge and Data Engineering, 27(1), 2604-2617.
  • Soon, W. M., Ng, H. T., Lim, D., 2001. A machine learning approach to coreference resolution of noun phrases. Computational linguistics, 27(4), 521-544.
  • URL-1, 2016. http://ekitap.kulturturizm.gov.tr/TR,78353/divanlar-ve-mesneviler.html
  • URL-2, 2018. http://snowball.tartarus.org/
  • URL-3, 2019. https://code.google.com/archive/p/word2vec/source/default/source
  • URL-4, 2019. http://snowball.tartarus.org/algorithms/turkish/stemmer.html
  • URL-5, 2019. https://code.google.com/archive/p/word2vec/
There are 26 citations in total.

Details

Primary Language Turkish
Subjects Engineering
Journal Section Articles
Authors

Mustafa Canım 0000-0002-3653-267X

Publication Date July 15, 2019
Submission Date January 17, 2019
Acceptance Date May 7, 2019
Published in Issue Year 2019 Volume: 9 Issue: 3

Cite

APA Canım, M. (2019). Eski Dilde Kullanılan Sözcükler Arasındaki Anlamsal Yakınlıkların Doğal Dil İşleme Yöntemleriyle Tespiti. Gümüşhane Üniversitesi Fen Bilimleri Dergisi, 9(3), 536-546. https://doi.org/10.17714/gumusfenbil.514154