Eski Dilde Kullanılan Sözcükler Arasındaki Anlamsal Yakınlıkların Doğal Dil İşleme Yöntemleriyle Tespiti

Mustafa Canım

doi:10.17714/gumusfenbil.514154

Araştırma Makalesi

Using NLP Methods For The Discovery Of Semantic Similarities Between Words In Old Turkish Language

Yıl 2019, Cilt: 9 Sayı: 3, 536 - 546, 15.07.2019

Mustafa Canım

https://doi.org/10.17714/gumusfenbil.514154

Öz

Leveraging machine learning techniques in NLP
domain has been a very hot research field due to the advancements in artificial
intelligence area. Despite the popularity of this field, there is no known
study on application of ML techniques on old Turkish language. This study aims to fill in this gap where
32000 pages of text has been downloaded from the websites of Ministry of Culture
and a two-layer neural network model has been built on top of them to discover
the semantic similarities between Turkish words in old Turkish language. The
algorithm has been run with different parameters such as window size, dimension
size, sampling size etc. and the produced vector spaces are uploaded into public
servers for the purposes of enabling a RESTful API based query interface. Also
a web UI has been created to provide a querying mechanism for regular users. The
services that are developed can be used for two different purposes. One of them
is to integrate these services into existing old Turkish language dictionary
websites that are made available by third party providers as well as other
institutions such as Ministry of Culture and Turkish Language Institution.
Secondly, the developed services are intended to be used for mitigating the
translation errors made during the translation of old Turkish texts into modern
Turkish language in the areas of history and Turkish literature. Also enabling these
services for public use will encourage other researchers to pursue this
academic work and compare their results with the experimental results presented
in this paper to make further improvements in this field.

Anahtar Kelimeler

Artificial neural networks, Natural language processing, NLP, Word embeddings

Kaynakça

Adıgüzel, H., Şahin, P., Kalpaklı, M., 2012. Line segmentation of Ottoman documents. 20th Signal Processing and Communications Applications Conference, Fethiye Mugla, Turkey.
Arifoğlu, D., Duygulu, P., 2011. Word retrieval in ottoman documents. IEEE 19th Signal Processing and Communications Applications Conference, Antalya, Turkey.
Ataer, E., Duygulu P., 2007. Matching ottoman words: an image retrieval approach to historical document indexing. Proceedings of the 6th ACM international conference on Image and video retrieval, Amsterdam, Netherlands.
Basu, M., Roy, A., Ghosh, K., Bandyopadhyay, S., Ghosh, S., 2017. A Novel Word Embedding Based Stemming Approach for Microblog Retrieval During Disasters. 39th European Conference on Information Retrieval, Scotland, UK.
Can, B., 2017. Unsupervised learning of allomorphs in Turkish. Turkish Journal of Electrical Engineering & Computer Sciences 25(4), 3253-3260.
Chris, B., Faralli, S., Panchenko, A., Ponzetto, S., 2018. A framework for enriching lexical semantic resources with distributional semantics. Natural Language Engineering, Cambridge University Press, 24(1), 265-312.
Church, K. W., 2017. Word2Vec. Natural Language Engineering: Cambridge University Press, 155 p.
Deniz, K., Özçift, A., Bozyigit, F., Yıldırım, P., Yücalar F., Borandag E., 2017. TTC-3600: A new benchmark dataset for Turkish text categorization. Journal of Information Science, 43(2), 174-185.
Ganggao, Z., Iglesias, C. A., 2017. Computing Semantic Similarity of Concepts in Knowledge Graphs. IEEE Transactions on Knowledge and Data Engineering, 29(1), 72-85.
İlgen, B., Adalı, E., Tantuğ, A., 2016. Exploring feature sets for Turkish word sense disambiguation. Turkish Journal of Electrical Engineering & Computer Sciences, 24(1), 4391-4405.
Kalender, M., Korkmaz, E. E., 2018. THINKER - Entity Linking System for Turkish Language. IEEE Transactions on Knowledge and Data Engineering, 30(2), 367-380.
Kaya, Y., Ertugrul, O., 2016. A novel feature extraction approach for text-based language identification: Binary patterns. Journal of the Faculty of Engineering and Architecture of Gazi University, 31(4)
Kılıç, N., Gorgel, P., Ucan N., Kala, A., 2008. Multifont Ottoman character recognition using support vector machine. In Communications, Control and Signal Processing. St Julians, Malta.
Lushan H., Finin, T., McNamee, P., Joshi, A., Yesha, Y., 2013. Improving Word Similarity by Augmenting PMI with Estimates of Word Polysemy. IEEE Transactions on Knowledge and Data Engineering, 25(6), 1307-1322.
Marrero, M, Urbano, J., 2018. A Semi-automatic and low-cost method to learn patterns for named entity recognition. Natural Language Engineering, 24(1), 39-75.
Metin, B., Amasyalı, M., 2017. Dependency parsing with stacked conditional random fields for Turkish. Journal of the Faculty of Engineering and Architecture of Gazi University 32(2).
Mikolov, T., Chen, K., Corrado, G., Dean, J., 2013. Efficient Estimation of Word Representations in Vector Space: arXiv preprint, 1301.3781.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J., 2013. Distributed Representations of Words and Phrases and Their Compositionality. Proceedings of the 26th International Conference on Neural Information Processing Systems, Lake Tahoe, Nevada.
Ozturk, A., Gunes, S., Ozbay, Y., 2000. Multifont Ottoman character recognition. 7th IEEE International Conference on Electronics, Circuits and Systems, Montreal, Quebec.
Peipei, L., Wang, H., Zhu, K. Q., Wang, Z., Hu, X., Wu, X., 2015. A Large Probabilistic Semantic Network Based Approach to Compute Term Similarity. IEEE Transactions on Knowledge and Data Engineering, 27(1), 2604-2617.
Soon, W. M., Ng, H. T., Lim, D., 2001. A machine learning approach to coreference resolution of noun phrases. Computational linguistics, 27(4), 521-544.
URL-1, 2016. http://ekitap.kulturturizm.gov.tr/TR,78353/divanlar-ve-mesneviler.html
URL-2, 2018. http://snowball.tartarus.org/
URL-3, 2019. https://code.google.com/archive/p/word2vec/source/default/source
URL-4, 2019. http://snowball.tartarus.org/algorithms/turkish/stemmer.html
URL-5, 2019. https://code.google.com/archive/p/word2vec/

Eski Dilde Kullanılan Sözcükler Arasındaki Anlamsal Yakınlıkların Doğal Dil İşleme Yöntemleriyle Tespiti

Yıl 2019, Cilt: 9 Sayı: 3, 536 - 546, 15.07.2019

Mustafa Canım

https://doi.org/10.17714/gumusfenbil.514154

Öz

Makina
öğrenme tekniklerinin doğal dil işleme alanında kullanımı son yıllarda oldukça
popüler bir çalışma konusu haline gelmiştir. Doğal dil işleme yöntemlerinin
yabancı dillerdeki birçok uygulamasına rastlanılmasına rağmen Türkçe ve
özellikle eski dil metinlerdeki uygulamaları oldukça yetersizdir. Bu alandaki
eksikliğin giderilmesine yönelik olarak yapılan bu çalışmada, Kültür Bakanlığı
kaynaklarından elde edilen 32000 sayfa doküman, temizleme işleminden geçirildikten
sonra, bu metinlerden elde edilen kelimeler üzerinde iki katmanlı bir sinir ağı
modeli çalıştırılmıştır. Pencere boyutu, uzay boyutu, örnekleme miktarı gibi birçok
farklı parametre ile geliştirilen modellere ait vektör uzayları bir sunucuya
kopyalanarak bir sorgulama sistemi ve RESTful API servisleri oluşturulmuştur. Ayrıca
bu sorgulama sisteminin doğrudan kullanabilmesi için bir kullanıcı portalı
oluşturularak RESTful API ile beraber internet kullanımına açılmıştır. Yapılan bu
çalışmanın iki farklı amaçla kullanılması hedeflenmektedir. Birinci hedef bu
sistemin Türk Dil Kurumu ve Kültür Bakanlığı gibi kurumların ve diğer eski dil
sözlük hizmeti sağlayan şirketlerin internet sitelerine entegre edilmesi ve
aratılan sözcüklere yakın terimlerin kullanıcılara getirilmesidir. İkinci hedef
ise tarih ve edebiyat gibi eski dilin kullanıldığı bilimsel çalışmalarda metinlerin
günümüz Türkçe’sine çevrilmesi esnasında ortaya çıkan hataların azaltılmasıdır.

Anahtar Kelimeler

Doğal dil işleme, Kelime simgeleri, NLP, Yapay sinir ağları

Kaynakça

Adıgüzel, H., Şahin, P., Kalpaklı, M., 2012. Line segmentation of Ottoman documents. 20th Signal Processing and Communications Applications Conference, Fethiye Mugla, Turkey.
Arifoğlu, D., Duygulu, P., 2011. Word retrieval in ottoman documents. IEEE 19th Signal Processing and Communications Applications Conference, Antalya, Turkey.
Ataer, E., Duygulu P., 2007. Matching ottoman words: an image retrieval approach to historical document indexing. Proceedings of the 6th ACM international conference on Image and video retrieval, Amsterdam, Netherlands.
Basu, M., Roy, A., Ghosh, K., Bandyopadhyay, S., Ghosh, S., 2017. A Novel Word Embedding Based Stemming Approach for Microblog Retrieval During Disasters. 39th European Conference on Information Retrieval, Scotland, UK.
Can, B., 2017. Unsupervised learning of allomorphs in Turkish. Turkish Journal of Electrical Engineering & Computer Sciences 25(4), 3253-3260.
Chris, B., Faralli, S., Panchenko, A., Ponzetto, S., 2018. A framework for enriching lexical semantic resources with distributional semantics. Natural Language Engineering, Cambridge University Press, 24(1), 265-312.
Church, K. W., 2017. Word2Vec. Natural Language Engineering: Cambridge University Press, 155 p.
Deniz, K., Özçift, A., Bozyigit, F., Yıldırım, P., Yücalar F., Borandag E., 2017. TTC-3600: A new benchmark dataset for Turkish text categorization. Journal of Information Science, 43(2), 174-185.
Ganggao, Z., Iglesias, C. A., 2017. Computing Semantic Similarity of Concepts in Knowledge Graphs. IEEE Transactions on Knowledge and Data Engineering, 29(1), 72-85.
İlgen, B., Adalı, E., Tantuğ, A., 2016. Exploring feature sets for Turkish word sense disambiguation. Turkish Journal of Electrical Engineering & Computer Sciences, 24(1), 4391-4405.
Kalender, M., Korkmaz, E. E., 2018. THINKER - Entity Linking System for Turkish Language. IEEE Transactions on Knowledge and Data Engineering, 30(2), 367-380.
Kaya, Y., Ertugrul, O., 2016. A novel feature extraction approach for text-based language identification: Binary patterns. Journal of the Faculty of Engineering and Architecture of Gazi University, 31(4)
Kılıç, N., Gorgel, P., Ucan N., Kala, A., 2008. Multifont Ottoman character recognition using support vector machine. In Communications, Control and Signal Processing. St Julians, Malta.
Lushan H., Finin, T., McNamee, P., Joshi, A., Yesha, Y., 2013. Improving Word Similarity by Augmenting PMI with Estimates of Word Polysemy. IEEE Transactions on Knowledge and Data Engineering, 25(6), 1307-1322.
Marrero, M, Urbano, J., 2018. A Semi-automatic and low-cost method to learn patterns for named entity recognition. Natural Language Engineering, 24(1), 39-75.
Metin, B., Amasyalı, M., 2017. Dependency parsing with stacked conditional random fields for Turkish. Journal of the Faculty of Engineering and Architecture of Gazi University 32(2).
Mikolov, T., Chen, K., Corrado, G., Dean, J., 2013. Efficient Estimation of Word Representations in Vector Space: arXiv preprint, 1301.3781.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J., 2013. Distributed Representations of Words and Phrases and Their Compositionality. Proceedings of the 26th International Conference on Neural Information Processing Systems, Lake Tahoe, Nevada.
Ozturk, A., Gunes, S., Ozbay, Y., 2000. Multifont Ottoman character recognition. 7th IEEE International Conference on Electronics, Circuits and Systems, Montreal, Quebec.
Peipei, L., Wang, H., Zhu, K. Q., Wang, Z., Hu, X., Wu, X., 2015. A Large Probabilistic Semantic Network Based Approach to Compute Term Similarity. IEEE Transactions on Knowledge and Data Engineering, 27(1), 2604-2617.
Soon, W. M., Ng, H. T., Lim, D., 2001. A machine learning approach to coreference resolution of noun phrases. Computational linguistics, 27(4), 521-544.
URL-1, 2016. http://ekitap.kulturturizm.gov.tr/TR,78353/divanlar-ve-mesneviler.html
URL-2, 2018. http://snowball.tartarus.org/
URL-3, 2019. https://code.google.com/archive/p/word2vec/source/default/source
URL-4, 2019. http://snowball.tartarus.org/algorithms/turkish/stemmer.html
URL-5, 2019. https://code.google.com/archive/p/word2vec/

Toplam 26 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	Türkçe
Konular	Mühendislik
Bölüm	Makaleler
Yazarlar	Mustafa Canım 0000-0002-3653-267X
Yayımlanma Tarihi	15 Temmuz 2019
Gönderilme Tarihi	17 Ocak 2019
Kabul Tarihi	7 Mayıs 2019
Yayımlandığı Sayı	Yıl 2019 Cilt: 9 Sayı: 3

Kaynak Göster

APA	Canım, M. (2019). Eski Dilde Kullanılan Sözcükler Arasındaki Anlamsal Yakınlıkların Doğal Dil İşleme Yöntemleriyle Tespiti. Gümüşhane Üniversitesi Fen Bilimleri Dergisi, 9(3), 536-546. https://doi.org/10.17714/gumusfenbil.514154

Kapak Resmi İndir

Makale Dosyaları

Tam Metin