Comparative Analysis of Turkish Proverbs and Idioms Using Natural Language Processing-Based Direct and Semantic Matching Methods

Zeynep Kılıç; Ertürk Erdağı

doi:10.70081/duted.1772271

TR EN

Doğal Dil İşleme Tabanlı Doğrudan ve Anlamsal Eşleştirme Yöntemleri Kullanılarak Türk Atasözleri ve Deyimlerinin Karşılaştırmalı Analizi

Öz

Türkçe deyim ve atasözlerinin kullanımı çeşitli dijital ortamlardan elde edilen veri kümeleri kullanılarak incelenmiştir. Çalışma, Türk Dil Kurumu tarafından hazırlanan deyim ve atasözü sözlüğü referans alınarak üç ayrı veri kümesi üzerinde yürütülmüştür. Veri kümesi, Türkçe haber makaleleri, Twitter verileri ve Ekşi Sözlük web sitesinden alınan veriler kullanılarak oluşturulmuştur. Çalışmada doğrudan eşleme ve anlam tabanlı eşleme olmak üzere iki ayrı yöntem kullanılmıştır. Anlamsal benzerliğe dayalı beş ayrı dil modeli kullanılmıştır. Atasözleri ve deyimlerin eşleme performansı SBERT, LaBSE, USE, E5 ve DistilBERT modelleri kullanılarak değerlendirilmiştir. Sonuçlar deyimlerin dilde atasözlerinden daha yaygın kullanıldığını göstermiştir. Daha yüksek kapsama sahip modeller daha yüksek eşleme değerleri verirken, kesinlik değerleri azalmıştır; ancak seçici eşleme modelleri daha yüksek kesinlik değerlerine ulaşmıştır. Model performansları F1 puanı kullanılarak değerlendirildiğinde, DistilBERT en dengeli performansı sergilerken, SBERT ve E5 modelleri yüksek kapsamlarıyla öne çıkarken, LaBSE ve USE modelleri daha düşük hatırlama değerlerine rağmen daha yüksek kesinlik değerlerine ulaşmıştır. Sonuçlar, atasözleri ve deyimlerin farklı ortamlarda nasıl sunulduğunun ve dil modellerinin bu unsurları nasıl algıladığının bir değerlendirmesini sunmaktadır.

Anahtar Kelimeler

Comparative Analysis of Turkish Proverbs and Idioms Using Natural Language Processing-Based Direct and Semantic Matching Methods

Abstract

This study examined the use of Turkish idioms and proverbs using datasets obtained from various digital environments. The study was conducted using three different datasets, with the idiom and proverb dictionary prepared by the Turkish Language Association serving as a reference. The dataset was created using Turkish news articles, Twitter data, and data from the Ekşi Sözlük website. Two different methods were used in the study: direct matching and semantic-based matching. Five different language models based on semantic similarity were used. The matching performance of proverbs and idioms was evaluated using the SBERT, LaBSE, USE, E5, and DistilBERT models. The results showed that idioms are more widely used in language than proverbs. Models with higher coverage yielded higher matching values, while precision values decreased; however, selective matching models achieved higher precision values. When the model performances were evaluated using the F1-score, DistilBERT demonstrated the most balanced, while the SBERT and E5 models stood out with their high coverage, and the LaBSE and USE models achieved higher precision values despite their lower recall values. The results provide an assessment of the way proverbs and idioms are delivered in different environments and the language models' perception of these elements.

Keywords

References

Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A., & Guo, W. (2013). SEM 2013 shared task: Semantic textual similarity. In M. Diab, T. Baldwin, & M. Baroni (Eds.), Second Joint Conference on Lexical and Computational Semantics (SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity (pp. 32–43). Association for Computational Linguistics.
Akın, A. A., & Akın, M. D. (2007). Zemberek, an open source NLP framework for Turkic languages. Structure, 10, 1–5. Arslan, A. (2020). Sözlü kültür ürünlerinin aktarımında medya, toplum ve kuşaklararası etkileşim. Uluslararası Sosyal Bilimler Akademi Dergisi, 4, 1037–1053. https://doi.org/10.47994/usbad.808429
Bayol, E. M. (2022). trnlp 0.2.3a0: Türkçe doğal dil işleme araçları [Computer software]. GitHub. https://github.com/brolin59/trnlp
Bowman, S. R., Angeli, G., Potts, C., & Manning, C. D. (2015). A large annotated corpus for learning natural language inference. In L. Màrquez, C. Callison-Burch, & J. Su (Eds.), Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 632–642). Association for Computational Linguistics. https://doi.org/10.18653/v1/D15-1075
Briskilal, J., & Subalalitha, C. (2022). An ensemble model for classifying idioms and literal texts using BERT and RoBERTa. Information Processing & Management, 59(1), Article 102756. https://doi.org/10.1016/j.ipm.2021.102756
Cer, D., Yang, Y., Kong, S., Hua, N., Limtiaco, N., John, R. S., Constant, N., Guajardo-Cespedes, M., Yuan, S., Tar, C., Sung, Y. H., Strope, B. & Kurzweil, R. (2018). Universal sentence encoder. arXiv:1803.11175. https://doi.org/10.48550/arXiv.1803.11175
Davis, E. (2021). Quantifying proverb dynamics in books, news articles, and tweets. [Master’s thesis, The University of Vermont and State Agricultural College]. The University of Vermont ScholarWorks. https://scholarworks.uvm.edu/graddis/1394/
Feng, F., Yang, Y., Cer, D., Arivazhagan, N., & Wang, W. (2020). Language-agnostic BERT sentence embedding. arXiv:2007.01852. https://doi.org/10.48550/arXiv.2007.01852

Geçgel, H., & Peker, B. (2020). Multimedya araçlarının yabancı dil öğretimine etkisi üzerine öğretmen görüşleri. RumeliDE Dil ve Edebiyat Araştırmaları Dergisi, 20, 12–22. https://doi.org/10.29000/rumelide.791070
Girmen, P. (2013). Türkçe eğitiminde atasözleri ve değer eğitimi. Değerler Eğitimi Dergisi, 11(25), 117–142.
Gürel, E., & Akşit, A. C. A. (2021). Yazı ve yazılı iletişim: Yazı temalı atasözleri ve deyimlere ilişkin bir içerik analizi. Anadolu Üniversitesi Sosyal Bilimler Dergisi, 21(1), 119–144. https://doi.org/10.18037/ausbd.902580
Gürel, E., & Tat, M. (2019). Türkçede beden olgusu: Atasözleri ve deyimler üzerine bir içerik analizi. Akdeniz Üniversitesi İletişim Fakültesi Dergisi, 32, 235–256. https://doi.org/10.31123/akil.620824
Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.
Himdi, H. (2024). Arabic idioms detection by utilizing deep learning and transformer-based models. Procedia Computer Science, 244, 37–48. https://doi.org/10.1016/j.procs.2024.10.176
Ibrahim, H. S., Abdou, S. M., & Gheith, M. (2015). Idioms-proverbs lexicon for modern standard Arabic and colloquial sentiment analysis. arXiv:1506.01906. https://arxiv.org/abs/1506.01906
Katı, T. N., & Can, U. (2024). Yapay zekâ ile üretilen metinlerin yabancı dil olarak Türkçe öğretiminde okuma becerisine yönelik kullanılabilirliği: ChatGPT-3.5 örneği. İnönü Üniversitesi Eğitim Fakültesi Dergisi, 25(2), 538–569. https://doi.org/10.17679/inuefd.1415303
Kaya, B. A. (2011). Atasözleri ve deyimlerin Dîvân şiirinde kullanımı ile Dîvânların bu söz varlıklarımız bakımından önemi. Divan Edebiyatı Araştırmaları Dergisi, 6(6), 11–54. https://doi.org/10.15247/dev.69
Manning, C. D. (2008). Introduction to information retrieval. Syngress Publishing.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv:1301.3781. https://arxiv.org/abs/1301.3781
Oruk, İ. (2024). Sosyal medya dili üzerine eleştirel bir bakış. Uluslararası Filoloji Bengü, 4(2), 135–145. https://doi.org/10.62605/ufb.1581242
Ömer, K. (2022). Türkçe ve Türk Dili ve Edebiyatı ders kitaplarındaki metinlerde yer alan deyimler üzerine bir inceleme. Dicle Üniversitesi Ziya Gökalp Eğitim Fakültesi Dergisi, 41, 1-12. https://doi.org/10.14582/DUZGEF.2022.182
Özcan, M. F. (2019). Orhun Yazıtlarındaki atasözü ve deyimlerin günümüzdeki karşılıklarına yönelik inceleme. RumeliDE Dil ve Edebiyat Araştırmaları Dergisi, 16, 98–105. https://doi.org/10.29000/rumelide.616897
Öztürk, İ. Y., & Kirmizi, Ö. (2021). Ortaokul Türkçe ders kitaplarındaki millî kültür teması metinlerinde deyim ve atasözü varlığı. Ondokuz Mayıs Üniversitesi Eğitim Fakültesi Dergisi, 40(1), 317-356.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
Priyanka, & Sinha, R. M. K. (2014). A system for identification of idioms in Hindi. In 2014 Seventh International Conference on Contemporary Computing (IC3) (pp. 467–472). IEEE https://doi.org/10.1109/IC3.2014.6897218
Rassi, A. P., Baptista, J., & Vale, O. (2014). Automatic detection of proverbs and their variants. In 3rd Symposium on Languages, Applications and Technologies (2014) (pp. 235–249).
Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv:1908.10084. https://arxiv.org/abs/1908.10084
Reis, S., & Baptista, J. (2016). Let’s play with proverbs? NLP tools and resources for I CALL applications around proverbs for PF. In Proceedings of the International Congress on Interdisciplinarity in Social and Human Sciences (5th-6th May 2016, pp. 435–454).
Şimşek, Ö. (2025). Yapay zeka ve dil öğrenme süreçlerinde kullanımına genel bakış. Temel Eğitim Araştırmaları Dergisi, 5(1), 118–130. https://doi.org/10.55008/te-ad.1567106
Tan, N. (2020). Türkiye’de genel atasözü ve deyim sözlüklerinde anlam verme çalışmalarına toplu bir bakış. Türk Dünyası Dil ve Edebiyat Dergisi, 50, 257–296. https://doi.org/10.24155/tdk.2020.149
Turkish Language Association. (n.d.). Türk Dil Kurumu. Retrieved August 22, 2025, from https://sozluk.gov.tr/ Tuzcu, O. U. (2019). Turkish anti-proverbs in social media discourse. International Journal of Social Science, 74, 243–261. http://dx.doi.org/10.9761/JASSS7993
Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R., & Wei, F. (2024). Multilingual E5 text embeddings: A technical report arXiv:2402.05672. https://doi.org/10.48550/arXiv.2402.05672
Williams, L., Bannister, C., Arribas-Ayllon, M., Preece, A., & Spasić, I. (2015). The role of idioms in sentiment analysis. Expert Systems with Applications, 42(21), 7375–7385.
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le Scao, T., Gugger, S., … Rush, A. M. (2020). Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 38–45).
Zeyrek, D., Demirsahin, I., Sevdik-Calli, A. B., Balaban, H. Ö., Yalçinkaya, İ., & Turan, U. D. (2010). The annotation scheme of the Turkish Discourse Bank and an evaluation of inconsistent annotations. In Proceedings of the Fourth Linguistic Annotation Workshop (pp. 282–289).

Details

Primary Language

English

Subjects

Electrical Engineering (Other)

Journal Section

Research Article

Authors

Zeynep Kılıç
0009-0009-2695-0817
Türkiye

Ertürk Erdağı ^*
0000-0001-8619-8879
Türkiye

Publication Date

December 22, 2025

Submission Date

August 26, 2025

Acceptance Date

October 20, 2025

Published in Issue

Year 2025 Volume: 3 Number: 2

DOI

https://doi.org/10.70081/duted.1772271

IZ

https://izlik.org/JA42XJ45MU

Cite

RIS / Bibtex

APA

Kılıç, Z., & Erdağı, E. (2025). Comparative Analysis of Turkish Proverbs and Idioms Using Natural Language Processing-Based Direct and Semantic Matching Methods. Düzce University Journal of Technical Sciences, 3(2), 104-117. https://doi.org/10.70081/duted.1772271