TR
EN
Comparative Analysis of Turkish Proverbs and Idioms Using Natural Language Processing-Based Direct and Semantic Matching Methods
Abstract
This study examined the use of Turkish idioms and proverbs using datasets obtained from various digital environments. The study was conducted using three different datasets, with the idiom and proverb dictionary prepared by the Turkish Language Association serving as a reference. The dataset was created using Turkish news articles, Twitter data, and data from the Ekşi Sözlük website. Two different methods were used in the study: direct matching and semantic-based matching. Five different language models based on semantic similarity were used. The matching performance of proverbs and idioms was evaluated using the SBERT, LaBSE, USE, E5, and DistilBERT models. The results showed that idioms are more widely used in language than proverbs. Models with higher coverage yielded higher matching values, while precision values decreased; however, selective matching models achieved higher precision values. When the model performances were evaluated using the F1-score, DistilBERT demonstrated the most balanced, while the SBERT and E5 models stood out with their high coverage, and the LaBSE and USE models achieved higher precision values despite their lower recall values. The results provide an assessment of the way proverbs and idioms are delivered in different environments and the language models' perception of these elements.
Keywords
References
- Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A., & Guo, W. (2013). SEM 2013 shared task: Semantic textual similarity. In M. Diab, T. Baldwin, & M. Baroni (Eds.), Second Joint Conference on Lexical and Computational Semantics (SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity (pp. 32–43). Association for Computational Linguistics.
- Akın, A. A., & Akın, M. D. (2007). Zemberek, an open source NLP framework for Turkic languages. Structure, 10, 1–5. Arslan, A. (2020). Sözlü kültür ürünlerinin aktarımında medya, toplum ve kuşaklararası etkileşim. Uluslararası Sosyal Bilimler Akademi Dergisi, 4, 1037–1053. https://doi.org/10.47994/usbad.808429
- Bayol, E. M. (2022). trnlp 0.2.3a0: Türkçe doğal dil işleme araçları [Computer software]. GitHub. https://github.com/brolin59/trnlp
- Bowman, S. R., Angeli, G., Potts, C., & Manning, C. D. (2015). A large annotated corpus for learning natural language inference. In L. Màrquez, C. Callison-Burch, & J. Su (Eds.), Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 632–642). Association for Computational Linguistics. https://doi.org/10.18653/v1/D15-1075
- Briskilal, J., & Subalalitha, C. (2022). An ensemble model for classifying idioms and literal texts using BERT and RoBERTa. Information Processing & Management, 59(1), Article 102756. https://doi.org/10.1016/j.ipm.2021.102756
- Cer, D., Yang, Y., Kong, S., Hua, N., Limtiaco, N., John, R. S., Constant, N., Guajardo-Cespedes, M., Yuan, S., Tar, C., Sung, Y. H., Strope, B. & Kurzweil, R. (2018). Universal sentence encoder. arXiv:1803.11175. https://doi.org/10.48550/arXiv.1803.11175
- Davis, E. (2021). Quantifying proverb dynamics in books, news articles, and tweets. [Master’s thesis, The University of Vermont and State Agricultural College]. The University of Vermont ScholarWorks. https://scholarworks.uvm.edu/graddis/1394/
- Feng, F., Yang, Y., Cer, D., Arivazhagan, N., & Wang, W. (2020). Language-agnostic BERT sentence embedding. arXiv:2007.01852. https://doi.org/10.48550/arXiv.2007.01852
Details
Primary Language
English
Subjects
Electrical Engineering (Other)
Journal Section
Research Article
Publication Date
December 22, 2025
Submission Date
August 26, 2025
Acceptance Date
October 20, 2025
Published in Issue
Year 2025 Volume: 3 Number: 2