Araştırma Makalesi

A Comparative Benchmark Study of Large Language Models on Turkish NLP Tasks: A Comparison of ChatGPT and DeepSeek

Cilt: 9 Sayı: 2 29 Aralık 2025
PDF İndir
EN TR

A Comparative Benchmark Study of Large Language Models on Turkish NLP Tasks: A Comparison of ChatGPT and DeepSeek

Abstract

Large language models (LLMs) have demonstrated remarkable success in high-resource languages such as English. Despite the increasing development of multilingual models, there is a lack of comprehensive, task-diverse benchmarking for Turkish Natural Language Processing (NLP) tasks. On the other hand, their effectiveness in low-resource and morphologically rich languages like Turkish have not been sufficiently investigated. This study presents a comprehensive performance evaluation of two leading LLMs, ChatGPT (GPT-4o) and DeepSeek-v3, on Turkish NLP tasks, addressing challenges in low-resource and morphologically complex languages. Five task specific datasets (Turkish NLP Question-Answering, XQuAD, MLSUM, Turkish News Headlines, and Turkish-English Translation) were used to evaluate model performance on question answering, summarization, headline generation, and translation tasks. Evaluation metrics include BLEU, ROUGE, METEOR, and BERTScore to measure both syntactic accuracy and semantic relevance. ChatGPT consistently outperformed DeepSeek in most tasks. GPT scored ROUGE-1: 0.52, METEOR: 0.62, and BERTScore: 0.68, while DeepSeek scored 0.26, 0.30, and 0.52 respectively. On the MLSUM dataset, GPT scored BLEU: 0.04 and ROUGE-1: 0.62 compared to DeepSeek’s 0.03 and 0.26. Both models performed equally well on the Turkish News Headlines dataset (ROUGE-1, ROUGE-L, METEOR: 1.0; BLEU: 0.83). For translation tasks, GPT held a slight advantage (BLEU: 0.29 vs. 0.23; METEOR: 0.62 vs. 0.60). Although GPT’s overall average metric score was 18% higher, DeepSeek occasionally performed better in BERTScore, which reflects surface-level semantic matching (e.g., XQuAD: 0.89 vs. 0.61). During the error analysis it was found that semantically valid outputs were sometimes penalized by ROUGE-L due to expression differences such as “1156–1241” vs. “He was born in 1156 and died in 1241”. The findings highlight the need for Turkish-specific LLM development and improved evaluation metrics. This study provides comprehensive comparison data and methodological insights to guide future improvements.

Keywords

Turkish Natural Language Processing , Large Language Models , Benchmarking Metrics , ChatGPT , DeepSeek

Kaynakça

  1. Seker, S. E. (2015). Doğal dil işleme (Natural language processing). YBS Ansiklopedi, 2(4), 1-31.
  2. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877-1901. arXiv. http://arxiv.org/abs/2005.14165.
  3. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9. arXiv.
  4. Yıldırım, E., Çetin, F. S., Eryiğit, G., & Temel, T. (2014). The impact of NLP on Turkish sentiment analysis. Türkiye Bilişim Vakfı Bilgisayar Bilimleri ve Mühendisliği Dergisi, 7(1), 43–51.
  5. OpenAI. (2024). ChatGPT-4o. Retrieved June 2, 2025, from https://openai.com/index/hello-gpt-4o
  6. DeepSeek. (2025). DeepSeek V3. Retrieved June 2, 2025, from https://api-docs.deepseek.com/news/news250325
  7. OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., … Zoph, B. (2024). GPT-4 technical report. arXiv.
  8. Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., Du, Y., Yang, C., Chen, Y., Chen, Z., Jiang, J., Ren, R., Li, Y., Tang, X., Liu, Z., … Wen, J.-R. (2025). A survey of large language models. arXiv.
  9. Gemirter, C. B., & Goularas, D. (2021). A Turkish question answering system based on deep learning neural networks. Journal of Intelligent Systems: Theory and Applications, 4(2), 65-75.
  10. Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2020). Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 8440–8451).

Kaynak Göster

APA
Ünsal, E., & Karakuş, A. T. (2025). A Comparative Benchmark Study of Large Language Models on Turkish NLP Tasks: A Comparison of ChatGPT and DeepSeek. International Journal of Innovative Engineering Applications, 9(2), 184-192. https://doi.org/10.46460/ijiea.1737325
AMA
1.Ünsal E, Karakuş AT. A Comparative Benchmark Study of Large Language Models on Turkish NLP Tasks: A Comparison of ChatGPT and DeepSeek. ijiea, IJIEA. 2025;9(2):184-192. doi:10.46460/ijiea.1737325
Chicago
Ünsal, Emre, ve Ahmet Turan Karakuş. 2025. “A Comparative Benchmark Study of Large Language Models on Turkish NLP Tasks: A Comparison of ChatGPT and DeepSeek”. International Journal of Innovative Engineering Applications 9 (2): 184-92. https://doi.org/10.46460/ijiea.1737325.
EndNote
Ünsal E, Karakuş AT (01 Aralık 2025) A Comparative Benchmark Study of Large Language Models on Turkish NLP Tasks: A Comparison of ChatGPT and DeepSeek. International Journal of Innovative Engineering Applications 9 2 184–192.
IEEE
[1]E. Ünsal ve A. T. Karakuş, “A Comparative Benchmark Study of Large Language Models on Turkish NLP Tasks: A Comparison of ChatGPT and DeepSeek”, ijiea, IJIEA, c. 9, sy 2, ss. 184–192, Ara. 2025, doi: 10.46460/ijiea.1737325.
ISNAD
Ünsal, Emre - Karakuş, Ahmet Turan. “A Comparative Benchmark Study of Large Language Models on Turkish NLP Tasks: A Comparison of ChatGPT and DeepSeek”. International Journal of Innovative Engineering Applications 9/2 (01 Aralık 2025): 184-192. https://doi.org/10.46460/ijiea.1737325.
JAMA
1.Ünsal E, Karakuş AT. A Comparative Benchmark Study of Large Language Models on Turkish NLP Tasks: A Comparison of ChatGPT and DeepSeek. ijiea, IJIEA. 2025;9:184–192.
MLA
Ünsal, Emre, ve Ahmet Turan Karakuş. “A Comparative Benchmark Study of Large Language Models on Turkish NLP Tasks: A Comparison of ChatGPT and DeepSeek”. International Journal of Innovative Engineering Applications, c. 9, sy 2, Aralık 2025, ss. 184-92, doi:10.46460/ijiea.1737325.
Vancouver
1.Emre Ünsal, Ahmet Turan Karakuş. A Comparative Benchmark Study of Large Language Models on Turkish NLP Tasks: A Comparison of ChatGPT and DeepSeek. ijiea, IJIEA. 01 Aralık 2025;9(2):184-92. doi:10.46460/ijiea.1737325