A Comparative Benchmark Study of Large Language Models on Turkish NLP Tasks: A Comparison of ChatGPT and DeepSeek

Emre Ünsal; Ahmet Turan Karakuş

doi:10.46460/ijiea.1737325

EN TR

A Comparative Benchmark Study of Large Language Models on Turkish NLP Tasks: A Comparison of ChatGPT and DeepSeek

Abstract

Large language models (LLMs) have demonstrated remarkable success in high-resource languages such as English. Despite the increasing development of multilingual models, there is a lack of comprehensive, task-diverse benchmarking for Turkish Natural Language Processing (NLP) tasks. On the other hand, their effectiveness in low-resource and morphologically rich languages like Turkish have not been sufficiently investigated. This study presents a comprehensive performance evaluation of two leading LLMs, ChatGPT (GPT-4o) and DeepSeek-v3, on Turkish NLP tasks, addressing challenges in low-resource and morphologically complex languages. Five task specific datasets (Turkish NLP Question-Answering, XQuAD, MLSUM, Turkish News Headlines, and Turkish-English Translation) were used to evaluate model performance on question answering, summarization, headline generation, and translation tasks. Evaluation metrics include BLEU, ROUGE, METEOR, and BERTScore to measure both syntactic accuracy and semantic relevance. ChatGPT consistently outperformed DeepSeek in most tasks. GPT scored ROUGE-1: 0.52, METEOR: 0.62, and BERTScore: 0.68, while DeepSeek scored 0.26, 0.30, and 0.52 respectively. On the MLSUM dataset, GPT scored BLEU: 0.04 and ROUGE-1: 0.62 compared to DeepSeek’s 0.03 and 0.26. Both models performed equally well on the Turkish News Headlines dataset (ROUGE-1, ROUGE-L, METEOR: 1.0; BLEU: 0.83). For translation tasks, GPT held a slight advantage (BLEU: 0.29 vs. 0.23; METEOR: 0.62 vs. 0.60). Although GPT’s overall average metric score was 18% higher, DeepSeek occasionally performed better in BERTScore, which reflects surface-level semantic matching (e.g., XQuAD: 0.89 vs. 0.61). During the error analysis it was found that semantically valid outputs were sometimes penalized by ROUGE-L due to expression differences such as “1156–1241” vs. “He was born in 1156 and died in 1241”. The findings highlight the need for Turkish-specific LLM development and improved evaluation metrics. This study provides comprehensive comparison data and methodological insights to guide future improvements.

Keywords

Turkish Natural Language Processing , Large Language Models , Benchmarking Metrics , ChatGPT , DeepSeek

Türkçe Doğal Dil İşleme Görevlerinde Büyük Dil Modellerinin Karşılaştırmalı Bir Kıyaslama Çalışması: ChatGPT ve DeepSeek Karşılaştırması

Öz

Büyük dil modelleri (Large Language Models, LLM’ler), İngilizce gibi zengin kaynaklı dillerde dikkate değer başarılar göstermiştir. Ancak, Türkçe gibi morfolojik açıdan zengin ve kaynakları kısıtlı dillerde etkinlikleri henüz yeterince araştırılmamıştır. Bu çalışma, iki önde gelen LLM olan ChatGPT (GPT-4o) ve DeepSeek-v3'ün çeşitli Türkçe Doğal Dil İşleme (NLP) görevlerinde kapsamlı bir performans değerlendirmesini sunmaktadır. Beş farklı göreve özel veri seti (Turkish NLP Question-Answering, XQuAD, MLSUM, Turkish News Headlines, and Türkçe–İngilizce Çeviri) soru cevaplama, özetleme, başlık oluşturma ve çeviri görevlerinde model performansını değerlendirmek için kullanılmıştır. Model çıktıları, hem sözdizimsel doğruluğu hem de anlamsal alaka düzeyini ölçmek için BLEU, ROUGE, METEOR ve BERTScore kullanılarak değerlendirilmiştir. ChatGPT, çoğu görevde DeepSeek'i tutarlı bir şekilde geride bırakarak ROUGE-1: 0,52, METEOR: 0,62 ve BERTScore: 0,68 değerlerini elde ederken, DeepSeek sırasıyla 0,26, 0,30 ve 0,52 değerlerini elde etmiştir. MLSUM veri setinde GPT, BLEU: 0,04 ve ROUGE-1: 0,62 değerlerini elde ederken, DeepSeek 0,03 ve 0,26 değerlerini elde etmiştir. Her iki model de Turkish News Headlines veri setinde eşit derecede iyi performans sergilemiştir (ROUGE-1, ROUGE-L, METEOR: 1,0; BLEU: 0,83). Çeviri görevinde GPT hafif bir üstünlük sağlamıştır (BLEU: 0,29'a karşı 0,23; METEOR: 0,62'ye karşı 0,60). GPT'nin genel ortalama metrik puanı yaklaşık %18 daha yüksek olmasına rağmen, DeepSeek özellikle XQuAD'da (0,89'a karşı 0,61) yüzeysel düzeyde anlamsal eşleşmeyi yansıtan daha yüksek BERTScore'lar elde etmiştir. Hata analizi sırasında, “1156–1241” ve “1156'da doğdu ve 1241'de öldü” gibi anlamsal olarak eşdeğer ancak ifade biçimi farklı olan çıktıların bazen ROUGE-L tarafından cezalandırıldığı tespit edilmiştir. Bu bulgular, Türkçeye özgü LLM geliştirme ihtiyacını ve anlamsal farkındalığı yüksek değerlendirme metriklerinin gerekliliğini vurgulamaktadır. Genel olarak, bu çalışma, Türkçe NLP ve LLM değerlendirmesi üzerine gelecekteki araştırmalara rehberlik edecek nicel içgörüler ve karşılaştırmalı değerlendirmeler sunmaktadır.

Anahtar Kelimeler

Türkçe Doğal Dil İşleme , Büyük Dil Modelleri , Karşılaştırma Metrikleri , ChatGPT , DeepSeek

Kaynakça

Seker, S. E. (2015). Doğal dil işleme (Natural language processing). YBS Ansiklopedi, 2(4), 1-31.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877-1901. arXiv. http://arxiv.org/abs/2005.14165.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9. arXiv.
Yıldırım, E., Çetin, F. S., Eryiğit, G., & Temel, T. (2014). The impact of NLP on Turkish sentiment analysis. Türkiye Bilişim Vakfı Bilgisayar Bilimleri ve Mühendisliği Dergisi, 7(1), 43–51.
OpenAI. (2024). ChatGPT-4o. Retrieved June 2, 2025, from https://openai.com/index/hello-gpt-4o
DeepSeek. (2025). DeepSeek V3. Retrieved June 2, 2025, from https://api-docs.deepseek.com/news/news250325
OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., … Zoph, B. (2024). GPT-4 technical report. arXiv.
Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., Du, Y., Yang, C., Chen, Y., Chen, Z., Jiang, J., Ren, R., Li, Y., Tang, X., Liu, Z., … Wen, J.-R. (2025). A survey of large language models. arXiv.
Gemirter, C. B., & Goularas, D. (2021). A Turkish question answering system based on deep learning neural networks. Journal of Intelligent Systems: Theory and Applications, 4(2), 65-75.
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2020). Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 8440–8451).

Acikgoz, E. C., Erdogan, M., & Yuret, D. (2024). Bridging the Bosphorus: Advancing Turkish large language models through strategies for low-resource language adaptation and benchmarking. Hugging Face Datasets. preprint arXiv:2405.04685.
Safaya, A., Kurtuluş, E., Goktogan, A., & Yuret, D. (2022). Mukayese: Turkish NLP strikes back. Findings of the Association for Computational Linguistics: ACL 2022, (pp. 846-863).
Toraman, C. (2024). LlamaTurk: Adapting open-source generative large language models for low-resource language. preprint arXiv:2405.07745.
Son, G., Lee, H., Kim, S., Kim, S., Muennighoff, N., Choi, T., Park, C., Yoo, K. M., & Biderman, S. (2024). KMMLU: Measuring massive multitask language understanding in Korean. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 4076-4104).
Peng, K., Ding, L., Zhong, Q., Shen, L., Liu, X., Zhang, M., Ouyang, Y., & Tao, D. (2023). Towards making the most of ChatGPT for machine translation. Findings of the Association for Computational Linguistics: (EMNLP 2023), 5622–5633.
Tang, Y., Tran, C., Li, X., Chen, P.-J., Goyal, N., Chaudhary, V., Gu, J., & Fan, A. (2020). Multilingual translation with extensible multilingual pretraining and finetuning. arXiv.
Farahani, M., Gharachorloo, M., Farahani, M., & Manthouri, M. (2021). ParsBERT: Transformer-based model for Persian language understanding. Neural Processing Letters, 53(6), 3831–3847.
Cui, Y., Che, W., Liu, T., Qin, B., & Yang, Z. (2021). Pre-training with whole word masking for Chinese BERT. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 3504–3514.
Schweter, S. (2020). BERTurk - BERT models for Turkish. Zenodo.
Jiang, Z., Yu, W., Zhou, D., Chen, Y., Feng, J., & Yan, S. (2020). ConvBERT: Improving BERT with span-based dynamic convolution. Advances in Neural Information Processing Systems, 33, (pp. 12837-12848.) arXiv. http://arxiv.org/abs/2008.02496
Tas, N. (2024). RoBERTurk: Adjusting roberta for Turkish.
Zeidi, F., Amasyali, M. F., & Erol, Ç. (2024). LegalTurk Optimized BERT for multi-label text classification and NER. arXiv. http://arxiv.org/abs/2407.00648
Kesgin, H. T., Yuce, M. K., Dogan, E., Uzun, M. E., Uz, A., Seyrek, H. E., Zeer, A., & Amasyali, M. F. (2024). Introducing cosmosGPT: Monolingual training for Turkish language models. In 2024 International Conference on INnovations in Intelligent SysTems and Applications (INISTA) (pp. 1-6).
Senel, L. K., Ebing, B., Baghirova, K., Schuetze, H., & Glavaš, G. (2024). Kardes-NLU: Transfer to Low-Resource Languages with the Help of a High-Resource Cousin - A Benchmark and Evaluation for Turkic Languages. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1672–1688).
Yüksel, A., Köksal, A., Şenel, L. K., Korhonen, A., & Schütze, H. (2024). TurkishMMLU: Measuring massive multitask language understanding in turkish. In Findings of the Association for Computational Linguistics: EMNLP 2024 (pp. 7035-7055).
Shliazhko, O., Fenogenova, A., Tikhonova, M., Kozlova, A., Mikhailov, V., & Shavrina, T. (2024). mGPT: Few-shot learners go multilingual. Transactions of the Association for Computational Linguistics, 12, 58-79. doi: 10.1162/tacl_a_00633.
Turkish NLP Q&A Dataset. (2019). Retrieved April 2, 2025, from https://github.com/TQuad/turkish-nlp-qa-dataset.
Artetxe, M., Ruder, S., & Yogatama, D. (2019). On the cross-lingual transferability of monolingual representations. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 4623-4637).
Scialom, T., Dray, P.-A., Lamprier, S., Piwowarski, B., & Staiano, J. (2020). MLSUM: The multilingual summarization corpus. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 8051–8067).
Gözüaçık Ö. (2020). Türkçe haber derlemi. Retrieved April 9, 2025, from https://github.com/ogozuacik/turkce-haber-derlemi.
Sarigil S. (n.d.) Turkish to English Translation Dataset. Retrieved April 9, 2025, from https://www.kaggle.com/datasets/seymasa/turkish-to-english-translation-dataset.
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, (pp. 311–318).
Lin, C. Y. (2004). ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, (pp. 74–81).
Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (pp. 65–72).
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2019). BERTScore: Evaluating text generation with BERT. arXiv:1904.09675, 2019.

Ayrıntılar

Birincil Dil

İngilizce

Konular

Yazılım Mühendisliği (Diğer)

Bölüm

Araştırma Makalesi

Yazarlar

Emre Ünsal ^*
0000-0001-6042-0742
Türkiye

Ahmet Turan Karakuş
0009-0001-2941-3250
Türkiye

Yayımlanma Tarihi

29 Aralık 2025

Gönderilme Tarihi

8 Temmuz 2025

Kabul Tarihi

13 Aralık 2025

Yayımlandığı Sayı

Yıl 2025 Cilt: 9 Sayı: 2

DOI

https://doi.org/10.46460/ijiea.1737325

IZ

https://izlik.org/JA42TE25EJ

APA

Ünsal, E., & Karakuş, A. T. (2025). A Comparative Benchmark Study of Large Language Models on Turkish NLP Tasks: A Comparison of ChatGPT and DeepSeek. International Journal of Innovative Engineering Applications, 9(2), 184-192. https://doi.org/10.46460/ijiea.1737325

AMA

1.Ünsal E, Karakuş AT. A Comparative Benchmark Study of Large Language Models on Turkish NLP Tasks: A Comparison of ChatGPT and DeepSeek. ijiea, IJIEA. 2025;9(2):184-192. doi:10.46460/ijiea.1737325

Chicago

Ünsal, Emre, ve Ahmet Turan Karakuş. 2025. “A Comparative Benchmark Study of Large Language Models on Turkish NLP Tasks: A Comparison of ChatGPT and DeepSeek”. International Journal of Innovative Engineering Applications 9 (2): 184-92. https://doi.org/10.46460/ijiea.1737325.

EndNote

Ünsal E, Karakuş AT (01 Aralık 2025) A Comparative Benchmark Study of Large Language Models on Turkish NLP Tasks: A Comparison of ChatGPT and DeepSeek. International Journal of Innovative Engineering Applications 9 2 184–192.

IEEE

[1]E. Ünsal ve A. T. Karakuş, “A Comparative Benchmark Study of Large Language Models on Turkish NLP Tasks: A Comparison of ChatGPT and DeepSeek”, ijiea, IJIEA, c. 9, sy 2, ss. 184–192, Ara. 2025, doi: 10.46460/ijiea.1737325.

ISNAD

Ünsal, Emre - Karakuş, Ahmet Turan. “A Comparative Benchmark Study of Large Language Models on Turkish NLP Tasks: A Comparison of ChatGPT and DeepSeek”. International Journal of Innovative Engineering Applications 9/2 (01 Aralık 2025): 184-192. https://doi.org/10.46460/ijiea.1737325.

JAMA

1.Ünsal E, Karakuş AT. A Comparative Benchmark Study of Large Language Models on Turkish NLP Tasks: A Comparison of ChatGPT and DeepSeek. ijiea, IJIEA. 2025;9:184–192.

MLA

Ünsal, Emre, ve Ahmet Turan Karakuş. “A Comparative Benchmark Study of Large Language Models on Turkish NLP Tasks: A Comparison of ChatGPT and DeepSeek”. International Journal of Innovative Engineering Applications, c. 9, sy 2, Aralık 2025, ss. 184-92, doi:10.46460/ijiea.1737325.

Vancouver

1.Emre Ünsal, Ahmet Turan Karakuş. A Comparative Benchmark Study of Large Language Models on Turkish NLP Tasks: A Comparison of ChatGPT and DeepSeek. ijiea, IJIEA. 01 Aralık 2025;9(2):184-92. doi:10.46460/ijiea.1737325

A Comparative Benchmark Study of Large Language Models on Turkish NLP Tasks: A Comparison of ChatGPT and DeepSeek

Abstract

Keywords

Türkçe Doğal Dil İşleme Görevlerinde Büyük Dil Modellerinin Karşılaştırmalı Bir Kıyaslama Çalışması: ChatGPT ve DeepSeek Karşılaştırması

Öz

Anahtar Kelimeler

Kaynakça

Ayrıntılar

Birincil Dil

Konular

Bölüm

Yazarlar

Yayımlanma Tarihi

Gönderilme Tarihi

Kabul Tarihi

Yayımlandığı Sayı

DOI

IZ

Kaynak Göster