Büyük Dil Modellerinin Ruh Sağlığı Uygulamaları İçin Performans ve Tutarlılık Analizi

Merve Nur Türk; Revas Akın; Durmuş Özkan Şahin; Sercan Demirci

doi:10.34248/bsengineering.1913122

EN TR

Performance and Consistency Analysis of Large Language Models for Mental Health Applications

Abstract

This study aims to comparatively examine the performance of nine large language models (LLMs) with different providers and architectural scales in question-answering tasks related to mental health. The models were evaluated under the same experimental conditions using a mental health-based dataset, and their generated responses were compared with reference answers. Multiple automatic metrics that consider both surface and semantic similarity were used in the evaluation process. The findings reveal significant performance differences between the models. While the GPT-4o model stands out in terms of overall quality and consistency, the GPT-3.5 Turbo model produces results similar to larger models in some tasks, indicating that model scale does not always provide a linear advantage. The open-source LLaMA 3.3 model's performance, which is close to that of commercial systems, is noteworthy. These findings indicate that model selection in digital applications for mental health depends not only on technical capacity but also on task suitability. Quantitative results show that GPT-4o achieved the highest performance in semantic metrics, with a COMET score of 0.7277 and a METEOR score of 0.1480, indicating superior consistency compared to other models.

Keywords

Büyük Dil Modellerinin Ruh Sağlığı Uygulamaları İçin Performans ve Tutarlılık Analizi

Abstract

Bu çalışma, ruh sağlığı alanına yönelik soru-cevap görevlerinde farklı sağlayıcılara ve mimari ölçeklere sahip sekiz büyük dil modelinin (Large Language Model – LLM) performansını karşılaştırmalı olarak incelemeyi amaçlamaktadır. Modeller, ruh sağlığı temelli bir veri seti kullanılarak aynı deneysel koşullar altında değerlendirilmiş ve ürettikleri yanıtlar referans cevaplarla karşılaştırılmıştır. Değerlendirme sürecinde hem yüzeysel hem de anlamsal benzerliği dikkate alan çoklu otomatik metriklerden yararlanılmıştır. Elde edilen bulgular, modeller arasında belirgin performans farklılıkları bulunduğunu göstermektedir. Özellikle GPT-4o modeli genel kalite ve tutarlılık açısından öne çıkarken, GPT-3.5 Turbo’nun bazı görevlerde daha büyük modellerle benzer sonuçlar üretmesi model ölçeğinin her durumda doğrusal bir üstünlük sağlamadığını ortaya koymaktadır. Açık kaynaklı LLaMA 3.3 modelinin ticari sistemlere yakın performans göstermesi dikkat çekicidir. Bu bulgular, ruh sağlığına yönelik dijital uygulamalarda model seçiminin teknik kapasite kadar görev uyumuna da bağlı olduğunu göstermektedir. Sayısal sonuçlar, GPT-4o'nun anlamsal ölçütlerde en yüksek performansı sergilediğini göstermektedir; 0,7277 COMET puanı ve 0,1480 METEOR puanı ile diğer modellere kıyasla yüksek bir tutarlılık sergilemiştir.

Keywords

Ethical Statement

Bu araştırmada hayvanlar ve insanlar üzerinde herhangi bir çalışma yapılmadığı için etik kurul onayı alınmamıştır.

References

Anthropic. (2024). Claude: A conversational AI assistant. https://www.anthropic.com
Ayers, J. W., Poliak, A., Dredze, M., Leas, E. C., Zhu, Z., Kelley, J. B., ... & Smith, D. M. (2023). Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. Journal of the American Medical Association Internal Medicine, 183(6), 589–596. https://doi.org/10.1001/jamainternmed.2023.1838
Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for machine translation evaluation. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. https://aclanthology.org/W05-0909/
Chen, A., Stanovsky, G., Singh, S., & Gardner, M. (2019). Evaluating question answering evaluation. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering (pp. 119–124).https://doi.org/10.18653/v1/D19-5817
Gilson, A., Safranek, C. W., Huang, T., Socrates, V., Chi, L., Taylor, R. A., & Chartash, D. (2023). How does ChatGPT perform on the United States Medical Licensing Examination? The implications of large language models for medical education and knowledge assessment. JMIR Medical Education, 9, e45312. https://doi.org/10.2196/45312
Google. (2024). Gemini: A family of multimodal models. https://deepmind.google/technologies/gemini
Hua, Y., Liu, F., Yang, K., Li, Z., Na, H., Sheu, Y. H., Zhou, P., Moran, L. V., Ananiadou, S., Clifton, D. A., Beam, A., & Torous, J. (2025). Large language models in mental health care: A scoping review. Current Treatment Options in Psychiatry, 12(1), 27. https://doi.org/10.1007/s40501-025-00363-y
Kaggle. (2023). Mental Health FAQ for Chatbot Veri Kümesi. Son Erişim Tarihi 19 Mart 2026. https://www.kaggle.com/datasets/narendrageek/mental-health-faq-for-chatbot

Kasneci, E., Sessler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., ... & Kasneci, G. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 103, 102274. https://doi.org/10.1016/j.lindif.2023.102274
Konyalı, A., Naipoğlu, C., Güner, S., Bakkal, İ., & Çelik, A. R. (2025). Psikolojide yapay zekâ kullanımı ve uygulamaları. Journal of Kocaeli Health and Technology University. https://doi.org/10.66163/jokohtu.1641864
Kung, T. H., Cheatham, M., Medenilla, A., Sillos, C., De Leon, L., Elepaño, C., ... & Tseng, V. (2023). Performance of ChatGPT on USMLE: Potential for AI-assisted medical education. PLOS Digital Health, 2(2), e0000198. https://doi.org/10.1371/journal.pdig.0000198
Lin, C.-Y. (2004). ROUGE: A package for automatic evaluation of summaries. In Proceedings of the ACL Workshop on Text Summarization Branches Out (pp. 74–81). https://aclanthology.org/W04-1013/
Meta AI. (2024). LLaMA: Open and efficient foundation language models. https://ai.meta.com/llama
Nyakhar, S., & Wang, H. (2025). Effectiveness of AI chatbots on mental health in students: A rapid review. JMIR Mental Health, 9(10), e32924. https://doi.org/10.3389/fpsyt.2025.1621768
OpenAI. (2023). GPT models overview. https://openai.com
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (pp. 311–318). https://doi.org/10.3115/1073083.1073135
Rei, R., Stewart, C., Farinha, A. C., & Lavie, A. (2020). COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 2685–2702). https://doi.org/10.18653/v1/2020.emnlp-main.213
Rządeczka, M., Sterna, A., Stolińska, J., Kaczyńska, P., & Moskalewicz, M. (2025). The efficacy of conversational AI in rectifying the theory-of-mind and autonomy biases: Comparative analysis. JMIR Mental Health, 12, e64396. https://doi.org/10.2196/64396
Sahab, S., Haqbeen, J., Sapkota, D., & Ito, T. (2025). GPT chatbots for alleviating anxiety and depression: A pilot randomized controlled trial with Afghan women. arXiv preprint arXiv:2508.00847. https://doi.org/10.48550/arXiv.2508.00847
Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., ... & Natarajan, V. (2023). Large language models encode clinical knowledge. Nature, 620(7972), 172–180. https://doi.org/10.1038/s41586-023-06291-2
Van Kolfschooten, H. B., Gonçalves, J., Orchard, N., & Figueroa, C. (2025). AI chatbots for promoting healthy habits: Legal, ethical, and societal considerations. Digital Health, 11. https://doi.org/10.1177/20552076251390004
Wang, Y., Li, X., Zhang, Q., Yeung, D., & Wu, Y. (2025). Effect of a cognitive behavioral therapy–based AI chatbot on depression and loneliness in Chinese university students: Randomized controlled trial with financial stress moderation. JMIR mHealth and uHealth, 13, e63806. https://doi.org/10.2196/63806
Xu, X., Yao, B., Dong, Y., Gabriel, S., Yu, H., Hendler, J., Ghassemi, M., Dey, A. K., & Wang, D. (2024). Mental-LLM: Leveraging large language models for mental health prediction via online text data. arXiv preprint arXiv:2307.14385. https://arxiv.org/abs/2307.14385
Xu, Z., Lee, Y.-C., Stasiak, K., Warren, J., & Lottridge, D. (2025). The digital therapeutic alliance with mental health chatbots: Diary study and thematic analysis. JMIR Mental Health, 12, e76642. https://doi.org/10.2196/76642
Zhai, X. (2023). ChatGPT for next generation science learning. XRDS: Crossroads, The ACM Magazine for Students, 29(3), 42–46. https://doi.org/10.1145/3589649
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2020). BERTScore: Evaluating text generation with BERT. In International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=SkeHuCVFDr

Details

Primary Language

Turkish

Subjects

Information Systems (Other)

Journal Section

Research Article

Authors

Merve Nur Türk
0009-0009-1061-5854
Türkiye

Revas Akın
0009-0001-4855-7102
Türkiye

Durmuş Özkan Şahin ^*
0000-0002-0831-7825
Türkiye

Sercan Demirci
0000-0001-6739-7653
Türkiye

Publication Date

May 15, 2026

Submission Date

March 19, 2026

Acceptance Date

May 7, 2026

Published in Issue

Year 2026 Volume: 9 Number: 3

DOI

https://doi.org/10.34248/bsengineering.1913122

IZ

https://izlik.org/JA47AK36RR

Cite

RIS / Bibtex

APA

Türk, M. N., Akın, R., Şahin, D. Ö., & Demirci, S. (2026). Büyük Dil Modellerinin Ruh Sağlığı Uygulamaları İçin Performans ve Tutarlılık Analizi. Black Sea Journal of Engineering and Science, 9(3), 1338-1349. https://doi.org/10.34248/bsengineering.1913122

AMA

1.Türk MN, Akın R, Şahin DÖ, Demirci S. Büyük Dil Modellerinin Ruh Sağlığı Uygulamaları İçin Performans ve Tutarlılık Analizi. BSJ Eng. Sci. 2026;9(3):1338-1349. doi:10.34248/bsengineering.1913122

Chicago

Türk, Merve Nur, Revas Akın, Durmuş Özkan Şahin, and Sercan Demirci. 2026. “Büyük Dil Modellerinin Ruh Sağlığı Uygulamaları İçin Performans Ve Tutarlılık Analizi”. Black Sea Journal of Engineering and Science 9 (3): 1338-49. https://doi.org/10.34248/bsengineering.1913122.

EndNote

Türk MN, Akın R, Şahin DÖ, Demirci S (May 1, 2026) Büyük Dil Modellerinin Ruh Sağlığı Uygulamaları İçin Performans ve Tutarlılık Analizi. Black Sea Journal of Engineering and Science 9 3 1338–1349.

IEEE

[1]M. N. Türk, R. Akın, D. Ö. Şahin, and S. Demirci, “Büyük Dil Modellerinin Ruh Sağlığı Uygulamaları İçin Performans ve Tutarlılık Analizi”, BSJ Eng. Sci., vol. 9, no. 3, pp. 1338–1349, May 2026, doi: 10.34248/bsengineering.1913122.

ISNAD

Türk, Merve Nur - Akın, Revas - Şahin, Durmuş Özkan - Demirci, Sercan. “Büyük Dil Modellerinin Ruh Sağlığı Uygulamaları İçin Performans Ve Tutarlılık Analizi”. Black Sea Journal of Engineering and Science 9/3 (May 1, 2026): 1338-1349. https://doi.org/10.34248/bsengineering.1913122.

JAMA

1.Türk MN, Akın R, Şahin DÖ, Demirci S. Büyük Dil Modellerinin Ruh Sağlığı Uygulamaları İçin Performans ve Tutarlılık Analizi. BSJ Eng. Sci. 2026;9:1338–1349.

MLA

Türk, Merve Nur, et al. “Büyük Dil Modellerinin Ruh Sağlığı Uygulamaları İçin Performans Ve Tutarlılık Analizi”. Black Sea Journal of Engineering and Science, vol. 9, no. 3, May 2026, pp. 1338-49, doi:10.34248/bsengineering.1913122.

Vancouver

1.Merve Nur Türk, Revas Akın, Durmuş Özkan Şahin, Sercan Demirci. Büyük Dil Modellerinin Ruh Sağlığı Uygulamaları İçin Performans ve Tutarlılık Analizi. BSJ Eng. Sci. 2026 May 1;9(3):1338-49. doi:10.34248/bsengineering.1913122