DIAGNOSTIC PERFORMANCE OF CHATGPT, GEMINI, AND DEEPSEEK IN CLINICAL DECISION SUPPORT: A COMPARATIVE ANALYSIS

Yıldız Büyükdereli Atadağ; Tuba Kalınkara Seyhan; Umut Hilaloğlu; Hatice Tuba Akbayram

doi:10.21763/tjfmpc.1809491

TR EN

KLİNİK KARAR DESTEK SÜRECİNDE CHATGPT, GEMINI VE DEEPSEEK’İN TANISAL PERFORMANSI: KARŞILAŞTIRMALI BİR ANALİZ

Abstract

Amaç: Bu çalışma, üç farklı büyük dil modeli (LLM) tabanlı yapay zeka aracının-ChatGPT-4, Gemini 2.0 Flash ve DeepSeek-V3-standardize edilmiş klinik senaryolar kullanarak başlangıç klinik karar verme sürecini desteklemedeki tanısal performanslarını karşılaştırmayı amaçlamıştır.

Yöntem: Guide to Diagnostic Tests (7. baskı) kitabındaki tanı algoritmaları temel alınarak, beş ana klinik alanı temsil eden toplam 36 klinik senaryo seçilmiştir. Her bir senaryo için ilgili tanı algoritmasının yalnızca ilk karar basamağı değerlendirilmiştir. Tüm sorular Türkçe olarak sunulmuş; aynı istemler (prompt) kullanılarak modellerin halka açık ücretsiz sürümlerine birer kez girilmiştir. Yanıtlar; üçlü kategorik doğruluk sistemi (tamamen doğru, kısmen doğru, yanlış) kullanılarak analiz edilmiştir.

Bulgular: ChatGPT en yüksek toplam puanı (40/72) alırken, onu Gemini ve DeepSeek (her biri 36/72) izlemiştir; ancak bu fark istatistiksel olarak anlamlı bulunmamıştır (p>0.05). ChatGPT senaryoların %36,1'inde tamamen doğru yanıtlar verirken, bu oran Gemini için %33,3 ve DeepSeek için %22,2 olarak gerçekleşmiştir. ChatGPT ve Gemini arasında tamamen doğru yanıtlarda örtüşen kalıplar gözlemlenmiş olsa da, bu durum istatistiksel anlamlılığa ulaşmamıştır (p>0.05). Performans kategorilere göre değişkenlik göstermiştir: Gemini elektrolit bozukluklarında, ChatGPT enfeksiyon ve sistemik durumlarda öne çıkmış; DeepSeek ise yalnızca endokrinoloji ve hematoloji alanlarında diğerleriyle benzerlik göstermiştir.

Sonuç: Tüm modeller belirli bir tanısal potansiyel sergilese de, hiçbiri klinik muhakemenin yerini alacak yeterli doğruluk seviyesine ulaşamamıştır. Bununla birlikte, kısıtlı klinik bilgiye dayalı tanısal akıl yürütmenin ilk adımı için kullanıldıklarında, bu modeller - özellikle daha geniş klinik karar destek sistemlerine entegre edildiklerinde -klinisyenler için destekleyici bir değer sunabilir.

Keywords

Supporting Institution

yok

Ethical Statement

Çalışma doğası gereği insan ya da hayvan verisi kullanılmadığından, etik kurul onayı gerekmemiştir. Araştırma süreci, bilimsel dürüstlük ilkeleri ve Helsinki Bildirgesi’nin etik standartları gözetilerek yürütülmüştür.

Thanks

yok

DIAGNOSTIC PERFORMANCE OF CHATGPT, GEMINI, AND DEEPSEEK IN CLINICAL DECISION SUPPORT: A COMPARATIVE ANALYSIS

Abstract

Objective: This study aimed to compare the diagnostic performance of three large language model (LLM)-based artificial intelligence (AI) tools-ChatGPT-4, Gemini 2.0 Flash, and DeepSeek-V3-in supporting initial clinical decision-making using standardized clinical scenarios.

Methods: A total of 36 clinical scenarios were selected based on diagnostic algorithms from the Guide to Diagnostic Tests (7th ed.), representing five major clinical domains. For each scenario, only the first decision step of the relevant diagnostic algorithm was assessed. All questions were presented in Turkish and entered once, using identical prompts, into the publicly available free versions of the three models. Responses were evaluated using a three-point categorical accuracy system (completely correct, partially correct, incorrect).

Results: ChatGPT achieved the highest total score (40/72), followed by Gemini and DeepSeek (36/72 each); however, this difference was not statistically significant (p>0.05). ChatGPT provided completely correct responses in 36.1% of scenarios, compared with 33.3% for Gemini and 22.2% for DeepSeek. Overlapping patterns of fully correct responses were observed between ChatGPT and Gemini, although this did not reach statistical significance (p>0.05). Performance varied by category: Gemini excelled in electrolyte disorders, ChatGPT in infectious and systemic conditions, and DeepSeek showed parity only in endocrinology and hematology.

Conclusion: While all models showed some diagnostic potential, none reached a level of accuracy sufficient to replace clinical judgment. However, when used for the initial step of diagnostic reasoning based on limited clinical information, these models may offer supportive value to clinicians, particularly when integrated into broader clinical decision-support systems.

Keywords

Supporting Institution

N-A

Ethical Statement

Since the nature of the study did not involve the use of human or animal data, ethical committee approval was not required. The research process was conducted in accordance with the principles of scientific integrity and the ethical standards of the Declaration of Helsinki.

Thanks

N-A

References

1. Sanli DET, Sanli AN, Buyukdereli Atadag Y, Kurt A, Esmerer E. GPT-4o and Specialized AI in Breast Ultrasound Imaging: A Comparative Study on Accuracy, Agreement, Limitations, and Diagnostic Potential. J Ultrasound Med. 2025;44(11):1993-2004. doi:10.1002/jum.16749
2. Ranji SR. Large language models-misdiagnosing diagnostic excellence? JAMA Netw Open. 2024;7(10):e2440901. doi:10.1001/jamanetworkopen.2024.40901
3. Hager P, Jungmann F, Holland R, et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat Med. 2024;30(9):2613-2622. doi:10.1038/s41591-024-03097-1
4. Austad B, Hetlevik I, Mjølstad BP, Helvik AS. Applying clinical guidelines in general practice: a qualitative study of potential complications. BMC Fam Pract. 2016;17:92. doi:10.1186/s12875-016-0490-3.
5. Corrao S, Argano C. Rethinking clinical decision-making to improve clinical reasoning. Front Med (Lausanne). 2022;9:900543. doi:10.3389/fmed.2022.900543
6. Wu X, Cai G, Guo B, et al. A multi-dimensional performance evaluation of large language models in dental implantology: comparison of ChatGPT, DeepSeek, Grok, Gemini and Qwen across diverse clinical scenarios. BMC Oral Health. 2025;25(1):1272. Published 2025 Jul 28. doi:10.1186/s12903-025-06619-6
7. Meo SA, Abukhalaf FA, ElToukhy RA, Sattar K. Exploring the role of DeepSeek-R1, ChatGPT-4, and Google Gemini in medical education: How valid and reliable are they?. Pak J Med Sci. 2025;41(7):1887-1892. doi:10.12669/pjms.41.7.12183
8. Lee S, Jung S, Park JH, Cho H, Moon S, Ahn S. Performance of ChatGPT, Gemini and DeepSeek for non-critical triage support using real-world conversations in emergency department. BMC Emerg Med. 2025;25(1):176. Published 2025 Sep 1. doi:10.1186/s12873-025-01337-2

Details

Primary Language

English

Subjects

Family Medicine

Journal Section

Short Report

Authors

Yıldız Büyükdereli Atadağ ^*
0000-0002-8516-6477
Türkiye

Tuba Kalınkara Seyhan
0009-0000-2612-1810
Türkiye

Umut Hilaloğlu
0009-0005-4669-5607
Türkiye

Hatice Tuba Akbayram
0000-0002-9777-9596
Türkiye

Early Pub Date

May 10, 2026

Publication Date

June 1, 2026

Submission Date

October 23, 2025

Acceptance Date

March 23, 2026

Published in Issue

Year 2026 Volume: 20 Number: 2

DOI

https://doi.org/10.21763/tjfmpc.1809491

IZ

https://izlik.org/JA99ZP49ZR

Cite

RIS / Bibtex

APA

Büyükdereli Atadağ, Y., Kalınkara Seyhan, T., Hilaloğlu, U., & Akbayram, H. T. (2026). DIAGNOSTIC PERFORMANCE OF CHATGPT, GEMINI, AND DEEPSEEK IN CLINICAL DECISION SUPPORT: A COMPARATIVE ANALYSIS. Turkish Journal of Family Medicine and Primary Care, 20(2), 214-219. https://doi.org/10.21763/tjfmpc.1809491

AMA

1.Büyükdereli Atadağ Y, Kalınkara Seyhan T, Hilaloğlu U, Akbayram HT. DIAGNOSTIC PERFORMANCE OF CHATGPT, GEMINI, AND DEEPSEEK IN CLINICAL DECISION SUPPORT: A COMPARATIVE ANALYSIS. TJFMPC. 2026;20(2):214-219. doi:10.21763/tjfmpc.1809491

Chicago

Büyükdereli Atadağ, Yıldız, Tuba Kalınkara Seyhan, Umut Hilaloğlu, and Hatice Tuba Akbayram. 2026. “DIAGNOSTIC PERFORMANCE OF CHATGPT, GEMINI, AND DEEPSEEK IN CLINICAL DECISION SUPPORT: A COMPARATIVE ANALYSIS”. Turkish Journal of Family Medicine and Primary Care 20 (2): 214-19. https://doi.org/10.21763/tjfmpc.1809491.

EndNote

Büyükdereli Atadağ Y, Kalınkara Seyhan T, Hilaloğlu U, Akbayram HT (June 1, 2026) DIAGNOSTIC PERFORMANCE OF CHATGPT, GEMINI, AND DEEPSEEK IN CLINICAL DECISION SUPPORT: A COMPARATIVE ANALYSIS. Turkish Journal of Family Medicine and Primary Care 20 2 214–219.

IEEE

[1]Y. Büyükdereli Atadağ, T. Kalınkara Seyhan, U. Hilaloğlu, and H. T. Akbayram, “DIAGNOSTIC PERFORMANCE OF CHATGPT, GEMINI, AND DEEPSEEK IN CLINICAL DECISION SUPPORT: A COMPARATIVE ANALYSIS”, TJFMPC, vol. 20, no. 2, pp. 214–219, June 2026, doi: 10.21763/tjfmpc.1809491.

ISNAD

Büyükdereli Atadağ, Yıldız - Kalınkara Seyhan, Tuba - Hilaloğlu, Umut - Akbayram, Hatice Tuba. “DIAGNOSTIC PERFORMANCE OF CHATGPT, GEMINI, AND DEEPSEEK IN CLINICAL DECISION SUPPORT: A COMPARATIVE ANALYSIS”. Turkish Journal of Family Medicine and Primary Care 20/2 (June 1, 2026): 214-219. https://doi.org/10.21763/tjfmpc.1809491.

JAMA

1.Büyükdereli Atadağ Y, Kalınkara Seyhan T, Hilaloğlu U, Akbayram HT. DIAGNOSTIC PERFORMANCE OF CHATGPT, GEMINI, AND DEEPSEEK IN CLINICAL DECISION SUPPORT: A COMPARATIVE ANALYSIS. TJFMPC. 2026;20:214–219.

MLA

Büyükdereli Atadağ, Yıldız, et al. “DIAGNOSTIC PERFORMANCE OF CHATGPT, GEMINI, AND DEEPSEEK IN CLINICAL DECISION SUPPORT: A COMPARATIVE ANALYSIS”. Turkish Journal of Family Medicine and Primary Care, vol. 20, no. 2, June 2026, pp. 214-9, doi:10.21763/tjfmpc.1809491.

Vancouver

1.Yıldız Büyükdereli Atadağ, Tuba Kalınkara Seyhan, Umut Hilaloğlu, Hatice Tuba Akbayram. DIAGNOSTIC PERFORMANCE OF CHATGPT, GEMINI, AND DEEPSEEK IN CLINICAL DECISION SUPPORT: A COMPARATIVE ANALYSIS. TJFMPC. 2026 Jun. 1;20(2):214-9. doi:10.21763/tjfmpc.1809491