Diagnostic Performance of ChatGPT-o1 and DeepSeek-V3 in Expert-Validated Simulated Ear Nose and Throat Scenarios: A Comparative Accuracy Study

Nazlım Hilal Taraf; Burcu Vural Çamalan; Sümeyra Doluoğlu; Erhan Arslan; Ahmet Ural; Gülbin Demiroğlu; Atilla Elhan Elhan; Samet Özlügedik

doi:10.65396/ejra.1846059

Research Article

BibTex

RIS

Cite

Diagnostic Performance of ChatGPT-o1 and DeepSeek-V3 in Expert-Validated Simulated Ear Nose and Throat Scenarios: A Comparative Accuracy Study

Year 2026, Volume: 9 Issue: 1, 1 - 9, 26.03.2026

Nazlım Hilal Taraf , Burcu Vural Çamalan , Sümeyra Doluoğlu , Erhan Arslan , Ahmet Ural , Gülbin Demiroğlu , Atilla Elhan Elhan , Samet Özlügedik

https://doi.org/10.65396/ejra.1846059

https://izlik.org/JA37CX44UL

Abstract

Abstract
Objective: To compare the diagnostic accuracy of two advanced large language models (LLMs), ChatGPT-o1 and DeepSeek-V3, in expert-validated simulated otorhinolaryngology cases, and to assess subspecialty-specific performance and inter-rater agreement relative to human specialists.
Methods: A cross-sectional diagnostic accuracy study was conducted using 70 expert-validated clinical vignettes across five ENT subspecialties. Two academic otolaryngologists and two LLMs independently evaluated each case. All LLMs operated in deterministic mode (temperature = 0) with standardized single-pass prompting in isolated sessions. Diagnostic accuracy, inter-rater agreement (Cohen’s κ), and subspecialty-specific performance were analyzed. A post hoc power analysis (Cohen’s h = 0.22; α = 0.05) assessed the ability to detect moderate effect sizes.
Results: Both LLMs achieved a diagnostic accuracy of 90.0% (63/70), with no significant difference between them (p = 1.00) and substantial inter-model agreement (κ = 0.68). Human evaluators achieved accuracies of 97.1% and 92.9%, with fair inter-rater agreement (κ = 0.26). Subspecialty performance was highest in otology and pediatric ENT (100%) and rhinology (92.3%), with greater variability observed in laryngology and head and neck surgery. Shared error patterns included overestimation of malignancy in high-risk patients. Post hoc power analysis demonstrated 78% power to detect moderate differences.
Conclusion: In controlled, vignette-based evaluations, ChatGPT-o1 and DeepSeek-V3 demonstrated diagnostic accuracy approaching expert-level performance across simulated ENT scenarios, with strong inter-model agreement and subspecialty-dependent variability. These findings highlight the potential of LLMs as diagnostic decision-support tools while underscoring the need for multimodal and real-world validation before clinical implementation.

Keywords

ChatCPT , DeepSeek , Diagnostic accuracy , Ear nose and throat , Large language models

Ethical Statement

Formal ethics committee approval was not required as this study involved only simulated clinical scenarios without real patient data or human subject involvement.

Supporting Institution

No financial support

References

1. Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):
2. Gilson A, Safranek CW, Huang T, et al. How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? JMIR Med Educ. 2023;9:e45312.
3. Tessler I, Wolfovitz A, Alon EE, et al. ChatGPT's adherence to otolaryngology clinical practice guidelines. Eur Arch Otorhinolaryngol. 2024;281(7):3829–3834.
4. Teixeira-Marques F, Medeiros N, Nazaré F, et al. Exploring the role of ChatGPT in clinical decision-making in otorhinolaryngology: a ChatGPT designed study. Eur Arch Otorhinolaryngol. 2024;281(4):2023–2030.
5. Vaira LA, Lechien JR, Abbate V, et al. Accuracy of ChatGPT-generated information on head and neck and oromaxillofacial surgery: a multicenter collaborative analysis. Otolaryngol Head Neck Surg. 2023;170(6):1492–1503.
6. Hoch CC, Wollenberg B, Lüers JC, et al. ChatGPT's quiz skills in different otolaryngology subspecialties. Eur Arch Otorhinolaryngol. 2023;280(9):4271–4278.
7. Buhr CR, Smith H, Huppertz T, et al. Assessing unknown potential—quality and limitations of different large language models in the field of otorhinolaryngology. Acta Otolaryngol. 2024;144(3):237–242.
8. Makhoul M, Melkane AE, Khoury PE, Hadi CE, Matar N. A cross-sectional comparative study: ChatGPT 3.5 versus diverse levels of medical experts in the diagnosis of ENT diseases. Eur Arch Otorhinolaryngol. 2024 May;281(5):2717-2721.
9. Karimov Z, Allahverdiyev I, Agayarov OY, Demir D, Almuradova E. ChatGPT vs UpToDate: comparative study of usefulness and reliability of Chatbot in common clinical presentations of otorhinolaryngology-head and neck surgery. Eur Arch Otorhinolaryngol. 2024 Apr;281(4):2145-2151.
10. Wang L, Li J, Zhuang B, et al. Accuracy of large language models when answering clinical research questions: systematic review and network meta-analysis. J Med Internet Res. 2025;27:e64486.
11. McDuff D, Schaekermann M, Tu T, et al. Towards accurate differential diagnosis with large language models. Nat Mach Intell. 2025;7:155–165.
12. Goh E, Gallo R, Hom J, et al. Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. JAMA Netw Open. 2024 Oct 1;7(10):e2440969.
13. Long C, Subburam D, Lowe K, et al. ChatENT: augmented large language model for expert knowledge retrieval in otolaryngology. Otolaryngol Head Neck Surg. 2024;171(3):1042–1051.
14. Bellinger JR, Kwak MW, Ramos GA, Mella JS, Mattos JL. Quantitative Comparison of Chatbots on Common Rhinology Pathologies. Laryngoscope. 2024 Oct;134(10):4225-4231.
15. Qu RW, Qureshi U, Petersen G, Lee SC. Diagnostic and Management Applications of ChatGPT in Structured Otolaryngology Clinical Scenarios. OTO Open. 2023 Aug 22;7(3):e67.
16. Lorenzi A, Pugliese G, Maniaci A, et al. Reliability of large language models for advanced head and neck malignancies management: comparison between ChatGPT 4 and Gemini Advanced. Eur Arch Otorhinolaryngol. 2024;281(9):5001–5006.
17. Pamuk E, Bilen YE, Külekçi Ç, Kuşcu O. ChatGPT-4 vs. multi-disciplinary tumor board decisions for the therapeutic management of primary laryngeal cancer. Acta Otolaryngol. 2025 Aug;145(8):714-719.
18. Vural Camalan B, Doluoglu S, Taraf NH, Gunay MM, Ozlugedik S. ChatGPT versus DeepSeek in head and neck cancer staging and treatment planning: guideline-based study. Eur Arch Otorhinolaryngol. 2025 Sep;282(9):4815-4824.

There are 18 citations in total.

Details

Primary Language	English
Subjects	Otorhinolaryngology
Journal Section	Research Article
Authors	Nazlım Hilal Taraf Burcu Vural Çamalan 0000-0002-4157-3396 Sümeyra Doluoğlu 0000-0002-7264-6578 Erhan Arslan 0000-0002-6799-8907 Ahmet Ural 0000-0002-6088-1415 Gülbin Demiroğlu Atilla Elhan Elhan Samet Özlügedik
Submission Date	December 21, 2025
Acceptance Date	February 5, 2026
Publication Date	March 26, 2026
DOI	https://doi.org/10.65396/ejra.1846059
IZ	https://izlik.org/JA37CX44UL
Published in Issue	Year 2026 Volume: 9 Issue: 1

Cite

APA	Taraf, N. H., Vural Çamalan, B., Doluoğlu, S., Arslan, E., Ural, A., Demiroğlu, G., Elhan, A. E., & Özlügedik, S. (2026). Diagnostic Performance of ChatGPT-o1 and DeepSeek-V3 in Expert-Validated Simulated Ear Nose and Throat Scenarios: A Comparative Accuracy Study. European Journal of Rhinology and Allergy, 9(1), 1-9. https://doi.org/10.65396/ejra.1846059
AMA	1.Taraf NH, Vural Çamalan B, Doluoğlu S, et al. Diagnostic Performance of ChatGPT-o1 and DeepSeek-V3 in Expert-Validated Simulated Ear Nose and Throat Scenarios: A Comparative Accuracy Study. Eur J Rhinol Allergy. 2026;9(1):1-9. doi:10.65396/ejra.1846059
Chicago	Taraf, Nazlım Hilal, Burcu Vural Çamalan, Sümeyra Doluoğlu, et al. 2026. “Diagnostic Performance of ChatGPT-O1 and DeepSeek-V3 in Expert-Validated Simulated Ear Nose and Throat Scenarios: A Comparative Accuracy Study”. European Journal of Rhinology and Allergy 9 (1): 1-9. https://doi.org/10.65396/ejra.1846059.
EndNote	Taraf NH, Vural Çamalan B, Doluoğlu S, Arslan E, Ural A, Demiroğlu G, Elhan AE, Özlügedik S (March 1, 2026) Diagnostic Performance of ChatGPT-o1 and DeepSeek-V3 in Expert-Validated Simulated Ear Nose and Throat Scenarios: A Comparative Accuracy Study. European Journal of Rhinology and Allergy 9 1 1–9.
IEEE	[1]N. H. Taraf et al., “Diagnostic Performance of ChatGPT-o1 and DeepSeek-V3 in Expert-Validated Simulated Ear Nose and Throat Scenarios: A Comparative Accuracy Study”, Eur J Rhinol Allergy, vol. 9, no. 1, pp. 1–9, Mar. 2026, doi: 10.65396/ejra.1846059.
ISNAD	Taraf, Nazlım Hilal - Vural Çamalan, Burcu - Doluoğlu, Sümeyra - Arslan, Erhan - Ural, Ahmet - Demiroğlu, Gülbin - Elhan, Atilla Elhan - Özlügedik, Samet. “Diagnostic Performance of ChatGPT-O1 and DeepSeek-V3 in Expert-Validated Simulated Ear Nose and Throat Scenarios: A Comparative Accuracy Study”. European Journal of Rhinology and Allergy 9/1 (March 1, 2026): 1-9. https://doi.org/10.65396/ejra.1846059.
JAMA	1.Taraf NH, Vural Çamalan B, Doluoğlu S, Arslan E, Ural A, Demiroğlu G, Elhan AE, Özlügedik S. Diagnostic Performance of ChatGPT-o1 and DeepSeek-V3 in Expert-Validated Simulated Ear Nose and Throat Scenarios: A Comparative Accuracy Study. Eur J Rhinol Allergy. 2026;9:1–9.
MLA	Taraf, Nazlım Hilal, et al. “Diagnostic Performance of ChatGPT-O1 and DeepSeek-V3 in Expert-Validated Simulated Ear Nose and Throat Scenarios: A Comparative Accuracy Study”. European Journal of Rhinology and Allergy, vol. 9, no. 1, Mar. 2026, pp. 1-9, doi:10.65396/ejra.1846059.
Vancouver	1.Nazlım Hilal Taraf, Burcu Vural Çamalan, Sümeyra Doluoğlu, Erhan Arslan, Ahmet Ural, Gülbin Demiroğlu, Atilla Elhan Elhan, Samet Özlügedik. Diagnostic Performance of ChatGPT-o1 and DeepSeek-V3 in Expert-Validated Simulated Ear Nose and Throat Scenarios: A Comparative Accuracy Study. Eur J Rhinol Allergy. 2026 Mar. 1;9(1):1-9. doi:10.65396/ejra.1846059

Article Files

Full Text

You can find the current version of the Instructions to Authors at: https://www.eurjrhinol.org/en/instructions-to-authors-104

Starting on 2020, all content published in the journal is licensed under the Creative Commons Attribution-NonCommercial (CC BY-NC) 4.0 International
License which allows third parties to use the content for non-commercial purposes as long as they give credit to the original work. This license
allows for the content to be shared and adapted for non-commercial purposes, promoting the dissemination and use of the research published in
the journal.
The content published before 2020 was licensed under a traditional copyright, but the archive is still available for free access.