Gemini 2.5 Pro and ChatGPT-5 on the ATLS Exam: Accuracy, Consistency, and Comparison with Physicians

Kamil Kokulu; Ekrem Taha Sert; Hüseyin Mutlu; Emin Hüseyin Akar; Muhammed Ali Topuz; Mustafa Önder Gönen; Oğuz Yürük

EN TR

Gemini 2.5 Pro and ChatGPT-5 on the ATLS Exam: Accuracy, Consistency, and Comparison with Physicians

Abstract

Abstract Objective: This study aimed to evaluate the accuracy and consistency performance of the current large language models (LLMs), Gemini 2.5 Pro and ChatGPT-5, on the Advanced Trauma Life Support (ATLS) exam. It also aimed to compare these two artificial intelligence (AI) models with emergency medicine residents and examine their performance on different question types. Materials and Methods: This observational study used the 2023 ATLS exam, consisting of 40 multiple-choice questions. Questions were categorized as either directly based on basic knowledge or scenario-based. Each question was administered six times to Gemini 2.5 Pro and ChatGPT-5, and once to six emergency medicine residents to measure response consistency. The accuracy rates for all examinees were calculated and compared. Results: On the ATLS exam, Gemini 2.5 Pro achieved an overall accuracy rate of 95.8%, ChatGPT-5 achieved 92.9%, and residents achieved 67.1%. The AI models performed significantly better than residents (p < 0.001). No significant difference was found between the exam performances of Gemini and ChatGPT (p = 0.17). Both models showed lower accuracy on scenario-based questions compared to knowledge questions. The AI models' response consistency across repeated exams was found to be moderate. Conclusion: Both Gemini 2.5 Pro and ChatGPT-5 passed the ATLS exam with a higher success rate and consistent performance than residents. These findings demonstrate the significant potential of LLMs as a tool for assisting in trauma education, providing rapid access to information, and potentially in clinical decision support mechanisms.

Keywords

Gemini 2.5 Pro ve ChatGPT-5 ATLS Sınavında: Doğruluk, Tutarlılık ve Hekimlerle Karşılaştırma

Öz

Öz Amaç: Bu çalışmada, güncel büyük dil modellerinden (LLM) Gemini 2.5 Pro ve ChatGPT-5'in İleri Travma Yaşam Desteği (ATLS) sınavındaki doğruluk ve tutarlılık performanslarının değerlendirilmesi, bu iki yapay zeka (AI) modelinin acil tıp asistan hekimleriyle karşılaştırılması ve farklı soru tiplerindeki başarılarının incelenmesi amaçlanmıştır. Gereç ve Yöntem: Bu gözlemsel çalışmada, 40 çoktan seçmeli sorudan oluşan 2023 ATLS sınavı kullanıldı. Sorular, direk temel bilgi ve senaryo bazlı olarak sınıflandırıldı. Her bir soru, yanıt istikrarını ölçebilmek için Gemini 2.5 Pro ve ChatGPT-5'e altışar kez, altı acil tıp asistanına ise birer kez yöneltildi. Tüm sınav katılımcılarının doğruluk oranları hesaplandı ve birbiriyle karşılaştırıldı. Bulgular: ATLS sınavında Gemini 2.5 Pro %95.8, ChatGPT-5 %92.9 ve asistan hekimler %67.1 genel doğruluk oranı elde etti. AI modellerinin performansı, asistan hekimlerden istatistiksel olarak anlamlı düzeyde yüksek bulundu (p < 0.001). Gemini ve ChatGPT'nin sınav performansları arasında ise anlamlı bir fark saptanmadı (p = 0.17). Her iki model de bilgi sorularına kıyasla senaryo bazlı sorularda daha düşük doğruluk gösterdi. AI modellerinin tekrarlanan sınavlardaki yanıt tutarlılığı ise orta düzeyde bulundu. Sonuç: Hem Gemini 2.5 Pro hem de ChatGPT-5, ATLS sınavını asistan hekimlerden daha yüksek bir başarı oranıyla ve tutarlı bir performans sergileyerek geçmiştir. Bu bulgular, LLM'lerin travma eğitiminde, bilgiye hızlı erişim sağlamada ve potansiyel olarak klinik karar destek mekanizmalarında yardımcı bir araç olarak önemli bir potansiyele sahip olduğunu göstermektedir.

Anahtar Kelimeler

References

GBD 2019 Diseases and Injuries Collaborators. Global burden of 369 diseases and injuries in 204 countries and territories, 1990-2019: a systematic analysis for the Global Burden of Disease Study 2019. Lancet. 2020;396(10258):1204–1222.
Tekeşin K, Basak F, Sisik A, et al. Epidemiology of Trauma with Analysis of 138.352 Patients: Trends of a Single Center. Haydarpaşa Numune Med J. 2019; 59(2):181–185.
Galvagno SM, Nahmias JT, Young DA. Advanced Trauma Life Support® Update 2019: Management and Applications for Adults and Special Populations. Anesthesiol Clin. 2019;37(1):13–32.
ATLS Subcommittee, American College of Surgeons’ Committee on Trauma, International ATLS working group. Advanced trauma life support (ATLS®): the ninth edition. J Trauma Acute Care Surg. 2013;74(5):1363–1366.
Advanced Trauma Life Support FAQs [Internet]. American College of Surgeons (ACS); [cited 2025 Aug 13]. Available from: https://www.facs.org/quality-programs/trauma/education/advanced-trauma-life-support/faq/
Cabral S, Restrepo D, Kanjee Z, et al. Clinical Reasoning of a Generative Artificial Intelligence Model Compared With Physicians. JAMA Intern Med. 2024;184(5):581–583.
Ayers JW, Poliak A, Dredze M, et al. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA Intern Med. 2023;183(6):589–596.
Arslan B, Nuhoglu C, Satici MO, et al. Evaluating LLM-based generative AI tools in emergency triage: A comparative study of ChatGPT Plus, Copilot Pro, and triage nurses. Am J Emerg Med. 2025; 89:174–181.

Kokulu K, Sert ET. Artificial intelligence application for identifying toxic plant species: A case of poisoning with Datura stramonium. Toxicon. 2024; 251:108129.
Moulaei K, Yadegari A, Baharestani M, et al. Generative artificial intelligence in healthcare: A scoping review on benefits, challenges and applications. Int J Med Inform. 2024; 188:105474.
Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2): e0000198.
Brin D, Sorin V, Vaid A, et al. Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Sci Rep. 2023;13(1):16492.
Zhu L, Mou W, Yang T, et al. ChatGPT can pass the AHA exams: Open-ended questions outperform multiple-choice format. Resuscitation. 2023; 188:109783.
Kokulu K, Demirtaş MS, Sert ET, et al. ChatGPT and pediatric advanced life support: A performance evaluation. Resuscitation. 2024; 205:110451.
Advanced Trauma Life Support (ATLS) Provider Programme [Internet]. Royal College of Surgeons of England; [cited 2025 Aug 14]. Available from: https://www.rcseng.ac.uk/education-and-exams/courses/search/advanced-trauma-life-support-atls-provider-programme/
ATLS Pretest Answers [Internet]. Scribd; [cited 2025 Jun 17]. Available from: https://www.scribd.com/document/628416589/ATLS-pretest-answers
Mutlu H, Kokulu K, Sert ET et al. Evaluation of ChatGPTs Performance in Türkiye’s First Emergency Medicine Sub-Specialization Exam. Eurasian J Emerg Med. 2025.32644
Balci AS, Yazar Z, Ozturk BT, et al. Performance of Chatgpt in ophthalmology exam; human versus AI. Int Ophthalmol. 2024;44(1):413.
Hoppe JM, Auer MK, Strüven A, et al. ChatGPT With GPT-4 Outperforms Emergency Department Physicians in Diagnostic Accuracy: Retrospective Analysis. J Med Internet Res. 2024;26: e56110.
Varlık M, Eroğlu SE, Özdemir S, et al. Araç İçi Trafik Kazası ile Acil Servisine Başvuran Hastaların Değerlendirilmesi. Fırat Tıp Dergisi/Firat Med J 2019; 24 (4):186-192.
Wang L, Mao Y, Wang L, et al. Suitability of GPT-4o as an evaluator of cardiopulmonary resuscitation skills examinations. Resuscitation. 2024; 204:110404.