TY - JOUR T1 - Performance of Generative AI Models on Cardiology Practice in Emergency Service: A Pilot Evaluation of GPT-4.o and Gemini-1.5-Flash TT - Kardiyak Acil Durumların Yönetiminde ChatGPT ve Gemini AU - Günay Polatkan, Şeyda AU - Sığırlı, Deniz AU - Durak, Vahide Aslıhan AU - Alak, Çetin AU - Kan, Irem Iris PY - 2025 DA - August Y2 - 2025 DO - 10.32708/uutfd.1718121 JF - Journal of Uludağ University Medical Faculty JO - Uludağ Tıp Derg PB - Bursa Uludağ Üniversitesi WT - DergiPark SN - 1300-414X SP - 239 EP - 246 VL - 51 IS - 2 LA - en AB - In healthcare, emergent clinical decision-making is complex and large language models (LLMs) may enhance both the quality and efficiency of care by aiding physicians. Case scenario-based multiple choice questions (CS-MCQs) are valuable for testing analytical skills and knowledge integration. Moreover, readability is as important as content accuracy. This study aims to compare the diagnostic and treatment capabilities of GPT-4.o and Gemini-1.5-Flash and to evaluate the readability of the responses for cardiac emergencies. A total of 70 single-answer MCQs were randomly selected from the Medscape Case Challenges and ECG Challenges series. The questions were about cardiac emergencies and were further categorized into four subgroups according to whether the question included a case presentation or an image, or not. ChatGPT and Gemini platforms were used to assess the selected questions. The Flesch–Kincaid Grade Level (FKGL) and Flesch Reading Ease (FRE) scores were utilized to evaluate the readability of the responses. GPT-4.o had a correct response rate of 65.7%, outperforming Gemini-1.5-Flash, which had a 58.6% correct response rate (p=0.010). When comparing by question type, GPT-4.o was inferior to Gemini-1.5-Flash only for non-case questions (52.5% vs. 62.5%, p=0.011). For all other question types, there were no significant performance differences between the two models (p>0.05). Both models performed better on easy questions compared to difficult ones, and on questions without images compared to those with images. Additionally, while GPT-4.o performed better on case questions than non-case questions. Gemini-1.5-Flash’s FRE score was higher than GPT-4.o’s (median [min-max], 23.75 [0-64.60] vs. 17.0 [0-56.60], p KW - cardiology KW - decision making KW - artificial intelligence KW - GPT-4.o KW - Gemini-1.5-Flash N2 - Sağlık hizmetlerinde, acil klinik karar alma karmaşıktır ve büyük dil modelleri (LLM'ler) hekimlere yardımcı olarak hem bakımın kalitesini hem de verimliliğini artırabilir. Vaka senaryosuna dayalı çoktan seçmeli sorular (VS-ÇSS), analitik becerileri ve bilgi bütünleştirmeyi test etmek için değerlidir. Ayrıca, okunabilirlik, içerik doğruluğu kadar önemlidir. Bu çalışma, GPT-4.o ve Gemini-1.5-Flash'ın tanı ve tedavi yeteneklerini karşılaştırmayı ve kardiyak acil durumlar için yanıtların okunabilirliğini değerlendirmeyi amaçlamaktadır. Medscape Vaka Zorlukları ve EKG Zorlukları serilerinden toplam 70 tek cevaplı ÇSS rastgele seçildi. Sorular kardiyak acil durumlarla ilgiliydi ve sorunun bir vaka sunumu veya bir görüntü içerip içermemesine göre dört alt gruba ayrıldı. Seçilen soruları değerlendirmek için CahtGPT ve Gemini platformları kullanıldı. Yanıtların okunabilirliğini değerlendirmek için Flesch-Kincaid Sınıf Düzeyi (FKGL) ve Flesch Okuma Kolaylığı (FRE) puanları kullanıldı. GPT-4.o'nun doğru yanıt oranı %65,7'ydi ve %58,6 doğru yanıt oranına sahip Gemini-1.5-Flash'ı geride bıraktı (p=0,010). Soru türüne göre karşılaştırıldığında, GPT-4.o yalnızca vaka dışı sorularda Gemini-1.5-Flash'tan daha düşüktü (%52,5'e karşı %62,5, p=0,011). Diğer tüm soru türleri için, iki model arasında önemli bir performans farkı yoktu (p>0,05). Her iki model de kolay sorularda zor sorulara göre ve resimsiz sorularda resimli sorulara göre daha iyi performans gösterdi. Ek olarak, GPT-4.o vaka dışı sorulara göre vaka sorularında daha iyi performans gösterdi. Gemini-1.5-Flash'ın FRE puanı GPT-4.o'dan daha yüksekti (ortanca [min-maks], 23.75 [0-64.60] - 17.0 [0-56.60], p CR - 1. Labadze L, Grigolia M, Machaidze L. Role of AI chatbots in education: systematic literature review. Int J Educ Technol High Educ. 2023;20(56). doi:10.1186/s41239-023-00416-7 CR - 2. Dwivedi YK, Kshetri N, Hughes L, et al. Opinion paper: “So what if ChatGPT wrote it?” Multidisciplinary perspectives on opportunities, challenges and implications of generative conversational AI for research, practice and policy. Int J Inform Manag. 2023;71:102642. doi:10.1016/j.ijinfomgt.2023.102642 CR - 3. Yenduri G. GPT (Generative Pre-Trained Transformer)—A comprehensive review on enabling technologies, potential applications, emerging challenges, and future directions. IEEE Access. 2024;12:1-36. doi:10.1109/ACCESS.2024.3389497 CR - 4. Hadi MU, Al-Tashi Q, Qureshi R, et al. Large language models: a comprehensive survey of applications, challenges, limitations, and future prospects. Authorea. Preprint. 2023. CR - 5. Johnson D, Goodman R, Patrinely J, et al. Assessing the accuracy and reliability of AI-generated medical responses: an evaluation of the Chat-GPT model. Res Sq. 2023. doi:10.21203/rs.3.rs-2924050/v1 CR - 6. Saka A, Taiwo R, Saka N, et al. GPT models in construction industry: opportunities, limitations, and a use case validation. Dev Built Environ. 2024;17:100300. doi:10.1016/j.dibe.2023.100300 CR - 7. Urbina F, Lentzos F, Invernizzi C, Ekins S. Dual use of artificial intelligence-powered drug discovery. Nat Mach Intell. 2022;4(3):189-191. doi:10.1038/s42256-022-00480-0 CR - 8. OpenAI. GPT-4 Technical Report. 2023. Available at: https://cdn.openai.com/papers/gpt-4.pdf. Accessed June 16, 2025. CR - 9. OpenAI. Introducing GPT-4o and more tools to ChatGPT free users. 2024. Available at: https://openai.com/index/gpt-4o-and-more-tools-to-chatgpt-free/. Accessed June 16, 2025. CR - 10. Chen CH, Hsieh KY, Huang KE, Lai HY. Comparing vision-capable models, GPT-4 and Gemini, with GPT-3.5 on Taiwan's pulmonologist exam. Cureus. 2024;16(8):e67641. doi:10.7759/cureus.67641 CR - 11. Masanneck L, Schmidt L, Seifert A, et al. Triage performance across large language models, ChatGPT, and untrained doctors in emergency medicine: comparative study. J Med Internet Res. 2024;26:e53297. doi:10.2196/53297 CR - 12. Builoff V, Shanbhag A, Miller RJ, et al. Evaluating AI proficiency in nuclear cardiology: large language models take on the board preparation exam. medRxiv. Preprint. 2024. doi:10.1101/2024.07.16.24310297 CR - 13. Botross M, Mohammadi SO, Montgomery K, Crawford C. Performance of Google's artificial intelligence chatbot "Bard" (now "Gemini") on ophthalmology board exam practice questions. Cureus. 2024;16(3):e57348. doi:10.7759/cureus.57348 CR - 14. Khan MP, O'Sullivan ED. A comparison of the diagnostic ability of large language models in challenging clinical cases. Front Artif Intell. 2024;7:1379297. doi:10.3389/frai.2024.1379297 CR - 15. Hirosawa T, Harada Y, Mizuta K, et al. Evaluating ChatGPT-4's accuracy in identifying final diagnoses within differential diagnoses compared with those of physicians: experimental study for diagnostic cases. JMIR Form Res. 2024;8:e59267. doi:10.2196/59267 CR - 16. Gomez-Cabello CA, Borna S, Pressman SM, Haider SA, Forte AJ. Large language models for intraoperative decision support in plastic surgery: a comparison between ChatGPT-4 and Gemini. Medicina (Kaunas). 2024;60(6):957. doi:10.3390/medicina60060957 CR - 17. Rush R. Assessing readability: formulas and alternatives. Read Teach. 1984;39(3):274-283. CR - 18. Flesch R. A new readability yardstick. J Appl Psychol. 1948;32(3):221-233. doi:10.1037/h0057532 CR - 19. Medscape. Case Challenges. Available at: https://reference.medscape.com/features/casechallenges?icd=login_success_email_match_norm. Accessed September 6, 2024. CR - 20. Medscape. Home Page. Available at: https://www.medscape.com/index/section_60_0. Accessed September 6, 2024. CR - 21. Readable. Flesch Reading Ease and the Flesch Kincaid Grade Level. Available at: https://readable.com/readability/flesch-reading-ease-flesch-kincaid-grade-level/. Accessed June 16, 2025. CR - 22. Klare GR. The measurement of readability: useful information for communicators. ACM J Comput Doc. 2000;24:107-121. CR - 23. Palmer EJ, Devitt PG. Assessment of higher order cognitive skills in undergraduate education: modified essay or multiple choice questions? BMC Med Educ. 2007;7:49. doi:10.1186/1472-6920-7-49 CR - 24. Hays RB, Coventry P, Wilcock D, Hartley K. Short and long multiple-choice question stems in a primary care oriented undergraduate medical curriculum. Educ Prim Care. 2009;20(3):173-177. CR - 25. Case SM, Swanson DB. Constructing Written Test Questions for the Basic and Clinical Sciences. 3rd ed. Philadelphia: National Board of Medical Examiners; 2002. CR - 26. Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a radiology board-style examination: insights into current strengths and limitations. Radiology. 2023;307(5):e230582. doi:10.1148/radiol.230582 CR - 27. Rao A, Kim J, Kamineni M, et al. Evaluating ChatGPT as an adjunct for radiologic decision-making. medRxiv. 2023. doi:10.1101/2023.02.02.23285399 CR - 28. Kitamura FC. ChatGPT is shaping the future of medical writing but still requires human judgment. Radiology. 2023;307:e230171. CR - 29. Momenaei B, Wakabayashi T, Shahlaee A, et al. Appropriateness and readability of ChatGPT-4-generated responses for surgical treatment of retinal diseases. Ophthalmol Retina. 2023;7:862-868. CR - 30. Al-Sharif EM, Penteado RC, Dib El Jalbout N, et al. Evaluating the accuracy of ChatGPT and Google Bard in fielding oculoplastic patient queries: a comparative study on artificial versus human intelligence. Ophthalmic Plast Reconstr Surg. 2024;40:303-311. CR - 31. Atkinson CJ, Seth I, Xie Y, et al. Artificial intelligence language model performance for rapid intraoperative queries in plastic surgery: ChatGPT and the deep inferior epigastric perforator flap. J Clin Med. 2024;13:900. CR - 32. Rizwan A, Sadiq T. The use of AI in diagnosing diseases and providing management plans: a consultation on cardiovascular disorders with ChatGPT. Cureus. 2023;15(8):e43106. doi:10.7759/cureus.43106 CR - 33. Mahendiran T, Thanou D, Senouf O, et al. Deep learning-based prediction of future myocardial infarction using invasive coronary angiography: a feasibility study. Open Heart. 2023;10:e002237. CR - 34. Skalidis I, Cagnina A, Luangphiphat W, et al. ChatGPT takes on the European Exam in Core Cardiology: an artificial intelligence success story? Eur Heart J Digit Health. 2023;4:279-281. CR - 35. Herman R, Kisova T, Belmonte M, et al. Artificial intelligence-powered electrocardiogram detecting culprit vessel blood flow abnormality: AI-ECG TIMI study design and rationale. J Soc Cardiovasc Angiogr Interv. 2025;4(3Part B):102494. doi:10.1016/j.jscai.2024.102494 CR - 36. Günay S, Öztürk A, Yiğit Y. The accuracy of Gemini, GPT-4, and GPT-4o in ECG analysis: a comparison with cardiologists and emergency medicine specialists. Am J Emerg Med. 2024;84:68-73. doi:10.1016/j.ajem.2024.07.043 CR - 37. Martínez-Sellés M, Marina-Breysse M. Current and future use of artificial intelligence in electrocardiography. J Cardiovasc Dev Dis. 2023;10(4):175. doi:10.3390/jcdd10040175 CR - 38. Yuan J, Tang R, Jiang X, Hu H. Large language models for healthcare data augmentation: an example on patient-trial matching. AMIA Annu Symp Proc. 2023;2023:1324-1333. CR - 39. Leslie D, Mazumder A, Peppin A, Wolters MK, Hagerty A. Does “AI” stand for augmenting inequality in the era of COVID-19 healthcare? BMJ. 2021;372:n304. CR - 40. Zaidi D, Miller T. Implicit bias and machine learning in health care. South Med J. 2023;116:62-64. UR - https://doi.org/10.32708/uutfd.1718121 L1 - https://dergipark.org.tr/tr/download/article-file/4952341 ER -