Performance of Generative AI Models on Cardiology Practice in Emergency Service: A Pilot Evaluation of GPT-4.o and Gemini-1.5-Flash

Şeyda Günay Polatkan; Deniz Sığırlı; Vahide Aslıhan Durak; Çetin Alak; Irem Iris Kan

doi:10.32708/uutfd.1718121

EN TR

Performance of Generative AI Models on Cardiology Practice in Emergency Service: A Pilot Evaluation of GPT-4.o and Gemini-1.5-Flash

Öz

In healthcare, emergent clinical decision-making is complex and large language models (LLMs) may enhance both the quality and efficiency of care by aiding physicians. Case scenario-based multiple choice questions (CS-MCQs) are valuable for testing analytical skills and knowledge integration. Moreover, readability is as important as content accuracy. This study aims to compare the diagnostic and treatment capabilities of GPT-4.o and Gemini-1.5-Flash and to evaluate the readability of the responses for cardiac emergencies. A total of 70 single-answer MCQs were randomly selected from the Medscape Case Challenges and ECG Challenges series. The questions were about cardiac emergencies and were further categorized into four subgroups according to whether the question included a case presentation or an image, or not. ChatGPT and Gemini platforms were used to assess the selected questions. The Flesch–Kincaid Grade Level (FKGL) and Flesch Reading Ease (FRE) scores were utilized to evaluate the readability of the responses. GPT-4.o had a correct response rate of 65.7%, outperforming Gemini-1.5-Flash, which had a 58.6% correct response rate (p=0.010). When comparing by question type, GPT-4.o was inferior to Gemini-1.5-Flash only for non-case questions (52.5% vs. 62.5%, p=0.011). For all other question types, there were no significant performance differences between the two models (p>0.05). Both models performed better on easy questions compared to difficult ones, and on questions without images compared to those with images. Additionally, while GPT-4.o performed better on case questions than non-case questions. Gemini-1.5-Flash’s FRE score was higher than GPT-4.o’s (median [min-max], 23.75 [0-64.60] vs. 17.0 [0-56.60], p<0.001). Although on the whole GPT-4.o outperformed Gemini-1.5-Flash, both models demonstrated an ability to comprehend the case scenarios and provided reasonable answers.

Anahtar Kelimeler

Kardiyak Acil Durumların Yönetiminde ChatGPT ve Gemini

Öz

Sağlık hizmetlerinde, acil klinik karar alma karmaşıktır ve büyük dil modelleri (LLM'ler) hekimlere yardımcı olarak hem bakımın kalitesini hem de verimliliğini artırabilir. Vaka senaryosuna dayalı çoktan seçmeli sorular (VS-ÇSS), analitik becerileri ve bilgi bütünleştirmeyi test etmek için değerlidir. Ayrıca, okunabilirlik, içerik doğruluğu kadar önemlidir. Bu çalışma, GPT-4.o ve Gemini-1.5-Flash'ın tanı ve tedavi yeteneklerini karşılaştırmayı ve kardiyak acil durumlar için yanıtların okunabilirliğini değerlendirmeyi amaçlamaktadır. Medscape Vaka Zorlukları ve EKG Zorlukları serilerinden toplam 70 tek cevaplı ÇSS rastgele seçildi. Sorular kardiyak acil durumlarla ilgiliydi ve sorunun bir vaka sunumu veya bir görüntü içerip içermemesine göre dört alt gruba ayrıldı. Seçilen soruları değerlendirmek için CahtGPT ve Gemini platformları kullanıldı. Yanıtların okunabilirliğini değerlendirmek için Flesch-Kincaid Sınıf Düzeyi (FKGL) ve Flesch Okuma Kolaylığı (FRE) puanları kullanıldı. GPT-4.o'nun doğru yanıt oranı %65,7'ydi ve %58,6 doğru yanıt oranına sahip Gemini-1.5-Flash'ı geride bıraktı (p=0,010). Soru türüne göre karşılaştırıldığında, GPT-4.o yalnızca vaka dışı sorularda Gemini-1.5-Flash'tan daha düşüktü (%52,5'e karşı %62,5, p=0,011). Diğer tüm soru türleri için, iki model arasında önemli bir performans farkı yoktu (p>0,05). Her iki model de kolay sorularda zor sorulara göre ve resimsiz sorularda resimli sorulara göre daha iyi performans gösterdi. Ek olarak, GPT-4.o vaka dışı sorulara göre vaka sorularında daha iyi performans gösterdi. Gemini-1.5-Flash'ın FRE puanı GPT-4.o'dan daha yüksekti (ortanca [min-maks], 23.75 [0-64.60] - 17.0 [0-56.60], p<0.001). Her ne kadar toplamda GPT-4.o, Gemini-1.5-Flash'tan daha iyi performans gösterse de, her iki model de durum senaryolarını anlama becerisi gösterdi ve makul yanıtlar sağladı.

Anahtar Kelimeler

Kaynakça

1. Labadze L, Grigolia M, Machaidze L. Role of AI chatbots in education: systematic literature review. Int J Educ Technol High Educ. 2023;20(56). doi:10.1186/s41239-023-00416-7
2. Dwivedi YK, Kshetri N, Hughes L, et al. Opinion paper: “So what if ChatGPT wrote it?” Multidisciplinary perspectives on opportunities, challenges and implications of generative conversational AI for research, practice and policy. Int J Inform Manag. 2023;71:102642. doi:10.1016/j.ijinfomgt.2023.102642
3. Yenduri G. GPT (Generative Pre-Trained Transformer)—A comprehensive review on enabling technologies, potential applications, emerging challenges, and future directions. IEEE Access. 2024;12:1-36. doi:10.1109/ACCESS.2024.3389497
4. Hadi MU, Al-Tashi Q, Qureshi R, et al. Large language models: a comprehensive survey of applications, challenges, limitations, and future prospects. Authorea. Preprint. 2023.
5. Johnson D, Goodman R, Patrinely J, et al. Assessing the accuracy and reliability of AI-generated medical responses: an evaluation of the Chat-GPT model. Res Sq. 2023. doi:10.21203/rs.3.rs-2924050/v1
6. Saka A, Taiwo R, Saka N, et al. GPT models in construction industry: opportunities, limitations, and a use case validation. Dev Built Environ. 2024;17:100300. doi:10.1016/j.dibe.2023.100300
7. Urbina F, Lentzos F, Invernizzi C, Ekins S. Dual use of artificial intelligence-powered drug discovery. Nat Mach Intell. 2022;4(3):189-191. doi:10.1038/s42256-022-00480-0
8. OpenAI. GPT-4 Technical Report. 2023. Available at: https://cdn.openai.com/papers/gpt-4.pdf. Accessed June 16, 2025.

9. OpenAI. Introducing GPT-4o and more tools to ChatGPT free users. 2024. Available at: https://openai.com/index/gpt-4o-and-more-tools-to-chatgpt-free/. Accessed June 16, 2025.
10. Chen CH, Hsieh KY, Huang KE, Lai HY. Comparing vision-capable models, GPT-4 and Gemini, with GPT-3.5 on Taiwan's pulmonologist exam. Cureus. 2024;16(8):e67641. doi:10.7759/cureus.67641
11. Masanneck L, Schmidt L, Seifert A, et al. Triage performance across large language models, ChatGPT, and untrained doctors in emergency medicine: comparative study. J Med Internet Res. 2024;26:e53297. doi:10.2196/53297
12. Builoff V, Shanbhag A, Miller RJ, et al. Evaluating AI proficiency in nuclear cardiology: large language models take on the board preparation exam. medRxiv. Preprint. 2024. doi:10.1101/2024.07.16.24310297
13. Botross M, Mohammadi SO, Montgomery K, Crawford C. Performance of Google's artificial intelligence chatbot "Bard" (now "Gemini") on ophthalmology board exam practice questions. Cureus. 2024;16(3):e57348. doi:10.7759/cureus.57348
14. Khan MP, O'Sullivan ED. A comparison of the diagnostic ability of large language models in challenging clinical cases. Front Artif Intell. 2024;7:1379297. doi:10.3389/frai.2024.1379297
15. Hirosawa T, Harada Y, Mizuta K, et al. Evaluating ChatGPT-4's accuracy in identifying final diagnoses within differential diagnoses compared with those of physicians: experimental study for diagnostic cases. JMIR Form Res. 2024;8:e59267. doi:10.2196/59267
16. Gomez-Cabello CA, Borna S, Pressman SM, Haider SA, Forte AJ. Large language models for intraoperative decision support in plastic surgery: a comparison between ChatGPT-4 and Gemini. Medicina (Kaunas). 2024;60(6):957. doi:10.3390/medicina60060957
17. Rush R. Assessing readability: formulas and alternatives. Read Teach. 1984;39(3):274-283.
18. Flesch R. A new readability yardstick. J Appl Psychol. 1948;32(3):221-233. doi:10.1037/h0057532
19. Medscape. Case Challenges. Available at: https://reference.medscape.com/features/casechallenges?icd=login_success_email_match_norm. Accessed September 6, 2024.
20. Medscape. Home Page. Available at: https://www.medscape.com/index/section_60_0. Accessed September 6, 2024.
21. Readable. Flesch Reading Ease and the Flesch Kincaid Grade Level. Available at: https://readable.com/readability/flesch-reading-ease-flesch-kincaid-grade-level/. Accessed June 16, 2025.
22. Klare GR. The measurement of readability: useful information for communicators. ACM J Comput Doc. 2000;24:107-121.
23. Palmer EJ, Devitt PG. Assessment of higher order cognitive skills in undergraduate education: modified essay or multiple choice questions? BMC Med Educ. 2007;7:49. doi:10.1186/1472-6920-7-49
24. Hays RB, Coventry P, Wilcock D, Hartley K. Short and long multiple-choice question stems in a primary care oriented undergraduate medical curriculum. Educ Prim Care. 2009;20(3):173-177.
25. Case SM, Swanson DB. Constructing Written Test Questions for the Basic and Clinical Sciences. 3rd ed. Philadelphia: National Board of Medical Examiners; 2002.
26. Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a radiology board-style examination: insights into current strengths and limitations. Radiology. 2023;307(5):e230582. doi:10.1148/radiol.230582
27. Rao A, Kim J, Kamineni M, et al. Evaluating ChatGPT as an adjunct for radiologic decision-making. medRxiv. 2023. doi:10.1101/2023.02.02.23285399
28. Kitamura FC. ChatGPT is shaping the future of medical writing but still requires human judgment. Radiology. 2023;307:e230171.
29. Momenaei B, Wakabayashi T, Shahlaee A, et al. Appropriateness and readability of ChatGPT-4-generated responses for surgical treatment of retinal diseases. Ophthalmol Retina. 2023;7:862-868.
30. Al-Sharif EM, Penteado RC, Dib El Jalbout N, et al. Evaluating the accuracy of ChatGPT and Google Bard in fielding oculoplastic patient queries: a comparative study on artificial versus human intelligence. Ophthalmic Plast Reconstr Surg. 2024;40:303-311.
31. Atkinson CJ, Seth I, Xie Y, et al. Artificial intelligence language model performance for rapid intraoperative queries in plastic surgery: ChatGPT and the deep inferior epigastric perforator flap. J Clin Med. 2024;13:900.
32. Rizwan A, Sadiq T. The use of AI in diagnosing diseases and providing management plans: a consultation on cardiovascular disorders with ChatGPT. Cureus. 2023;15(8):e43106. doi:10.7759/cureus.43106
33. Mahendiran T, Thanou D, Senouf O, et al. Deep learning-based prediction of future myocardial infarction using invasive coronary angiography: a feasibility study. Open Heart. 2023;10:e002237.
34. Skalidis I, Cagnina A, Luangphiphat W, et al. ChatGPT takes on the European Exam in Core Cardiology: an artificial intelligence success story? Eur Heart J Digit Health. 2023;4:279-281.
35. Herman R, Kisova T, Belmonte M, et al. Artificial intelligence-powered electrocardiogram detecting culprit vessel blood flow abnormality: AI-ECG TIMI study design and rationale. J Soc Cardiovasc Angiogr Interv. 2025;4(3Part B):102494. doi:10.1016/j.jscai.2024.102494
36. Günay S, Öztürk A, Yiğit Y. The accuracy of Gemini, GPT-4, and GPT-4o in ECG analysis: a comparison with cardiologists and emergency medicine specialists. Am J Emerg Med. 2024;84:68-73. doi:10.1016/j.ajem.2024.07.043
37. Martínez-Sellés M, Marina-Breysse M. Current and future use of artificial intelligence in electrocardiography. J Cardiovasc Dev Dis. 2023;10(4):175. doi:10.3390/jcdd10040175
38. Yuan J, Tang R, Jiang X, Hu H. Large language models for healthcare data augmentation: an example on patient-trial matching. AMIA Annu Symp Proc. 2023;2023:1324-1333.
39. Leslie D, Mazumder A, Peppin A, Wolters MK, Hagerty A. Does “AI” stand for augmenting inequality in the era of COVID-19 healthcare? BMJ. 2021;372:n304.
40. Zaidi D, Miller T. Implicit bias and machine learning in health care. South Med J. 2023;116:62-64.

Ayrıntılar

Birincil Dil

İngilizce

Konular

Acil Tıp

Bölüm

Araştırma Makalesi

Yazarlar

Şeyda Günay Polatkan ^*
0000-0003-0012-345X
Türkiye

Deniz Sığırlı
0000-0002-4006-3263
Türkiye

Vahide Aslıhan Durak
0000-0003-0836-7862
Türkiye

Çetin Alak
0000-0003-1875-2078
Türkiye

Irem Iris Kan
0000-0002-1600-9531
Türkiye

Yayımlanma Tarihi

28 Ağustos 2025

Gönderilme Tarihi

12 Haziran 2025

Kabul Tarihi

2 Temmuz 2025

Yayımlandığı Sayı

Yıl 2025 Cilt: 51 Sayı: 2

DOI

https://doi.org/10.32708/uutfd.1718121

IZ

https://izlik.org/JA94RU89FB

Kaynak Göster

RIS / Bibtex

APA

Günay Polatkan, Ş., Sığırlı, D., Durak, V. A., Alak, Ç., & Kan, I. I. (2025). Performance of Generative AI Models on Cardiology Practice in Emergency Service: A Pilot Evaluation of GPT-4.o and Gemini-1.5-Flash. Journal of Uludağ University Medical Faculty, 51(2), 239-246. https://doi.org/10.32708/uutfd.1718121

AMA

1.Günay Polatkan Ş, Sığırlı D, Durak VA, Alak Ç, Kan II. Performance of Generative AI Models on Cardiology Practice in Emergency Service: A Pilot Evaluation of GPT-4.o and Gemini-1.5-Flash. Uludağ Tıp Derg. 2025;51(2):239-246. doi:10.32708/uutfd.1718121

Chicago

Günay Polatkan, Şeyda, Deniz Sığırlı, Vahide Aslıhan Durak, Çetin Alak, ve Irem Iris Kan. 2025. “Performance of Generative AI Models on Cardiology Practice in Emergency Service: A Pilot Evaluation of GPT-4.o and Gemini-1.5-Flash”. Journal of Uludağ University Medical Faculty 51 (2): 239-46. https://doi.org/10.32708/uutfd.1718121.

EndNote

Günay Polatkan Ş, Sığırlı D, Durak VA, Alak Ç, Kan II (01 Ağustos 2025) Performance of Generative AI Models on Cardiology Practice in Emergency Service: A Pilot Evaluation of GPT-4.o and Gemini-1.5-Flash. Journal of Uludağ University Medical Faculty 51 2 239–246.

IEEE

[1]Ş. Günay Polatkan, D. Sığırlı, V. A. Durak, Ç. Alak, ve I. I. Kan, “Performance of Generative AI Models on Cardiology Practice in Emergency Service: A Pilot Evaluation of GPT-4.o and Gemini-1.5-Flash”, Uludağ Tıp Derg, c. 51, sy 2, ss. 239–246, Ağu. 2025, doi: 10.32708/uutfd.1718121.

ISNAD

Günay Polatkan, Şeyda - Sığırlı, Deniz - Durak, Vahide Aslıhan - Alak, Çetin - Kan, Irem Iris. “Performance of Generative AI Models on Cardiology Practice in Emergency Service: A Pilot Evaluation of GPT-4.o and Gemini-1.5-Flash”. Journal of Uludağ University Medical Faculty 51/2 (01 Ağustos 2025): 239-246. https://doi.org/10.32708/uutfd.1718121.

JAMA

1.Günay Polatkan Ş, Sığırlı D, Durak VA, Alak Ç, Kan II. Performance of Generative AI Models on Cardiology Practice in Emergency Service: A Pilot Evaluation of GPT-4.o and Gemini-1.5-Flash. Uludağ Tıp Derg. 2025;51:239–246.

MLA

Günay Polatkan, Şeyda, vd. “Performance of Generative AI Models on Cardiology Practice in Emergency Service: A Pilot Evaluation of GPT-4.o and Gemini-1.5-Flash”. Journal of Uludağ University Medical Faculty, c. 51, sy 2, Ağustos 2025, ss. 239-46, doi:10.32708/uutfd.1718121.

Vancouver

1.Şeyda Günay Polatkan, Deniz Sığırlı, Vahide Aslıhan Durak, Çetin Alak, Irem Iris Kan. Performance of Generative AI Models on Cardiology Practice in Emergency Service: A Pilot Evaluation of GPT-4.o and Gemini-1.5-Flash. Uludağ Tıp Derg. 01 Ağustos 2025;51(2):239-46. doi:10.32708/uutfd.1718121