Araştırma Makalesi
BibTex RIS Kaynak Göster

Performance of Generative AI Models on Cardiology Practice in Emergency Service: A Pilot Evaluation of GPT-4.o and Gemini-1.5-Flash

Yıl 2025, Cilt: 51 Sayı: 2, 239 - 246, 28.08.2025
https://doi.org/10.32708/uutfd.1718121

Öz

In healthcare, emergent clinical decision-making is complex and large language models (LLMs) may enhance both the quality and efficiency of care by aiding physicians. Case scenario-based multiple choice questions (CS-MCQs) are valuable for testing analytical skills and knowledge integration. Moreover, readability is as important as content accuracy. This study aims to compare the diagnostic and treatment capabilities of GPT-4.o and Gemini-1.5-Flash and to evaluate the readability of the responses for cardiac emergencies. A total of 70 single-answer MCQs were randomly selected from the Medscape Case Challenges and ECG Challenges series. The questions were about cardiac emergencies and were further categorized into four subgroups according to whether the question included a case presentation or an image, or not. ChatGPT and Gemini platforms were used to assess the selected questions. The Flesch–Kincaid Grade Level (FKGL) and Flesch Reading Ease (FRE) scores were utilized to evaluate the readability of the responses. GPT-4.o had a correct response rate of 65.7%, outperforming Gemini-1.5-Flash, which had a 58.6% correct response rate (p=0.010). When comparing by question type, GPT-4.o was inferior to Gemini-1.5-Flash only for non-case questions (52.5% vs. 62.5%, p=0.011). For all other question types, there were no significant performance differences between the two models (p>0.05). Both models performed better on easy questions compared to difficult ones, and on questions without images compared to those with images. Additionally, while GPT-4.o performed better on case questions than non-case questions. Gemini-1.5-Flash’s FRE score was higher than GPT-4.o’s (median [min-max], 23.75 [0-64.60] vs. 17.0 [0-56.60], p<0.001). Although on the whole GPT-4.o outperformed Gemini-1.5-Flash, both models demonstrated an ability to comprehend the case scenarios and provided reasonable answers.

Kaynakça

  • 1. Labadze L, Grigolia M, Machaidze L. Role of AI chatbots in education: systematic literature review. Int J Educ Technol High Educ. 2023;20(56). doi:10.1186/s41239-023-00416-7
  • 2. Dwivedi YK, Kshetri N, Hughes L, et al. Opinion paper: “So what if ChatGPT wrote it?” Multidisciplinary perspectives on opportunities, challenges and implications of generative conversational AI for research, practice and policy. Int J Inform Manag. 2023;71:102642. doi:10.1016/j.ijinfomgt.2023.102642
  • 3. Yenduri G. GPT (Generative Pre-Trained Transformer)—A comprehensive review on enabling technologies, potential applications, emerging challenges, and future directions. IEEE Access. 2024;12:1-36. doi:10.1109/ACCESS.2024.3389497
  • 4. Hadi MU, Al-Tashi Q, Qureshi R, et al. Large language models: a comprehensive survey of applications, challenges, limitations, and future prospects. Authorea. Preprint. 2023.
  • 5. Johnson D, Goodman R, Patrinely J, et al. Assessing the accuracy and reliability of AI-generated medical responses: an evaluation of the Chat-GPT model. Res Sq. 2023. doi:10.21203/rs.3.rs-2924050/v1
  • 6. Saka A, Taiwo R, Saka N, et al. GPT models in construction industry: opportunities, limitations, and a use case validation. Dev Built Environ. 2024;17:100300. doi:10.1016/j.dibe.2023.100300
  • 7. Urbina F, Lentzos F, Invernizzi C, Ekins S. Dual use of artificial intelligence-powered drug discovery. Nat Mach Intell. 2022;4(3):189-191. doi:10.1038/s42256-022-00480-0
  • 8. OpenAI. GPT-4 Technical Report. 2023. Available at: https://cdn.openai.com/papers/gpt-4.pdf. Accessed June 16, 2025.
  • 9. OpenAI. Introducing GPT-4o and more tools to ChatGPT free users. 2024. Available at: https://openai.com/index/gpt-4o-and-more-tools-to-chatgpt-free/. Accessed June 16, 2025.
  • 10. Chen CH, Hsieh KY, Huang KE, Lai HY. Comparing vision-capable models, GPT-4 and Gemini, with GPT-3.5 on Taiwan's pulmonologist exam. Cureus. 2024;16(8):e67641. doi:10.7759/cureus.67641
  • 11. Masanneck L, Schmidt L, Seifert A, et al. Triage performance across large language models, ChatGPT, and untrained doctors in emergency medicine: comparative study. J Med Internet Res. 2024;26:e53297. doi:10.2196/53297
  • 12. Builoff V, Shanbhag A, Miller RJ, et al. Evaluating AI proficiency in nuclear cardiology: large language models take on the board preparation exam. medRxiv. Preprint. 2024. doi:10.1101/2024.07.16.24310297
  • 13. Botross M, Mohammadi SO, Montgomery K, Crawford C. Performance of Google's artificial intelligence chatbot "Bard" (now "Gemini") on ophthalmology board exam practice questions. Cureus. 2024;16(3):e57348. doi:10.7759/cureus.57348
  • 14. Khan MP, O'Sullivan ED. A comparison of the diagnostic ability of large language models in challenging clinical cases. Front Artif Intell. 2024;7:1379297. doi:10.3389/frai.2024.1379297
  • 15. Hirosawa T, Harada Y, Mizuta K, et al. Evaluating ChatGPT-4's accuracy in identifying final diagnoses within differential diagnoses compared with those of physicians: experimental study for diagnostic cases. JMIR Form Res. 2024;8:e59267. doi:10.2196/59267
  • 16. Gomez-Cabello CA, Borna S, Pressman SM, Haider SA, Forte AJ. Large language models for intraoperative decision support in plastic surgery: a comparison between ChatGPT-4 and Gemini. Medicina (Kaunas). 2024;60(6):957. doi:10.3390/medicina60060957
  • 17. Rush R. Assessing readability: formulas and alternatives. Read Teach. 1984;39(3):274-283.
  • 18. Flesch R. A new readability yardstick. J Appl Psychol. 1948;32(3):221-233. doi:10.1037/h0057532
  • 19. Medscape. Case Challenges. Available at: https://reference.medscape.com/features/casechallenges?icd=login_success_email_match_norm. Accessed September 6, 2024.
  • 20. Medscape. Home Page. Available at: https://www.medscape.com/index/section_60_0. Accessed September 6, 2024.
  • 21. Readable. Flesch Reading Ease and the Flesch Kincaid Grade Level. Available at: https://readable.com/readability/flesch-reading-ease-flesch-kincaid-grade-level/. Accessed June 16, 2025.
  • 22. Klare GR. The measurement of readability: useful information for communicators. ACM J Comput Doc. 2000;24:107-121.
  • 23. Palmer EJ, Devitt PG. Assessment of higher order cognitive skills in undergraduate education: modified essay or multiple choice questions? BMC Med Educ. 2007;7:49. doi:10.1186/1472-6920-7-49
  • 24. Hays RB, Coventry P, Wilcock D, Hartley K. Short and long multiple-choice question stems in a primary care oriented undergraduate medical curriculum. Educ Prim Care. 2009;20(3):173-177.
  • 25. Case SM, Swanson DB. Constructing Written Test Questions for the Basic and Clinical Sciences. 3rd ed. Philadelphia: National Board of Medical Examiners; 2002.
  • 26. Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a radiology board-style examination: insights into current strengths and limitations. Radiology. 2023;307(5):e230582. doi:10.1148/radiol.230582
  • 27. Rao A, Kim J, Kamineni M, et al. Evaluating ChatGPT as an adjunct for radiologic decision-making. medRxiv. 2023. doi:10.1101/2023.02.02.23285399
  • 28. Kitamura FC. ChatGPT is shaping the future of medical writing but still requires human judgment. Radiology. 2023;307:e230171.
  • 29. Momenaei B, Wakabayashi T, Shahlaee A, et al. Appropriateness and readability of ChatGPT-4-generated responses for surgical treatment of retinal diseases. Ophthalmol Retina. 2023;7:862-868.
  • 30. Al-Sharif EM, Penteado RC, Dib El Jalbout N, et al. Evaluating the accuracy of ChatGPT and Google Bard in fielding oculoplastic patient queries: a comparative study on artificial versus human intelligence. Ophthalmic Plast Reconstr Surg. 2024;40:303-311.
  • 31. Atkinson CJ, Seth I, Xie Y, et al. Artificial intelligence language model performance for rapid intraoperative queries in plastic surgery: ChatGPT and the deep inferior epigastric perforator flap. J Clin Med. 2024;13:900.
  • 32. Rizwan A, Sadiq T. The use of AI in diagnosing diseases and providing management plans: a consultation on cardiovascular disorders with ChatGPT. Cureus. 2023;15(8):e43106. doi:10.7759/cureus.43106
  • 33. Mahendiran T, Thanou D, Senouf O, et al. Deep learning-based prediction of future myocardial infarction using invasive coronary angiography: a feasibility study. Open Heart. 2023;10:e002237.
  • 34. Skalidis I, Cagnina A, Luangphiphat W, et al. ChatGPT takes on the European Exam in Core Cardiology: an artificial intelligence success story? Eur Heart J Digit Health. 2023;4:279-281.
  • 35. Herman R, Kisova T, Belmonte M, et al. Artificial intelligence-powered electrocardiogram detecting culprit vessel blood flow abnormality: AI-ECG TIMI study design and rationale. J Soc Cardiovasc Angiogr Interv. 2025;4(3Part B):102494. doi:10.1016/j.jscai.2024.102494
  • 36. Günay S, Öztürk A, Yiğit Y. The accuracy of Gemini, GPT-4, and GPT-4o in ECG analysis: a comparison with cardiologists and emergency medicine specialists. Am J Emerg Med. 2024;84:68-73. doi:10.1016/j.ajem.2024.07.043
  • 37. Martínez-Sellés M, Marina-Breysse M. Current and future use of artificial intelligence in electrocardiography. J Cardiovasc Dev Dis. 2023;10(4):175. doi:10.3390/jcdd10040175
  • 38. Yuan J, Tang R, Jiang X, Hu H. Large language models for healthcare data augmentation: an example on patient-trial matching. AMIA Annu Symp Proc. 2023;2023:1324-1333.
  • 39. Leslie D, Mazumder A, Peppin A, Wolters MK, Hagerty A. Does “AI” stand for augmenting inequality in the era of COVID-19 healthcare? BMJ. 2021;372:n304.
  • 40. Zaidi D, Miller T. Implicit bias and machine learning in health care. South Med J. 2023;116:62-64.

Kardiyak Acil Durumların Yönetiminde ChatGPT ve Gemini

Yıl 2025, Cilt: 51 Sayı: 2, 239 - 246, 28.08.2025
https://doi.org/10.32708/uutfd.1718121

Öz

Sağlık hizmetlerinde, acil klinik karar alma karmaşıktır ve büyük dil modelleri (LLM'ler) hekimlere yardımcı olarak hem bakımın kalitesini hem de verimliliğini artırabilir. Vaka senaryosuna dayalı çoktan seçmeli sorular (VS-ÇSS), analitik becerileri ve bilgi bütünleştirmeyi test etmek için değerlidir. Ayrıca, okunabilirlik, içerik doğruluğu kadar önemlidir. Bu çalışma, GPT-4.o ve Gemini-1.5-Flash'ın tanı ve tedavi yeteneklerini karşılaştırmayı ve kardiyak acil durumlar için yanıtların okunabilirliğini değerlendirmeyi amaçlamaktadır. Medscape Vaka Zorlukları ve EKG Zorlukları serilerinden toplam 70 tek cevaplı ÇSS rastgele seçildi. Sorular kardiyak acil durumlarla ilgiliydi ve sorunun bir vaka sunumu veya bir görüntü içerip içermemesine göre dört alt gruba ayrıldı. Seçilen soruları değerlendirmek için CahtGPT ve Gemini platformları kullanıldı. Yanıtların okunabilirliğini değerlendirmek için Flesch-Kincaid Sınıf Düzeyi (FKGL) ve Flesch Okuma Kolaylığı (FRE) puanları kullanıldı. GPT-4.o'nun doğru yanıt oranı %65,7'ydi ve %58,6 doğru yanıt oranına sahip Gemini-1.5-Flash'ı geride bıraktı (p=0,010). Soru türüne göre karşılaştırıldığında, GPT-4.o yalnızca vaka dışı sorularda Gemini-1.5-Flash'tan daha düşüktü (%52,5'e karşı %62,5, p=0,011). Diğer tüm soru türleri için, iki model arasında önemli bir performans farkı yoktu (p>0,05). Her iki model de kolay sorularda zor sorulara göre ve resimsiz sorularda resimli sorulara göre daha iyi performans gösterdi. Ek olarak, GPT-4.o vaka dışı sorulara göre vaka sorularında daha iyi performans gösterdi. Gemini-1.5-Flash'ın FRE puanı GPT-4.o'dan daha yüksekti (ortanca [min-maks], 23.75 [0-64.60] - 17.0 [0-56.60], p<0.001). Her ne kadar toplamda GPT-4.o, Gemini-1.5-Flash'tan daha iyi performans gösterse de, her iki model de durum senaryolarını anlama becerisi gösterdi ve makul yanıtlar sağladı.

Kaynakça

  • 1. Labadze L, Grigolia M, Machaidze L. Role of AI chatbots in education: systematic literature review. Int J Educ Technol High Educ. 2023;20(56). doi:10.1186/s41239-023-00416-7
  • 2. Dwivedi YK, Kshetri N, Hughes L, et al. Opinion paper: “So what if ChatGPT wrote it?” Multidisciplinary perspectives on opportunities, challenges and implications of generative conversational AI for research, practice and policy. Int J Inform Manag. 2023;71:102642. doi:10.1016/j.ijinfomgt.2023.102642
  • 3. Yenduri G. GPT (Generative Pre-Trained Transformer)—A comprehensive review on enabling technologies, potential applications, emerging challenges, and future directions. IEEE Access. 2024;12:1-36. doi:10.1109/ACCESS.2024.3389497
  • 4. Hadi MU, Al-Tashi Q, Qureshi R, et al. Large language models: a comprehensive survey of applications, challenges, limitations, and future prospects. Authorea. Preprint. 2023.
  • 5. Johnson D, Goodman R, Patrinely J, et al. Assessing the accuracy and reliability of AI-generated medical responses: an evaluation of the Chat-GPT model. Res Sq. 2023. doi:10.21203/rs.3.rs-2924050/v1
  • 6. Saka A, Taiwo R, Saka N, et al. GPT models in construction industry: opportunities, limitations, and a use case validation. Dev Built Environ. 2024;17:100300. doi:10.1016/j.dibe.2023.100300
  • 7. Urbina F, Lentzos F, Invernizzi C, Ekins S. Dual use of artificial intelligence-powered drug discovery. Nat Mach Intell. 2022;4(3):189-191. doi:10.1038/s42256-022-00480-0
  • 8. OpenAI. GPT-4 Technical Report. 2023. Available at: https://cdn.openai.com/papers/gpt-4.pdf. Accessed June 16, 2025.
  • 9. OpenAI. Introducing GPT-4o and more tools to ChatGPT free users. 2024. Available at: https://openai.com/index/gpt-4o-and-more-tools-to-chatgpt-free/. Accessed June 16, 2025.
  • 10. Chen CH, Hsieh KY, Huang KE, Lai HY. Comparing vision-capable models, GPT-4 and Gemini, with GPT-3.5 on Taiwan's pulmonologist exam. Cureus. 2024;16(8):e67641. doi:10.7759/cureus.67641
  • 11. Masanneck L, Schmidt L, Seifert A, et al. Triage performance across large language models, ChatGPT, and untrained doctors in emergency medicine: comparative study. J Med Internet Res. 2024;26:e53297. doi:10.2196/53297
  • 12. Builoff V, Shanbhag A, Miller RJ, et al. Evaluating AI proficiency in nuclear cardiology: large language models take on the board preparation exam. medRxiv. Preprint. 2024. doi:10.1101/2024.07.16.24310297
  • 13. Botross M, Mohammadi SO, Montgomery K, Crawford C. Performance of Google's artificial intelligence chatbot "Bard" (now "Gemini") on ophthalmology board exam practice questions. Cureus. 2024;16(3):e57348. doi:10.7759/cureus.57348
  • 14. Khan MP, O'Sullivan ED. A comparison of the diagnostic ability of large language models in challenging clinical cases. Front Artif Intell. 2024;7:1379297. doi:10.3389/frai.2024.1379297
  • 15. Hirosawa T, Harada Y, Mizuta K, et al. Evaluating ChatGPT-4's accuracy in identifying final diagnoses within differential diagnoses compared with those of physicians: experimental study for diagnostic cases. JMIR Form Res. 2024;8:e59267. doi:10.2196/59267
  • 16. Gomez-Cabello CA, Borna S, Pressman SM, Haider SA, Forte AJ. Large language models for intraoperative decision support in plastic surgery: a comparison between ChatGPT-4 and Gemini. Medicina (Kaunas). 2024;60(6):957. doi:10.3390/medicina60060957
  • 17. Rush R. Assessing readability: formulas and alternatives. Read Teach. 1984;39(3):274-283.
  • 18. Flesch R. A new readability yardstick. J Appl Psychol. 1948;32(3):221-233. doi:10.1037/h0057532
  • 19. Medscape. Case Challenges. Available at: https://reference.medscape.com/features/casechallenges?icd=login_success_email_match_norm. Accessed September 6, 2024.
  • 20. Medscape. Home Page. Available at: https://www.medscape.com/index/section_60_0. Accessed September 6, 2024.
  • 21. Readable. Flesch Reading Ease and the Flesch Kincaid Grade Level. Available at: https://readable.com/readability/flesch-reading-ease-flesch-kincaid-grade-level/. Accessed June 16, 2025.
  • 22. Klare GR. The measurement of readability: useful information for communicators. ACM J Comput Doc. 2000;24:107-121.
  • 23. Palmer EJ, Devitt PG. Assessment of higher order cognitive skills in undergraduate education: modified essay or multiple choice questions? BMC Med Educ. 2007;7:49. doi:10.1186/1472-6920-7-49
  • 24. Hays RB, Coventry P, Wilcock D, Hartley K. Short and long multiple-choice question stems in a primary care oriented undergraduate medical curriculum. Educ Prim Care. 2009;20(3):173-177.
  • 25. Case SM, Swanson DB. Constructing Written Test Questions for the Basic and Clinical Sciences. 3rd ed. Philadelphia: National Board of Medical Examiners; 2002.
  • 26. Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a radiology board-style examination: insights into current strengths and limitations. Radiology. 2023;307(5):e230582. doi:10.1148/radiol.230582
  • 27. Rao A, Kim J, Kamineni M, et al. Evaluating ChatGPT as an adjunct for radiologic decision-making. medRxiv. 2023. doi:10.1101/2023.02.02.23285399
  • 28. Kitamura FC. ChatGPT is shaping the future of medical writing but still requires human judgment. Radiology. 2023;307:e230171.
  • 29. Momenaei B, Wakabayashi T, Shahlaee A, et al. Appropriateness and readability of ChatGPT-4-generated responses for surgical treatment of retinal diseases. Ophthalmol Retina. 2023;7:862-868.
  • 30. Al-Sharif EM, Penteado RC, Dib El Jalbout N, et al. Evaluating the accuracy of ChatGPT and Google Bard in fielding oculoplastic patient queries: a comparative study on artificial versus human intelligence. Ophthalmic Plast Reconstr Surg. 2024;40:303-311.
  • 31. Atkinson CJ, Seth I, Xie Y, et al. Artificial intelligence language model performance for rapid intraoperative queries in plastic surgery: ChatGPT and the deep inferior epigastric perforator flap. J Clin Med. 2024;13:900.
  • 32. Rizwan A, Sadiq T. The use of AI in diagnosing diseases and providing management plans: a consultation on cardiovascular disorders with ChatGPT. Cureus. 2023;15(8):e43106. doi:10.7759/cureus.43106
  • 33. Mahendiran T, Thanou D, Senouf O, et al. Deep learning-based prediction of future myocardial infarction using invasive coronary angiography: a feasibility study. Open Heart. 2023;10:e002237.
  • 34. Skalidis I, Cagnina A, Luangphiphat W, et al. ChatGPT takes on the European Exam in Core Cardiology: an artificial intelligence success story? Eur Heart J Digit Health. 2023;4:279-281.
  • 35. Herman R, Kisova T, Belmonte M, et al. Artificial intelligence-powered electrocardiogram detecting culprit vessel blood flow abnormality: AI-ECG TIMI study design and rationale. J Soc Cardiovasc Angiogr Interv. 2025;4(3Part B):102494. doi:10.1016/j.jscai.2024.102494
  • 36. Günay S, Öztürk A, Yiğit Y. The accuracy of Gemini, GPT-4, and GPT-4o in ECG analysis: a comparison with cardiologists and emergency medicine specialists. Am J Emerg Med. 2024;84:68-73. doi:10.1016/j.ajem.2024.07.043
  • 37. Martínez-Sellés M, Marina-Breysse M. Current and future use of artificial intelligence in electrocardiography. J Cardiovasc Dev Dis. 2023;10(4):175. doi:10.3390/jcdd10040175
  • 38. Yuan J, Tang R, Jiang X, Hu H. Large language models for healthcare data augmentation: an example on patient-trial matching. AMIA Annu Symp Proc. 2023;2023:1324-1333.
  • 39. Leslie D, Mazumder A, Peppin A, Wolters MK, Hagerty A. Does “AI” stand for augmenting inequality in the era of COVID-19 healthcare? BMJ. 2021;372:n304.
  • 40. Zaidi D, Miller T. Implicit bias and machine learning in health care. South Med J. 2023;116:62-64.
Toplam 40 adet kaynakça vardır.

Ayrıntılar

Birincil Dil İngilizce
Konular Acil Tıp
Bölüm Özgün Araştırma Makaleleri
Yazarlar

Şeyda Günay Polatkan 0000-0003-0012-345X

Deniz Sığırlı 0000-0002-4006-3263

Vahide Aslıhan Durak 0000-0003-0836-7862

Çetin Alak 0000-0003-1875-2078

Irem Iris Kan 0000-0002-1600-9531

Yayımlanma Tarihi 28 Ağustos 2025
Gönderilme Tarihi 12 Haziran 2025
Kabul Tarihi 2 Temmuz 2025
Yayımlandığı Sayı Yıl 2025 Cilt: 51 Sayı: 2

Kaynak Göster

AMA Günay Polatkan Ş, Sığırlı D, Durak VA, Alak Ç, Kan II. Performance of Generative AI Models on Cardiology Practice in Emergency Service: A Pilot Evaluation of GPT-4.o and Gemini-1.5-Flash. Uludağ Tıp Derg. Ağustos 2025;51(2):239-246. doi:10.32708/uutfd.1718121

ISSN: 1300-414X, e-ISSN: 2645-9027

Uludağ Üniversitesi Tıp Fakültesi Dergisi "Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License" ile lisanslanmaktadır.


Creative Commons License
Journal of Uludag University Medical Faculty is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

2023