A Comparative Analysis of GPT-3.5, GPT-4 and GPT-4.o in Heart Failure

Şeyda Günay-polatkan; Deniz Sığırlı

doi:10.32708/uutfd.1543370

EN TR

A Comparative Analysis of GPT-3.5, GPT-4 and GPT-4.o in Heart Failure

Abstract

Digitalization have increasingly penetrated in healthcare. Generative artificial intelligence (AI) is a type of AI technology that can generate new content. Patients can use AI-powered chatbots to get medical information. Heart failure is a syndrome with high morbidity and mortality. Patients search about heart failure in many web sites commonly. This study aimed to assess Large Language Models (LLMs) -ChatGPT 3.5, GPT-4 and GPT-4.o- in terms of their accuracy in answering the questions about heart failure (HF). Thirteen questions regarding to the definition, causes, signs and symptoms, complications, treatment and lifestyle recommendations of the HF were evaluated. These questions to assess the knowledge and awareness of medical students about heart failure were taken from a previous study in literature. Of the students who participated in this study, 158 (58.7%) were first-year students, while 111 (41.3%) were sixth-year students and were taking their cardiology internship in their fourth year. The questions were entered in Turkish language and 2 cardiologists with over ten years of experience evaluated the responses generated by different models including GPT-3.5, GPT-4 and GPT-4.o. ChatGPT-3.5 yielded “correct” responses to 8/13 (61.5%) of the questions whereas, GPT-4 yielded “correct” responses to 11/13 (84.6%) of the questions. All of the responses of GPT-4.o were accurate and complete. Performance of medical students did not include 100% correct answers for any question. This study revealed that performance of GPT-4.o was superior to GPT-3.5, but similar with GPT-4

Keywords

Kalp Yetersizliğinde GPT-3,5, GPT-4 ve GPT-4.o Performansının Karşılaştırılması

Abstract

Dijitalleşme sağlık hizmetleri alanında giderek daha fazla yer almaktadır. Üretken yapay zeka yeni içerik üretebilen bir yapay zeka teknolojisi türüdür. Hastalar tıbbi bilgi almak için yapay zeka destekli sohbet robotlarını kullanabilmektedir. Kalp yetersizliği, yüksek morbidite ve mortaliteye sahip bir sendromdur. Hastalar genellikle birçok web sitesinde kalp yetersizliği hakkında arama yapmaktadır. Bu çalışma, kalp yetersizliği hakkındaki soruları yanıtlamadaki doğrulukları açısından Büyük Dil Modelleri (LLM'ler) - ChatGPT 3.5, GPT-4 ve GPT-4.o'yu karşılaştırmayı amaçlamaktadır. Çalışmada kalp yetersizliğinin tanımı, nedenleri, belirti ve semptomları, komplikasyonları, tedavisi ve yaşam tarzı önerileriyle ilgili on üç soru soruldu. Bu sorular, tıp fakültesi öğrencilerinin kalp yetmezliği hakkındaki bilgi ve farkındalığını değerlendirmek için yapılan önceki bir çalışmadan alındı. Bu çalışmaya katılmış olan öğrencilerin 158 tanesi (%58,7) 1. Sınıf öğrencisi iken, 111 tanesi (%41,3) 6. Sınıf öğrencisiydi ve kardiyoloji stajı 4. sınıfta alınmaktaydı. Sorular yapay zeka destekli modellere Türkçe dilinde soruldu ve on yılı aşkın deneyime sahip 2 kardiyolog, GPT-3.5, GPT-4 ve GPT-4.o tarafından üretilen yanıtları değerlendirdi. ChatGPT-3.5 soruların 8/13'üne (61.5%) "doğru" yanıt verirken, GPT-4 soruların 11/13'üne (84.6%) "doğru" yanıt verdi. GPT-4.o'nun tüm yanıtları doğru ve eksiksizdi. Tıp fakültesi öğrencilerinin performansı hiçbir soru için %100 doğru yanıt içermiyordu. Bu çalışma GPT-4.o' nun performansının GPT-3.5'ten üstün olduğunu ancak GPT-4 ile benzer olduğunu ortaya koydu.

Keywords

Ethical Statement

Bursa Uludağ Üniversitesi Tıp Fakültesi Dergisi’ne gönderdiğimiz “ A Comparative Analysis of GPT-3.5, GPT-4, GPT-4.o and Human Performance in Heart Failure” başlıklı makale, yapay zeka modellerine sorular sorularak yürütülmüştür. İnsan katılımcı yoktur. Literatürde yer alan benzer çalışmalarda olduğu gibi bu araştırmada da etik kurul onayı gerekmemektedir.

References

1-Braunwald E., Heart Failure, Journal of the American Collegeof Cardiology: Heart Failure, (2013). 1(1): 1-20.
2-Wagner S & Cohn K. Heart failure. A proposed definition andclassification. Arch Intern Med. 1977; 137: 675-678.
3-Biykem B. et al. Universal Definition and Classification ofHeart Failure, Journal of Cardiac Failure, (2021) 27 (4), 387-413.
4-Khan, M.S., Shahid, I., Bennis, A. et al. Global epidemiologyof heart failure. Nat Rev Cardiol (2024). https://doi.org/10.1038/s41569-024-01046-6
5-GBD 2017 Disease and Injury Incidence and PrevalenceCollaborators. Global, regional, and national incidence,prevalence, and years lived with disability for 354 diseases andinjuries for 195 countries and territories, 1990-2017: asystematic analysis for the Global Burden of Disease Study2017. Lancet 2018; 392: 1789– 1858.
6-Lloyd-Jones DM, Larson MG, Leip EP, et al. Lifetime risk fordeveloping congestive heart failure: the Framingham HeartStudy. Circulation. 2002;106(24):3068-3072.
7-Johansson S, Wallander M.A., Ruigomez A., Garcia RodriguezL.A. Incidence of newly diagnosed heart failure in UK generalpractice. Eur J Heart Fail. 2001; 3 (2): 225–231.
8-ITU releases 2015 ICT figures. Statistics confirm ICTrevolution of the past 15years. http://www.itu.int/net/pressoffice/press_releases/2015/17.aspx#.

9-Torrent-Sellens J, Díaz-Chao Á, Soler-Ramos I, et al.Modelling and predicting eHealth usage in Europe: amultidimensional approach from an online survey of 13,000european union internet users. J Med InternetRes. 2016;18(7):e188.
10-Klerings I, Weinhandl AS, Thaler KJ. Information overload inhealthcare: too much of a good thing? Z Evid Fortbild QualGesundhwes. 2015;109(4-5):285-90.
11-Labadze, L., Grigolia, M., Machaidze, L. Role of AI chatbots in education: systematic literature review. Int J Educ TechnolHigh Educ 20, 56 (2023).
12-Dwivedi Y.K., et al. Opinion Paper: “So what if ChatGPTwrote it?” Multidisciplinary perspectives on opportunities,challenges and implications of generative conversational AI forresearch, practice and policy, International Journal ofInformation Management, 71, 2023,https://doi.org/10.1016/j.ijinfomgt.2023.102642.
13-Yenduri G. GPT (Generative Pre-Trained Transformer)— A Comprehensive Review on Enabling Technologies, PotentialApplications, Emerging Challenges, and Future Directions.IEEE Access, 12, 2024.https://doi.org/10.1109/ACCESS.2024.3389497.
14-Venkat N. Gudivada, Dhana Rao, Vijay V. Raghavan. Chapter9 - Big Data Driven Natural Language Processing Research and Applications. Editor(s): Venu Govindaraju, Vijay V. Raghavan, C.R. Rao, Handbook of Statistics,Elsevier, 2015, Pages 203-238, https://doi.org/10.1016/B978-0-444-63492-4.00009-5.
15-Picazo-Sanchez, P., Ortiz-Martin, L. Analysing the impact ofChatGPT in research. Appl Intell (2024). 4172–4188.
16-Chiang H.H., Lee H.Y. Can Large Language Models Be anAlternative to Human Evaluations? Proceedings of the 61stAnnual Meeting of the Association for ComputationalLinguistics Volume 1: Long Papers, pages 15607–15631, 2023.
17-Ferrara E, The Butterfly Effect in artificial intelligence systems: Implications for AI bias and fairness, Machine Learning withApplications, 15, 2024, doi.org/10.1016/j.mlwa.2024.100525.
18-Saka, A., Taiwo, R., Saka, N., Salami, B., Ajayi, S., Akande,K., Kazemi, H. GPT Models in Construction Industry:Opportunities, Limitations, and a Use Case Validation.Developments in the Built Environment. 2024, 17, 1-29. https://doi.org/10.1016/j.dibe.2023.100300
19-Urbina F, Lentzos F, Invernizzi C, Ekins S. Dual Use ofArtificial Intelligence-powered Drug Discovery. Nat MachIntell. 2022 Mar;4(3):189-191.
20-GPT-4 Technical Report. OpenAI (2023). https://cdn.openai.com/papers/gpt-4.pdf
21-OpenAI. Introducing GPT-4o and more tools to ChatGPT freeusers. https://openai.com/index/gpt-4o-and-more-tools-to-chatgpt-free/.
22-Gencer A, Aydin S. Can ChatGPT pass the thoracic surgeryexam? Am J Med Sci. 2023 Oct;366(4):291-295.
23-Strong E, DiGiammarino A, Weng Y, Kumar A, Hosamani P, Hom J, Chen JH. Chatbot vs Medical Student Performance onFree-Response Clinical Reasoning Examinations. JAMA InternMed. 2023 Sep 1;183(9):1028-1030.
24-Beam K, Sharma P, Kumar B, Wang C, Brodsky D, Martin CR, Beam A. Performance of a Large Language Model on PracticeQuestions for the Neonatal Board Examination. JAMA Pediatr.2023 Sep 1;177(9):977-979.
25-Wang X, Gong Z, Wang G, Jia J, Xu Y, Zhao J, Fan Q, Wu S,Hu W, Li X. ChatGPT Performs on the Chinese NationalMedical Licensing Examination. J Med Syst. 2023 Aug15;47(1):86.
26-Fang C, Wu Y, Fu W, Ling J, Wang Y, Liu X, Jiang Y, Wu Y,Chen Y, Zhou J, Zhu Z, Yan Z, Yu P, Liu X. How doesChatGPT-4 preform on non-English national medical licensingexamination? An evaluation in Chinese language. PLOS DigitHealth. 2023 Dec 1;2(12):e0000397. doi: 10.1371/journal.pdig.0000397.
27-Gilson A, Safranek CW, Huang T, Socrates V, Chi L, TaylorRA, Chartash D. How Does ChatGPT Perform on the UnitedStates Medical Licensing Examination (USMLE)? TheImplications of Large Language Models for Medical Educationand Knowledge Assessment. JMIR Med Educ. 2023 Feb8;9:e45312. doi: 10.2196/45312. Erratum in: JMIR Med Educ. 2024 Feb 27;10:e57594. doi: 10.2196/57594.
28-Kung TH, et al. Performance of ChatGPT on USMLE:Potential for AI-assisted medical education using largelanguage models. PLOS Dig. Health. 2023;2:e0000198
29-WJ, McMurray JJ, Rauch B, Zannad F, Keukelaar K,CohenSolal A, Lopez-Sendon J, Hobbs FD, Grobbee DE,Boccanelli A, Cline C, Macarie C, Dietz R, Ruzyllo W. Publicawareness of heart failure in Europe: first results from SHAPE.Eur Heart J. 2005 Nov;26(22):2413-21.
30-Zelenak C, Radenovic S, Musial-Bright L, Tahirovic E,Sacirovic M, Lee CB, Jahandar-Lashki D, Inkrot S, Trippel TD,Busjahn A, Hashemi D, Wachter R, Pankuweit S, Störk S,Pieske B, Edelmann F, Düngen HD. Heart failure awarenesssurvey in Germany: general knowledge on heart failure remainspoor. ESC Heart Fail. 2017 Aug;4(3):224-231.
31-Nowak K, Stępień K, Furczyńska P, Owsianka I, WłodarczykA, Zalewski J, Nessler J, Gackowski A. The awareness andknowledge about heart failure in Poland - lessons from theHeart Failure Awareness Day and internet surveys. Folia MedCracov. 2019;59(2):93-109.
32-Dimitriadis F, Alkagiet S, Tsigkriki L, Kleitsioti P,Sidiropoulos G, Efstratiou D, Askalidi T, Tsaousidis A, SiarkosM, Giannakopoulou P, Mavrogianni AD, Zarifis J,Koulaouzidis G. ChatGPT and Patients With Heart Failure. Angiology. 2024 Mar 7:33197241238403.
33-King RC, Samaan JS, Yeo YH, Mody B, Lombardo DM,Ghashghaei R. Appropriateness of ChatGPT in AnsweringHeart Failure Related Questions. Heart Lung Circ. 2024 May30:S1443-9506(24)00165-3. doi: 10.1016/j.hlc.2024.03.005.
34-Gunay-Polatkan S, Sigirli D, Alak C, Senturk T. Assessment ofKnowledge and Awareness on Heart Failure among MedicalStudents. Journal of Uludag Medical Faculty.2023;49(3):305-12.
35-Nori, H., King, N., McKinney, S. M., Carignan, D. & Horvitz,E.Capabilities of GPT-4 on Medical Challenge Problems. (2023).
36-Rosoł M, Gąsior JS, Łaba J, Korzeniewski K, Młyńczak M. Evaluation of the performance of GPT-3.5 and GPT-4 on thePolish Medical Final Examination. Sci Rep. 2023 Nov22;13(1):20512.
37-Oner S.K., Ocak B., Sahbat Y. Kurnaz R.Y. and Cilingir E. Performance of Chat Gpt on a Turkish Board of Orthopaedı̇c Surgery Examination.(2024). DOI: 10.21203/rs.3.rs-4637339/v1
38-Ayers JW, Poliak A, Dredze M, Leas EC, Zhu Z, Kelley JB,Faix DJ, Goodman AM, Longhurst CA, Hogarth M, Smith DM. Comparing Physician and Artificial Intelligence ChatbotResponses to Patient Questions Posted to a Public Social MediaForum. JAMA Intern Med. 2023 Jun 1;183(6):589-596.

Details

Primary Language

English

Subjects

Cardiovascular Medicine and Haematology (Other)

Journal Section

Research Article

Authors

Şeyda Günay-polatkan ^*
0000-0003-0012-345X
Türkiye

Deniz Sığırlı
0000-0002-4006-3263
Türkiye

Publication Date

January 12, 2025

Submission Date

September 4, 2024

Acceptance Date

November 18, 2024

Published in Issue

Year 2024 Volume: 50 Number: 3

DOI

https://doi.org/10.32708/uutfd.1543370

IZ

https://izlik.org/JA29UJ77HM

Cite

RIS / Bibtex

APA

Günay-polatkan, Ş., & Sığırlı, D. (2025). A Comparative Analysis of GPT-3.5, GPT-4 and GPT-4.o in Heart Failure. Journal of Uludağ University Medical Faculty, 50(3), 443-447. https://doi.org/10.32708/uutfd.1543370

AMA

1.Günay-polatkan Ş, Sığırlı D. A Comparative Analysis of GPT-3.5, GPT-4 and GPT-4.o in Heart Failure. Journal of Uludağ University Medical Faculty. 2025;50(3):443-447. doi:10.32708/uutfd.1543370

Chicago

Günay-polatkan, Şeyda, and Deniz Sığırlı. 2025. “A Comparative Analysis of GPT-3.5, GPT-4 and GPT-4.O in Heart Failure”. Journal of Uludağ University Medical Faculty 50 (3): 443-47. https://doi.org/10.32708/uutfd.1543370.

EndNote

Günay-polatkan Ş, Sığırlı D (January 1, 2025) A Comparative Analysis of GPT-3.5, GPT-4 and GPT-4.o in Heart Failure. Journal of Uludağ University Medical Faculty 50 3 443–447.

IEEE

[1]Ş. Günay-polatkan and D. Sığırlı, “A Comparative Analysis of GPT-3.5, GPT-4 and GPT-4.o in Heart Failure”, Journal of Uludağ University Medical Faculty, vol. 50, no. 3, pp. 443–447, Jan. 2025, doi: 10.32708/uutfd.1543370.

ISNAD

Günay-polatkan, Şeyda - Sığırlı, Deniz. “A Comparative Analysis of GPT-3.5, GPT-4 and GPT-4.O in Heart Failure”. Journal of Uludağ University Medical Faculty 50/3 (January 1, 2025): 443-447. https://doi.org/10.32708/uutfd.1543370.

JAMA

1.Günay-polatkan Ş, Sığırlı D. A Comparative Analysis of GPT-3.5, GPT-4 and GPT-4.o in Heart Failure. Journal of Uludağ University Medical Faculty. 2025;50:443–447.

MLA

Günay-polatkan, Şeyda, and Deniz Sığırlı. “A Comparative Analysis of GPT-3.5, GPT-4 and GPT-4.O in Heart Failure”. Journal of Uludağ University Medical Faculty, vol. 50, no. 3, Jan. 2025, pp. 443-7, doi:10.32708/uutfd.1543370.

Vancouver

1.Şeyda Günay-polatkan, Deniz Sığırlı. A Comparative Analysis of GPT-3.5, GPT-4 and GPT-4.o in Heart Failure. Journal of Uludağ University Medical Faculty. 2025 Jan. 1;50(3):443-7. doi:10.32708/uutfd.1543370

Cited By

Comparative Analysis of Artificial Intelligence Chatbots for Heart Failure Care

Muğla Sıtkı Koçman Üniversitesi Tıp Dergisi

https://doi.org/10.47572/muskutd.1858353