Performance of Large Language Models in Medical Exams: A Comparison Between ChatGPT and Medical Students

Ahmet Ozan Kaleci; Burcu Şahinbaş; Ezgi Ağadayı; Sümeyye İdil Çelikkaya; Ahmet Altun; Emre Kemal Kardan

doi:10.25282/ted.1729174

Araştırma Makalesi

Performance of Large Language Models in Medical Exams: A Comparison Between ChatGPT and Medical Students

Yıl 2025, Cilt: 24 Sayı: 74, 135 - 143, 22.12.2025

Ahmet Ozan Kaleci , Burcu Şahinbaş , Ezgi Ağadayı , Sümeyye İdil Çelikkaya , Ahmet Altun , Emre Kemal Kardan

https://doi.org/10.25282/ted.1729174

Öz

Background: Medical education in Türkiye is delivered through a six-year, discipline-based curriculum aligned with global trends. The assessment process largely relies on multiple-choice questions, placing a significant preparation burden on faculty members. AI-powered large language models like ChatGPT have the potential to ease exam preparation, enhance feedback quality, and support personalized learning. The aim of this study is to evaluate the success of the ChatGPT-4o model performs when answering multiple-choice (MCQ) questions on medical education exams. Additionally, by comparing exam performance and consistency to student success, we explore the potential benefits of AI-supported models to medical education.
Methods: This cross-sectional, analytical investigation was carried out in Türkiye at the [XX] University Faculty of Medicine. During the 2023–2024 academic year, ChatGPT solved multiple-choice questions from seven board exams and one final exam for third-year students, the results were compared with the students' achievements. Statistical analysis included descriptive statistics, correlation analyses, chi-square tests, McNemar tests, and t-tests for independent samples.
Results: With a 90.2% correct response percentage, ChatGPT outperformed the entire class, outperforming 293 other students. There was no significant difference in the correct response rate between the surgical, internal, and fundamental medical sciences (p = 0.742). In several fields, such psychiatry, neurology, and medical genetics, 100% success was attained. Forensic medicine, family medicine, medical ethics, pulmonary medicine, and thoracic surgery all had success rates that were lower than 80%. A retest conducted two months later revealed that ChatGPT's success rate had somewhat risen, with response consistency standing at 91.4%.
Conclusions: With a high success rate on medical education tests, ChatGPT has shown a great deal of promise to help both students and instructors. The integration of AI models into educational systems should be done strategically and with a human-centered approach, though, given the constraints in areas like clinical reasoning, ethical evaluation, and human-centered medical education. It is important to design instructional strategies in the future that combine artificial intelligence technologies with human skills.

Anahtar Kelimeler

Medical education , Artificial intelligence , Natural language processing , Educational measurement

Kaynakça

1. ULUSAL CEP-2020 UCG, ULUSAL CEP-2020 UYVYCG, ULUSAL CEP-2020 DSBBCG. Medical Faculty - National Core Curriculum 2020. Tıp Eğitimi Dünyası. 2020 Jun 3;19(57–1):1–146.
2. Collins J. Writing Multiple-Choice Questions for Continuing Medical Education Activities and Self-Assessment Modules. RadioGraphics. 2006 Mar;26(2):543–51.
3. Cansever Z, Acemoğlu H, Zeynep Avşar Ü, Hoşoğlu S, Üniversitesi Tıp fakültesi Tıp Eğitimi ve Bilişimi Anabilim Dalı M. Tıp Fakültesindeki Çoktan Seçmeli Sınav Sorularının Değerlendirilmesi. Tıp Eğitimi Dünyası [Internet]. 2016 Apr 28 [cited 2025 Apr 28];14(44):44–55. Available from: https://dergipark.org.tr/tr/pub/ted/issue/21328/228764
4. Friederichs H, Friederichs WJ, März M. ChatGPT in medical school: how successful is AI in progress testing? Med Educ Online. 2023 Dec 31;28(1).
5. Kıyak YS, Coşkun Ö, Budakoğlu Iİ, Uluoğlu C. Psychometric Analysis of the First Turkish Multiple-Choice Questions Generated Using Automatic Item Generation Method in Medical Education. Tıp Eğitimi Dünyası [Internet]. 2023 Dec 31 [cited 2025 Apr 28];22(68):154–61. Available from: https://dergipark.org.tr/en/pub/ted/issue/82095/1376840
6. OpenAI. ChatGPT: Optimizing Language Models for Dialogue [Internet]. 2022 [cited 2025 Apr 28]. Available from: https://openai.com/blog/chatgpt
7. Roumeliotis KI, Tselikas ND. ChatGPT and Open-AI Models: A Preliminary Review. Future Internet. 2023 May 26;15(6):192.
8. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digital Health [Internet]. 2023 Feb 1 [cited 2025 Apr 28];2(2 February). Available from: https://pubmed.ncbi.nlm.nih.gov/36812645/
9. Bicknell BT, Butler D, Whalen S, Ricks J, Dixon CJ, Clark AB, et al. ChatGPT-4 Omni Performance in USMLE Disciplines and Clinical Skills: Comparative Analysis. JMIR Med Educ. 2024;10:e63430.
10. Brin D, Sorin V, Vaid A, Soroush A, Glicksberg BS, Charney AW, et al. Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Sci Rep. 2023;13(1):16492.
11. Liu M, Okuhara T, Chang X, Shirabe R, Nishiie Y, Okada H, et al. Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis. J Med Internet Res. 2024;26:e60807.
12. Watari T, Takagi S, Sakaguchi K, Nishizaki Y, Shimizu T, Yamamoto Y, et al. Performance Comparison of ChatGPT-4 and Japanese Medical Residents in the General Medicine In-Training Examination: Comparison Study. JMIR Med Educ. 2023;9:e52202.
13. Rosoł M, Gąsior JS, Łaba J, Korzeniewski K, Młyńczak M. Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination. Sci Rep. 2023;13(1):20512.
14. Yao Z, Duan L, Xu S, Chi L, Sheng D. Performance of Large Language Models in the Non-English Context: Qualitative Study of Models Trained on Different Languages in Chinese Medical Examinations. JMIR Med Inform. 2025;13:e69485.
15. Altermatt FR, Neyem A, Sumonte N, Mendoza M, Villagran I, Lacassie HJ. Performance of single-agent and multi-agent language models in Spanish language medical competency exams. BMC Med Educ. 2025;25(1):666.
16. Attal L, Shvartz E, Nakhoul N, Bahir D. Chat GPT 4o vs residents: French language evaluation in ophthalmology. AJO International. 2025;2(1):100104.
17. Lingard L. Joining a conversation: the problem/gap/hook heuristic. Perspect Med Educ. 2015;4(5):252–3.
18. Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, et al. How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ [Internet]. 2023 [cited 2025 Apr 28];9. Available from: https://pubmed.ncbi.nlm.nih.gov/36753318/
19. Keskin A, Aygun T. A Performance of Generative Pre-Trained Transformers (GPT) in Answering Questions on Anatomy in The Turkish Dentistry Specialization Exam. JITSI : Jurnal Ilmiah Teknologi Sistem Informasi. 2024 Dec 30;5(4):188–92.
20. Flores-Cohaila JA, García-Vicente A, Vizcarra-Jiménez SF, De la Cruz-Galán JP, Gutiérrez-Arratia JD, Quiroga Torres BG, et al. Performance of ChatGPT on the Peruvian National Licensing Medical Examination: Cross-Sectional Study. JMIR Med Educ. 2023 Sep 28;9:e48039.
21. Chustecki M. Benefits and Risks of AI in Health Care: Narrative Review. Interact J Med Res. 2024;13:e53616.
22. Khosravi M, Zare Z, Mojtabaeian SM, Izadi R. Artificial Intelligence and Decision-Making in Healthcare: A Thematic Analysis of a Systematic Review of Reviews. Health Serv Res Manag Epidemiol. 2024;11:23333928241234864.
23. Roll I, Wylie R. Evolution and Revolution in Artificial Intelligence in Education. Int J Artif Intell Educ [Internet]. 2016 Jun 1 [cited 2025 Apr 28];26(2):582–99. Available from: https://link.springer.com/article/10.1007/s40593-016-0110-3
24. Zawacki-Richter O, Marín VI, Bond M, Gouverneur F. Systematic review of research on artificial intelligence applications in higher education – where are the educators? International Journal of Educational Technology in Higher Education 2019 16:1 [Internet]. 2019 Oct 28 [cited 2025 Apr 28];16(1):1–27. Available from: https://educationaltechnologyjournal.springeropen.com/articles/10.1186/s41239-019-0171-0
25. Schaye V, Guzman B, Burk-Rafel J, Marin M, Reinstein I, Kudlowitz D, et al. Development and Validation of a Machine Learning Model for Automated Assessment of Resident Clinical Reasoning Documentation. J Gen Intern Med [Internet]. 2022 Jul 1 [cited 2025 Apr 28];37(9):2230. Available from: https://pmc.ncbi.nlm.nih.gov/articles/PMC9296753/

Tıp Sınavlarında Büyük Dil Modellerinin Performansı: ChatGPT ile Tıp Fakültesi Öğrencilerinin Karşılaştırılması

Yıl 2025, Cilt: 24 Sayı: 74, 135 - 143, 22.12.2025

Ahmet Ozan Kaleci , Burcu Şahinbaş , Ezgi Ağadayı , Sümeyye İdil Çelikkaya , Ahmet Altun , Emre Kemal Kardan

https://doi.org/10.25282/ted.1729174

Öz

Amaç: Türkiye’de tıp eğitimi, küresel eğilimlerle uyumlu olarak altı yıllık disiplin temelli bir müfredatla yürütülmektedir. Değerlendirme süreci büyük ölçüde çoktan seçmeli sorulara dayanmakta ve bu durum öğretim üyeleri için önemli bir hazırlık yükü oluşturmaktadır. ChatGPT gibi yapay zekâ destekli büyük dil modelleri, sınav hazırlığını kolaylaştırma, geribildirim kalitesini artırma ve kişiselleştirilmiş öğrenmeyi destekleme potansiyeline sahiptir. Bu çalışmanın amacı, ChatGPT-4o modelinin tıp eğitimi sınavlarındaki çoktan seçmeli soruları yanıtlama başarısını değerlendirmektir. Ayrıca sınav performansı ve tutarlılığı öğrenci başarısıyla karşılaştırılarak, yapay zekâ destekli modellerin tıp eğitimine olası katkıları araştırılmaktadır.
Gereç ve yöntem: Bu kesitsel ve analitik araştırma, Türkiye’de [XX] Üniversitesi Tıp Fakültesi'nde yürütülmüştür. 2023–2024 eğitim-öğretim yılı boyunca ChatGPT, üçüncü sınıf öğrencilerine yönelik yedi kurul sınavı ve bir final sınavındaki çoktan seçmeli soruları yanıtlamış; elde edilen sonuçlar öğrencilerin başarılarıyla karşılaştırılmıştır. İstatistiksel analizde tanımlayıcı istatistikler, korelasyon analizleri, ki-kare testi, McNemar testi ve bağımsız örneklemler için t-testi kullanılmıştır.
Bulgular: %90,2’lik doğru yanıt oranıyla ChatGPT, sınıftaki 293 öğrencinin tamamından daha yüksek bir başarı göstermiştir. Cerrahi, dahili ve temel tıp bilimleri arasında doğru yanıt oranı açısından anlamlı bir fark saptanmamıştır (p = 0,742). Psikiyatri, nöroloji ve tıbbi genetik gibi bazı alanlarda %100 başarı elde edilmiştir. Adli tıp, aile hekimliği, tıp etiği, göğüs hastalıkları ve göğüs cerrahisi gibi alanlarda ise başarı oranı %80’in altında kalmıştır. İki ay sonra yapılan tekrarlama testinde ChatGPT’nin başarı oranı hafifçe artmış; yanıt tutarlılığı %91,4 olarak bulunmuştur.
Sonuç: ChatGPT, tıp eğitimi sınavlarında yüksek başarı oranı göstererek hem öğrenciler hem de eğitmenler için önemli bir potansiyele sahip olduğunu ortaya koymuştur. Ancak yapay zekâ modellerinin klinik akıl yürütme, etik değerlendirme ve insan merkezli tıp eğitimi gibi alanlardaki sınırlılıkları göz önüne alındığında, bu teknolojilerin eğitim sistemlerine entegrasyonu stratejik ve insan odaklı bir yaklaşımla gerçekleştirilmelidir. Gelecekte, yapay zekâ teknolojileri ile insan becerilerini birleştiren öğretim stratejileri tasarlanması önem arz etmektedir.

Anahtar Kelimeler

Tıp eğitimi , Yapay zeka , Doğal lisan işleme , Eğitimsel ölçüm

Kaynakça

1. ULUSAL CEP-2020 UCG, ULUSAL CEP-2020 UYVYCG, ULUSAL CEP-2020 DSBBCG. Medical Faculty - National Core Curriculum 2020. Tıp Eğitimi Dünyası. 2020 Jun 3;19(57–1):1–146.
2. Collins J. Writing Multiple-Choice Questions for Continuing Medical Education Activities and Self-Assessment Modules. RadioGraphics. 2006 Mar;26(2):543–51.
3. Cansever Z, Acemoğlu H, Zeynep Avşar Ü, Hoşoğlu S, Üniversitesi Tıp fakültesi Tıp Eğitimi ve Bilişimi Anabilim Dalı M. Tıp Fakültesindeki Çoktan Seçmeli Sınav Sorularının Değerlendirilmesi. Tıp Eğitimi Dünyası [Internet]. 2016 Apr 28 [cited 2025 Apr 28];14(44):44–55. Available from: https://dergipark.org.tr/tr/pub/ted/issue/21328/228764
4. Friederichs H, Friederichs WJ, März M. ChatGPT in medical school: how successful is AI in progress testing? Med Educ Online. 2023 Dec 31;28(1).
5. Kıyak YS, Coşkun Ö, Budakoğlu Iİ, Uluoğlu C. Psychometric Analysis of the First Turkish Multiple-Choice Questions Generated Using Automatic Item Generation Method in Medical Education. Tıp Eğitimi Dünyası [Internet]. 2023 Dec 31 [cited 2025 Apr 28];22(68):154–61. Available from: https://dergipark.org.tr/en/pub/ted/issue/82095/1376840
6. OpenAI. ChatGPT: Optimizing Language Models for Dialogue [Internet]. 2022 [cited 2025 Apr 28]. Available from: https://openai.com/blog/chatgpt
7. Roumeliotis KI, Tselikas ND. ChatGPT and Open-AI Models: A Preliminary Review. Future Internet. 2023 May 26;15(6):192.
8. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digital Health [Internet]. 2023 Feb 1 [cited 2025 Apr 28];2(2 February). Available from: https://pubmed.ncbi.nlm.nih.gov/36812645/
9. Bicknell BT, Butler D, Whalen S, Ricks J, Dixon CJ, Clark AB, et al. ChatGPT-4 Omni Performance in USMLE Disciplines and Clinical Skills: Comparative Analysis. JMIR Med Educ. 2024;10:e63430.
10. Brin D, Sorin V, Vaid A, Soroush A, Glicksberg BS, Charney AW, et al. Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Sci Rep. 2023;13(1):16492.
11. Liu M, Okuhara T, Chang X, Shirabe R, Nishiie Y, Okada H, et al. Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis. J Med Internet Res. 2024;26:e60807.
12. Watari T, Takagi S, Sakaguchi K, Nishizaki Y, Shimizu T, Yamamoto Y, et al. Performance Comparison of ChatGPT-4 and Japanese Medical Residents in the General Medicine In-Training Examination: Comparison Study. JMIR Med Educ. 2023;9:e52202.
13. Rosoł M, Gąsior JS, Łaba J, Korzeniewski K, Młyńczak M. Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination. Sci Rep. 2023;13(1):20512.
14. Yao Z, Duan L, Xu S, Chi L, Sheng D. Performance of Large Language Models in the Non-English Context: Qualitative Study of Models Trained on Different Languages in Chinese Medical Examinations. JMIR Med Inform. 2025;13:e69485.
15. Altermatt FR, Neyem A, Sumonte N, Mendoza M, Villagran I, Lacassie HJ. Performance of single-agent and multi-agent language models in Spanish language medical competency exams. BMC Med Educ. 2025;25(1):666.
16. Attal L, Shvartz E, Nakhoul N, Bahir D. Chat GPT 4o vs residents: French language evaluation in ophthalmology. AJO International. 2025;2(1):100104.
17. Lingard L. Joining a conversation: the problem/gap/hook heuristic. Perspect Med Educ. 2015;4(5):252–3.
18. Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, et al. How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ [Internet]. 2023 [cited 2025 Apr 28];9. Available from: https://pubmed.ncbi.nlm.nih.gov/36753318/
19. Keskin A, Aygun T. A Performance of Generative Pre-Trained Transformers (GPT) in Answering Questions on Anatomy in The Turkish Dentistry Specialization Exam. JITSI : Jurnal Ilmiah Teknologi Sistem Informasi. 2024 Dec 30;5(4):188–92.
20. Flores-Cohaila JA, García-Vicente A, Vizcarra-Jiménez SF, De la Cruz-Galán JP, Gutiérrez-Arratia JD, Quiroga Torres BG, et al. Performance of ChatGPT on the Peruvian National Licensing Medical Examination: Cross-Sectional Study. JMIR Med Educ. 2023 Sep 28;9:e48039.
21. Chustecki M. Benefits and Risks of AI in Health Care: Narrative Review. Interact J Med Res. 2024;13:e53616.
22. Khosravi M, Zare Z, Mojtabaeian SM, Izadi R. Artificial Intelligence and Decision-Making in Healthcare: A Thematic Analysis of a Systematic Review of Reviews. Health Serv Res Manag Epidemiol. 2024;11:23333928241234864.
23. Roll I, Wylie R. Evolution and Revolution in Artificial Intelligence in Education. Int J Artif Intell Educ [Internet]. 2016 Jun 1 [cited 2025 Apr 28];26(2):582–99. Available from: https://link.springer.com/article/10.1007/s40593-016-0110-3
24. Zawacki-Richter O, Marín VI, Bond M, Gouverneur F. Systematic review of research on artificial intelligence applications in higher education – where are the educators? International Journal of Educational Technology in Higher Education 2019 16:1 [Internet]. 2019 Oct 28 [cited 2025 Apr 28];16(1):1–27. Available from: https://educationaltechnologyjournal.springeropen.com/articles/10.1186/s41239-019-0171-0
25. Schaye V, Guzman B, Burk-Rafel J, Marin M, Reinstein I, Kudlowitz D, et al. Development and Validation of a Machine Learning Model for Automated Assessment of Resident Clinical Reasoning Documentation. J Gen Intern Med [Internet]. 2022 Jul 1 [cited 2025 Apr 28];37(9):2230. Available from: https://pmc.ncbi.nlm.nih.gov/articles/PMC9296753/

Toplam 25 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	İngilizce
Konular	Tıp Eğitimi
Bölüm	Araştırma Makalesi
Yazarlar	Ahmet Ozan Kaleci 0000-0003-4514-6209 Burcu Şahinbaş 0000-0002-9866-052X Ezgi Ağadayı 0000-0001-9546-2483 Sümeyye İdil Çelikkaya 0000-0001-7036-7922 Ahmet Altun 0000-0003-2056-8683 Emre Kemal Kardan 0009-0000-5622-2751
Gönderilme Tarihi	28 Haziran 2025
Kabul Tarihi	8 Eylül 2025
Yayımlanma Tarihi	22 Aralık 2025
Yayımlandığı Sayı	Yıl 2025 Cilt: 24 Sayı: 74

Kaynak Göster

Vancouver	1.Kaleci AO, Şahinbaş B, Ağadayı E, Çelikkaya Sİ, Altun A, Kardan EK. Performance of Large Language Models in Medical Exams: A Comparison Between ChatGPT and Medical Students. TED [Internet]. 01 Aralık 2025;24(74):135-43. Erişim adresi: https://izlik.org/JA28RZ54NC

Makale Dosyaları

Tam Metin