Araştırma Makalesi

Performance of Large Language Models in Medical Exams: A Comparison Between ChatGPT and Medical Students

Cilt: 24 Sayı: 74 22 Aralık 2025
PDF İndir
EN TR

Performance of Large Language Models in Medical Exams: A Comparison Between ChatGPT and Medical Students

Öz

Background: Medical education in Türkiye is delivered through a six-year, discipline-based curriculum aligned with global trends. The assessment process largely relies on multiple-choice questions, placing a significant preparation burden on faculty members. AI-powered large language models like ChatGPT have the potential to ease exam preparation, enhance feedback quality, and support personalized learning. The aim of this study is to evaluate the success of the ChatGPT-4o model performs when answering multiple-choice (MCQ) questions on medical education exams. Additionally, by comparing exam performance and consistency to student success, we explore the potential benefits of AI-supported models to medical education. Methods: This cross-sectional, analytical investigation was carried out in Türkiye at the [XX] University Faculty of Medicine. During the 2023–2024 academic year, ChatGPT solved multiple-choice questions from seven board exams and one final exam for third-year students, the results were compared with the students' achievements. Statistical analysis included descriptive statistics, correlation analyses, chi-square tests, McNemar tests, and t-tests for independent samples. Results: With a 90.2% correct response percentage, ChatGPT outperformed the entire class, outperforming 293 other students. There was no significant difference in the correct response rate between the surgical, internal, and fundamental medical sciences (p = 0.742). In several fields, such psychiatry, neurology, and medical genetics, 100% success was attained. Forensic medicine, family medicine, medical ethics, pulmonary medicine, and thoracic surgery all had success rates that were lower than 80%. A retest conducted two months later revealed that ChatGPT's success rate had somewhat risen, with response consistency standing at 91.4%. Conclusions: With a high success rate on medical education tests, ChatGPT has shown a great deal of promise to help both students and instructors. The integration of AI models into educational systems should be done strategically and with a human-centered approach, though, given the constraints in areas like clinical reasoning, ethical evaluation, and human-centered medical education. It is important to design instructional strategies in the future that combine artificial intelligence technologies with human skills.

Anahtar Kelimeler

Kaynakça

  1. 1. ULUSAL CEP-2020 UCG, ULUSAL CEP-2020 UYVYCG, ULUSAL CEP-2020 DSBBCG. Medical Faculty - National Core Curriculum 2020. Tıp Eğitimi Dünyası. 2020 Jun 3;19(57–1):1–146.
  2. 2. Collins J. Writing Multiple-Choice Questions for Continuing Medical Education Activities and Self-Assessment Modules. RadioGraphics. 2006 Mar;26(2):543–51.
  3. 3. Cansever Z, Acemoğlu H, Zeynep Avşar Ü, Hoşoğlu S, Üniversitesi Tıp fakültesi Tıp Eğitimi ve Bilişimi Anabilim Dalı M. Tıp Fakültesindeki Çoktan Seçmeli Sınav Sorularının Değerlendirilmesi. Tıp Eğitimi Dünyası [Internet]. 2016 Apr 28 [cited 2025 Apr 28];14(44):44–55. Available from: https://dergipark.org.tr/tr/pub/ted/issue/21328/228764
  4. 4. Friederichs H, Friederichs WJ, März M. ChatGPT in medical school: how successful is AI in progress testing? Med Educ Online. 2023 Dec 31;28(1).
  5. 5. Kıyak YS, Coşkun Ö, Budakoğlu Iİ, Uluoğlu C. Psychometric Analysis of the First Turkish Multiple-Choice Questions Generated Using Automatic Item Generation Method in Medical Education. Tıp Eğitimi Dünyası [Internet]. 2023 Dec 31 [cited 2025 Apr 28];22(68):154–61. Available from: https://dergipark.org.tr/en/pub/ted/issue/82095/1376840
  6. 6. OpenAI. ChatGPT: Optimizing Language Models for Dialogue [Internet]. 2022 [cited 2025 Apr 28]. Available from: https://openai.com/blog/chatgpt
  7. 7. Roumeliotis KI, Tselikas ND. ChatGPT and Open-AI Models: A Preliminary Review. Future Internet. 2023 May 26;15(6):192.
  8. 8. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digital Health [Internet]. 2023 Feb 1 [cited 2025 Apr 28];2(2 February). Available from: https://pubmed.ncbi.nlm.nih.gov/36812645/

Ayrıntılar

Birincil Dil

İngilizce

Konular

Tıp Eğitimi

Bölüm

Araştırma Makalesi

Yayımlanma Tarihi

22 Aralık 2025

Gönderilme Tarihi

28 Haziran 2025

Kabul Tarihi

8 Eylül 2025

Yayımlandığı Sayı

Yıl 2025 Cilt: 24 Sayı: 74

Kaynak Göster

APA
Kaleci, A. O., Şahinbaş, B., Ağadayı, E., Çelikkaya, S. İ., Altun, A., & Kardan, E. K. (2025). Performance of Large Language Models in Medical Exams: A Comparison Between ChatGPT and Medical Students. Tıp Eğitimi Dünyası, 24(74), 135-143. https://doi.org/10.25282/ted.1729174
AMA
1.Kaleci AO, Şahinbaş B, Ağadayı E, Çelikkaya Sİ, Altun A, Kardan EK. Performance of Large Language Models in Medical Exams: A Comparison Between ChatGPT and Medical Students. TED. 2025;24(74):135-143. doi:10.25282/ted.1729174
Chicago
Kaleci, Ahmet Ozan, Burcu Şahinbaş, Ezgi Ağadayı, Sümeyye İdil Çelikkaya, Ahmet Altun, ve Emre Kemal Kardan. 2025. “Performance of Large Language Models in Medical Exams: A Comparison Between ChatGPT and Medical Students”. Tıp Eğitimi Dünyası 24 (74): 135-43. https://doi.org/10.25282/ted.1729174.
EndNote
Kaleci AO, Şahinbaş B, Ağadayı E, Çelikkaya Sİ, Altun A, Kardan EK (01 Aralık 2025) Performance of Large Language Models in Medical Exams: A Comparison Between ChatGPT and Medical Students. Tıp Eğitimi Dünyası 24 74 135–143.
IEEE
[1]A. O. Kaleci, B. Şahinbaş, E. Ağadayı, S. İ. Çelikkaya, A. Altun, ve E. K. Kardan, “Performance of Large Language Models in Medical Exams: A Comparison Between ChatGPT and Medical Students”, TED, c. 24, sy 74, ss. 135–143, Ara. 2025, doi: 10.25282/ted.1729174.
ISNAD
Kaleci, Ahmet Ozan - Şahinbaş, Burcu - Ağadayı, Ezgi - Çelikkaya, Sümeyye İdil - Altun, Ahmet - Kardan, Emre Kemal. “Performance of Large Language Models in Medical Exams: A Comparison Between ChatGPT and Medical Students”. Tıp Eğitimi Dünyası 24/74 (01 Aralık 2025): 135-143. https://doi.org/10.25282/ted.1729174.
JAMA
1.Kaleci AO, Şahinbaş B, Ağadayı E, Çelikkaya Sİ, Altun A, Kardan EK. Performance of Large Language Models in Medical Exams: A Comparison Between ChatGPT and Medical Students. TED. 2025;24:135–143.
MLA
Kaleci, Ahmet Ozan, vd. “Performance of Large Language Models in Medical Exams: A Comparison Between ChatGPT and Medical Students”. Tıp Eğitimi Dünyası, c. 24, sy 74, Aralık 2025, ss. 135-43, doi:10.25282/ted.1729174.
Vancouver
1.Ahmet Ozan Kaleci, Burcu Şahinbaş, Ezgi Ağadayı, Sümeyye İdil Çelikkaya, Ahmet Altun, Emre Kemal Kardan. Performance of Large Language Models in Medical Exams: A Comparison Between ChatGPT and Medical Students. TED. 01 Aralık 2025;24(74):135-43. doi:10.25282/ted.1729174