Pushing the boundaries of generative AI: multiple-choice question generation and assessment performance within medical education

Birsen Özdemir; Mevlüt Okan Aydin; Esra Akdeniz

doi:10.32322/jhsm.1842373

TR EN

Üretken yapay zekanın sınırlarını zorlamak: tıp eğitiminde çoktan seçmeli soru üretme ve değerlendirme performansı

Abstract

Amaçlar: Bu çalışmanın amacı; tıp eğitiminde kullanılmak üzere çoktan seçmeli soru (ÇSS) üretimi ve değerlendirilmesinde, büyük dil modeli tabanlı üretken yapay zeka (ÜYZ) araçları olan Gemini ve Copilot'ın performanslarını sistematik olarak değerlendirmektir. Yöntemler: Standartlaştırılmış istemler kullanılarak iki sanal hasta vakasından toplam 335 ÇSS üretilmiştir. ÜYZ araçları; kabul edilebilir performans düzeyi (KPD), Miller'ın yeterlik piramidi (Miller) ve Bloom'un revize edilmiş taksonomisi (Bloom) seviyeleri ile uyumlu amaçlanan dağılımları ve öğrenim hedefleriyle (ÖH’leri) uyum olarak belirlenen kriterlere dayanarak en kaliteli 56 maddeyi seçmiştir. Uzman tıp eğitimcileri ve güncel ÜYZ araçları bu maddeleri; (KPD değerlerini hesaplamak amacıyla) sınırda olan adaylar için yanıltıcı/kafa karıştırıcı çeldirici(lerin) tespiti ve doğru yanıt(ların) tespitinin yanı sıra, Miller ve Bloom seviyeleri, ÖH uyumu, madde kökü uygunluğu ve teknik madde kusurlarını esas alarak değerlendirmiştir. "ÜYZ ile genişletilmiş uzlaşısı", özneler arası uzlaşı modeli (altın standart) olarak kullanılmıştır. Üretim performansı bu uzlaşıyla olan uyum üzerinden; değerlendirme performansı ise ÜYZ'lerin uzman değerlendirmelerini ne ölçüde değiştirdiği veya koruduğu üzerinden nicelendirilmiştir. Analizler; güvenirlik için ICC, kategorik uyum için Po/Cohen/Fleiss Kappa ve sistematik yanlılık ile yönsel kaymaları tespit etmek için çıkarımsal testleri (Exact McNemar ve Wilcoxon işaretli sıralar testi) kapsamıştır. Bulgular: ÜYZ'ler, bilişsel seviyeleri atamada belirgin şekilde farklı performans örüntüleri göstermiştir. Miller için, Gemini tarafından üretilen ÇSS'ler özneler arası uzlaşı ile üstün bir tutarlılık sergilerken (ICC(2,k)=0.82); Bloom için bu üstünlüğü Copilot tarafından üretilen ÇSS'ler göstermiştir (ICC(2,k)=0.97). Her iki araç da ÖH uyumu ve doğru yanıt tespiti konusunda iyi performans göstermiş, ancak madde kökü yapısına yaklaşımları önemli ölçüde ayrışmıştır. Uzmanlar, ÇSS'leri ÜYZ'lerin iddia ettiğinden daha kolay olarak algılamış; güncel ÜYZ sürümleri ise bu soruları hem üreten sürümlerden hem de uzmanlardan daha da kolay bulmuştur. Değerlendirme davranışı açısından; ÜYZ'ler Miller sınıflandırmalarında uzman uzlaşısını 'bilir'den 'nasıl yapacağını bilir' seviyesine istatistiksel olarak anlamlı düzeyde (p<0.001) kaydırarak sistematik bir katılık eğilimi göstermiştir. Bloom sınıflandırmalarında ise değerlendirme örüntüleri, uçlardaki uzman puanlarını orta kategorilere çekerek bir merkezi eğilim yanlılığını yansıtmıştır. Madde yazım kusurları analizinde, ÜYZ'ler biçimsel kusurları tespit etmede yetkinken, uzmanlar mantıksal kusurlara daha duyarlı olmuştur. Sonuç: Bu çalışma, ÜYZ araçlarının tıp eğitimindeki değerlendirme süreçlerinde uç görüşler için bir "kontrol mekanizması" veya "düzeltici" rolü oynayabileceğini öne sürmektedir. ÜYZ'lerin uzman uzlaşısına katılımı, modele ve metriğe bağlı olarak değerlendirme güvenirliğini etkilemektedir. Sonuçlar, ÜYZ araçlarının insan gözetimi altındaki hibrit tıp eğitimi değerlendirme sistemlerinde verimliliği artırabileceğini göstermekte ve bunların kontrollü entegrasyonu için umut verici kanıtlar sunmaktadır.

Keywords

Pushing the boundaries of generative AI: multiple-choice question generation and assessment performance within medical education

Abstract

Aims: The aim of this study is to systematically evaluate the performances of large language model-based generative Artificial Intelligence (Gen-AI) tools, Gemini and Copilot, in the generation and assessment of multiple-choice questions (MCQs) for use in medical education. Methods: A total of 335 MCQs were generated from two virtual patient cases using standardized prompts. Gen-AI tools selected the 56 best-quality items based on criteria encompassing the intended distributions regarding acceptable level of performance (ALP), Miller's competency pyramid (Miller) and Bloom's revised taxonomy (Bloom) levels, as well as alignment with learning objectives (LOs). Expert medical educators and current Gen-AI tools assessed these items based on the identification of misleading/confusing distractor(s) for borderline candidates -minimally competent examinees- (to calculate ALP values) and the identification of key(s), as well as Miller and Bloom levels, LO alignment, stem appropriateness, and technical item flaws. "AI-extended consensus" served as intersubjective consensus model (the gold standard). Generation performance was quantified by alignment with this consensus, and assessment performance by the degree to which Gen-AIs shifted or preserved Expert assessments. Analyses included ICC for reliability, Po/Cohen’s/Fleiss’ Kappa for categorical agreement, and inferential tests (Exact McNemar and Wilcoxon signed-rank) for detecting systematic bias and directional shifts. Results: Gen-AIs demonstrated markedly different performance patterns in assigning cognitive levels. For Miller, Gemini generated MCQs exhibited superior consistency with the intersubjective consensus (ICC(2,k)=0.82), whereas for Bloom, Copilot-generated MCQs demonstrated this superiority (ICC(2,k)=0.97). Both tools performed well in LO alignment and key identification, but their approaches to stem structure diverged substantially. Experts perceived the MCQs to be easier than the Gen-AIs claimed, and the current Gen-AI versions found them even easier than both the generating versions and the Experts did. In terms of assessment behaviour, Gen-AIs showed a systematic stringency tendency in Miller classifications, statistically significantly shifting Expert consensus from 'knows' to 'knows how' (p<0.001). For Bloom classifications, their assessment patterns reflected a central tendency bias, pulling extreme expert ratings toward the middle categories. In the analysis of item writing flaws, Gen-AIs were adept at detecting formal flaws, whereas Experts were more attuned to logical flaws. Conclusion: This study suggests that Gen-AI tools can serve as a 'control mechanism' or play a 'corrective and confirmatory role' for extreme views within the assessment processes in medical education. The participation of Gen-AIs in expert consensus affects assessment reliability depending on the model and metric. The results indicate that Gen-AI tools can increase efficiency in hybrid models of medical education assessment systems under human supervision and offer promising evidence for their controlled integration.

Keywords

Ethical Statement

The study was approved by the Bursa Uludağ University Clinical Research Ethics Committee (Date: 11.01.2023, Decision No: 2023-1/47).

Thanks

The authors wish to thank the participating experts for their contributions and the sworn translator for verifying the English-Turkish translations of the MCQs.

References

Rush BR, Rankin DC, White BJ. The impact of item-writing flaws and item complexity on examination item difficulty and discrimination value. BMC Med Educ. 2016;16(1):250. doi:10.1186/s12909-016-0773-3
Cheung BHH, Lau GKK, Wong GTC, et al. ChatGPT versus human in generating medical graduate exam multiple choice questions—a multinational prospective study (Hong Kong SAR, Singapore, Ireland, and the United Kingdom). PLoS One. 2023;18(8):e0290691. doi:10.1371/journal.pone.0290691
Ch DR, Saha SK. Automatic multiple choice question generation from text: a survey. IEEE Trans Learn Technol. 2020;13(1):14-25. doi:10.1109/TLT.2018.2889100
Zuckerman M, Flood R, Tan RJ, et al. ChatGPT for assessment writing. Med Teach. 2023;45(11):1224-1227. doi:10.1080/0142159X.2023.2249239
Grévisse C, Pavlou MAS, Schneider JG. Docimological quality analysis of LLM-generated multiple choice questions in computer science and medicine. SN Comput Sci. 2024;5(5):636. doi:10.1007/s42979-024-02963-6
Tan B, Armoush N, Mazzullo E, et al. A review of automatic item generation techniques leveraging large language models. Int J Assess Tools Educ. 2025;12(2):317-340. doi:10.21449/ijate.1602294
Hang CN, Tan CW, Yu PD. MCQGen: a large language model-driven MCQ generator for personalized learning. IEEE Access. 2024;12:102261-102273. doi:10.1109/ACCESS.2024.3420709
Biancini G, Ferrato A, Limongelli C. Multiple-choice question generation using large language models: methodology and educator insights. In: Adjunct Proceedings of the 32nd ACM Conference on User Modeling, Adaptation and Personalization. Association for Computing Machinery; 2024:584-590. doi:10.1145/3631700.3665233

Song T, Tian Q, Xiao Y, Liu S. Automatic generation of multiple-choice questions for CS0 and CS1 curricula using large language models. In: Hong W, Kanaparan G, eds. Computer Science and Education. ICCSE 2023. Computer Science and Technology. Springer; 2024:314-324. doi: 10.1007/978-981-97-0730-0_28
Doughty J, Wan Z, Bompelli A, et al. A comparative study of AI-generated (GPT-4) and human-crafted MCQs in programming education. In: Proceedings of the 26th Australasian Computing Education Conference. Association for Computing Machinery; 2024: 114-123. doi:10.1145/3636243.3636256
Nasution NEA. Using artificial intelligence to create biology multiple choice questions for higher education. Agric Environ Educ. 2023;2(1): em002. doi:10.29333/agrenvedu/13071
Hwang K, Wang K, Alomair M, Choa FS, Chen LK. Towards automated multiple choice question generation and evaluation: aligning with Bloom’s taxonomy. In: International Conference on Artificial Intelligence in Education. Springer Nature Switzerland; 2024:389-396. doi:10.1007/978-3-031-64299-9_35
Law AK, So J, Lui CT, et al. AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination. BMC Med Educ. 2025;25(1):208. doi:10.1186/s12909-025-06796-6
Camarata T, McCoy L, Rosenberg R, Temprine Grellinger KR, Brettschnieder K, Berman J. LLM-generated multiple choice practice quizzes for preclinical medical students. Adv Physiol Educ. 2025;49(3): 758-763. doi:10.1152/advan.00106.2024
Karahan BN, Emekli E. Comparison of applicability, difficulty, and discrimination indices of multiple-choice questions on medical imaging generated by different AI-based chatbots. Radiography. 2025;31(5): 103087. doi:10.1016/j.radi.2025.103087
Laupichler MC, Rother JF, Grunwald Kadow IC, Ahmadi S, Raupach T. Large language models in medical education: comparing ChatGPT- to human-generated exam questions. Acad Med. 2024;99(5):508-512. doi: 10.1097/ACM.0000000000005626
Kurdi G, Leo J, Matentzoglu N, et al. A comparative study of methods for a priori prediction of MCQ difficulty. Semant Web. 2021;12(3):449-465. doi:10.3233/SW-2003
Indran IR, Paranthaman P, Gupta N, Mustafa N. Twelve tips to leverage AI for efficient and effective medical question generation: a guide for educators using ChatGPT. Med Teach. 2024;46(8):1021-1026. doi:10.1080/ 0142159X.2023.2294703
Billings MS, DeRuchie K, Go S, et al. NBME item-writing guide: constructing written test questions for the health sciences. 6th ed. National Board of Medical Examiners; 2024:11-25.
Haladyna TM, Downing SM, Rodriguez MC. A review of multiple-choice item-writing guidelines for classroom assessment. Appl Meas Educ. 2002;15(3):309-333. doi:10.1207/S15324818AME1503_5
Bloom BS, Engelhart MD, Furst EJ, Hill WH, Krathwohl DR. Taxonomy of Educational Objectives: The Classification of Educational Goals. Handbook 1: Cognitive Domain. David McKay; 1956.
Herrmann-Werner A, Festl-Wietek T, Holderried F, et al. Assessing ChatGPT’s mastery of Bloom’s taxonomy using psychosomatic medicine exam questions: mixed-methods study. J Med Internet Res. 2024;26:e52113. doi:10.2196/52113
Krathwohl DR. A revision of Bloom’s taxonomy: an overview. Theory Pract. 2002;41(4):212-218. doi:10.1207/s15430421tip4104_2
Miller GE. The assessment of clinical skills/competence/performance. Acad Med. 1990;65(9 Suppl):S63-S67. doi:10.1097/00001888-199009000-00045
Nedelsky L. Absolute grading standards for objective tests. Educ Psychol Meas. 1954;14(1):3-19. doi:10.1177/001316445401400101
Mıdık Ö, Karabilgin ÖS. Tıp eğitiminde öğrenci değerlendirmelerinde standart belirleme: sanal bir sınav kurgusunda geçme kalma sınırı. Tıp Eğitimi Dünyası. 2011;29(29):21-33.
UÇEP-2020 Geliştirme ve Güncelleme Çalışma Grubu. Tıp fakültesi mezuniyet öncesi eğitimi ulusal çekirdek eğitim programı 2020. Tıp Eğitimi Dünyası. 2020;19(57-1):1-146. doi:10.25282/ted.716873
Tabish SA. Assessment methods in medical education. Int J Health Sci (Qassim). 2008;2(2):3-7.
Bandaranayake RC. Setting and maintaining standards in multiple choice examinations: AMEE Guide No. 37. Med Teach. 2008;30(9-10): 836-845. doi:10.1080/01421590802402247
Koo TK, Li MY. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J Chiropr Med. 2016;15(2): 155-163. doi:10.1016/j.jcm.2016.02.012
Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33(1):159-174. doi:10.2307/2529310
Özdemir B, Aydin MO, Akdeniz E. Üretken yapay zeka araçlarının tıp eğitimi için çoktan seçmeli soru üretme süreçlerinin karşılaştırılması. In: Proceedings of the 6th International Congress on Innovative Approaches in Medical and Health Sciences. Güven Plus Group Publications; 2025:81-89.
Mishra V, Lurie Y, Mark S. Accuracy of LLMs in medical education: evidence from a concordance test with medical teacher. BMC Med Educ. 2025;25(1):443. doi:10.1186/s12909-025-07009-w

Details

Primary Language

English

Subjects

Medical Education

Journal Section

Research Article

Authors

Birsen Özdemir ^*
0000-0001-6277-4878
Türkiye

Mevlüt Okan Aydin
0000-0002-8060-8803
Türkiye

Esra Akdeniz
0000-0002-3549-5416
Türkiye

Publication Date

March 12, 2026

Submission Date

December 15, 2025

Acceptance Date

January 10, 2026

Published in Issue

Year 2026 Volume: 9 Number: 2

DOI

https://doi.org/10.32322/jhsm.1842373

IZ

https://izlik.org/JA36YS57WR

Cite

RIS / Bibtex

APA

Özdemir, B., Aydin, M. O., & Akdeniz, E. (2026). Pushing the boundaries of generative AI: multiple-choice question generation and assessment performance within medical education. Journal of Health Sciences and Medicine, 9(2), 276-286. https://doi.org/10.32322/jhsm.1842373

AMA

1.Özdemir B, Aydin MO, Akdeniz E. Pushing the boundaries of generative AI: multiple-choice question generation and assessment performance within medical education. J Health Sci Med / JHSM. 2026;9(2):276-286. doi:10.32322/jhsm.1842373

Chicago

Özdemir, Birsen, Mevlüt Okan Aydin, and Esra Akdeniz. 2026. “Pushing the Boundaries of Generative AI: Multiple-Choice Question Generation and Assessment Performance Within Medical Education”. Journal of Health Sciences and Medicine 9 (2): 276-86. https://doi.org/10.32322/jhsm.1842373.

EndNote

Özdemir B, Aydin MO, Akdeniz E (March 1, 2026) Pushing the boundaries of generative AI: multiple-choice question generation and assessment performance within medical education. Journal of Health Sciences and Medicine 9 2 276–286.

IEEE

[1]B. Özdemir, M. O. Aydin, and E. Akdeniz, “Pushing the boundaries of generative AI: multiple-choice question generation and assessment performance within medical education”, J Health Sci Med / JHSM, vol. 9, no. 2, pp. 276–286, Mar. 2026, doi: 10.32322/jhsm.1842373.

ISNAD

Özdemir, Birsen - Aydin, Mevlüt Okan - Akdeniz, Esra. “Pushing the Boundaries of Generative AI: Multiple-Choice Question Generation and Assessment Performance Within Medical Education”. Journal of Health Sciences and Medicine 9/2 (March 1, 2026): 276-286. https://doi.org/10.32322/jhsm.1842373.

JAMA

1.Özdemir B, Aydin MO, Akdeniz E. Pushing the boundaries of generative AI: multiple-choice question generation and assessment performance within medical education. J Health Sci Med / JHSM. 2026;9:276–286.

MLA

Özdemir, Birsen, et al. “Pushing the Boundaries of Generative AI: Multiple-Choice Question Generation and Assessment Performance Within Medical Education”. Journal of Health Sciences and Medicine, vol. 9, no. 2, Mar. 2026, pp. 276-8, doi:10.32322/jhsm.1842373.

Vancouver

1.Birsen Özdemir, Mevlüt Okan Aydin, Esra Akdeniz. Pushing the boundaries of generative AI: multiple-choice question generation and assessment performance within medical education. J Health Sci Med / JHSM. 2026 Mar. 1;9(2):276-8. doi:10.32322/jhsm.1842373