Research Article

Pushing the boundaries of generative AI: multiple-choice question generation and assessment performance within medical education

Volume: 9 Number: 2 March 12, 2026
TR EN

Pushing the boundaries of generative AI: multiple-choice question generation and assessment performance within medical education

Abstract

Aims: The aim of this study is to systematically evaluate the performances of large language model-based generative Artificial Intelligence (Gen-AI) tools, Gemini and Copilot, in the generation and assessment of multiple-choice questions (MCQs) for use in medical education. Methods: A total of 335 MCQs were generated from two virtual patient cases using standardized prompts. Gen-AI tools selected the 56 best-quality items based on criteria encompassing the intended distributions regarding acceptable level of performance (ALP), Miller's competency pyramid (Miller) and Bloom's revised taxonomy (Bloom) levels, as well as alignment with learning objectives (LOs). Expert medical educators and current Gen-AI tools assessed these items based on the identification of misleading/confusing distractor(s) for borderline candidates -minimally competent examinees- (to calculate ALP values) and the identification of key(s), as well as Miller and Bloom levels, LO alignment, stem appropriateness, and technical item flaws. "AI-extended consensus" served as intersubjective consensus model (the gold standard). Generation performance was quantified by alignment with this consensus, and assessment performance by the degree to which Gen-AIs shifted or preserved Expert assessments. Analyses included ICC for reliability, Po/Cohen’s/Fleiss’ Kappa for categorical agreement, and inferential tests (Exact McNemar and Wilcoxon signed-rank) for detecting systematic bias and directional shifts. Results: Gen-AIs demonstrated markedly different performance patterns in assigning cognitive levels. For Miller, Gemini generated MCQs exhibited superior consistency with the intersubjective consensus (ICC(2,k)=0.82), whereas for Bloom, Copilot-generated MCQs demonstrated this superiority (ICC(2,k)=0.97). Both tools performed well in LO alignment and key identification, but their approaches to stem structure diverged substantially. Experts perceived the MCQs to be easier than the Gen-AIs claimed, and the current Gen-AI versions found them even easier than both the generating versions and the Experts did. In terms of assessment behaviour, Gen-AIs showed a systematic stringency tendency in Miller classifications, statistically significantly shifting Expert consensus from 'knows' to 'knows how' (p<0.001). For Bloom classifications, their assessment patterns reflected a central tendency bias, pulling extreme expert ratings toward the middle categories. In the analysis of item writing flaws, Gen-AIs were adept at detecting formal flaws, whereas Experts were more attuned to logical flaws. Conclusion: This study suggests that Gen-AI tools can serve as a 'control mechanism' or play a 'corrective and confirmatory role' for extreme views within the assessment processes in medical education. The participation of Gen-AIs in expert consensus affects assessment reliability depending on the model and metric. The results indicate that Gen-AI tools can increase efficiency in hybrid models of medical education assessment systems under human supervision and offer promising evidence for their controlled integration.

Keywords

Ethical Statement

The study was approved by the Bursa Uludağ University Clinical Research Ethics Committee (Date: 11.01.2023, Decision No: 2023-1/47).

Thanks

The authors wish to thank the participating experts for their contributions and the sworn translator for verifying the English-Turkish translations of the MCQs.

References

  1. Rush BR, Rankin DC, White BJ. The impact of item-writing flaws and item complexity on examination item difficulty and discrimination value. BMC Med Educ. 2016;16(1):250. doi:10.1186/s12909-016-0773-3
  2. Cheung BHH, Lau GKK, Wong GTC, et al. ChatGPT versus human in generating medical graduate exam multiple choice questions—a multinational prospective study (Hong Kong SAR, Singapore, Ireland, and the United Kingdom). PLoS One. 2023;18(8):e0290691. doi:10.1371/journal.pone.0290691
  3. Ch DR, Saha SK. Automatic multiple choice question generation from text: a survey. IEEE Trans Learn Technol. 2020;13(1):14-25. doi:10.1109/TLT.2018.2889100
  4. Zuckerman M, Flood R, Tan RJ, et al. ChatGPT for assessment writing. Med Teach. 2023;45(11):1224-1227. doi:10.1080/0142159X.2023.2249239
  5. Grévisse C, Pavlou MAS, Schneider JG. Docimological quality analysis of LLM-generated multiple choice questions in computer science and medicine. SN Comput Sci. 2024;5(5):636. doi:10.1007/s42979-024-02963-6
  6. Tan B, Armoush N, Mazzullo E, et al. A review of automatic item generation techniques leveraging large language models. Int J Assess Tools Educ. 2025;12(2):317-340. doi:10.21449/ijate.1602294
  7. Hang CN, Tan CW, Yu PD. MCQGen: a large language model-driven MCQ generator for personalized learning. IEEE Access. 2024;12:102261-102273. doi:10.1109/ACCESS.2024.3420709
  8. Biancini G, Ferrato A, Limongelli C. Multiple-choice question generation using large language models: methodology and educator insights. In: Adjunct Proceedings of the 32nd ACM Conference on User Modeling, Adaptation and Personalization. Association for Computing Machinery; 2024:584-590. doi:10.1145/3631700.3665233

Details

Primary Language

English

Subjects

Medical Education

Journal Section

Research Article

Publication Date

March 12, 2026

Submission Date

December 15, 2025

Acceptance Date

January 10, 2026

Published in Issue

Year 2026 Volume: 9 Number: 2

APA
Özdemir, B., Aydin, M. O., & Akdeniz, E. (2026). Pushing the boundaries of generative AI: multiple-choice question generation and assessment performance within medical education. Journal of Health Sciences and Medicine, 9(2), 276-286. https://doi.org/10.32322/jhsm.1842373
AMA
1.Özdemir B, Aydin MO, Akdeniz E. Pushing the boundaries of generative AI: multiple-choice question generation and assessment performance within medical education. J Health Sci Med / JHSM. 2026;9(2):276-286. doi:10.32322/jhsm.1842373
Chicago
Özdemir, Birsen, Mevlüt Okan Aydin, and Esra Akdeniz. 2026. “Pushing the Boundaries of Generative AI: Multiple-Choice Question Generation and Assessment Performance Within Medical Education”. Journal of Health Sciences and Medicine 9 (2): 276-86. https://doi.org/10.32322/jhsm.1842373.
EndNote
Özdemir B, Aydin MO, Akdeniz E (March 1, 2026) Pushing the boundaries of generative AI: multiple-choice question generation and assessment performance within medical education. Journal of Health Sciences and Medicine 9 2 276–286.
IEEE
[1]B. Özdemir, M. O. Aydin, and E. Akdeniz, “Pushing the boundaries of generative AI: multiple-choice question generation and assessment performance within medical education”, J Health Sci Med / JHSM, vol. 9, no. 2, pp. 276–286, Mar. 2026, doi: 10.32322/jhsm.1842373.
ISNAD
Özdemir, Birsen - Aydin, Mevlüt Okan - Akdeniz, Esra. “Pushing the Boundaries of Generative AI: Multiple-Choice Question Generation and Assessment Performance Within Medical Education”. Journal of Health Sciences and Medicine 9/2 (March 1, 2026): 276-286. https://doi.org/10.32322/jhsm.1842373.
JAMA
1.Özdemir B, Aydin MO, Akdeniz E. Pushing the boundaries of generative AI: multiple-choice question generation and assessment performance within medical education. J Health Sci Med / JHSM. 2026;9:276–286.
MLA
Özdemir, Birsen, et al. “Pushing the Boundaries of Generative AI: Multiple-Choice Question Generation and Assessment Performance Within Medical Education”. Journal of Health Sciences and Medicine, vol. 9, no. 2, Mar. 2026, pp. 276-8, doi:10.32322/jhsm.1842373.
Vancouver
1.Birsen Özdemir, Mevlüt Okan Aydin, Esra Akdeniz. Pushing the boundaries of generative AI: multiple-choice question generation and assessment performance within medical education. J Health Sci Med / JHSM. 2026 Mar. 1;9(2):276-8. doi:10.32322/jhsm.1842373

Interuniversity Board (UAK) Equivalency: Article published in Ulakbim TR Index journal [10 POINTS], and Article published in other (excuding 1a, b, c) international indexed journal (1d) [5 POINTS].

The Directories (indexes) and Platforms we are included in are at the bottom of the page.

Note: Our journal is not WOS indexed and therefore is not classified as Q.

You can download Council of Higher Education (CoHG) [Yüksek Öğretim Kurumu (YÖK)] Criteria) decisions about predatory/questionable journals and the author's clarification text and journal charge policy from your browser. https://dergipark.org.tr/tr/journal/2316/file/4905/show







The indexes of the journal are ULAKBİM TR Dizin, ICI World of Journals, DOAJ, Directory of Research Journals Indexing (DRJI), General Impact Factor, ASOS Index, WorldCat (OCLC), MIAR, OpenAIRE, Türkiye Citation Index, Türk Medline Index, InfoBase Index, Scilit, etc.

       images?q=tbn:ANd9GcRB9r6zRLDl0Pz7om2DQkiTQXqDtuq64Eb1Qg&usqp=CAU

500px-WorldCat_logo.svg.png

atifdizini.png

logo_world_of_journals_no_margin.png

images?q=tbn%3AANd9GcTNpvUjQ4Ffc6uQBqMQrqYMR53c7bRqD9rohCINkko0Y1a_hPSn&usqp=CAU

doaj.png  

images?q=tbn:ANd9GcSpOQFsFv3RdX0lIQJC3SwkFIA-CceHin_ujli_JrqBy3A32A_Tx_oMoIZn96EcrpLwTQg&usqp=CAU

ici2.png

asos-index.png

drji.png





The platforms of the journal are Google Scholar, CrossRef (DOI), ResearchBib, Open Access, COPE, ICMJE, NCBI, ORCID, Creative Commons, etc.

COPE-logo-300x199.jpgimages?q=tbn:ANd9GcQR6_qdgvxMP9owgnYzJ1M6CS_XzR_d7orTjA&usqp=CAU

icmje_1_orig.png

cc.logo.large.png

ncbi.pngimages?q=tbn:ANd9GcRBcJw8ia8S9TI4Fun5vj3HPzEcEKIvF_jtnw&usqp=CAU

ORCID_logo.png

1*mvsP194Golg0Dmo2rjJ-oQ.jpeg


Our Journal using the DergiPark system indexed are;

Ulakbim TR Dizin,  Index Copernicus, ICI World of JournalsDirectory of Research Journals Indexing (DRJI), General Impact FactorASOS Index, OpenAIRE, MIAR,  EuroPub, WorldCat (OCLC)DOAJ,  Türkiye Citation Index, Türk Medline Index, InfoBase Index


Our Journal using the DergiPark system platforms are;

Google, Google Scholar, CrossRef (DOI), ResearchBib, ICJME, COPE, NCBI, ORCID, Creative Commons, Open Access, and etc.


Journal articles are evaluated as "Double-Blind Peer Review". 

Our journal has adopted the Open Access Policy and articles in JHSM are Open Access and fully comply with Open Access instructions. All articles in the system can be accessed and read without a journal user.  https//dergipark.org.tr/tr/pub/jhsm/page/9535

Journal charge policy   https://dergipark.org.tr/tr/pub/jhsm/page/10912

Our journal has been indexed in DOAJ as of May 18, 2020.

Our journal has been indexed in TR-Dizin as of March 12, 2021.


17873

Articles published in Journal of Health Sciences and Medicine have open access and are licensed under the Creative Commons CC BY-NC-ND 4.0 International License.