Research Article

ChatGPT vs. DeepSeek: A comparative psychometric evaluation of AI tools in generating multiple-choice questions

Volume: 12 Number: 4 December 5, 2025
EN TR

ChatGPT vs. DeepSeek: A comparative psychometric evaluation of AI tools in generating multiple-choice questions

Abstract

This study examined the psychometric quality of multiple-choice questions generated by two AI tools, ChatGPT and DeepSeek, within the context of an undergraduate Educational Measurement and Evaluation course. Guided by ten learning outcomes (LOs) aligned with Bloom’s Taxonomy, each tool was prompted to generate one five-option multiple-choice item per LO. Following expert review (Kendall’s W = .58); revisions were made, and the finalized test was administered to 120 students. Item analyses revealed no statistically significant differences between the two AI models regarding item difficulty, discrimination, variance, or reliability. A few items -two from ChatGPT and one from DeepSeek- had suboptimal discrimination indices. Tetrachoric correlation analyses of item pairs generated by the two AI tools for the same LO revealed that only one pair showed a non-significant association, whereas all other pairs demonstrated statistically significant and generally moderate correlations. KR-20 and split-half reliability coefficients reflected acceptable internal consistency for a classroom-based assessment, with the DeepSeek-generated half showing a slightly stronger correlation with total scores. Expert feedback indicated that while AI tools generally produced valid stems and correct answers, most revisions focused on improving distractor quality, highlighting the need for human refinement. Generalizability and Decision studies confirmed consistency in expert ratings and recommended a minimum of seven experts for reliable evaluations. In conclusion, both AI tools demonstrated the capacity to generate psychometrically comparable items, highlighting their potential to support educators and test developers in test construction. The study concludes with practical recommendations for effectively incorporating AI into test development workflows.

Keywords

Ethical Statement

The necessary ethical approval for this study was obtained from the institutional ethics committee (Approval Number: 2025/03, Aksaray University Human Research Ethics Committee).

References

  1. Alafnan, M.A. (2025). DeepSeek vs. ChatGPT: A comparative evaluation of AI tools in composition, business writing, and communication tasks. Journal of Artificial Intelligence and Technology, 5, 202-210. https://doi.org/10.37965/jait.2025.0740
  2. Anderson, L.W., & Krathwohl, D.R. (2001). A taxonomy for learning, teaching and assessing: A revision of Bloom’s taxonomy of educational objectives: Complete edition. Longman.
  3. Atılgan, H. (2004). Genellenebilirlik kuramı ve çok değişkenlik kaynaklı Rasch modelinin karşılaştırılmasına ilişkin bir araştırma [A reseach on comparisons of generalizability theory and many facets Rasch measurement] [Unpublished doctoral dissertation]. Hacettepe University.
  4. Atılgan, H. (2009). Madde ve test istatistikleri [Item and test statistics]. In H. Atılgan, A. Kan, & N. Doğan (Eds.), Eğitimde ölçme ve değerlendirme [Measurement and evaluation in education] (3rd ed., pp. 293-314). Anı Publishing.
  5. Baykul, Y. (2000). Eğitimde ve psikolojide ölçme: Klasik test teorisi ve uygulaması [Measurement in education and psychology: Classical test theory and its applications]. ÖSYM Publications.
  6. Bloom, B., Englehart, M., Furst, E., Hill, W., & Krathwohl, D. (1956). Taxonomy of educational objectives: The classification of educational goals. Handbook I: Cognitive domain. Longmans, Green.
  7. Brennan, R.L. (2001). Generalizability Theory. Springer-Verlag.
  8. Clauser, B. (2008). A review of the EDUG software for generalizability analysis. International Journal of Testing, 8(3), 296-301. https://doi.org/10.1080/15305050802262357

Details

Primary Language

English

Subjects

Measurement Theories and Applications in Education and Psychology , Classroom Measurement Practices

Journal Section

Research Article

Early Pub Date

October 1, 2025

Publication Date

December 5, 2025

Submission Date

April 12, 2025

Acceptance Date

August 3, 2025

Published in Issue

Year 2025 Volume: 12 Number: 4

APA
Gündeğer Kılcı, C. (2025). ChatGPT vs. DeepSeek: A comparative psychometric evaluation of AI tools in generating multiple-choice questions. International Journal of Assessment Tools in Education, 12(4), 1055-1079. https://doi.org/10.21449/ijate.1674995

23823             23825             23824