EN
TR
ChatGPT vs. DeepSeek: A comparative psychometric evaluation of AI tools in generating multiple-choice questions
Abstract
This study examined the psychometric quality of multiple-choice questions generated by two AI tools, ChatGPT and DeepSeek, within the context of an undergraduate Educational Measurement and Evaluation course. Guided by ten learning outcomes (LOs) aligned with Bloom’s Taxonomy, each tool was prompted to generate one five-option multiple-choice item per LO. Following expert review (Kendall’s W = .58); revisions were made, and the finalized test was administered to 120 students. Item analyses revealed no statistically significant differences between the two AI models regarding item difficulty, discrimination, variance, or reliability. A few items -two from ChatGPT and one from DeepSeek- had suboptimal discrimination indices. Tetrachoric correlation analyses of item pairs generated by the two AI tools for the same LO revealed that only one pair showed a non-significant association, whereas all other pairs demonstrated statistically significant and generally moderate correlations. KR-20 and split-half reliability coefficients reflected acceptable internal consistency for a classroom-based assessment, with the DeepSeek-generated half showing a slightly stronger correlation with total scores. Expert feedback indicated that while AI tools generally produced valid stems and correct answers, most revisions focused on improving distractor quality, highlighting the need for human refinement. Generalizability and Decision studies confirmed consistency in expert ratings and recommended a minimum of seven experts for reliable evaluations. In conclusion, both AI tools demonstrated the capacity to generate psychometrically comparable items, highlighting their potential to support educators and test developers in test construction. The study concludes with practical recommendations for effectively incorporating AI into test development workflows.
Keywords
Ethical Statement
The necessary ethical approval for this study was obtained from the institutional ethics committee (Approval Number: 2025/03, Aksaray University Human Research Ethics Committee).
References
- Alafnan, M.A. (2025). DeepSeek vs. ChatGPT: A comparative evaluation of AI tools in composition, business writing, and communication tasks. Journal of Artificial Intelligence and Technology, 5, 202-210. https://doi.org/10.37965/jait.2025.0740
- Anderson, L.W., & Krathwohl, D.R. (2001). A taxonomy for learning, teaching and assessing: A revision of Bloom’s taxonomy of educational objectives: Complete edition. Longman.
- Atılgan, H. (2004). Genellenebilirlik kuramı ve çok değişkenlik kaynaklı Rasch modelinin karşılaştırılmasına ilişkin bir araştırma [A reseach on comparisons of generalizability theory and many facets Rasch measurement] [Unpublished doctoral dissertation]. Hacettepe University.
- Atılgan, H. (2009). Madde ve test istatistikleri [Item and test statistics]. In H. Atılgan, A. Kan, & N. Doğan (Eds.), Eğitimde ölçme ve değerlendirme [Measurement and evaluation in education] (3rd ed., pp. 293-314). Anı Publishing.
- Baykul, Y. (2000). Eğitimde ve psikolojide ölçme: Klasik test teorisi ve uygulaması [Measurement in education and psychology: Classical test theory and its applications]. ÖSYM Publications.
- Bloom, B., Englehart, M., Furst, E., Hill, W., & Krathwohl, D. (1956). Taxonomy of educational objectives: The classification of educational goals. Handbook I: Cognitive domain. Longmans, Green.
- Brennan, R.L. (2001). Generalizability Theory. Springer-Verlag.
- Clauser, B. (2008). A review of the EDUG software for generalizability analysis. International Journal of Testing, 8(3), 296-301. https://doi.org/10.1080/15305050802262357
Details
Primary Language
English
Subjects
Measurement Theories and Applications in Education and Psychology, Classroom Measurement Practices
Journal Section
Research Article
Authors
Early Pub Date
October 1, 2025
Publication Date
December 5, 2025
Submission Date
April 12, 2025
Acceptance Date
August 3, 2025
Published in Issue
Year 2025 Volume: 12 Number: 4
APA
Gündeğer Kılcı, C. (2025). ChatGPT vs. DeepSeek: A comparative psychometric evaluation of AI tools in generating multiple-choice questions. International Journal of Assessment Tools in Education, 12(4), 1055-1079. https://doi.org/10.21449/ijate.1674995
AMA
1.Gündeğer Kılcı C. ChatGPT vs. DeepSeek: A comparative psychometric evaluation of AI tools in generating multiple-choice questions. Int. J. Assess. Tools Educ. 2025;12(4):1055-1079. doi:10.21449/ijate.1674995
Chicago
Gündeğer Kılcı, Ceylan. 2025. “ChatGPT Vs. DeepSeek: A Comparative Psychometric Evaluation of AI Tools in Generating Multiple-Choice Questions”. International Journal of Assessment Tools in Education 12 (4): 1055-79. https://doi.org/10.21449/ijate.1674995.
EndNote
Gündeğer Kılcı C (December 1, 2025) ChatGPT vs. DeepSeek: A comparative psychometric evaluation of AI tools in generating multiple-choice questions. International Journal of Assessment Tools in Education 12 4 1055–1079.
IEEE
[1]C. Gündeğer Kılcı, “ChatGPT vs. DeepSeek: A comparative psychometric evaluation of AI tools in generating multiple-choice questions”, Int. J. Assess. Tools Educ., vol. 12, no. 4, pp. 1055–1079, Dec. 2025, doi: 10.21449/ijate.1674995.
ISNAD
Gündeğer Kılcı, Ceylan. “ChatGPT Vs. DeepSeek: A Comparative Psychometric Evaluation of AI Tools in Generating Multiple-Choice Questions”. International Journal of Assessment Tools in Education 12/4 (December 1, 2025): 1055-1079. https://doi.org/10.21449/ijate.1674995.
JAMA
1.Gündeğer Kılcı C. ChatGPT vs. DeepSeek: A comparative psychometric evaluation of AI tools in generating multiple-choice questions. Int. J. Assess. Tools Educ. 2025;12:1055–1079.
MLA
Gündeğer Kılcı, Ceylan. “ChatGPT Vs. DeepSeek: A Comparative Psychometric Evaluation of AI Tools in Generating Multiple-Choice Questions”. International Journal of Assessment Tools in Education, vol. 12, no. 4, Dec. 2025, pp. 1055-79, doi:10.21449/ijate.1674995.
Vancouver
1.Ceylan Gündeğer Kılcı. ChatGPT vs. DeepSeek: A comparative psychometric evaluation of AI tools in generating multiple-choice questions. Int. J. Assess. Tools Educ. 2025 Dec. 1;12(4):1055-79. doi:10.21449/ijate.1674995