ChatGPT vs. DeepSeek: A comparative psychometric evaluation of AI tools in generating multiple-choice questions

Ceylan Gündeğer Kılcı

doi:10.21449/ijate.1674995

EN TR

ChatGPT vs. DeepSeek: A comparative psychometric evaluation of AI tools in generating multiple-choice questions

Abstract

This study examined the psychometric quality of multiple-choice questions generated by two AI tools, ChatGPT and DeepSeek, within the context of an undergraduate Educational Measurement and Evaluation course. Guided by ten learning outcomes (LOs) aligned with Bloom’s Taxonomy, each tool was prompted to generate one five-option multiple-choice item per LO. Following expert review (Kendall’s W = .58); revisions were made, and the finalized test was administered to 120 students. Item analyses revealed no statistically significant differences between the two AI models regarding item difficulty, discrimination, variance, or reliability. A few items -two from ChatGPT and one from DeepSeek- had suboptimal discrimination indices. Tetrachoric correlation analyses of item pairs generated by the two AI tools for the same LO revealed that only one pair showed a non-significant association, whereas all other pairs demonstrated statistically significant and generally moderate correlations. KR-20 and split-half reliability coefficients reflected acceptable internal consistency for a classroom-based assessment, with the DeepSeek-generated half showing a slightly stronger correlation with total scores. Expert feedback indicated that while AI tools generally produced valid stems and correct answers, most revisions focused on improving distractor quality, highlighting the need for human refinement. Generalizability and Decision studies confirmed consistency in expert ratings and recommended a minimum of seven experts for reliable evaluations. In conclusion, both AI tools demonstrated the capacity to generate psychometrically comparable items, highlighting their potential to support educators and test developers in test construction. The study concludes with practical recommendations for effectively incorporating AI into test development workflows.

Keywords

ChatGPT vs. DeepSeek: A comparative psychometric evaluation of AI tools in generating multiple-choice questions

Öz

This study examined the psychometric quality of multiple-choice questions generated by two AI tools, ChatGPT and DeepSeek, within the context of an undergraduate Educational Measurement and Evaluation course. Guided by ten learning outcomes (LOs) aligned with Bloom’s Taxonomy, each tool was prompted to generate one five-option multiple-choice item per LO. Following expert review (Kendall’s W = .58); revisions were made, and the finalized test was administered to 120 students. Item analyses revealed no statistically significant differences between the two AI models regarding item difficulty, discrimination, variance, or reliability. A few items -two from ChatGPT and one from DeepSeek- had suboptimal discrimination indices. Tetrachoric correlation analyses of item pairs generated by the two AI tools for the same LO revealed that only one pair showed a non-significant association, whereas all other pairs demonstrated statistically significant and generally moderate correlations. KR-20 and split-half reliability coefficients reflected acceptable internal consistency for a classroom-based assessment, with the DeepSeek-generated half showing a slightly stronger correlation with total scores. Expert feedback indicated that while AI tools generally produced valid stems and correct answers, most revisions focused on improving distractor quality, highlighting the need for human refinement. Generalizability and Decision studies confirmed consistency in expert ratings and recommended a minimum of seven experts for reliable evaluations. In conclusion, both AI tools demonstrated the capacity to generate psychometrically comparable items, highlighting their potential to support educators and test developers in test construction. The study concludes with practical recommendations for effectively incorporating AI into test development workflows.

Anahtar Kelimeler

Ethical Statement

The necessary ethical approval for this study was obtained from the institutional ethics committee (Approval Number: 2025/03, Aksaray University Human Research Ethics Committee).

References

Alafnan, M.A. (2025). DeepSeek vs. ChatGPT: A comparative evaluation of AI tools in composition, business writing, and communication tasks. Journal of Artificial Intelligence and Technology, 5, 202-210. https://doi.org/10.37965/jait.2025.0740
Anderson, L.W., & Krathwohl, D.R. (2001). A taxonomy for learning, teaching and assessing: A revision of Bloom’s taxonomy of educational objectives: Complete edition. Longman.
Atılgan, H. (2004). Genellenebilirlik kuramı ve çok değişkenlik kaynaklı Rasch modelinin karşılaştırılmasına ilişkin bir araştırma [A reseach on comparisons of generalizability theory and many facets Rasch measurement] [Unpublished doctoral dissertation]. Hacettepe University.
Atılgan, H. (2009). Madde ve test istatistikleri [Item and test statistics]. In H. Atılgan, A. Kan, & N. Doğan (Eds.), Eğitimde ölçme ve değerlendirme [Measurement and evaluation in education] (3rd ed., pp. 293-314). Anı Publishing.
Baykul, Y. (2000). Eğitimde ve psikolojide ölçme: Klasik test teorisi ve uygulaması [Measurement in education and psychology: Classical test theory and its applications]. ÖSYM Publications.
Bloom, B., Englehart, M., Furst, E., Hill, W., & Krathwohl, D. (1956). Taxonomy of educational objectives: The classification of educational goals. Handbook I: Cognitive domain. Longmans, Green.
Brennan, R.L. (2001). Generalizability Theory. Springer-Verlag.
Clauser, B. (2008). A review of the EDUG software for generalizability analysis. International Journal of Testing, 8(3), 296-301. https://doi.org/10.1080/15305050802262357

Clauser, B.E., Margolis, M.J., & Case, S.M. (2006). Testing for licensure and certification in the professions. In R. L. Brennon (Ed.), Educational measurement (4th ed. pp. 701-730). Praeger Publications.
Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. Holt, Rinehart, and Winston Inc.
Daily Sabah. (2023, October 06). Türkiye's student assessment center to use AI to set exam question. https://www.dailysabah.com/turkiye/education/turkiyes student assessment center-to-use-ai-to-set-exam-question
Dancey, C.P., & Reidy, J. (2017). Statistics without maths for psychology. Pearson.
DeepSeek AI. (2024). DeepSeek v2: Advancing open source large language models. https://www.deepseek.com
Diedenhofen, B., & Musch, J. (2015). cocor: A Comprehensive Solution for the Statistical Comparison of Correlations. PLOS ONE, 10(4), Article e0121945. http://dx.doi.org/10.1371/journal.pone.0121945
Downing, S.M., & Haladyna, T.M. (2011). Handbook of test development. Lawrence Erlbaum Associates Publishers.
Educational Testing Service. (2025a). e rater® scoring engine. ETS. https://www.ets.org/erater.html
Educational Testing Service. (2025b). SpeechRater® scoring engine. ETS. https://www.ets.org/speechrater.html
Fraenkel, J.R., Wallen, N.E., & Hyun, H.H. (2012). How to design and evaluate research in education. McGraw-Hill.
Gao, H., Hashim, H., & Md Yunus, M. (2025). Assessing the reliability and relevance of DeepSeek in EFL writing evaluation: A generalizability theory approach. Language Testing in Asia, 15(33). https://doi.org/10.1186/s40468-025-00369-6
Gierl, M.J., & Haladyna, T.M. (2013). Automatic item generation: Theory & practice. Routledge.
Gierl, M.J., Lai, H., & Tanygin, V. (2021). Advanced methods in automatic item generation. Routledge.
Gierl, M.J., Lai, H., & Turner, S. (2012). Using automatic item generation to create multiple-choice items for assessments in medical education. Medical Education, 46(8), 757-765. https://doi.org/10.1111/j.1365-2923.2012.04289.x
Graduate Management Admission Council. (2009). Fairness of automated essay scoring of GMAT AWA. https://www.gmac.com/market intelligence and research/research library/gmat-test-taker-data/research-reports-gmat-related/fairness-of-automated-essay-scoring-of-gmat-awa
Graesser, A.C., Conley, M.W., & Olney, A. (2012). Intelligent tutoring systems. In K. R. Harris, S. Graham, T. Urdan, A. G. Bus, S. Major, & H. L. Swanson (Eds.), APA educational psychology handbook, vol. 3. application to learning and teaching (pp. 451-473). American Psychological Association. https://doi.org/10.1037/13275-018
Haladyna, T.M., & Rodriguez, M.C. (2013). Developing and validating test items. Routledge.
Irwing, P., & Hughes, D.J. (2018). Test development. In P. Irwing, T. Booth, & D. J. Hughes (Eds.), The Wiley handbook of psychometric testing: A multidisciplinary reference on survey, scale and test development (First ed., pp. 3-48), John Wiley & Sons Ltd. https://doi.org/10.1002/9781118489772.ch1
Kanık, M. (2024). The use of ChatGPT in assessment. International Journal of Assessment Tools in Education, 11(3), 608-621. https://doi.org/10.21449/ijate.1379647
Kendall, M.G., & Smith, B.B. (1939). The problem of m rankings. The Annals of Mathematical Statistics, 10(3), 275-287. http://www.jstor.org/stable/2235668
Kıyak, Y.S., & Emekli, E. (2024). ChatGPT prompts for generating multiple-choice questions in medical education and evidence on their validity: A literature review. Postgraduate Medical Journal, 100(1189), 858-865. https://doi.org/10.1093/postmj/qgae065
Landis, J.R., & Koch, G.G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159-174. https://doi.org/10.2307/2529310
Leslie, T., & Gierl, M.J. (2023). Using automatic item generation to create multiple-choice questions for pharmacy assessment. American Journal of Pharmaceutical Education, 87(10), 1-7. https://doi.org/10.1016/j.ajpe.2023.100081
Liu, Z., He, X., Liu, L., Liu, T., & Zhai, X. (2023). Context matters: A strategy to pre-train language model for science education. In N. Wang, G. Rebolledo-Mendez, V. Dimitrova, N. Matsuda, & O.C. Santos (Eds.), Artificial intelligence in education. Springer. https://doi.org/10.1007/978-3-031-36336-8_103
Lo, C.K. (2023). What is the impact of ChatGPT on education? A rapid review of the literature. Education Sciences, 13(4), Article 410. https://doi.org/10.3390/educsci13040410
Makowski, D., Wiernik, B., Patil, I., Lüdecke, D., & Ben-Shachar, M. (2022). correlation: Methods for Correlation Analysis (Version 0.8.3) [R package]. https://CRAN.R project.org/package=correlation
Malik, M., Rehan, S., Zimbittas, G., & Manna, S. (2024). Multiple-choice questions reimagined: Exploring the ethical and pedagogical implications of GenAI for higher education. In T. Fujita (Ed.), Proceedings of the British Society for Research into Learning Mathematics (BSRLM), 44(1). https://bsrlm.org.uk/wp-content/uploads/2024/05/BSRLM-CP-44-1-05.pdf
Meng, X.-L., Rosenthal, R., & Rubin, D.B. (1992). Comparing correlated correlation coefficients. Psychological Bulletin, 111(1), 172 175. https://doi.org/10.1037/0033 2909.111.1.172
Murphy, K.R., & Davidshofer, C.O. (1991). Psychological testing: Principles and applications. Prantice Hall.
Ngo, A., Gupta, S., Perrine, O., Reddy, R., Ershadi, S., & Remick, D. (2024). ChatGPT 3.5 fails to write appropriate multiple choice practice exam questions. Academic Pathology, 11(1), 1-5. https://doi.org/10.1016/j.acpath.2023.100099
OECD. (2023). PISA 2022 results (Volume I): The state of learning and equity in education. OECD Publishing. https://www.oecd.org/en/publications/pisa 2022 results volume i_53f23881-en/full-report/adaptive-testing-in-pisa-2022_21364c8d.html
OpenAI. (2023). GPT-4 technical report. https://openai.com/research/gpt-4
ÖSYM (2024, February 13). ÖSYM Başkanı Ersoy: Yapay zekâ ile soru üreteceğiz [ÖSYM President Ersoy: We will generate questions with artificial intelligence]. https://www.osym.gov.tr/TR,29174/osym-baskani-ersoy-yapay-zeka-ile-soru-uretecegiz 13022024.html
Özçelik, D.A. (1992). Ölçme ve değerlendirme [Measurement and assessment]. ÖSYM Publication.
Özçelik, D.A. (2013). Test hazırlama kılavuzu [Test preparation manual]. Pegem.
R Core Team. (2023). R: A language and environment for statistical computing (Version 4.3.2). R Foundation for Statistical Computing. https://www.R-project.org/
Revelle, W. (2023). psych: Procedures for psychological, psychometric, and personality research (Version 2.3.9) [R package]. Northwestern University. https://CRAN.R project.org/package=psych
Robitzsch, A. (2024). sirt: Supplementary Item Response Theory Models (Version 4.1.-15). [R package]. https://CRAN.R-project.org/package=sirt
Rudner, L., & Schafer, W. (2002) What teachers need to know about assessment. National Education Association.
Rycroft-Smith, L., & Macey, D. (2024). Using AI for question generation in mathematics education: What are the advantages and disadvantages? In T. Fujita (Ed.), Proceedings of the British Society for Research into Learning Mathematics (BSRLM), 44(1). https://bsrlm.org.uk/wp-content/uploads/2024/05/BSRLM-CP-44-1-07.pdf
The Princeton Review. (2024). Digital SAT security and fairness. https://www.princetonreview.com/college-advice/digital-security-and-fairness
Schober, P., Boer, C., & Schwarte, L.A. (2018). Correlation coefficients: Appropriate use and interpretation. Anesthesia & Analgesia, 126(5), 1763-1768.
Ratner, B. (2009) The correlation coefficient: Its values range between +1/−1, or do they? Journal of Targeting, Measurement and Analysis for Marketing, 17, 139 142. https://doi.org/10.1057/jt.2009.5
Seldon, A., & Abidoye, O. (2018). The fourth education revolution. University of Buckingham Press.
Shavelson, R.J., & Webb, N.M. (1991). Generalizability theory: A primer. SAGE Publications.
Shin, D. (2023). A case study on English test item development training for secondary school teachers using AI tools: Focusing on ChatGPT. Language Research, 59(1), 21 42. https://doi.org/10.30961/lr.2023.59.1.21
Society, S., & Group, E.W. (2010). Edug user guide. Edumetrics.
Tan, Ş. (2014). Öğretimde ölçme ve değerlendirme: KPSS el kitabı [Assessment and evaluation in instruction: KPSS handbook]. Pegem.
Thorndike, R.M., & Thorndike-Christ, T. (2014). Measurement and evaluation in psychology and education. Pearson.
Turgut, M.F., & Baykul, Y. (2012). Eğitimde ölçme ve değerlendirme [Measurement and evaluation in education]. Pegem.
Urhan, S., Gençaslan, O., & Dost, Ş. (2024). An argumentation experience regarding concepts of calculus with ChatGPT. Interactive Learning Environments, 32(10), 7186 7211. https://doi.org/10.1080/10494820.2024.2308093
Ünal, D., Erdem, Z.Ç., & Şahin, Z.G. (2025). Will artificial intelligence succeed in passing this test? Creating an achievement test utilizing ChatGPT. Education & Information Technologies, 30, 17263-17287. https://doi.org/10.1007/s10639-025-13461-4
Xiaoyu, W. (2024, June 21). AI Scores High in Gaokao Language Tests, Low in Math. China Daily. https://www.chinadaily.com.cn/a/202406/21/WS6674bb00a31095c51c50a0a9.html
Wang, J., & Heung, K. (2025). Educational innovation driven by artificial intelligence: The impact of DeepSeek on teachers’ teaching models. Learning & Education, 14(1), 38-42. https://ojs.piscomed.com/index.php/L-E/article/view/4291
Wardat, Y., Tashtoush, M.A., Alali, R., & Jarrah, A.M. (2023). ChatGPT: A revolutionary tool for teaching and learning mathematics. Eurasia Journal of Mathematics, Science and Technology Education, 19(7), Article em2286. https://doi.org/10.29333/ejmste/13272
Wickham, H., & Bryan, J. (2023). readxl: Read Excel files (Version 1.4.3) [R package]. https://CRAN.R-project.org/package=readxl
Wickham, H., François, R., Henry, L., & Müller, K. (2023). dplyr: A grammar of data manipulation (Version 1.1.4) [R package]. https://CRAN.R-project.org/package=dplyr
Willse, J.T. (2014). CTT: Classical Test Theory functions (Version 2.3.3) [R package]. https://CRAN.R-project.org/package=CTT
Yogesh, A., Telon, G., Lovely, T.F., & Braiton, M. (2025). A comparative study: Evaluating ChatGPT and DeepSeek AI tools in practice. International Journal of Open Information Technologies, 13(5), 67 70. https://cyberleninka.ru/article/n/a comparative study evaluating-chatgpt-and-deepseek-ai-tools-in-practice
Zhai, X. (2025). DeepSeek: Transforming the Foundations of Education. Preprints. https://doi.org/10.20944/preprints202503.1776.v1

Details

Primary Language

English

Subjects

Measurement Theories and Applications in Education and Psychology , Classroom Measurement Practices

Journal Section

Research Article

Authors

Ceylan Gündeğer Kılcı ^*
0000-0003-3572-1708
Türkiye

Early Pub Date

October 1, 2025

Publication Date

December 5, 2025

Submission Date

April 12, 2025

Acceptance Date

August 3, 2025

Published in Issue

Year 2025 Volume: 12 Number: 4

DOI

https://doi.org/10.21449/ijate.1674995

IZ

https://izlik.org/JA93NL56RJ

Cite

RIS / Bibtex

APA

Gündeğer Kılcı, C. (2025). ChatGPT vs. DeepSeek: A comparative psychometric evaluation of AI tools in generating multiple-choice questions. International Journal of Assessment Tools in Education, 12(4), 1055-1079. https://doi.org/10.21449/ijate.1674995