Research Article
BibTex RIS Cite

ChatGPT vs. DeepSeek: A comparative psychometric evaluation of AI tools in generating multiple-choice questions

Year 2025, Volume: 12 Issue: 4, 1055 - 1079
https://doi.org/10.21449/ijate.1674995

Abstract

This study examined the psychometric quality of multiple-choice questions generated by two AI tools, ChatGPT and DeepSeek, within the context of an undergraduate Educational Measurement and Evaluation course. Guided by ten learning outcomes (LOs) aligned with Bloom’s Taxonomy, each tool was prompted to generate one five-option multiple-choice item per LO. Following expert review (Kendall’s W = .58); revisions were made, and the finalized test was administered to 120 students. Item analyses revealed no statistically significant differences between the two AI models regarding item difficulty, discrimination, variance, or reliability. A few items -two from ChatGPT and one from DeepSeek- had suboptimal discrimination indices. Tetrachoric correlation analyses of item pairs generated by the two AI tools for the same LO revealed that only one pair showed a non-significant association, whereas all other pairs demonstrated statistically significant and generally moderate correlations. KR-20 and split-half reliability coefficients reflected acceptable internal consistency for a classroom-based assessment, with the DeepSeek-generated half showing a slightly stronger correlation with total scores. Expert feedback indicated that while AI tools generally produced valid stems and correct answers, most revisions focused on improving distractor quality, highlighting the need for human refinement. Generalizability and Decision studies confirmed consistency in expert ratings and recommended a minimum of seven experts for reliable evaluations. In conclusion, both AI tools demonstrated the capacity to generate psychometrically comparable items, highlighting their potential to support educators and test developers in test construction. The study concludes with practical recommendations for effectively incorporating AI into test development workflows.

Ethical Statement

The necessary ethical approval for this study was obtained from the institutional ethics committee (Approval Number: 2025/03, Aksaray University Human Research Ethics Committee).

Supporting Institution

-

Project Number

-

Thanks

-

References

  • Alafnan, M.A. (2025). DeepSeek vs. ChatGPT: A comparative evaluation of AI tools in composition, business writing, and communication tasks. Journal of Artificial Intelligence and Technology, 5, 202-210. https://doi.org/10.37965/jait.2025.0740
  • Anderson, L.W., & Krathwohl, D.R. (2001). A taxonomy for learning, teaching and assessing: A revision of Bloom’s taxonomy of educational objectives: Complete edition. Longman.
  • Atılgan, H. (2004). Genellenebilirlik kuramı ve çok değişkenlik kaynaklı Rasch modelinin karşılaştırılmasına ilişkin bir araştırma [A reseach on comparisons of generalizability theory and many facets Rasch measurement] [Unpublished doctoral dissertation]. Hacettepe University.
  • Atılgan, H. (2009). Madde ve test istatistikleri [Item and test statistics]. In H. Atılgan, A. Kan, & N. Doğan (Eds.), Eğitimde ölçme ve değerlendirme [Measurement and evaluation in education] (3rd ed., pp. 293-314). Anı Publishing.
  • Baykul, Y. (2000). Eğitimde ve psikolojide ölçme: Klasik test teorisi ve uygulaması [Measurement in education and psychology: Classical test theory and its applications]. ÖSYM Publications.
  • Bloom, B., Englehart, M., Furst, E., Hill, W., & Krathwohl, D. (1956). Taxonomy of educational objectives: The classification of educational goals. Handbook I: Cognitive domain. Longmans, Green.
  • Brennan, R.L. (2001). Generalizability Theory. Springer-Verlag.
  • Clauser, B. (2008). A review of the EDUG software for generalizability analysis. International Journal of Testing, 8(3), 296-301. https://doi.org/10.1080/15305050802262357
  • Clauser, B.E., Margolis, M.J., & Case, S.M. (2006). Testing for licensure and certification in the professions. In R. L. Brennon (Ed.), Educational measurement (4th ed. pp. 701-730). Praeger Publications.
  • Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. Holt, Rinehart, and Winston Inc.
  • Daily Sabah. (2023, October 06). Türkiye's student assessment center to use AI to set exam question. https://www.dailysabah.com/turkiye/education/turkiyes student assessment center-to-use-ai-to-set-exam-question
  • Dancey, C.P., & Reidy, J. (2017). Statistics without maths for psychology. Pearson.
  • DeepSeek AI. (2024). DeepSeek v2: Advancing open source large language models. https://www.deepseek.com
  • Diedenhofen, B., & Musch, J. (2015). cocor: A Comprehensive Solution for the Statistical Comparison of Correlations. PLOS ONE, 10(4), Article e0121945. http://dx.doi.org/10.1371/journal.pone.0121945
  • Downing, S.M., & Haladyna, T.M. (2011). Handbook of test development. Lawrence Erlbaum Associates Publishers.
  • Educational Testing Service. (2025a). e rater® scoring engine. ETS. https://www.ets.org/erater.html
  • Educational Testing Service. (2025b). SpeechRater® scoring engine. ETS. https://www.ets.org/speechrater.html
  • Fraenkel, J.R., Wallen, N.E., & Hyun, H.H. (2012). How to design and evaluate research in education. McGraw-Hill.
  • Gao, H., Hashim, H., & Md Yunus, M. (2025). Assessing the reliability and relevance of DeepSeek in EFL writing evaluation: A generalizability theory approach. Language Testing in Asia, 15(33). https://doi.org/10.1186/s40468-025-00369-6
  • Gierl, M.J., & Haladyna, T.M. (2013). Automatic item generation: Theory & practice. Routledge.
  • Gierl, M.J., Lai, H., & Tanygin, V. (2021). Advanced methods in automatic item generation. Routledge.
  • Gierl, M.J., Lai, H., & Turner, S. (2012). Using automatic item generation to create multiple-choice items for assessments in medical education. Medical Education, 46(8), 757-765. https://doi.org/10.1111/j.1365-2923.2012.04289.x
  • Graduate Management Admission Council. (2009). Fairness of automated essay scoring of GMAT AWA. https://www.gmac.com/market intelligence and research/research library/gmat-test-taker-data/research-reports-gmat-related/fairness-of-automated-essay-scoring-of-gmat-awa
  • Graesser, A.C., Conley, M.W., & Olney, A. (2012). Intelligent tutoring systems. In K. R. Harris, S. Graham, T. Urdan, A. G. Bus, S. Major, & H. L. Swanson (Eds.), APA educational psychology handbook, vol. 3. application to learning and teaching (pp. 451-473). American Psychological Association. https://doi.org/10.1037/13275-018
  • Haladyna, T.M., & Rodriguez, M.C. (2013). Developing and validating test items. Routledge.
  • Irwing, P., & Hughes, D.J. (2018). Test development. In P. Irwing, T. Booth, & D. J. Hughes (Eds.), The Wiley handbook of psychometric testing: A multidisciplinary reference on survey, scale and test development (First ed., pp. 3-48), John Wiley & Sons Ltd. https://doi.org/10.1002/9781118489772.ch1
  • Kanık, M. (2024). The use of ChatGPT in assessment. International Journal of Assessment Tools in Education, 11(3), 608-621. https://doi.org/10.21449/ijate.1379647
  • Kendall, M.G., & Smith, B.B. (1939). The problem of m rankings. The Annals of Mathematical Statistics, 10(3), 275-287. http://www.jstor.org/stable/2235668
  • Kıyak, Y.S., & Emekli, E. (2024). ChatGPT prompts for generating multiple-choice questions in medical education and evidence on their validity: A literature review. Postgraduate Medical Journal, 100(1189), 858-865. https://doi.org/10.1093/postmj/qgae065
  • Landis, J.R., & Koch, G.G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159-174. https://doi.org/10.2307/2529310
  • Leslie, T., & Gierl, M.J. (2023). Using automatic item generation to create multiple-choice questions for pharmacy assessment. American Journal of Pharmaceutical Education, 87(10), 1-7. https://doi.org/10.1016/j.ajpe.2023.100081
  • Liu, Z., He, X., Liu, L., Liu, T., & Zhai, X. (2023). Context matters: A strategy to pre-train language model for science education. In N. Wang, G. Rebolledo-Mendez, V. Dimitrova, N. Matsuda, & O.C. Santos (Eds.), Artificial intelligence in education. Springer. https://doi.org/10.1007/978-3-031-36336-8_103
  • Lo, C.K. (2023). What is the impact of ChatGPT on education? A rapid review of the literature. Education Sciences, 13(4), Article 410. https://doi.org/10.3390/educsci13040410
  • Makowski, D., Wiernik, B., Patil, I., Lüdecke, D., & Ben-Shachar, M. (2022). correlation: Methods for Correlation Analysis (Version 0.8.3) [R package]. https://CRAN.R project.org/package=correlation
  • Malik, M., Rehan, S., Zimbittas, G., & Manna, S. (2024). Multiple-choice questions reimagined: Exploring the ethical and pedagogical implications of GenAI for higher education. In T. Fujita (Ed.), Proceedings of the British Society for Research into Learning Mathematics (BSRLM), 44(1). https://bsrlm.org.uk/wp-content/uploads/2024/05/BSRLM-CP-44-1-05.pdf
  • Meng, X.-L., Rosenthal, R., & Rubin, D.B. (1992). Comparing correlated correlation coefficients. Psychological Bulletin, 111(1), 172 175. https://doi.org/10.1037/0033 2909.111.1.172
  • Murphy, K.R., & Davidshofer, C.O. (1991). Psychological testing: Principles and applications. Prantice Hall.
  • Ngo, A., Gupta, S., Perrine, O., Reddy, R., Ershadi, S., & Remick, D. (2024). ChatGPT 3.5 fails to write appropriate multiple choice practice exam questions. Academic Pathology, 11(1), 1-5. https://doi.org/10.1016/j.acpath.2023.100099
  • OECD. (2023). PISA 2022 results (Volume I): The state of learning and equity in education. OECD Publishing. https://www.oecd.org/en/publications/pisa 2022 results volume i_53f23881-en/full-report/adaptive-testing-in-pisa-2022_21364c8d.html
  • OpenAI. (2023). GPT-4 technical report. https://openai.com/research/gpt-4
  • ÖSYM (2024, February 13). ÖSYM Başkanı Ersoy: Yapay zekâ ile soru üreteceğiz [ÖSYM President Ersoy: We will generate questions with artificial intelligence]. https://www.osym.gov.tr/TR,29174/osym-baskani-ersoy-yapay-zeka-ile-soru-uretecegiz 13022024.html
  • Özçelik, D.A. (1992). Ölçme ve değerlendirme [Measurement and assessment]. ÖSYM Publication.
  • Özçelik, D.A. (2013). Test hazırlama kılavuzu [Test preparation manual]. Pegem.
  • R Core Team. (2023). R: A language and environment for statistical computing (Version 4.3.2). R Foundation for Statistical Computing. https://www.R-project.org/
  • Revelle, W. (2023). psych: Procedures for psychological, psychometric, and personality research (Version 2.3.9) [R package]. Northwestern University. https://CRAN.R project.org/package=psych
  • Robitzsch, A. (2024). sirt: Supplementary Item Response Theory Models (Version 4.1.-15). [R package]. https://CRAN.R-project.org/package=sirt
  • Rudner, L., & Schafer, W. (2002) What teachers need to know about assessment. National Education Association.
  • Rycroft-Smith, L., & Macey, D. (2024). Using AI for question generation in mathematics education: What are the advantages and disadvantages? In T. Fujita (Ed.), Proceedings of the British Society for Research into Learning Mathematics (BSRLM), 44(1). https://bsrlm.org.uk/wp-content/uploads/2024/05/BSRLM-CP-44-1-07.pdf
  • The Princeton Review. (2024). Digital SAT security and fairness. https://www.princetonreview.com/college-advice/digital-security-and-fairness
  • Schober, P., Boer, C., & Schwarte, L.A. (2018). Correlation coefficients: Appropriate use and interpretation. Anesthesia & Analgesia, 126(5), 1763-1768.
  • Ratner, B. (2009) The correlation coefficient: Its values range between +1/−1, or do they? Journal of Targeting, Measurement and Analysis for Marketing, 17, 139 142. https://doi.org/10.1057/jt.2009.5
  • Seldon, A., & Abidoye, O. (2018). The fourth education revolution. University of Buckingham Press.
  • Shavelson, R.J., & Webb, N.M. (1991). Generalizability theory: A primer. SAGE Publications.
  • Shin, D. (2023). A case study on English test item development training for secondary school teachers using AI tools: Focusing on ChatGPT. Language Research, 59(1), 21 42. https://doi.org/10.30961/lr.2023.59.1.21
  • Society, S., & Group, E.W. (2010). Edug user guide. Edumetrics.
  • Tan, Ş. (2014). Öğretimde ölçme ve değerlendirme: KPSS el kitabı [Assessment and evaluation in instruction: KPSS handbook]. Pegem.
  • Thorndike, R.M., & Thorndike-Christ, T. (2014). Measurement and evaluation in psychology and education. Pearson.
  • Turgut, M.F., & Baykul, Y. (2012). Eğitimde ölçme ve değerlendirme [Measurement and evaluation in education]. Pegem.
  • Urhan, S., Gençaslan, O., & Dost, Ş. (2024). An argumentation experience regarding concepts of calculus with ChatGPT. Interactive Learning Environments, 32(10), 7186 7211. https://doi.org/10.1080/10494820.2024.2308093
  • Ünal, D., Erdem, Z.Ç., & Şahin, Z.G. (2025). Will artificial intelligence succeed in passing this test? Creating an achievement test utilizing ChatGPT. Education & Information Technologies, 30, 17263-17287. https://doi.org/10.1007/s10639-025-13461-4
  • Xiaoyu, W. (2024, June 21). AI Scores High in Gaokao Language Tests, Low in Math. China Daily. https://www.chinadaily.com.cn/a/202406/21/WS6674bb00a31095c51c50a0a9.html
  • Wang, J., & Heung, K. (2025). Educational innovation driven by artificial intelligence: The impact of DeepSeek on teachers’ teaching models. Learning & Education, 14(1), 38-42. https://ojs.piscomed.com/index.php/L-E/article/view/4291
  • Wardat, Y., Tashtoush, M.A., Alali, R., & Jarrah, A.M. (2023). ChatGPT: A revolutionary tool for teaching and learning mathematics. Eurasia Journal of Mathematics, Science and Technology Education, 19(7), Article em2286. https://doi.org/10.29333/ejmste/13272
  • Wickham, H., & Bryan, J. (2023). readxl: Read Excel files (Version 1.4.3) [R package]. https://CRAN.R-project.org/package=readxl
  • Wickham, H., François, R., Henry, L., & Müller, K. (2023). dplyr: A grammar of data manipulation (Version 1.1.4) [R package]. https://CRAN.R-project.org/package=dplyr
  • Willse, J.T. (2014). CTT: Classical Test Theory functions (Version 2.3.3) [R package]. https://CRAN.R-project.org/package=CTT
  • Yogesh, A., Telon, G., Lovely, T.F., & Braiton, M. (2025). A comparative study: Evaluating ChatGPT and DeepSeek AI tools in practice. International Journal of Open Information Technologies, 13(5), 67 70. https://cyberleninka.ru/article/n/a comparative study evaluating-chatgpt-and-deepseek-ai-tools-in-practice
  • Zhai, X. (2025). DeepSeek: Transforming the Foundations of Education. Preprints. https://doi.org/10.20944/preprints202503.1776.v1

ChatGPT vs. DeepSeek: A comparative psychometric evaluation of AI tools in generating multiple-choice questions

Year 2025, Volume: 12 Issue: 4, 1055 - 1079
https://doi.org/10.21449/ijate.1674995

Abstract

This study examined the psychometric quality of multiple-choice questions generated by two AI tools, ChatGPT and DeepSeek, within the context of an undergraduate Educational Measurement and Evaluation course. Guided by ten learning outcomes (LOs) aligned with Bloom’s Taxonomy, each tool was prompted to generate one five-option multiple-choice item per LO. Following expert review (Kendall’s W = .58); revisions were made, and the finalized test was administered to 120 students. Item analyses revealed no statistically significant differences between the two AI models regarding item difficulty, discrimination, variance, or reliability. A few items -two from ChatGPT and one from DeepSeek- had suboptimal discrimination indices. Tetrachoric correlation analyses of item pairs generated by the two AI tools for the same LO revealed that only one pair showed a non-significant association, whereas all other pairs demonstrated statistically significant and generally moderate correlations. KR-20 and split-half reliability coefficients reflected acceptable internal consistency for a classroom-based assessment, with the DeepSeek-generated half showing a slightly stronger correlation with total scores. Expert feedback indicated that while AI tools generally produced valid stems and correct answers, most revisions focused on improving distractor quality, highlighting the need for human refinement. Generalizability and Decision studies confirmed consistency in expert ratings and recommended a minimum of seven experts for reliable evaluations. In conclusion, both AI tools demonstrated the capacity to generate psychometrically comparable items, highlighting their potential to support educators and test developers in test construction. The study concludes with practical recommendations for effectively incorporating AI into test development workflows.

Ethical Statement

Aksaray University, 21/01/2025-03.

Project Number

-

References

  • Alafnan, M.A. (2025). DeepSeek vs. ChatGPT: A comparative evaluation of AI tools in composition, business writing, and communication tasks. Journal of Artificial Intelligence and Technology, 5, 202-210. https://doi.org/10.37965/jait.2025.0740
  • Anderson, L.W., & Krathwohl, D.R. (2001). A taxonomy for learning, teaching and assessing: A revision of Bloom’s taxonomy of educational objectives: Complete edition. Longman.
  • Atılgan, H. (2004). Genellenebilirlik kuramı ve çok değişkenlik kaynaklı Rasch modelinin karşılaştırılmasına ilişkin bir araştırma [A reseach on comparisons of generalizability theory and many facets Rasch measurement] [Unpublished doctoral dissertation]. Hacettepe University.
  • Atılgan, H. (2009). Madde ve test istatistikleri [Item and test statistics]. In H. Atılgan, A. Kan, & N. Doğan (Eds.), Eğitimde ölçme ve değerlendirme [Measurement and evaluation in education] (3rd ed., pp. 293-314). Anı Publishing.
  • Baykul, Y. (2000). Eğitimde ve psikolojide ölçme: Klasik test teorisi ve uygulaması [Measurement in education and psychology: Classical test theory and its applications]. ÖSYM Publications.
  • Bloom, B., Englehart, M., Furst, E., Hill, W., & Krathwohl, D. (1956). Taxonomy of educational objectives: The classification of educational goals. Handbook I: Cognitive domain. Longmans, Green.
  • Brennan, R.L. (2001). Generalizability Theory. Springer-Verlag.
  • Clauser, B. (2008). A review of the EDUG software for generalizability analysis. International Journal of Testing, 8(3), 296-301. https://doi.org/10.1080/15305050802262357
  • Clauser, B.E., Margolis, M.J., & Case, S.M. (2006). Testing for licensure and certification in the professions. In R. L. Brennon (Ed.), Educational measurement (4th ed. pp. 701-730). Praeger Publications.
  • Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. Holt, Rinehart, and Winston Inc.
  • Daily Sabah. (2023, October 06). Türkiye's student assessment center to use AI to set exam question. https://www.dailysabah.com/turkiye/education/turkiyes student assessment center-to-use-ai-to-set-exam-question
  • Dancey, C.P., & Reidy, J. (2017). Statistics without maths for psychology. Pearson.
  • DeepSeek AI. (2024). DeepSeek v2: Advancing open source large language models. https://www.deepseek.com
  • Diedenhofen, B., & Musch, J. (2015). cocor: A Comprehensive Solution for the Statistical Comparison of Correlations. PLOS ONE, 10(4), Article e0121945. http://dx.doi.org/10.1371/journal.pone.0121945
  • Downing, S.M., & Haladyna, T.M. (2011). Handbook of test development. Lawrence Erlbaum Associates Publishers.
  • Educational Testing Service. (2025a). e rater® scoring engine. ETS. https://www.ets.org/erater.html
  • Educational Testing Service. (2025b). SpeechRater® scoring engine. ETS. https://www.ets.org/speechrater.html
  • Fraenkel, J.R., Wallen, N.E., & Hyun, H.H. (2012). How to design and evaluate research in education. McGraw-Hill.
  • Gao, H., Hashim, H., & Md Yunus, M. (2025). Assessing the reliability and relevance of DeepSeek in EFL writing evaluation: A generalizability theory approach. Language Testing in Asia, 15(33). https://doi.org/10.1186/s40468-025-00369-6
  • Gierl, M.J., & Haladyna, T.M. (2013). Automatic item generation: Theory & practice. Routledge.
  • Gierl, M.J., Lai, H., & Tanygin, V. (2021). Advanced methods in automatic item generation. Routledge.
  • Gierl, M.J., Lai, H., & Turner, S. (2012). Using automatic item generation to create multiple-choice items for assessments in medical education. Medical Education, 46(8), 757-765. https://doi.org/10.1111/j.1365-2923.2012.04289.x
  • Graduate Management Admission Council. (2009). Fairness of automated essay scoring of GMAT AWA. https://www.gmac.com/market intelligence and research/research library/gmat-test-taker-data/research-reports-gmat-related/fairness-of-automated-essay-scoring-of-gmat-awa
  • Graesser, A.C., Conley, M.W., & Olney, A. (2012). Intelligent tutoring systems. In K. R. Harris, S. Graham, T. Urdan, A. G. Bus, S. Major, & H. L. Swanson (Eds.), APA educational psychology handbook, vol. 3. application to learning and teaching (pp. 451-473). American Psychological Association. https://doi.org/10.1037/13275-018
  • Haladyna, T.M., & Rodriguez, M.C. (2013). Developing and validating test items. Routledge.
  • Irwing, P., & Hughes, D.J. (2018). Test development. In P. Irwing, T. Booth, & D. J. Hughes (Eds.), The Wiley handbook of psychometric testing: A multidisciplinary reference on survey, scale and test development (First ed., pp. 3-48), John Wiley & Sons Ltd. https://doi.org/10.1002/9781118489772.ch1
  • Kanık, M. (2024). The use of ChatGPT in assessment. International Journal of Assessment Tools in Education, 11(3), 608-621. https://doi.org/10.21449/ijate.1379647
  • Kendall, M.G., & Smith, B.B. (1939). The problem of m rankings. The Annals of Mathematical Statistics, 10(3), 275-287. http://www.jstor.org/stable/2235668
  • Kıyak, Y.S., & Emekli, E. (2024). ChatGPT prompts for generating multiple-choice questions in medical education and evidence on their validity: A literature review. Postgraduate Medical Journal, 100(1189), 858-865. https://doi.org/10.1093/postmj/qgae065
  • Landis, J.R., & Koch, G.G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159-174. https://doi.org/10.2307/2529310
  • Leslie, T., & Gierl, M.J. (2023). Using automatic item generation to create multiple-choice questions for pharmacy assessment. American Journal of Pharmaceutical Education, 87(10), 1-7. https://doi.org/10.1016/j.ajpe.2023.100081
  • Liu, Z., He, X., Liu, L., Liu, T., & Zhai, X. (2023). Context matters: A strategy to pre-train language model for science education. In N. Wang, G. Rebolledo-Mendez, V. Dimitrova, N. Matsuda, & O.C. Santos (Eds.), Artificial intelligence in education. Springer. https://doi.org/10.1007/978-3-031-36336-8_103
  • Lo, C.K. (2023). What is the impact of ChatGPT on education? A rapid review of the literature. Education Sciences, 13(4), Article 410. https://doi.org/10.3390/educsci13040410
  • Makowski, D., Wiernik, B., Patil, I., Lüdecke, D., & Ben-Shachar, M. (2022). correlation: Methods for Correlation Analysis (Version 0.8.3) [R package]. https://CRAN.R project.org/package=correlation
  • Malik, M., Rehan, S., Zimbittas, G., & Manna, S. (2024). Multiple-choice questions reimagined: Exploring the ethical and pedagogical implications of GenAI for higher education. In T. Fujita (Ed.), Proceedings of the British Society for Research into Learning Mathematics (BSRLM), 44(1). https://bsrlm.org.uk/wp-content/uploads/2024/05/BSRLM-CP-44-1-05.pdf
  • Meng, X.-L., Rosenthal, R., & Rubin, D.B. (1992). Comparing correlated correlation coefficients. Psychological Bulletin, 111(1), 172 175. https://doi.org/10.1037/0033 2909.111.1.172
  • Murphy, K.R., & Davidshofer, C.O. (1991). Psychological testing: Principles and applications. Prantice Hall.
  • Ngo, A., Gupta, S., Perrine, O., Reddy, R., Ershadi, S., & Remick, D. (2024). ChatGPT 3.5 fails to write appropriate multiple choice practice exam questions. Academic Pathology, 11(1), 1-5. https://doi.org/10.1016/j.acpath.2023.100099
  • OECD. (2023). PISA 2022 results (Volume I): The state of learning and equity in education. OECD Publishing. https://www.oecd.org/en/publications/pisa 2022 results volume i_53f23881-en/full-report/adaptive-testing-in-pisa-2022_21364c8d.html
  • OpenAI. (2023). GPT-4 technical report. https://openai.com/research/gpt-4
  • ÖSYM (2024, February 13). ÖSYM Başkanı Ersoy: Yapay zekâ ile soru üreteceğiz [ÖSYM President Ersoy: We will generate questions with artificial intelligence]. https://www.osym.gov.tr/TR,29174/osym-baskani-ersoy-yapay-zeka-ile-soru-uretecegiz 13022024.html
  • Özçelik, D.A. (1992). Ölçme ve değerlendirme [Measurement and assessment]. ÖSYM Publication.
  • Özçelik, D.A. (2013). Test hazırlama kılavuzu [Test preparation manual]. Pegem.
  • R Core Team. (2023). R: A language and environment for statistical computing (Version 4.3.2). R Foundation for Statistical Computing. https://www.R-project.org/
  • Revelle, W. (2023). psych: Procedures for psychological, psychometric, and personality research (Version 2.3.9) [R package]. Northwestern University. https://CRAN.R project.org/package=psych
  • Robitzsch, A. (2024). sirt: Supplementary Item Response Theory Models (Version 4.1.-15). [R package]. https://CRAN.R-project.org/package=sirt
  • Rudner, L., & Schafer, W. (2002) What teachers need to know about assessment. National Education Association.
  • Rycroft-Smith, L., & Macey, D. (2024). Using AI for question generation in mathematics education: What are the advantages and disadvantages? In T. Fujita (Ed.), Proceedings of the British Society for Research into Learning Mathematics (BSRLM), 44(1). https://bsrlm.org.uk/wp-content/uploads/2024/05/BSRLM-CP-44-1-07.pdf
  • The Princeton Review. (2024). Digital SAT security and fairness. https://www.princetonreview.com/college-advice/digital-security-and-fairness
  • Schober, P., Boer, C., & Schwarte, L.A. (2018). Correlation coefficients: Appropriate use and interpretation. Anesthesia & Analgesia, 126(5), 1763-1768.
  • Ratner, B. (2009) The correlation coefficient: Its values range between +1/−1, or do they? Journal of Targeting, Measurement and Analysis for Marketing, 17, 139 142. https://doi.org/10.1057/jt.2009.5
  • Seldon, A., & Abidoye, O. (2018). The fourth education revolution. University of Buckingham Press.
  • Shavelson, R.J., & Webb, N.M. (1991). Generalizability theory: A primer. SAGE Publications.
  • Shin, D. (2023). A case study on English test item development training for secondary school teachers using AI tools: Focusing on ChatGPT. Language Research, 59(1), 21 42. https://doi.org/10.30961/lr.2023.59.1.21
  • Society, S., & Group, E.W. (2010). Edug user guide. Edumetrics.
  • Tan, Ş. (2014). Öğretimde ölçme ve değerlendirme: KPSS el kitabı [Assessment and evaluation in instruction: KPSS handbook]. Pegem.
  • Thorndike, R.M., & Thorndike-Christ, T. (2014). Measurement and evaluation in psychology and education. Pearson.
  • Turgut, M.F., & Baykul, Y. (2012). Eğitimde ölçme ve değerlendirme [Measurement and evaluation in education]. Pegem.
  • Urhan, S., Gençaslan, O., & Dost, Ş. (2024). An argumentation experience regarding concepts of calculus with ChatGPT. Interactive Learning Environments, 32(10), 7186 7211. https://doi.org/10.1080/10494820.2024.2308093
  • Ünal, D., Erdem, Z.Ç., & Şahin, Z.G. (2025). Will artificial intelligence succeed in passing this test? Creating an achievement test utilizing ChatGPT. Education & Information Technologies, 30, 17263-17287. https://doi.org/10.1007/s10639-025-13461-4
  • Xiaoyu, W. (2024, June 21). AI Scores High in Gaokao Language Tests, Low in Math. China Daily. https://www.chinadaily.com.cn/a/202406/21/WS6674bb00a31095c51c50a0a9.html
  • Wang, J., & Heung, K. (2025). Educational innovation driven by artificial intelligence: The impact of DeepSeek on teachers’ teaching models. Learning & Education, 14(1), 38-42. https://ojs.piscomed.com/index.php/L-E/article/view/4291
  • Wardat, Y., Tashtoush, M.A., Alali, R., & Jarrah, A.M. (2023). ChatGPT: A revolutionary tool for teaching and learning mathematics. Eurasia Journal of Mathematics, Science and Technology Education, 19(7), Article em2286. https://doi.org/10.29333/ejmste/13272
  • Wickham, H., & Bryan, J. (2023). readxl: Read Excel files (Version 1.4.3) [R package]. https://CRAN.R-project.org/package=readxl
  • Wickham, H., François, R., Henry, L., & Müller, K. (2023). dplyr: A grammar of data manipulation (Version 1.1.4) [R package]. https://CRAN.R-project.org/package=dplyr
  • Willse, J.T. (2014). CTT: Classical Test Theory functions (Version 2.3.3) [R package]. https://CRAN.R-project.org/package=CTT
  • Yogesh, A., Telon, G., Lovely, T.F., & Braiton, M. (2025). A comparative study: Evaluating ChatGPT and DeepSeek AI tools in practice. International Journal of Open Information Technologies, 13(5), 67 70. https://cyberleninka.ru/article/n/a comparative study evaluating-chatgpt-and-deepseek-ai-tools-in-practice
  • Zhai, X. (2025). DeepSeek: Transforming the Foundations of Education. Preprints. https://doi.org/10.20944/preprints202503.1776.v1
There are 68 citations in total.

Details

Primary Language English
Subjects Measurement Theories and Applications in Education and Psychology, Classroom Measurement Practices
Journal Section Articles
Authors

Ceylan Gündeğer Kılcı 0000-0003-3572-1708

Project Number -
Early Pub Date October 1, 2025
Publication Date October 12, 2025
Submission Date April 12, 2025
Acceptance Date August 3, 2025
Published in Issue Year 2025 Volume: 12 Issue: 4

Cite

APA Gündeğer Kılcı, C. (2025). ChatGPT vs. DeepSeek: A comparative psychometric evaluation of AI tools in generating multiple-choice questions. International Journal of Assessment Tools in Education, 12(4), 1055-1079. https://doi.org/10.21449/ijate.1674995

23823             23825             23824