Comparison of the Psychometric Properties of 8th Grade Inequality Math Questions Prepared with Artificial Intelligence Support and Past LGS Questions
Year 2025,
Volume: 7 Issue: Özel Sayı, 185 - 210, 29.11.2025
Hatice Dökmeci
Şeyma Uyar
Abstract
The study examined the functionality of AI-supported automatic item generation in measurement and evaluation processes in education. Questions on the topic of “inequalities” for 8th grade mathematics, generated using the Python programming language through the Claude AI model, were psychometrically compared with questions previously administered by the Ministry of National Education as part of the High School Transition System (LGS). The study group consisted of 500 eighth-grade students attending four public middle schools in Uşak province. The AI questions, whose content validity was ensured by expert opinions, and the past LGS questions were administered to the students as a 18-item test form with a response time of 40 minutes. The data obtained were analyzed using TAP and JAMOVI. The findings revealed that the questions generated by artificial intelligence showed similar characteristics to LGS questions in terms of difficulty, discriminative power, and reliability levels. In conclusion, it was determined that AI-supported automatic item generation provides advantages in terms of time and cost in measurement and evaluation processes, but it must be supported by expert evaluations in terms of validity and pedagogical appropriateness.
References
-
Adiguzel, T., Kaya, M. H., & Cansu, F. K. (2023). Revolutionizing education with AI: Exploring the transformative potential of ChatGPT. Contemporary Educational Technology, 15(3). https://doi.org/10.30935/cedtech/13152
Agarwal, M., Sharma, P., & Wani, P. (2025). Evaluating the accuracy and reliability of large language models (ChatGPT, Claude, DeepSeek, Gemini, Grok, and Le Chat) in answering item-analyzed multiple-choice questions on blood physiology. Cureus 17(4): e81871. https://doi.org/10.7759/cureus.81871
-
Antropic, (2024). Claude pro. https://www.anthropic.com/news/claude-pro
-
Ayre, C., & Scally, A. J. (2014). Critical values for Lawshe’s content validity ratio: Revisiting the original methods of calculation. Measurement and Evaluation in Counseling and Development, 47(1), 79–86. https://doi.org/10.1177/0748175613513808
-
Baykul, Y. (2010). Eğitimde ve psikolojide ölçme: Klasik test teorisi ve uygulaması. Pegem Akademi.
-
Bhandari, S., Liu, Y., & Pardos, Z. A. (2023). Evaluating ChatGPT-generated textbook questions using IRT. NeurIPS 2023 Workshop on Generative AI for Education (GAIED), University of California, Berkeley. https://doi.org/10.1109/TLT.2022.3224232
-
Bozkurt, A. (2023). Chatgpt, üretken yapay zeka ve algoritmik paradigma değişikliği. Alanyazın, 4(1), 63-72. https://doi.org/10.59320/alanyazin.1283282
-
Boztepe, C. (2025). Eğitimde yapay zekâ uygulamaları: Fırsatlar, sınırlılıklar ve etik tartışmalar [Review article]. Dumlupınar Üniversitesi Eğitim Bilimleri Enstitüsü Dergisi, 9(1). https://doi.org/10.71272/debder.1706141
-
Bulut, G., & Akyıldız, M. (2024). Yapay zekâ ile üretilen soruların ve madde parametrelerinin MST test koşullarında karşılaştırılması. Dijital Teknolojiler ve Eğitim Dergisi, 3(1), 1-12. https://doi.org/10.5281/zenodo.12637347
-
Bulut, O., & Yıldırım-Erbaşlı, S. N. (2022). Automatic story and item generation for reading comprehension assessments with transformers. International Journal of Assessment Tools in Education, 9(Special Issue), 72-87. https://doi.org/10.21449/ijate.1124382
-
Büyüköztürk, Ş. (2016). Sınavlar üzerine düşünceler. Kalem Eğitim ve İnsan Bilimleri Dergisi, 6 (2), 345 – 356
-
Büyüköztürk, Ş. (2018). Sosyal bilimler için veri analizi el kitabı: İstatistik, araştırma deseni, SPSS uygulamaları ve yorum (24. baskı). Pegem Akademi.
-
Büyüköztürk, Ş., Kılıç Çakmak, E., Akgün, Ö. E., Karadeniz, Ş., & Demirel, F. (2020). Bilimsel araştırma yöntemleri (27. baskı). Pegem Akademi.
-
Büyükuslu, A. R. (2020). Koronavirüs sonrası yenidünya düzeni ekonomi-devlet-yapay zekâ. Der Yayınları.
-
Can, A. (2014). SPSS ile bilimsel araştırma sürecinde nicel veri analizi (2. Baskı). Pegem A Yayıncılık.
-
Chalasani, S. H., Syed, J., Ramesh, M., Patil, V., & Kumar, T. P. (2023). Artificial intelligence in the field of pharmacy practice: A literature review. Exploratory research in clinical and social pharmacy, 12, 100346. https://doi.org/10.1016/j.rcsop.2023.100346
-
Chen, L., Chen, P., & Lin, Z. (2020). Artificial intelligence in education: A review. IEEE Access, 8, 75264-75278. https://doi.org/10.1109/ACCESS.2020.2988510.
-
Chen, B., Zilles, C., West, M., & Bretl, T. (2019). Effect of discrete and continuous parameter variation on difficulty in automatic item generation. Artificial Intelligence in Education. 20th International Conference, AIED 2019, Chicago, IL, USA, June 25-29, Proceedings, Part I 20,
-
Creswell, J. W., & Plano Clark, V. L. (2018). Designing and conducting mixed methods research (3rd ed.). SAGE Publications.
-
Crocker, L., & Algina, J. (2006). Introduction to classical and modern test theory. Cengage Learning.
-
Çavuş, M. N. (2024). Eğitimde yapay zekâ tabanlı ölçme ve değerlendirme üzerine bir derleme. Uluslararası Özel Amaçlar için İngilizce Dergisi, 2(1), 39-54. https://dergipark.org.tr/en/pub/joinesp/issue/85082/1492424
-
Çetin, M. ve Baklavacı, G. Y. (2024). Endüstri̇ 4.0 perspekti̇fi̇nde yapay zekanin eği̇ti̇mde uygulanabi̇li̇rli̇ği̇ i̇le i̇lgi̇li̇ öğretmen görüşleri̇ni̇n i̇ncelenmesi̇. İstanbul Ticaret Üniversitesi Girişimcilik Dergisi, 7(14), 1-21. https://doi.org/10.55830/tje.1404165
-
Derici, S. (2025). Karar vermede yapay zeka tabanlı derin öğrenme ve makine öğrenmesi algoritmaları: Yapay zeka ile bütünleşme. E. Atalay ve S. Derici (Ed.) Yapay Zekâ: Dijital Çağın Anahtarı içinde, s. 83-97. Akademisyen Kitapevi
-
Doğan, S., & Oktay, Y. (2022). Liselere geçiş sınavı (LGS) hazırlık sürecinin değerlendirilmesi. İnsan ve Toplum Bilimleri Araştırmaları Dergisi, 11(2), 963-992. https://doi.org/10.15869/itobiad.1054829
-
Dorsey, D. W., & Michaels, H. R. (2022). Validity arguments meet artificial intelligence in innovative educational assessment: A discussion and look forward. Journal of Educational Measurement, 59(3), 389-394.
-
Downing, S. M. (2005). The effects of violating standard item writing principles on tests and students: the consequences of using flawed test items on achievement examinations in medical education. Advances in Health Sciences Education, 10(2), 133-43. https://doi.org/10.1007/s10459-004-4019-5.
-
Ebel, R. L. & Frisbie, D. A. (1991). Essentials of educational measurement (5th ed.). Prentice Hall.
-
Embretson, S. E. & Reise, S. P. (2000). Item response theory for psychologists. Psychology Press.
-
Ertel, W. (2024). Introduction to artificial intelligence. Springer Nature
-
Feng, W., Lee, J., McNichols, H., Scarlatos, A., Smith, D., Woodhead, S., ... & Lan, A. (2024). Exploring automated distractor generation for math multiple-choice questions via large language models. Findings of the Association for Computational Linguistics: NAACL 2024, 3067–3082. https://doi.org/10.48550/arxiv.2404.02124 .
-
Feuerriegel, S., Hartmann, J., Janiesch, C., & Zschech, P. (2024). Generative AI. Business & Information Systems Engineering, 66 (1), 111–126. https://doi.org/10.1007/s12599-023-00834-7
-
Field, A. (2013). Discovering statistics using IBM SPSS Statistics (4th ed.). SAGE Publications.
-
Gierl, M.J., & Haladyna, T.M. (2012). Automatic item generation: Theory and practice. Routledge.
Gierl, M.J., & Lai, H. (2012). The role of item models in automatic item generation. International Journal of Testing, 12(3), 273-298.
-
Göksün, D. O. (2025). Öğretmen adaylarının üst düzey düşünme becerilerini geliştirecek soru etkinliklerinde yapay zekâ araçlarının kullanımına yönelik swot analizleri. Erzincan Üniversitesi Eğitim Fakültesi Dergisi, 27(2), 278-293. https://doi.org/10.17556/erziefd.1648080
-
Gündeğer Kılcı, C. (2025). ChatGPT vs. DeepSeek: A comparative psychometric evaluation of AI tools in generating multiple-choice questions. International Journal of Assessment Tools in Education, 12(4), 1055-1079. https://doi.org/10.21449/ijate.1674995
-
Haladyna, T. M. & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.
-
Halaweh, M. (2023). ChatGPT in education: Strategies for responsible implementation. Contemporary Educational Technology, 15(2), ep421. https://doi.org/10.30935/cedtech/13036
-
Hambleton, R. K., Swaminathan, H. & Rogers, H. J. (1991). Fundamentals of item response theory. SAGE.
-
Holmes, W., Bialik, M. & Fadel, C. (2019). Artificial intelligence in education: Promises and implications for teaching and learning. Center for Curriculum Redesign.
-
Hou, J., Li, Z., & Liu, G. (2022). Macro education approach to improve learning interest under the background of artificial intelligence. Wireless Communications and Mobile Computing, 2022, 4295887. https://doi.org/10.1155/2022/4295887
-
Huang, A. Y., Lu, O. H., & Yang, S. J. (2023). Effects of artificial intelligence-enabled personalized recommendations on learners’ learning engagement, motivation, and outcomes in a flipped classroom. Computers & Education, 194, 104684. https://doi.org/10.1016/j.compedu.2022.104684
-
Indran, I. R., Paranthaman, P., Gupta, N., & Mustafa, N. (2024). Twelve tips to leverage AI for efficient and effective medical question generation: A guide for educators using Chat GPT. Medical Teacher, 46(8):1021-1026 https://doi.org/10.1080/0142159X.2023.2294703
-
Jiang, Q., Gao, Z., & Karniadakis, G.E. (2025). DeepSeek vs. ChatGPT vs. Claude: A comparative study for scientific computing and scientific machine learning tasks. Theoretical and Applied Mechanics Letters. https://doi.org/10.1016/j.taml.2025.100583
-
Kaiser, H. F. (1974). An index of factorial simplicity. Psychometrika, 39(1), 31–36.
-
Kelley, T. L. (1939). The selection of upper and lower groups for the validation of test items. Journal of Educational Psychology, 30(1), 17–24. https://doi.org/10.1037/h0057123
-
Kıyak, Y. S., Budakoğlu, I. İ., Coşkun Ö. & Koyun E. (2023). The first automatic ıtem generation in Turkish for assessment of clinical reasoning in medical education. World of Medical Education, 22(66), 72-90. https://doi.org/10.25282/ted.1225814
-
Kline, P. (2000). The handbook of psychological testing (2nd ed.). Routledge.
-
Kosh, A. E., Simpson, M. A., Bickel, L., Kellogg, M., & Sanford‐Moore, E. (2019). A cost– benefit analysis of automatic item generation. Educational Measurement: Issues and Practice, 38(1), 48-53. https://doi.org/10.1111/emip.12237
-
Krumm, S., Thiel, A. M., Reznik, N., Freudenstein, J.-P., Schäpers, P., & Mussel, P. (2024). Creating a psychological test in a few seconds: Can ChatGPT develop a psychometrically sound situational judgment test?European Journal of Psychological Assessment. Advance online publication. https://doi.org/10.1027/1015-5759/a000878
-
Kumar, N. S. (2019). Implementation of artificial intelligence in imparting education and evaluating student performance. Journal of Artificial Intelligence, 1(01), 1-9. https://doi.org/10.36548/jaicn.2019.1. 001
-
Laverghetta, A., & Licato, J. (2023). Generating better items for cognitive assessments using large language models. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) (pp. 414–428). Association for Computational Linguistics.
-
Lawshe, C. H. (1975). A quantitative approach to content validity. Personnel Psychology, 28(4), 563–575. https://doi.org/10.1111/j.1744-6570.1975.tb01393.x
-
Lee, K., Park, J., Kim, I., & Choi, Y. (2018). Predicting movie success with machine learning techniques: ways to improve accuracy. Information Systems Frontiers, 20, 577-588. https://doi.org/10.1007/s10796-016-9689-z
-
Lee, J., Smith, D., Woodhead, S., & Lan, A. (2024). Math multiple choice question generation via human-large language model collaboration. 17th International Conference on Educational Data Mining (EDM 2024). arXiv, abs/2405.00864. https://doi.org/10.48550/arxiv.2405.00864.
-
Lin, P. Y., Chai, C. S., Jong, M. S. Y., Dai, Y., Guo, Y., & Qin, J. (2021). Modelling the structural relationship among primary students’ motivation to learn artificial intelligence. Computers and Education: Artificial Intelligence, 2, 100006. https://doi.org/ 10.1016/j.caeai.2020.100006
-
Lin, Z., & Chen, H. (2024). Investigating the capability of ChatGPT for generating multiple-choice reading comprehension items. System, 123, 103344. https://doi.org/10.1016/j.system.2024.103344
-
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H. ve Neubig, G. (2023). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM computing surveys, 55(9), 1-35. https://doi.org/10.1145/3560815
-
Luo, Y., Han, X., & Zhang, C. (2022). Prediction of learning outcomes with a machine learning algorithm based on online learning behavior data in blended courses. Asia Pacific Education Review. https://doi.org/10.1007/s12564-022-09749-6
-
Luckin, R., Holmes, W., Griffiths, M. & Forcier, L. B. (2016). Intelligence unleashed: An argument for AI in education. Pearson.
-
Malec, W. (2024). Investigating the quality of ai-generated distractors for a multiple-choice vocabulary test. CSEDU, 1, 836-843. https://doi.org/10.5220/0012762400003693
-
Meehan, K., Wilson, S., Farrelly, W. (2024). Evaluating alternatives to essay-based assessment in a world where generative AI is pervasive, ICERI2024 Proceedings, pp. 1152-1159. 11-13 November, Spain.
-
Memarian, B., & Doleck, T. (2024). A review of assessment for learning with artificial intelligence. Computers in Human Behavior: Artificial Humans, 2(1), 100040. https://doi.org/10.1016/j.chbah.2023.100040
-
Mena-Guacas, A. F., Urueña Rodríguez, J. A., Santana Trujillo, D. M., Gómez-Galán, J., & López-Meneses, E. (2023). Collaborative learning and skill development for educational growth of artificial intelligence: A systematic review. Contemporary Educational Technology, 15(3), ep428. https://doi.org/10.30935/cedtech/13123
-
Millî Eğitim Bakanlığı [MEB]. (2018–2023). LGS örnek ve çıkmış sorular arşivi. https://odsgm.meb.gov.tr/
-
Millî Eğitim Bakanlığı [MEB]. (2023). LGS Değerlendirme Raporu. Ankara: MEB.
-
Omopekunola, M. O., & Kardanova, E. Y. (2024). Automatic generation of physics items with large language models (LLMs). Research and Evaluation in Education, 10(2). https://doi.org/10.21831/ reid.v10i2.76864
-
Owan, V. J., Abang, K. B., Idika, D. O., Etta, E. O., & Bassey, B. A. (2023). Exploring the potential of artificial intelligence tools in educational measurement and assessment. Eurasia Journal of Mathematics, Science and Technology Education, 19(8), em2307. https://doi.org/10.29333/ejmste/13428
-
Özgeldi, M. (2019). Yapay zeka ve insan kaynakları. G. Telli (Ed.), Yapay zeka ve gelecek içinde (s. 198-222). İstanbul: Doğu Kitapevi.
-
Pais, J., Silva, A., Guimarães, B., Povo, A., Coelho, E., Silva-Pereira, F., ... & Severo, M. (2016). Do item-writing flaws reduce examinations psychometric quality?. BMC Research Notes, 9, 399. https://doi.org/10.1186/s13104-016-2202-4
-
Peters, U., & Chin‐Yee, B. (2025). Generalization bias in large language model summarization of scientific research. Royal Society Open Science, 12(4). https://doi.org/10.1098/rsos.241776
-
Polit, D. F., & Beck, C. T. (2006). The content validity index: Are you sure you know what’s being reported? Critique and recommendations. Research in Nursing & Health, 29(5), 489–497. https://doi.org/10.1002/nur.20147
-
Pressey, S. L. (1950). Development and appraisal of devices providing immediate automatic scoring of objective tests and concomitant self-instruction. The Journal of Psychology, 29(2), 417-447.
-
Rodrigues, L., Pereira, F. D., Cabral, L., Gašević, D., Ramalho, G., & Mello, R. F. (2024). Assessing the quality of automatic-generated short answers using GPT4. Computers and Education: Artificial Intelligence, 7, 100248. https://doi.org/10.1016/j.caeai.2024.100248
-
Rubinstein, S. M., Muhsin, A., Banerjee, R., Mishra, S., Kwok, M., Yang, P., Warner, J. L., & Cowan, A. W. (2025). Summarizing clinical evidence utilizing large language models: A blinded comparative analysis. Frontiers in Digital Health, 5, 1569554. https://doi.org/10.3389/fdgth.2025.1569554
-
Rudner, L.M. (2009). Implementing the graduate management admission test computerized adaptive test. In Elements of adaptive testing (pp. 151-165). Springer New York. https://doi.org/10.1007/978-0-387-85461-8_8
-
Sayın, A., & Bozdağ, S. (2024). Çeşitli yönleri ile otomatik madde üretimi. EPODDER
-
Schermelleh-Engel, K., Moosbrugger, H., & Müller, H. (2003). Evaluating the fit of structural equation models: tests of significance and descriptive goodness-of-fit measures. Methods of Psychological Research, 8(2), 23–74. https://psycnet.apa.org/record/2003-08119-003
-
Schumacker, R. E., & Lomax, R. G. (2010). A beginners guide to structural equation modeling. New York: Routledge
-
Shin, D., & Lee, J. H. (2023). Can ChatGPT make reading comprehension testing items on par with human experts? Language Learning & Technology, 27(3), 27–40. https://hdl.handle.net/10125/73530
-
Soylu, A., Coşkun, Ö., Budakoğlu, I. İ., & Peker, T. V.(2024). Discrimination and difficulty indices of ChatGPT- generated multiple-choice questions in anatomy. 24. Ulusal Anatomi Kongresi (pp.76-77). İstanbul, Turkey
-
Tabachnick, B. G., & Fidell, L. S. (2013). Using multivariate statistics (6th ed.). Pearson.
-
Şad, S. N., & Şahiner, Y. K. (2016). Temel eğitimden ortaöğretime geçiş (TEOG) sistemine ilişkin öğrenci, öğretmen ve veli görüşleri. İlköğretim Online, 15(1). https://doi.org/10.17051/io.2016.78720
-
Tan, B., Armoush, N., Mazzullo, E., Bulut, O., & Gierl, M. J. (2025). A review of automatic item generation techniques leveraging large language models. International Journal of Assessment Tools in Education, 12(2), 317-340. https://doi.org/10.21449/ijate.1602294
-
Taşdelen-Teker, G., Bakan-Kalaycıoğlu, D., Şahin, M. G., & Esenboğa, S. (2024). Yapay zekâ tarafından oluşturulan olgu temelli çoktan seçmeli maddelerin değerlendirilmesi: Pediatri örneği. Uluslararası Ölçme, Seçme ve Yerleştirme Sempozyumu. ÖSYM, Ankara.
-
Tarrant, M., Knierim, A., Hayes, S. K., & Ware, J. (2006). The frequency of item writing flaws in multiple-choice questions used in high stakes nursing assessments. Nurse Education Today, 26(8), 662-671. https://doi.org/10.1016/j.nedt.2006.07.006
-
Tavşancıl, E. (2010). Tutumların ölçülmesi ve SPSS ile veri analizi (5. baskı). Nobel Yayıncılık.
-
Tekin, H. (1996). Eğitimde ölçme ve değerlendirme. Yargı Yayınları.
-
Turgut, M. F., & Baykul, Y. (2012). Eğitimde ölçme ve değerlendirme. Pegem Akademi Yayıncılık.
-
Ulusoy, B. (2020). 8. sınıf öğrencilerinin liselere geçiş sınavına (LGS) ilişkin algılarının metaforlar aracılığıyla incelenmesi. Necmettin Erbakan Üniversitesi Ereğli Eğitim Fakültesi Dergisi, 2(2), 186-202. https://doi.org/10.51119/ereegf.2020.5
-
Wallen, N. E., & Fraenkel, J. R. (2013). Educational research: A guide to the process. Routledge.
-
Woolf, B. P. (2010). Building intelligent interactive tutors: Student-centered strategies for revolutionizing e-learning. Morgan Kaufmann.
-
Yapıcı Coşkun, Z., Kıyak, Y. S., Coşkun, Ö., Budakoğlu, I. İ., & Özdemir, Ö. (2025). Large language models for generating script concordance test in obstetrics and gynecology: ChatGPT and Claude. Medical Teacher, 47(11), 1767–1771. https://doi.org/10.1080/0142159X.2025.2497888
-
Yeşilyurt, S., Dündar, R., & Aydın, M. (2024). Sosyal bilgiler eğitimi̇ alanında lisansüstü eğitimini̇ sürdüren öğrencilerin yapay zekâ hakkındaki̇ görüşleri̇. Asya Studies, 8(27), 1-14. https://doi.org/10.31455/asya.1406649
-
Yüzüak, B. N., & Yılmaz, F. N. (2025). Soruları gözden geçirmede farklı yapay zeka uygulamalarının karşılaştırılması: Doğru-yanlış önermeleri üzerine bir uygulama. Dokuz Eylül Üniversitesi Buca Eğitim Fakültesi Dergisi, 63, 500-520. https://doi.org/10.53444/deubefd.1532545
-
Zawacki-Richter, O., Marín, V. I., Bond, M., & Gouverneur, F. (2019). Systematic review of research on artificial intelligence applications in higher education – where are the educators? International Journal of Educational Technology in Higher Education, 16(1), 39. https://doi.org/10.1186/s41239-019-0171-0
Yapay Zekâ Destekli Hazırlanan 8. Sınıf Eşitsizlik Matematik Soruları ile LGS Çıkmış Sorularının Psikometrik Özelliklerinin Karşılaştırılması
Year 2025,
Volume: 7 Issue: Özel Sayı, 185 - 210, 29.11.2025
Hatice Dökmeci
Şeyma Uyar
Abstract
Araştırmada, yapay zekâ destekli otomatik madde üretiminin eğitimde ölçme-değerlendirme süreçlerindeki işlevselliği incelenmiştir. Claude yapay zekâ modeli aracılığıyla Python programlama dili kullanılarak üretilen 8. sınıf matematik “eşitsizlikler” konusuna ait sorular ile Millî Eğitim Bakanlığı tarafından Liselere Geçiş Sistemi (LGS) kapsamında uygulanan çıkmış sorular psikometrik açıdan karşılaştırılmıştır. Uşak ilindeki 4 devlet ortaokulunda öğrenim gören 500 sekizinci sınıf öğrencisi çalışma grubunu oluşturmuştur. Uzman görüşleriyle kapsam geçerliği sağlananyapay zeka soruları ile çıkmış LGS soruları toplam 18 maddelik test formu ve form için yanıt süresi 40 dakika olarak öğrencilere uygulanmış, elde edilen veriler TAP and JAMOVI programları ile analiz edilmiştir. Bulgular, yapay zekâ tarafından üretilen soruların güçlük, ayırtedicilik ve güvenirlik düzeyleri bakımından LGS sorularıyla benzer özellikler gösterdiğini ortaya koymuştur. Sonuç olarak, yapay zekâ destekli otomatik madde üretiminin ölçme-değerlendirme süreçlerinde zaman ve maliyet açısından avantaj sağladığı, ancak geçerlik ve pedagojik uygunluk açısından uzman değerlendirmeleriyle desteklenmesinin gerektiği belirlenmiştir.
Ethical Statement
Bu çalışma MAKÜ Girişimsel Olmayan Klinik Araştırmalar Etik Kurulundan alınan GO 2025/1527 karar numaralı etik kurul izni ile yürütülmüştür.
Thanks
Bu çalışmanın bir bölümü 3. Uluslararası 21. Yüzyıl Eğitim Araştırmaları Kongresi’nde (INER-2025 25-27 Nisan) sözlü bildiri olarak sunulmuştur.
References
-
Adiguzel, T., Kaya, M. H., & Cansu, F. K. (2023). Revolutionizing education with AI: Exploring the transformative potential of ChatGPT. Contemporary Educational Technology, 15(3). https://doi.org/10.30935/cedtech/13152
Agarwal, M., Sharma, P., & Wani, P. (2025). Evaluating the accuracy and reliability of large language models (ChatGPT, Claude, DeepSeek, Gemini, Grok, and Le Chat) in answering item-analyzed multiple-choice questions on blood physiology. Cureus 17(4): e81871. https://doi.org/10.7759/cureus.81871
-
Antropic, (2024). Claude pro. https://www.anthropic.com/news/claude-pro
-
Ayre, C., & Scally, A. J. (2014). Critical values for Lawshe’s content validity ratio: Revisiting the original methods of calculation. Measurement and Evaluation in Counseling and Development, 47(1), 79–86. https://doi.org/10.1177/0748175613513808
-
Baykul, Y. (2010). Eğitimde ve psikolojide ölçme: Klasik test teorisi ve uygulaması. Pegem Akademi.
-
Bhandari, S., Liu, Y., & Pardos, Z. A. (2023). Evaluating ChatGPT-generated textbook questions using IRT. NeurIPS 2023 Workshop on Generative AI for Education (GAIED), University of California, Berkeley. https://doi.org/10.1109/TLT.2022.3224232
-
Bozkurt, A. (2023). Chatgpt, üretken yapay zeka ve algoritmik paradigma değişikliği. Alanyazın, 4(1), 63-72. https://doi.org/10.59320/alanyazin.1283282
-
Boztepe, C. (2025). Eğitimde yapay zekâ uygulamaları: Fırsatlar, sınırlılıklar ve etik tartışmalar [Review article]. Dumlupınar Üniversitesi Eğitim Bilimleri Enstitüsü Dergisi, 9(1). https://doi.org/10.71272/debder.1706141
-
Bulut, G., & Akyıldız, M. (2024). Yapay zekâ ile üretilen soruların ve madde parametrelerinin MST test koşullarında karşılaştırılması. Dijital Teknolojiler ve Eğitim Dergisi, 3(1), 1-12. https://doi.org/10.5281/zenodo.12637347
-
Bulut, O., & Yıldırım-Erbaşlı, S. N. (2022). Automatic story and item generation for reading comprehension assessments with transformers. International Journal of Assessment Tools in Education, 9(Special Issue), 72-87. https://doi.org/10.21449/ijate.1124382
-
Büyüköztürk, Ş. (2016). Sınavlar üzerine düşünceler. Kalem Eğitim ve İnsan Bilimleri Dergisi, 6 (2), 345 – 356
-
Büyüköztürk, Ş. (2018). Sosyal bilimler için veri analizi el kitabı: İstatistik, araştırma deseni, SPSS uygulamaları ve yorum (24. baskı). Pegem Akademi.
-
Büyüköztürk, Ş., Kılıç Çakmak, E., Akgün, Ö. E., Karadeniz, Ş., & Demirel, F. (2020). Bilimsel araştırma yöntemleri (27. baskı). Pegem Akademi.
-
Büyükuslu, A. R. (2020). Koronavirüs sonrası yenidünya düzeni ekonomi-devlet-yapay zekâ. Der Yayınları.
-
Can, A. (2014). SPSS ile bilimsel araştırma sürecinde nicel veri analizi (2. Baskı). Pegem A Yayıncılık.
-
Chalasani, S. H., Syed, J., Ramesh, M., Patil, V., & Kumar, T. P. (2023). Artificial intelligence in the field of pharmacy practice: A literature review. Exploratory research in clinical and social pharmacy, 12, 100346. https://doi.org/10.1016/j.rcsop.2023.100346
-
Chen, L., Chen, P., & Lin, Z. (2020). Artificial intelligence in education: A review. IEEE Access, 8, 75264-75278. https://doi.org/10.1109/ACCESS.2020.2988510.
-
Chen, B., Zilles, C., West, M., & Bretl, T. (2019). Effect of discrete and continuous parameter variation on difficulty in automatic item generation. Artificial Intelligence in Education. 20th International Conference, AIED 2019, Chicago, IL, USA, June 25-29, Proceedings, Part I 20,
-
Creswell, J. W., & Plano Clark, V. L. (2018). Designing and conducting mixed methods research (3rd ed.). SAGE Publications.
-
Crocker, L., & Algina, J. (2006). Introduction to classical and modern test theory. Cengage Learning.
-
Çavuş, M. N. (2024). Eğitimde yapay zekâ tabanlı ölçme ve değerlendirme üzerine bir derleme. Uluslararası Özel Amaçlar için İngilizce Dergisi, 2(1), 39-54. https://dergipark.org.tr/en/pub/joinesp/issue/85082/1492424
-
Çetin, M. ve Baklavacı, G. Y. (2024). Endüstri̇ 4.0 perspekti̇fi̇nde yapay zekanin eği̇ti̇mde uygulanabi̇li̇rli̇ği̇ i̇le i̇lgi̇li̇ öğretmen görüşleri̇ni̇n i̇ncelenmesi̇. İstanbul Ticaret Üniversitesi Girişimcilik Dergisi, 7(14), 1-21. https://doi.org/10.55830/tje.1404165
-
Derici, S. (2025). Karar vermede yapay zeka tabanlı derin öğrenme ve makine öğrenmesi algoritmaları: Yapay zeka ile bütünleşme. E. Atalay ve S. Derici (Ed.) Yapay Zekâ: Dijital Çağın Anahtarı içinde, s. 83-97. Akademisyen Kitapevi
-
Doğan, S., & Oktay, Y. (2022). Liselere geçiş sınavı (LGS) hazırlık sürecinin değerlendirilmesi. İnsan ve Toplum Bilimleri Araştırmaları Dergisi, 11(2), 963-992. https://doi.org/10.15869/itobiad.1054829
-
Dorsey, D. W., & Michaels, H. R. (2022). Validity arguments meet artificial intelligence in innovative educational assessment: A discussion and look forward. Journal of Educational Measurement, 59(3), 389-394.
-
Downing, S. M. (2005). The effects of violating standard item writing principles on tests and students: the consequences of using flawed test items on achievement examinations in medical education. Advances in Health Sciences Education, 10(2), 133-43. https://doi.org/10.1007/s10459-004-4019-5.
-
Ebel, R. L. & Frisbie, D. A. (1991). Essentials of educational measurement (5th ed.). Prentice Hall.
-
Embretson, S. E. & Reise, S. P. (2000). Item response theory for psychologists. Psychology Press.
-
Ertel, W. (2024). Introduction to artificial intelligence. Springer Nature
-
Feng, W., Lee, J., McNichols, H., Scarlatos, A., Smith, D., Woodhead, S., ... & Lan, A. (2024). Exploring automated distractor generation for math multiple-choice questions via large language models. Findings of the Association for Computational Linguistics: NAACL 2024, 3067–3082. https://doi.org/10.48550/arxiv.2404.02124 .
-
Feuerriegel, S., Hartmann, J., Janiesch, C., & Zschech, P. (2024). Generative AI. Business & Information Systems Engineering, 66 (1), 111–126. https://doi.org/10.1007/s12599-023-00834-7
-
Field, A. (2013). Discovering statistics using IBM SPSS Statistics (4th ed.). SAGE Publications.
-
Gierl, M.J., & Haladyna, T.M. (2012). Automatic item generation: Theory and practice. Routledge.
Gierl, M.J., & Lai, H. (2012). The role of item models in automatic item generation. International Journal of Testing, 12(3), 273-298.
-
Göksün, D. O. (2025). Öğretmen adaylarının üst düzey düşünme becerilerini geliştirecek soru etkinliklerinde yapay zekâ araçlarının kullanımına yönelik swot analizleri. Erzincan Üniversitesi Eğitim Fakültesi Dergisi, 27(2), 278-293. https://doi.org/10.17556/erziefd.1648080
-
Gündeğer Kılcı, C. (2025). ChatGPT vs. DeepSeek: A comparative psychometric evaluation of AI tools in generating multiple-choice questions. International Journal of Assessment Tools in Education, 12(4), 1055-1079. https://doi.org/10.21449/ijate.1674995
-
Haladyna, T. M. & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.
-
Halaweh, M. (2023). ChatGPT in education: Strategies for responsible implementation. Contemporary Educational Technology, 15(2), ep421. https://doi.org/10.30935/cedtech/13036
-
Hambleton, R. K., Swaminathan, H. & Rogers, H. J. (1991). Fundamentals of item response theory. SAGE.
-
Holmes, W., Bialik, M. & Fadel, C. (2019). Artificial intelligence in education: Promises and implications for teaching and learning. Center for Curriculum Redesign.
-
Hou, J., Li, Z., & Liu, G. (2022). Macro education approach to improve learning interest under the background of artificial intelligence. Wireless Communications and Mobile Computing, 2022, 4295887. https://doi.org/10.1155/2022/4295887
-
Huang, A. Y., Lu, O. H., & Yang, S. J. (2023). Effects of artificial intelligence-enabled personalized recommendations on learners’ learning engagement, motivation, and outcomes in a flipped classroom. Computers & Education, 194, 104684. https://doi.org/10.1016/j.compedu.2022.104684
-
Indran, I. R., Paranthaman, P., Gupta, N., & Mustafa, N. (2024). Twelve tips to leverage AI for efficient and effective medical question generation: A guide for educators using Chat GPT. Medical Teacher, 46(8):1021-1026 https://doi.org/10.1080/0142159X.2023.2294703
-
Jiang, Q., Gao, Z., & Karniadakis, G.E. (2025). DeepSeek vs. ChatGPT vs. Claude: A comparative study for scientific computing and scientific machine learning tasks. Theoretical and Applied Mechanics Letters. https://doi.org/10.1016/j.taml.2025.100583
-
Kaiser, H. F. (1974). An index of factorial simplicity. Psychometrika, 39(1), 31–36.
-
Kelley, T. L. (1939). The selection of upper and lower groups for the validation of test items. Journal of Educational Psychology, 30(1), 17–24. https://doi.org/10.1037/h0057123
-
Kıyak, Y. S., Budakoğlu, I. İ., Coşkun Ö. & Koyun E. (2023). The first automatic ıtem generation in Turkish for assessment of clinical reasoning in medical education. World of Medical Education, 22(66), 72-90. https://doi.org/10.25282/ted.1225814
-
Kline, P. (2000). The handbook of psychological testing (2nd ed.). Routledge.
-
Kosh, A. E., Simpson, M. A., Bickel, L., Kellogg, M., & Sanford‐Moore, E. (2019). A cost– benefit analysis of automatic item generation. Educational Measurement: Issues and Practice, 38(1), 48-53. https://doi.org/10.1111/emip.12237
-
Krumm, S., Thiel, A. M., Reznik, N., Freudenstein, J.-P., Schäpers, P., & Mussel, P. (2024). Creating a psychological test in a few seconds: Can ChatGPT develop a psychometrically sound situational judgment test?European Journal of Psychological Assessment. Advance online publication. https://doi.org/10.1027/1015-5759/a000878
-
Kumar, N. S. (2019). Implementation of artificial intelligence in imparting education and evaluating student performance. Journal of Artificial Intelligence, 1(01), 1-9. https://doi.org/10.36548/jaicn.2019.1. 001
-
Laverghetta, A., & Licato, J. (2023). Generating better items for cognitive assessments using large language models. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) (pp. 414–428). Association for Computational Linguistics.
-
Lawshe, C. H. (1975). A quantitative approach to content validity. Personnel Psychology, 28(4), 563–575. https://doi.org/10.1111/j.1744-6570.1975.tb01393.x
-
Lee, K., Park, J., Kim, I., & Choi, Y. (2018). Predicting movie success with machine learning techniques: ways to improve accuracy. Information Systems Frontiers, 20, 577-588. https://doi.org/10.1007/s10796-016-9689-z
-
Lee, J., Smith, D., Woodhead, S., & Lan, A. (2024). Math multiple choice question generation via human-large language model collaboration. 17th International Conference on Educational Data Mining (EDM 2024). arXiv, abs/2405.00864. https://doi.org/10.48550/arxiv.2405.00864.
-
Lin, P. Y., Chai, C. S., Jong, M. S. Y., Dai, Y., Guo, Y., & Qin, J. (2021). Modelling the structural relationship among primary students’ motivation to learn artificial intelligence. Computers and Education: Artificial Intelligence, 2, 100006. https://doi.org/ 10.1016/j.caeai.2020.100006
-
Lin, Z., & Chen, H. (2024). Investigating the capability of ChatGPT for generating multiple-choice reading comprehension items. System, 123, 103344. https://doi.org/10.1016/j.system.2024.103344
-
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H. ve Neubig, G. (2023). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM computing surveys, 55(9), 1-35. https://doi.org/10.1145/3560815
-
Luo, Y., Han, X., & Zhang, C. (2022). Prediction of learning outcomes with a machine learning algorithm based on online learning behavior data in blended courses. Asia Pacific Education Review. https://doi.org/10.1007/s12564-022-09749-6
-
Luckin, R., Holmes, W., Griffiths, M. & Forcier, L. B. (2016). Intelligence unleashed: An argument for AI in education. Pearson.
-
Malec, W. (2024). Investigating the quality of ai-generated distractors for a multiple-choice vocabulary test. CSEDU, 1, 836-843. https://doi.org/10.5220/0012762400003693
-
Meehan, K., Wilson, S., Farrelly, W. (2024). Evaluating alternatives to essay-based assessment in a world where generative AI is pervasive, ICERI2024 Proceedings, pp. 1152-1159. 11-13 November, Spain.
-
Memarian, B., & Doleck, T. (2024). A review of assessment for learning with artificial intelligence. Computers in Human Behavior: Artificial Humans, 2(1), 100040. https://doi.org/10.1016/j.chbah.2023.100040
-
Mena-Guacas, A. F., Urueña Rodríguez, J. A., Santana Trujillo, D. M., Gómez-Galán, J., & López-Meneses, E. (2023). Collaborative learning and skill development for educational growth of artificial intelligence: A systematic review. Contemporary Educational Technology, 15(3), ep428. https://doi.org/10.30935/cedtech/13123
-
Millî Eğitim Bakanlığı [MEB]. (2018–2023). LGS örnek ve çıkmış sorular arşivi. https://odsgm.meb.gov.tr/
-
Millî Eğitim Bakanlığı [MEB]. (2023). LGS Değerlendirme Raporu. Ankara: MEB.
-
Omopekunola, M. O., & Kardanova, E. Y. (2024). Automatic generation of physics items with large language models (LLMs). Research and Evaluation in Education, 10(2). https://doi.org/10.21831/ reid.v10i2.76864
-
Owan, V. J., Abang, K. B., Idika, D. O., Etta, E. O., & Bassey, B. A. (2023). Exploring the potential of artificial intelligence tools in educational measurement and assessment. Eurasia Journal of Mathematics, Science and Technology Education, 19(8), em2307. https://doi.org/10.29333/ejmste/13428
-
Özgeldi, M. (2019). Yapay zeka ve insan kaynakları. G. Telli (Ed.), Yapay zeka ve gelecek içinde (s. 198-222). İstanbul: Doğu Kitapevi.
-
Pais, J., Silva, A., Guimarães, B., Povo, A., Coelho, E., Silva-Pereira, F., ... & Severo, M. (2016). Do item-writing flaws reduce examinations psychometric quality?. BMC Research Notes, 9, 399. https://doi.org/10.1186/s13104-016-2202-4
-
Peters, U., & Chin‐Yee, B. (2025). Generalization bias in large language model summarization of scientific research. Royal Society Open Science, 12(4). https://doi.org/10.1098/rsos.241776
-
Polit, D. F., & Beck, C. T. (2006). The content validity index: Are you sure you know what’s being reported? Critique and recommendations. Research in Nursing & Health, 29(5), 489–497. https://doi.org/10.1002/nur.20147
-
Pressey, S. L. (1950). Development and appraisal of devices providing immediate automatic scoring of objective tests and concomitant self-instruction. The Journal of Psychology, 29(2), 417-447.
-
Rodrigues, L., Pereira, F. D., Cabral, L., Gašević, D., Ramalho, G., & Mello, R. F. (2024). Assessing the quality of automatic-generated short answers using GPT4. Computers and Education: Artificial Intelligence, 7, 100248. https://doi.org/10.1016/j.caeai.2024.100248
-
Rubinstein, S. M., Muhsin, A., Banerjee, R., Mishra, S., Kwok, M., Yang, P., Warner, J. L., & Cowan, A. W. (2025). Summarizing clinical evidence utilizing large language models: A blinded comparative analysis. Frontiers in Digital Health, 5, 1569554. https://doi.org/10.3389/fdgth.2025.1569554
-
Rudner, L.M. (2009). Implementing the graduate management admission test computerized adaptive test. In Elements of adaptive testing (pp. 151-165). Springer New York. https://doi.org/10.1007/978-0-387-85461-8_8
-
Sayın, A., & Bozdağ, S. (2024). Çeşitli yönleri ile otomatik madde üretimi. EPODDER
-
Schermelleh-Engel, K., Moosbrugger, H., & Müller, H. (2003). Evaluating the fit of structural equation models: tests of significance and descriptive goodness-of-fit measures. Methods of Psychological Research, 8(2), 23–74. https://psycnet.apa.org/record/2003-08119-003
-
Schumacker, R. E., & Lomax, R. G. (2010). A beginners guide to structural equation modeling. New York: Routledge
-
Shin, D., & Lee, J. H. (2023). Can ChatGPT make reading comprehension testing items on par with human experts? Language Learning & Technology, 27(3), 27–40. https://hdl.handle.net/10125/73530
-
Soylu, A., Coşkun, Ö., Budakoğlu, I. İ., & Peker, T. V.(2024). Discrimination and difficulty indices of ChatGPT- generated multiple-choice questions in anatomy. 24. Ulusal Anatomi Kongresi (pp.76-77). İstanbul, Turkey
-
Tabachnick, B. G., & Fidell, L. S. (2013). Using multivariate statistics (6th ed.). Pearson.
-
Şad, S. N., & Şahiner, Y. K. (2016). Temel eğitimden ortaöğretime geçiş (TEOG) sistemine ilişkin öğrenci, öğretmen ve veli görüşleri. İlköğretim Online, 15(1). https://doi.org/10.17051/io.2016.78720
-
Tan, B., Armoush, N., Mazzullo, E., Bulut, O., & Gierl, M. J. (2025). A review of automatic item generation techniques leveraging large language models. International Journal of Assessment Tools in Education, 12(2), 317-340. https://doi.org/10.21449/ijate.1602294
-
Taşdelen-Teker, G., Bakan-Kalaycıoğlu, D., Şahin, M. G., & Esenboğa, S. (2024). Yapay zekâ tarafından oluşturulan olgu temelli çoktan seçmeli maddelerin değerlendirilmesi: Pediatri örneği. Uluslararası Ölçme, Seçme ve Yerleştirme Sempozyumu. ÖSYM, Ankara.
-
Tarrant, M., Knierim, A., Hayes, S. K., & Ware, J. (2006). The frequency of item writing flaws in multiple-choice questions used in high stakes nursing assessments. Nurse Education Today, 26(8), 662-671. https://doi.org/10.1016/j.nedt.2006.07.006
-
Tavşancıl, E. (2010). Tutumların ölçülmesi ve SPSS ile veri analizi (5. baskı). Nobel Yayıncılık.
-
Tekin, H. (1996). Eğitimde ölçme ve değerlendirme. Yargı Yayınları.
-
Turgut, M. F., & Baykul, Y. (2012). Eğitimde ölçme ve değerlendirme. Pegem Akademi Yayıncılık.
-
Ulusoy, B. (2020). 8. sınıf öğrencilerinin liselere geçiş sınavına (LGS) ilişkin algılarının metaforlar aracılığıyla incelenmesi. Necmettin Erbakan Üniversitesi Ereğli Eğitim Fakültesi Dergisi, 2(2), 186-202. https://doi.org/10.51119/ereegf.2020.5
-
Wallen, N. E., & Fraenkel, J. R. (2013). Educational research: A guide to the process. Routledge.
-
Woolf, B. P. (2010). Building intelligent interactive tutors: Student-centered strategies for revolutionizing e-learning. Morgan Kaufmann.
-
Yapıcı Coşkun, Z., Kıyak, Y. S., Coşkun, Ö., Budakoğlu, I. İ., & Özdemir, Ö. (2025). Large language models for generating script concordance test in obstetrics and gynecology: ChatGPT and Claude. Medical Teacher, 47(11), 1767–1771. https://doi.org/10.1080/0142159X.2025.2497888
-
Yeşilyurt, S., Dündar, R., & Aydın, M. (2024). Sosyal bilgiler eğitimi̇ alanında lisansüstü eğitimini̇ sürdüren öğrencilerin yapay zekâ hakkındaki̇ görüşleri̇. Asya Studies, 8(27), 1-14. https://doi.org/10.31455/asya.1406649
-
Yüzüak, B. N., & Yılmaz, F. N. (2025). Soruları gözden geçirmede farklı yapay zeka uygulamalarının karşılaştırılması: Doğru-yanlış önermeleri üzerine bir uygulama. Dokuz Eylül Üniversitesi Buca Eğitim Fakültesi Dergisi, 63, 500-520. https://doi.org/10.53444/deubefd.1532545
-
Zawacki-Richter, O., Marín, V. I., Bond, M., & Gouverneur, F. (2019). Systematic review of research on artificial intelligence applications in higher education – where are the educators? International Journal of Educational Technology in Higher Education, 16(1), 39. https://doi.org/10.1186/s41239-019-0171-0