Madde Güçlüklerinin Tahmin Edilmesinde Uzman Görüşleri ve ChatGPT Performansının Karşılaştırılması / Comparison of Expert Opinions and ChatGPT Performance in Predicting Item Difficulties

Erdem Boduroğlu; Oğuz Koç; Mahmut Sami Yiğiter

doi:10.57135/jier.1296255

Research Article

Madde Güçlüklerinin Tahmin Edilmesinde Uzman Görüşleri ve ChatGPT Performansının Karşılaştırılması / Comparison of Expert Opinions and ChatGPT Performance in Predicting Item Difficulties

Year 2023, , 202 - 210, 30.08.2023

Erdem Boduroğlu , Oğuz Koç , Mahmut Sami Yiğiter

https://doi.org/10.57135/jier.1296255

Abstract

Bu çalışmada ChatGPT yapay zeka teknolojisinin eğitim alanında destekleyici unsur olarak kullanımına yönelik bir araştırma yürütülmüştür. ChatGPT’nin çoktan seçmeli test maddelerini yanıtlama ve bu maddelerin madde güçlük düzeylerini sınıflama performansı incelenmiştir. 20 maddeden oluşan beş seçenekli çoktan seçmeli test maddesine 4930 öğrencinin verdiği yanıtlara göre madde güçlük düzeyleri belirlenmiştir. Bu güçlük düzeyleri ile ChatGPT’nin ve uzmanların yaptığı sınıflandırmalar arasındaki ilişkiler incelenmiştir. Elde edilen bulgulara göre ChatGPT’nin çoktan seçmeli maddelere doğru yanıt verme performansının yüksek düzeyde olmadığı (%55) görülmüştür. Ancak madde güçlük düzeylerini sınıflandırma konusunda ChatGPT; gerçek madde güçlük düzeyleri ile 0.748, uzman görüşleri ile 0.870 korelasyon göstermiştir. Bu sonuçlara göre deneme uygulamasının yapılamadığı veya uzman görüşlerine başvurulamadığı durumlarda ChatGPT'den test geliştirme aşamalarında destek alınabileceği düşünülmektedir. Geniş ölçekli sınavlarda da uzman gözetiminde ChatGPT benzeri yapay zeka teknolojilerinden faydalanılabilir.

Keywords

ChatGPT, yapay zeka, madde güçlüğü, eğitim teknolojisi

References

Anıl, D. (2002). Deneme uygulamasının yapılamadıgı durumlarda madde ve test parametrelerinin klasik ve örtük özellikler test teorilerine göre kestirilmesi. Yayımlanmamış doktora tezi, Hacettepe Üniversitesi Sosyal Bilimler Estitüsü, Ankara.
Baykul, Y., & Sezer, S. (1993). Deneme yapılamayan durumlarda madde güçlük ve ayırıcılık gücü indekslerinin ve bunlara bağlı test istatiklerinin kestirilmesi. Eğitim ve Bilim, 17(83)
Baykul, Y. (2015). Eğitimde ve psikolojide ölçme: Klasik test teorisi ve uygulaması. Ankara: Pegem Akademi.
Bozkurt, A., Xiao, J., Lambert, S., Crompton, H., Koseoglu, S., Farrow, R., Bond, M., Nerantzi, C., Honeychurch, S., Bali, M., Dron, J., Mir, K., Stewart, B., Costello, E., Mason, J., Stracke, C., Romero-Hall, E., Koutropoulos, A., . . . Jandrić, P. (2023). Speculative futures on ChatGPT and Generative Artificial Intelligence (AI): A collective reflection Pazurek, A., from the educational landscape. Asian Journal of Distance Education, 18(1), 53-130. https://www.asianjde.com/ojs/index.php/AsianJDE/article/view/709
Choi, J. H., Hickman, K. E., Monahan, A. B. & Schwarcz, D. (2023). ChatGPT Goes to Law School. Minnesota Legal Studies Research Paper No. 23-03.
CNN (2023). ChatGPT Passes Exams from Law and Business Schools. Available online: https://edition.cnn.com/2023/01/26/tech/chatgpt-passes-exams (accessed on 10 March 2023).
Crocker, L. & Algina, J. (1986). Introduction to Classical and Modern Test Theory. USA:Harcourt Brace Javanovich College Publishers.
Deng, J., & Lin, Y. (2022). The benefits and challenges of ChatGPT: An overview. Frontiers in Computing and Intelligent Systems, 2(2), 81-83. https://doi.org/10.54097/fcis.v2i2.4465
Fraenkel, J. R., Wallen, N. E., & Hyun, H. H. (2012). How to design and evaluate research in education (Vol. 7, p. 429). New York: McGraw-hill.
Frieder, S., Pinchetti, L., Griffiths, R. R., Salvatori, T., Lukasiewicz, T., Petersen, P. C., ... & Berner, J. (2023). Mathematical capabilities of chatgpt. arXiv preprint arXiv:2301.13867.
Güler, N., İlhan, M., & Taşdelen-Teker, G. (2021). Çoktan seçmeli maddelerde uzmanlarca öngörülen ve ampirik olarak hesaplanan güçlük indekslerinin karşılaştırılması. Journal of Computer and Education Research, 9(18), 1022-1036. DOI: 10.18009/jcer.1000934
Impara, J. C., & Plake, B. S. (1998). Teachers' ability to estimate item difficulty: A test of the assumptions in the Angoff standard setting method. Journal of Educational Measurement, 35(1), 69-81.
Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., ... & Kasneci, G. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 103, 102274.
Khademi, A. (2023). Can ChatGPT and Bard Generate Aligned Assessment Items? A Reliability Analysis against Human Performance. arXiv preprint arXiv:2304.05372.
Kung, T. H., Cheatham, M., Medenilla, A., Sillos, C., De Leon, L., Elepaño, C., ... & Tseng, V. (2023). Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLoS digital health, 2(2), e0000198.
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 159-174.
Lo, C. K. (2023). What Is the Impact of ChatGPT on Education? A Rapid Review of the Literature. Education Sciences, 13(4), 410.
Lorge, I., & Diamon, L. K. (1954). The value of information to good and poor judges of item difficulty. Educational and Psychological Measurement, 14(1), 29–33. https://doi.org/10.1177/001316445401400103
OpenAI (2023). Introducing OpenAI. Erişim tarihi:08.05.2023. Erişim adresi: https://openai.com/blog/introducing-openai
Quereshi, M. Y., & Fisher, T. L. (1977). Logical versus empirical estimates of item difficulty. Educational and Psychologıcal Measurement, 37(1), 91–100. https://doi.org/10.1177/001316447703700110
Ryznar, M. (2023). Exams in the Time of ChatGPT. Washington and Lee Law Review Online, 80(5), 305.
Sezer, S. (1992). Ön deneme yapılamayan durumlarda madde güçlük ve ayırıcılık gücü indekslerinin ve bunlara bağlı test istatistiklerinin kestirilmesi. Yayımlanmamış doktora tezi, Hacettepe Üniversitesi Sosyal Bilimler Estitüsü, Ankara.
Shakarian, P., Koyyalamudi, A., Ngu, N., & Mareedu, L. (2023). An Independent Evaluation of ChatGPT on Mathematical Word Problems (MWP). arXiv preprint arXiv:2302.13814.
Sok, S., & Heng, K. (2023). ChatGPT for education and research: A review of benefits and risks. Available at SSRN 4378735.
Tinkelman, S. (1947). Difficulty prediction of test items. Teachers College Contributions to Education, 941, 55.
Urbina, S. (2014). Essentials of psychological testing (2nd ed.). Hoboken, New Jersey: Wiley Walter, S. D., Eliasziw, M., & Donner, A. (1998). Sample size and optimal designs for reliability studies. Statistics in medicine, 17(1), 101-110.
Zhai, X. (2023). Chatgpt for next generation science learning. XRDS: Crossroads, The ACM Magazine for Students, 29(3), 42-46.

Comparison of Expert Opinions and ChatGPT Performance in Predicting Item Difficulties

Year 2023, , 202 - 210, 30.08.2023

Erdem Boduroğlu , Oğuz Koç , Mahmut Sami Yiğiter

https://doi.org/10.57135/jier.1296255

Abstract

In this study, ChatGPT's performance in answering multiple-choice test items and classifying the item difficulty levels of these items was examined. Item’s actual difficulty levels were determined according to the responses of 4930 students to the five-choice multiple-choice test items consisting of 20 items. The relationships between these difficulty levels and the classifications made by ChatGPT and experts were tested. The findings demonsrated that ChatGPT's performance in giving correct answers to multiple-choice items was at moderate level (55%). However, in terms of classifying item difficulty levels, ChatGPT showed a correlation of 0.748 with actual item difficulty levels and 0.870 with expert opinions. According to these results, it is thought that ChatGPT can be used to support test development in cases where trial application cannot be conducted or expert opinions cannot be consulted. In largescale exams, ChatGPT-like artificial intelligence technologies can be utilized under expert supervision.

Keywords

ChatGPT, artificial intelligence, item difficulties, expert opinion

References

Anıl, D. (2002). Deneme uygulamasının yapılamadıgı durumlarda madde ve test parametrelerinin klasik ve örtük özellikler test teorilerine göre kestirilmesi. Yayımlanmamış doktora tezi, Hacettepe Üniversitesi Sosyal Bilimler Estitüsü, Ankara.
Baykul, Y., & Sezer, S. (1993). Deneme yapılamayan durumlarda madde güçlük ve ayırıcılık gücü indekslerinin ve bunlara bağlı test istatiklerinin kestirilmesi. Eğitim ve Bilim, 17(83)
Baykul, Y. (2015). Eğitimde ve psikolojide ölçme: Klasik test teorisi ve uygulaması. Ankara: Pegem Akademi.
Bozkurt, A., Xiao, J., Lambert, S., Crompton, H., Koseoglu, S., Farrow, R., Bond, M., Nerantzi, C., Honeychurch, S., Bali, M., Dron, J., Mir, K., Stewart, B., Costello, E., Mason, J., Stracke, C., Romero-Hall, E., Koutropoulos, A., . . . Jandrić, P. (2023). Speculative futures on ChatGPT and Generative Artificial Intelligence (AI): A collective reflection Pazurek, A., from the educational landscape. Asian Journal of Distance Education, 18(1), 53-130. https://www.asianjde.com/ojs/index.php/AsianJDE/article/view/709
Choi, J. H., Hickman, K. E., Monahan, A. B. & Schwarcz, D. (2023). ChatGPT Goes to Law School. Minnesota Legal Studies Research Paper No. 23-03.
CNN (2023). ChatGPT Passes Exams from Law and Business Schools. Available online: https://edition.cnn.com/2023/01/26/tech/chatgpt-passes-exams (accessed on 10 March 2023).
Crocker, L. & Algina, J. (1986). Introduction to Classical and Modern Test Theory. USA:Harcourt Brace Javanovich College Publishers.
Deng, J., & Lin, Y. (2022). The benefits and challenges of ChatGPT: An overview. Frontiers in Computing and Intelligent Systems, 2(2), 81-83. https://doi.org/10.54097/fcis.v2i2.4465
Fraenkel, J. R., Wallen, N. E., & Hyun, H. H. (2012). How to design and evaluate research in education (Vol. 7, p. 429). New York: McGraw-hill.
Frieder, S., Pinchetti, L., Griffiths, R. R., Salvatori, T., Lukasiewicz, T., Petersen, P. C., ... & Berner, J. (2023). Mathematical capabilities of chatgpt. arXiv preprint arXiv:2301.13867.
Güler, N., İlhan, M., & Taşdelen-Teker, G. (2021). Çoktan seçmeli maddelerde uzmanlarca öngörülen ve ampirik olarak hesaplanan güçlük indekslerinin karşılaştırılması. Journal of Computer and Education Research, 9(18), 1022-1036. DOI: 10.18009/jcer.1000934
Impara, J. C., & Plake, B. S. (1998). Teachers' ability to estimate item difficulty: A test of the assumptions in the Angoff standard setting method. Journal of Educational Measurement, 35(1), 69-81.
Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., ... & Kasneci, G. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 103, 102274.
Khademi, A. (2023). Can ChatGPT and Bard Generate Aligned Assessment Items? A Reliability Analysis against Human Performance. arXiv preprint arXiv:2304.05372.
Kung, T. H., Cheatham, M., Medenilla, A., Sillos, C., De Leon, L., Elepaño, C., ... & Tseng, V. (2023). Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLoS digital health, 2(2), e0000198.
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 159-174.
Lo, C. K. (2023). What Is the Impact of ChatGPT on Education? A Rapid Review of the Literature. Education Sciences, 13(4), 410.
Lorge, I., & Diamon, L. K. (1954). The value of information to good and poor judges of item difficulty. Educational and Psychological Measurement, 14(1), 29–33. https://doi.org/10.1177/001316445401400103
OpenAI (2023). Introducing OpenAI. Erişim tarihi:08.05.2023. Erişim adresi: https://openai.com/blog/introducing-openai
Quereshi, M. Y., & Fisher, T. L. (1977). Logical versus empirical estimates of item difficulty. Educational and Psychologıcal Measurement, 37(1), 91–100. https://doi.org/10.1177/001316447703700110
Ryznar, M. (2023). Exams in the Time of ChatGPT. Washington and Lee Law Review Online, 80(5), 305.
Sezer, S. (1992). Ön deneme yapılamayan durumlarda madde güçlük ve ayırıcılık gücü indekslerinin ve bunlara bağlı test istatistiklerinin kestirilmesi. Yayımlanmamış doktora tezi, Hacettepe Üniversitesi Sosyal Bilimler Estitüsü, Ankara.
Shakarian, P., Koyyalamudi, A., Ngu, N., & Mareedu, L. (2023). An Independent Evaluation of ChatGPT on Mathematical Word Problems (MWP). arXiv preprint arXiv:2302.13814.
Sok, S., & Heng, K. (2023). ChatGPT for education and research: A review of benefits and risks. Available at SSRN 4378735.
Tinkelman, S. (1947). Difficulty prediction of test items. Teachers College Contributions to Education, 941, 55.
Urbina, S. (2014). Essentials of psychological testing (2nd ed.). Hoboken, New Jersey: Wiley Walter, S. D., Eliasziw, M., & Donner, A. (1998). Sample size and optimal designs for reliability studies. Statistics in medicine, 17(1), 101-110.
Zhai, X. (2023). Chatgpt for next generation science learning. XRDS: Crossroads, The ACM Magazine for Students, 29(3), 42-46.

There are 27 citations in total.

Details

Primary Language	Turkish
Subjects	Other Fields of Education
Journal Section	Eğitim Bilimleri
Authors	Erdem Boduroğlu 0000-0001-8318-4914 Oğuz Koç 0000-0002-8656-6069 Mahmut Sami Yiğiter 0000-0002-2896-0201
Publication Date	August 30, 2023
Submission Date	May 12, 2023
Acceptance Date	July 25, 2023
Published in Issue	Year 2023

Cite

APA	Boduroğlu, E., Koç, O., & Yiğiter, M. S. (2023). Madde Güçlüklerinin Tahmin Edilmesinde Uzman Görüşleri ve ChatGPT Performansının Karşılaştırılması / Comparison of Expert Opinions and ChatGPT Performance in Predicting Item Difficulties. Disiplinlerarası Eğitim Araştırmaları Dergisi, 7(15), 202-210. https://doi.org/10.57135/jier.1296255

Article Files

Full Text

The Aim of The Journal

The Journal of Interdisciplinary Educational Researches (JIER) published by the Interdisciplinary Educational and Research Association (JIER)A) is an internationally eminent journal.

JIER, a nonprofit, nonprofit NGO, is concerned with improving the education system within the context of its corporate objectives and social responsibility policies. JIER, has the potential to solve educational problems and has a strong gratification for the contributions of qualified scientific researchers.

JIER has the purpose of serving the construction of an education system that can win the knowledge and skills that each individual should have firstly in our country and then in the world. In addition, JIER serves to disseminate the academic work that contributes to the professional development of teachers and academicians, offering concrete solutions to the problems of all levels of education, from preschool education to higher education.

JIER has the priority to contribute to more qualified school practices. Creating and managing content within this context will help to advance towards the goal of being a "focus magazine" and "magazine school", and will also form the basis for a holistic view of educational issues. It also acts as an intermediary in the production of common mind for sustainable development and education