Investigation of ChatGPT and Real Raters in Scoring Open-Ended Items in Terms of Inter-Rater Reliability

Seda Demir

doi:10.46778/goputeb.1345752

Research Article

Açık Uçlu Maddelerin Puanlanmasında ChatGPT ve Gerçek Puanlayıcıların Puanlayıcılar Arası Güvenirlik Bakımından İncelenmesi

Year 2023, Volume: 2023 Issue: 21, 1072 - 1099, 31.10.2023

Seda Demir

https://doi.org/10.46778/goputeb.1345752

Cited By: 1

Abstract

Bu araştırmanın amacı, açık uçlu maddelere verilen yanıtlar için yapay zekâ tabanlı bir araç olan ChatGPT ve iki gerçek puanlayıcı tarafından puanlama anahtarlarına göre yapılan puanlamanın puanlayıcılar arası güvenirlik bakımından incelenmesidir. Araştırmanın çalışma grubunu, 2022-2023 eğitim öğretim yılında Eskişehir ilinde öğrenim gören 13-15 yaş grubundan 30 öğrenci oluşturmaktadır. Araştırmanın verileri, Uluslararası Öğrenci Değerlendirme Programı-PISA Okuma Becerileri alanında yayımlanmış örnek sorular arasından seçilen 16 açık uçlu madde yardımıyla yüz yüze toplanmıştır. Puanlayıcılar arası güvenirliği belirlemek amacıyla korelasyon, uyuşma yüzdesi ve Genellenebilirlik kuramından yararlanılmıştır. Korelasyon analizlerinde SPSS 25, uyuşma yüzdesinin analizlerinde Excel ve genellenebilirlik kuramı analizlerinde EduG 6.1 programları kullanılmıştır. Araştırma sonuçları, puanlayıcılar arasında pozitif yönlü ve yüksek düzeyde bir ilişki olduğunu, puanlayıcıların yüksek oranda uyuşma gösterdiğini ve Genellenebilirlik kuramı kullanılarak hesaplanan güvenirlik (G) katsayılarının, korelasyon değerleri ve uyuşma yüzdelerine kıyasla daha düşük olduğunu göstermiştir. Bunun yanı sıra cevabı doğrudan metnin içinde geçen ve kısa cevaplı olan maddelere verilen yanıtların puanlanmasında tüm puanlayıcıların birbirleriyle mükemmel pozitif korelasyon ve tam uyuşma gösterdiği belirlenmiştir. Ayrıca Genellenebilirlik kuramı sonuçlarına göre toplam varyansı ana etkiler arasından en çok maddelerin (m), etkileşim etkileri arasından ise en çok öğrenci-madde etkileşiminin (öxm) açıkladığı görülmüştür. Sonuçta, uygulamaya dönük olarak eğitimcilere, kalabalık sınıflarda veya zamanın kısıtlı olduğu durumlarda özellikle puanlaması uzun zaman alan açık uçlu maddeler puanlanırken ChatGPT gibi yapay zekâ tabanlı araçlardan destek almaları önerilebilir.

Keywords

Puanlayıcılar Arası Güvenirlik, Açık Uçlu Maddelerin Puanlanması, Genellenebilirlik Kuramı, ChatGPT

References

Aiken, L. R. (2000). Psychological testing and assessment. Allyn and Bacon.
Aktay, S., Seçkin, G. Ö. K., & Uzunoğlu, D. (2023). ChatGPT in education. TAY Journal, 7(2), 378-406. https://doi.org/10.29329/tayjournal.2023.543.03
Atılgan, H. (2005). Generalizability theory and a sample application for inter-rater reliability. Educational Sciences and Practice, 4(7), 95-108. http://www.ebuline.com/pdfs/7Sayi/7_6.pdf
Atılgan, H., Kan, A., & Doğan, N. (2011). Eğitimde ölçme ve değerlendirme [Measurement and evaluation in education]. (5th ed.) Anı Yayıncılık.
Baykul, Y. (2000) Eğitimde ve psikolojide ölçme: Klasik Test Teorisi ve uygulaması [Measurement in education and psychology: Classical Test Theory and its application]. ÖSYM Yayınları.
Bilgen, Ö. B., & Doğan, N. (2017). The comparison of interrater reliability estimating techniques. Journal of Measurement and Evaluation in Education and Psychology, 8(1), 63-78. https://doi.org/10.21031/epod.294847
Brennan, R. L. (2001). Generalizability Theory. Springer-Verlag.
Büyüköztürk, Ş., Çakmak, E. Kılıç, A., Özcan, E., Karadeniz, Ş., & Demirel, F. (2011). Bilimsel araştırma yöntemleri [Scientific research methods]. Pegem Akademi.
Doğan, N. (Ed.). (2021). Eğitimde ölçme ve değerlendirme [Measurement and evaluation in education]. Pegem Akademi.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37-46. https://doi.org/10.1177/001316446002000104
Crocker, L. M., & Algina, L. (1986). Introduction to classical and modern test theory. Holt, Rinehart and Winson.
Çakıcı Eser, D., & Gelbal, S. (2012). Comparison of interrater agreement Calculated with generalizability theory and logistic regression. Kastamonu Education Journal, 21(2), 423-438. https://acikerisim.kku.edu.tr/xmlui/handle/20.500.12587/1380
Gage, N. A., Prykanowski, D., & Hirn, R. (2014). Increasing reliability of direct observation measurement approaches in emotional and/or behavioral disorders research using generalizability theory. Behavioral Disorders, 39(4), 228-244. https://doi.org/10.1177/019874291303900407
Goodwin, L. D., & Goodwin, W. L. (1991). Using generalizability theory in early childhood special education. Journal of Early Intervention, 15(2), 193-204. https://doi.org/10.1177/105381519101500208
Goodwin, L. D., Sands, D. J., & Kozleski, E. B. (1991). Estimating interinterviewer reliability for interview schedules used in special education research. The Journal of Special Education, 25(1), 73-89. https://doi.org/10.1177/002246699102500105
Goodwin, L. D. (2001). Interrater agreement and reliability. Measurement in Physical Education and Exercise Science, 5(1), 13-34. https://doi.org/10.1207/S15327841MPEE0501_2
Göktaş, L. S. (2023). Can ChatGPT succeed in distance education exams? A research on accuracy and verification in tourism. Journal of Tourism & Gastronomy Studies, 11(2), 892-905. https://doi.org/10.21325/jotags.2023.1224
Grassini, S. (2023). Shaping the future of education: Exploring the potential and consequences of AI and ChatGPT in educational settings. Education Sciences, 13(7), 692. https://doi.org/10.3390/educsci13070692
Güler, N., & Teker, G. T. (2015). The evaluation of rater reliability of open ended items obtained from different approaches. Journal of Measurement and Evaluation in Education and Psychology, 6(1), 12-24. https://doi.org/10.21031/epod.63041
Gümüş, F. Ö., & Arıkan, Ç. A. (2020). Investigation of solutions of mathematical problems using multiple representations in terms of inter-rater reliability. Necatibey Faculty of Education Electronic Journal of Science and Mathematics Education, 14(1), 606-628. https://doi.org/10.17522/balikesirnef.687639
Hallgren, K. A. (2012). Computing inter-rater reliability for observational data: an overview and tutorial. Tutorials in Quantitative Methods for Psychology, 8(1), 23-34. https://doi.org/10.20982/tqmp.08.1.p023
Hill, H. C., Charalambous, C. Y., & Kraft, M. A. (2012). When rater reliability is not enough: Teacher observation systems and a case for the generalizability study. Educational Researcher, 41(2), 56-64. https://doi.org/10.3102/0013189X12437203
İlhan, M. (2016). A comparison of the ability estimations of classical test theory and the many facet Rasch model in measurements with open-ended questions. Hacettepe University Journal of Education, 31(2), 346-368. https://doi.org/10.16986/HUJE.2016015182
Kan, A. (2005). The effect of using grading scale and answer key to grader’s reliability. Eurasian Journal of Educational Research, 20. 166-177. https://web.s.ebscohost.com/ehost/pdfviewer/pdfviewer?vid=0&sid=50df9fc0-9dbc-43f8-a338-7a1110d5ce44%40redis
Lilford, R., Edwards, A., Girling, A., Hofer, T., Di Tanna, G. L., Petty, J., & Nicholl, J. (2007). Inter-rater reliability of case-note audit: A systematic review. Journal of Health Services Research & Policy, 12(3), 173-180. https://doi.org/10.1258/135581907781543012
Lo, C. K. (2023). What is the impact of ChatGPT on education? A rapid review of the literature. Education Sciences, 13(4), 410. https://doi.org/10.3390/educsci13040410
Lord, F. M., & Novick, M. R. (1968) Statistical theory of mental test scores. Addison-Wesley.
Mancar, S. A. (2019). The comparison of inter rater reliability estimating Techniques in performance based assessment. [Unpublished Master Thesis]. Ankara University.
Meyer, G. J. (1999). Simple procedures to estimate chance agreement and kappa for the interrater reliability of response segments using the Rorschach Comprehensive System. Journal of Personality Assessment, 72(2), 230-255. https://doi.org/10.1207/S15327752JP720209
Mizumoto, A., & Eguchi, M. (2023). Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics, 2(2), 100050. https://doi.org/10.1016/j.rmal.2023.100050
OpenAI. (2015). OpenAI. https://openai.com/about
Opara, E., Mfon-Ette Theresa, A., & Aduke, T. C. (2023). ChatGPT for teaching, learning and research: Prospects and challenges. Global Academic Journal of Humanities and Social Sciences, 5(2), 33-40. https://doi.org/10.36348/gajhss.2023.v05i02.001
Öksüzoğlu, M. (2022). The investigation of items measuring high-level thinking skills in terms of student score and score reliability. [Unpublished Doctoral Dissertation]. Hacettepe University.
Özşavlı, M. (2023). The effect of peer feedback on the writing skills of students learning Turkish as a foreign language. International Journal of Turkish Literature Culture Education, 12(1), 253-273. https://doi.org/10.7884/teke.5638
Park, C. U., & Kim, H. J. (2015). Measurement of inter-rater reliability in systematic review. Hanyang Medical Reviews, 35(1), 44-49. https://doi.org/10.7599/hmr.2015.35.1.44
Pekin, Z., Çetin, S., & Güler, N. (2018). Comparison of Interrater Reliability Based on Different Theories for Autism Social Skills Profile. Journal of Measurement and Evaluation in Education and Psychology, 9(2), 202-215. https://doi.org/10.21031/epod.388590
Seheryeli, M. Y. (2018). An examination of the reliability estimates of a scoring rubric of a writing skill examination using the classical test theory, generalizability theory and the item response theory models. [Unpublished Master Thesis]. Gazi University.
Shavelson, R. J. & Webb, N. M. (1991). Generalizability Theory: A Primer. Sage Publications.
Şencan, H. (2005). Sosyal ve davranışsal ölçmelerde güvenirlik ve geçerlik [Reliability and validity in social and behavioural measurements]. Sözkesen Matbaacılık.
Tabachnick, B. G., & Fidell, L. S. (2014). Using Multivariate Statistics. Pearson.
Tapan Broutin, M. S. (2023). Examination of questions asked by pre-service mathematics teachers in their initial experiences with ChatGPT. Journal of Uludag University Faculty of Education, 36(2), 1-26. https://doi.org/10.19171/uefad.1299680
Turgut, M. F. (1993). Eğitimde ölçme ve değerlendirme metotları. Saydam Matbaacılık.
Wilson, M. H., Ashworth, E., Hutchinson, P. J., & British Neurotrauma Group. (2022). A proposed novel traumatic brain injury classification system–an overview and inter-rater reliability validation on behalf of the Society of British Neurological Surgeons. British Journal of Neurosurgery, 36(5), 633-638. https://doi.org/10.1080/02688697.2022.2090509
Zileli, E. N. (2023). ChatGPT example in learning Turkish as a foreign language. International Journal of Karamanoğlu Mehmetbey Educatioanal Research, 5(1), 42-51. https://doi.org/10.47770/ukmead.1296013

Investigation of ChatGPT and Real Raters in Scoring Open-Ended Items in Terms of Inter-Rater Reliability

Year 2023, Volume: 2023 Issue: 21, 1072 - 1099, 31.10.2023

Seda Demir

https://doi.org/10.46778/goputeb.1345752

Cited By: 1

Abstract

The aim of this study is to examine the inter-rater reliability of the responses to open-ended items scored by ChatGPT, an artificial intelligence-based tool, and two real raters according to the scoring keys. The study group consists of 30 students, aged between 13 and 15, studying in Eskişehir province in the 2022-2023 academic year. The data of the study were collected face-to-face with the help of 16 open-ended items selected from the sample questions published in the International Student Assessment Program-PISA Reading Skills. Correlation, percentage of agreement and the Generalizability theory were used to determine inter-rater reliability. SPSS 25 was used for correlation analysis, Excel for percentage of agreement analysis, and EduG 6.1 for the Generalizability theory analysis. The results of the study showed that there was a positive and high level of correlation between the raters, the raters showed a high level of agreement, and the reliability (G) coefficients calculated using the Generalizability theory were lower than the correlation values and percentage of agreement. In addition, it was determined that all raters showed excellent positive correlation and full agreement with each other in the scoring of the answers given to the short-answer items whose answers were directly in the text. In addition, according to the results of the Generalizability theory, it was found out that the items (i) explained the total variance the most among the main effects and the student-item interaction (sxi) explained the most among the interaction effects. As a result, it can be suggested to educators to get support from artificial intelligence-based tools such as ChatGPT when scoring open-ended items that take a long time to score, especially in crowded classes or when time is limited.

Keywords

Inter-rater Reliability, Scoring Open-Ended Items, Generalizability Theory, ChatGPT

References

Aiken, L. R. (2000). Psychological testing and assessment. Allyn and Bacon.
Aktay, S., Seçkin, G. Ö. K., & Uzunoğlu, D. (2023). ChatGPT in education. TAY Journal, 7(2), 378-406. https://doi.org/10.29329/tayjournal.2023.543.03
Atılgan, H. (2005). Generalizability theory and a sample application for inter-rater reliability. Educational Sciences and Practice, 4(7), 95-108. http://www.ebuline.com/pdfs/7Sayi/7_6.pdf
Atılgan, H., Kan, A., & Doğan, N. (2011). Eğitimde ölçme ve değerlendirme [Measurement and evaluation in education]. (5th ed.) Anı Yayıncılık.
Baykul, Y. (2000) Eğitimde ve psikolojide ölçme: Klasik Test Teorisi ve uygulaması [Measurement in education and psychology: Classical Test Theory and its application]. ÖSYM Yayınları.
Bilgen, Ö. B., & Doğan, N. (2017). The comparison of interrater reliability estimating techniques. Journal of Measurement and Evaluation in Education and Psychology, 8(1), 63-78. https://doi.org/10.21031/epod.294847
Brennan, R. L. (2001). Generalizability Theory. Springer-Verlag.
Büyüköztürk, Ş., Çakmak, E. Kılıç, A., Özcan, E., Karadeniz, Ş., & Demirel, F. (2011). Bilimsel araştırma yöntemleri [Scientific research methods]. Pegem Akademi.
Doğan, N. (Ed.). (2021). Eğitimde ölçme ve değerlendirme [Measurement and evaluation in education]. Pegem Akademi.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37-46. https://doi.org/10.1177/001316446002000104
Crocker, L. M., & Algina, L. (1986). Introduction to classical and modern test theory. Holt, Rinehart and Winson.
Çakıcı Eser, D., & Gelbal, S. (2012). Comparison of interrater agreement Calculated with generalizability theory and logistic regression. Kastamonu Education Journal, 21(2), 423-438. https://acikerisim.kku.edu.tr/xmlui/handle/20.500.12587/1380
Gage, N. A., Prykanowski, D., & Hirn, R. (2014). Increasing reliability of direct observation measurement approaches in emotional and/or behavioral disorders research using generalizability theory. Behavioral Disorders, 39(4), 228-244. https://doi.org/10.1177/019874291303900407
Goodwin, L. D., & Goodwin, W. L. (1991). Using generalizability theory in early childhood special education. Journal of Early Intervention, 15(2), 193-204. https://doi.org/10.1177/105381519101500208
Goodwin, L. D., Sands, D. J., & Kozleski, E. B. (1991). Estimating interinterviewer reliability for interview schedules used in special education research. The Journal of Special Education, 25(1), 73-89. https://doi.org/10.1177/002246699102500105
Goodwin, L. D. (2001). Interrater agreement and reliability. Measurement in Physical Education and Exercise Science, 5(1), 13-34. https://doi.org/10.1207/S15327841MPEE0501_2
Göktaş, L. S. (2023). Can ChatGPT succeed in distance education exams? A research on accuracy and verification in tourism. Journal of Tourism & Gastronomy Studies, 11(2), 892-905. https://doi.org/10.21325/jotags.2023.1224
Grassini, S. (2023). Shaping the future of education: Exploring the potential and consequences of AI and ChatGPT in educational settings. Education Sciences, 13(7), 692. https://doi.org/10.3390/educsci13070692
Güler, N., & Teker, G. T. (2015). The evaluation of rater reliability of open ended items obtained from different approaches. Journal of Measurement and Evaluation in Education and Psychology, 6(1), 12-24. https://doi.org/10.21031/epod.63041
Gümüş, F. Ö., & Arıkan, Ç. A. (2020). Investigation of solutions of mathematical problems using multiple representations in terms of inter-rater reliability. Necatibey Faculty of Education Electronic Journal of Science and Mathematics Education, 14(1), 606-628. https://doi.org/10.17522/balikesirnef.687639
Hallgren, K. A. (2012). Computing inter-rater reliability for observational data: an overview and tutorial. Tutorials in Quantitative Methods for Psychology, 8(1), 23-34. https://doi.org/10.20982/tqmp.08.1.p023
Hill, H. C., Charalambous, C. Y., & Kraft, M. A. (2012). When rater reliability is not enough: Teacher observation systems and a case for the generalizability study. Educational Researcher, 41(2), 56-64. https://doi.org/10.3102/0013189X12437203
İlhan, M. (2016). A comparison of the ability estimations of classical test theory and the many facet Rasch model in measurements with open-ended questions. Hacettepe University Journal of Education, 31(2), 346-368. https://doi.org/10.16986/HUJE.2016015182
Kan, A. (2005). The effect of using grading scale and answer key to grader’s reliability. Eurasian Journal of Educational Research, 20. 166-177. https://web.s.ebscohost.com/ehost/pdfviewer/pdfviewer?vid=0&sid=50df9fc0-9dbc-43f8-a338-7a1110d5ce44%40redis
Lilford, R., Edwards, A., Girling, A., Hofer, T., Di Tanna, G. L., Petty, J., & Nicholl, J. (2007). Inter-rater reliability of case-note audit: A systematic review. Journal of Health Services Research & Policy, 12(3), 173-180. https://doi.org/10.1258/135581907781543012
Lo, C. K. (2023). What is the impact of ChatGPT on education? A rapid review of the literature. Education Sciences, 13(4), 410. https://doi.org/10.3390/educsci13040410
Lord, F. M., & Novick, M. R. (1968) Statistical theory of mental test scores. Addison-Wesley.
Mancar, S. A. (2019). The comparison of inter rater reliability estimating Techniques in performance based assessment. [Unpublished Master Thesis]. Ankara University.
Meyer, G. J. (1999). Simple procedures to estimate chance agreement and kappa for the interrater reliability of response segments using the Rorschach Comprehensive System. Journal of Personality Assessment, 72(2), 230-255. https://doi.org/10.1207/S15327752JP720209
Mizumoto, A., & Eguchi, M. (2023). Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics, 2(2), 100050. https://doi.org/10.1016/j.rmal.2023.100050
OpenAI. (2015). OpenAI. https://openai.com/about
Opara, E., Mfon-Ette Theresa, A., & Aduke, T. C. (2023). ChatGPT for teaching, learning and research: Prospects and challenges. Global Academic Journal of Humanities and Social Sciences, 5(2), 33-40. https://doi.org/10.36348/gajhss.2023.v05i02.001
Öksüzoğlu, M. (2022). The investigation of items measuring high-level thinking skills in terms of student score and score reliability. [Unpublished Doctoral Dissertation]. Hacettepe University.
Özşavlı, M. (2023). The effect of peer feedback on the writing skills of students learning Turkish as a foreign language. International Journal of Turkish Literature Culture Education, 12(1), 253-273. https://doi.org/10.7884/teke.5638
Park, C. U., & Kim, H. J. (2015). Measurement of inter-rater reliability in systematic review. Hanyang Medical Reviews, 35(1), 44-49. https://doi.org/10.7599/hmr.2015.35.1.44
Pekin, Z., Çetin, S., & Güler, N. (2018). Comparison of Interrater Reliability Based on Different Theories for Autism Social Skills Profile. Journal of Measurement and Evaluation in Education and Psychology, 9(2), 202-215. https://doi.org/10.21031/epod.388590
Seheryeli, M. Y. (2018). An examination of the reliability estimates of a scoring rubric of a writing skill examination using the classical test theory, generalizability theory and the item response theory models. [Unpublished Master Thesis]. Gazi University.
Shavelson, R. J. & Webb, N. M. (1991). Generalizability Theory: A Primer. Sage Publications.
Şencan, H. (2005). Sosyal ve davranışsal ölçmelerde güvenirlik ve geçerlik [Reliability and validity in social and behavioural measurements]. Sözkesen Matbaacılık.
Tabachnick, B. G., & Fidell, L. S. (2014). Using Multivariate Statistics. Pearson.
Tapan Broutin, M. S. (2023). Examination of questions asked by pre-service mathematics teachers in their initial experiences with ChatGPT. Journal of Uludag University Faculty of Education, 36(2), 1-26. https://doi.org/10.19171/uefad.1299680
Turgut, M. F. (1993). Eğitimde ölçme ve değerlendirme metotları. Saydam Matbaacılık.
Wilson, M. H., Ashworth, E., Hutchinson, P. J., & British Neurotrauma Group. (2022). A proposed novel traumatic brain injury classification system–an overview and inter-rater reliability validation on behalf of the Society of British Neurological Surgeons. British Journal of Neurosurgery, 36(5), 633-638. https://doi.org/10.1080/02688697.2022.2090509
Zileli, E. N. (2023). ChatGPT example in learning Turkish as a foreign language. International Journal of Karamanoğlu Mehmetbey Educatioanal Research, 5(1), 42-51. https://doi.org/10.47770/ukmead.1296013

There are 44 citations in total.

Details

Primary Language	English
Subjects	Measurement Theories and Applications in Education and Psychology
Journal Section	Articles
Authors	Seda Demir 0000-0003-4230-5593
Publication Date	October 31, 2023
Submission Date	August 18, 2023
Acceptance Date	September 22, 2023
Published in Issue	Year 2023 Volume: 2023 Issue: 21

Cite

APA	Demir, S. (2023). Investigation of ChatGPT and Real Raters in Scoring Open-Ended Items in Terms of Inter-Rater Reliability. International Journal of Turkish Education Sciences, 2023(21), 1072-1099. https://doi.org/10.46778/goputeb.1345752

Cited By

AN INVESTIGATION ON THE EFFECTIVENESS OF CHATBOTS IN EVALUATING WRITING ASSIGNMENTS IN EFL CONTEXTS

Mehmet Akif Ersoy Üniversitesi Eğitim Fakültesi Dergisi

https://doi.org/10.21764/maeuefd.1425384

Download Cover Image

Article Files

Full Text