Araştırma Makalesi
BibTex RIS Kaynak Göster

Performansa Dayalı Değerlendirmede Puanlayıcılar Arası Güvenirlik Analizi: Genellenebilirlik Katsayıları ve Puanlayıcı Tutarlılığının Karşılaştırılması

Yıl 2025, Cilt: 7 Sayı: 2, 218 - 234, 30.09.2025
https://doi.org/10.38151/akef.2025.158

Öz

Bu çalışma, üniversite öğrencilerinin temel istatistik becerilerini değerlendirmek için kullanılan performansa dayalı bir ölçme aracının güvenirliğini Genellenebilirlik Kuramı (GK) çerçevesinde incelemektedir. Rehberlik ve Psikolojik Danışmanlık programından toplam 80 öğrenci, 10 uygulamalı görevden oluşan iki saatlik bir sınava katılmıştır. Görevler, ayrıntılı bir analitik rubrik kullanılarak iki puanlayıcı tarafından bağımsız olarak puanlanmıştır. Puanlar, tam çaprazlanmış bir desen (b × m × d) kullanılarak analiz edilmiş; varyans bileşenleri maksimum olabilirlik yöntemiyle tahmin edilmiş ve %95 güven aralıkları 1.000 tekrar içeren küme bootstrap prosedürüyle hesaplanmıştır. Sonuçlar, toplam varyansın %50,2’sinin öğrencilere, %25,6’sının maddelere ve %16,6’sının puanlayıcılara atfedildiğini, etkileşim terimlerinin ise düşük düzeyde kaldığını göstermiştir. İlk göreli genellenebilirlik katsayısı .98, mutlak karar katsayısı (Φ) ise .81 olarak hesaplanmıştır. Madde sayısı 15’e, puanlayıcı sayısı ise beşe çıkarıldığında Φ katsayısı .90’a yükselmiş ve mutlak hata varyansı .45’e düşmüştür. Bulgular, öğrenciler arasındaki gerçek performans farklılıklarının güçlü bir şekilde yakalandığını, ancak puanlayıcı etkilerinin tamamen ortadan kaldırılamadığını göstermektedir. Görev kapsamını genişletmek ve puanlayıcı sayısını artırmak, hem mutlak hem de göreli hata varyanslarını azaltmada etkili stratejiler olarak bulunmuştur. Çalışma, rubrik kullanımının, puanlayıcı eğitimine yatırım yapmanın, çoklu görev–çoklu puanlayıcı yaklaşımının ve GK temelli revizyon döngülerinin yüksek riskli performans değerlendirmelerinde önemini desteklemektedir. Bulguların, öğretmen eğitimi programlarında istatistiksel okuryazarlığı geliştirmeyi amaçlayan pratik ölçme stratejilerine katkı sağlaması beklenmektedir. Ayrıca, çalışmanın iç geçerliliğini artırmak için farklı disiplinlerde daha büyük ve çeşitli örneklemlerle tekrarlanması önerilmektedir. Gelecek araştırmalar, çevrim içi platformlar aracılığıyla puanlayıcı geri bildirim döngülerinin uygulanmasını ve rubrik destekli puanlama sistemlerinin entegrasyonunu içerebilir.
Anahtar Kelimeler: Genellenebilirlik kuramı, performansa dayalı değerlendirme, puanlayıcı güvenirliği

Kaynakça

  • Andersen, S. A. W., Nayahangan, L. J., Park, Y. S., & Konge, L. (2021). Use of Generalizability Theory for exploring reliability of and sources of variance in assessment of technical skills: A systematic review and meta-analysis. Academic Medicine, 96(11), 1609–1619. https://doi.org/10.1097/ACM.0000000000004150.
  • Bacon, D. R. (2003). Assessing Learning Outcomes: A Comparison of Multiple-Choice and Short-Answer Questions in a Marketing Context. Journal of Marketing Education, 25(1), 31-36. https://doi.org/10.1177/0273475302250570
  • Baykul, Y. (2000). Measurement in education and psychology: Classical test theory and its application. Ankara: ÖSYM.
  • Brennan, R. L. (2000). Performance assessments from the perspective of generalizability theory. Applied Psychological Measurement, 24(4), 339–353. https://doi.org/10.1177/01466210022031796
  • Brennan, R. L. (2001). Generalizability theory. Springer-Verlag Publishing. https://doi.org/10.1007/978-1-4757-3456-0
  • Burry-Stock, J. A., Shaw, D. G., Laurie, C., & Chissom, B. S. (1996). Rater Agreement Indexes for Performance Assessment. Educational and Psychological Measurement, 56(2), 251-262. https://doi.org/10.1177/0013164496056002006
  • Cohen, J. (1960). A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement, 20(1), 37-46. https://doi.org/10.1177/001316446002000104
  • Cohen, J., Swerdlik, M. E., & Phillips, S. M. (1996). Psychological testing and assessment (4th ed.). California: Mayfield
  • Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. Wiley.
  • Dunbar, S.B., Koretz, D., & Hoover, H.D. (1991). Quality Control in the Development and Use of Performance Assessments. Applied Measurement in Education, 4, 289-303.
  • Efron, B., & Tibshirani, R.J. (1994). An Introduction to the Bootstrap (1st ed.). Chapman and Hall/CRC. https://doi.org/10.1201/9780429246593
  • Fitzpatrick, R. & Morrison, F. (1971). Performance-based testing. Educational Technology, 11(5), 63-64 Güler, N. (2009). Generalizability Theory and Comparison of the Results of G and D Studies Computed by SPSS and GENOVA Packet Programs. Education and Science, 34(154). https://doi.org/10.15390/ES.2009.840
  • Huebner, A., & Lucht, M. (2019). Generalizability theory in R. Practical Assessment, Research & Evaluation, 24(5), 1–12. https://doi.org/10.7275/5065-gc10.
  • Jiang, Z., Raymond, M., Shi, D., & DiStefano, C. (2020). Using a linear mixed-effect model framework to estimate multivariate generalizability theory parameters in R. Behavior Research Methods, 52, 2383–2393. https://doi.org/10.3758/s13428-020-01399-z.
  • Jonsson, A. & Svingby, G. (2007). The use of scoring rubrics: Reliability, validity and educational consequences. Educational Research Review, 2(2), 130-144 https://doi.org/10.1016/j.edurev.2007.05.002
  • Karasar, N. (2020). Scientific research method (35th ed.). Ankara: Nobel Academic Publishing.
  • Kim, G. Y., Schatschneider, C., Wanzek, J., Gatlin, B., & Al Otaiba, S. (2017). Writing evaluation: Rater and task effects on the reliability of writing scores for children in grades 3 and 4. Reading and Writing, 30(6), 1287–1310. https://doi.org/10.1007/s11145-017-9724-6
  • Koo, T. K., & Li, M. Y. (2016). A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research. Journal of chiropractic medicine, 15(2), 155–163. https://doi.org/10.1016/j.jcm.2016.02.012
  • Kutlu, Ö., Doğan, C. D. & Karakaya, İ. (2009). Determining student achievement: Performance and portfoliobased assessment. Ankara: Pegem Akademi. http://dx.doi.org/10.14527/9786053647003
  • Linn, R. L., Baker, E. L., & Dunbar, S. B. (1991). Complex, performance-based assessment: Expectations and validation criteria. Educational Researcher, 20(8), 15–21. https://doi.org/10.3102/0013189X020008015
  • Mandinach, E. B., & Gummer, E. S. (2016). What does it mean for teachers to be data literate? Teaching and Teacher Education, 60, 366–376. https://doi.org/10.1016/j.tate.2016.07.011
  • Mertler, C. A., (2000). Designing scoring rubrics for your classroom, Practical Assessment, Research, and Evaluation 7(1): 25. doi: https://doi.org/10.7275/gcy8-0w24
  • Moskal, B. M. & Leydens, J. A., (2000) “Scoring Rubric Development: Validity and Reliability”, Practical Assessment, Research, and Evaluation 7(1): 10. doi: https://doi.org/10.7275/q7rm-gg74
  • Nitko, A. J. (2001). Educational assessment of students (3rd ed.). Upper Saddle River, NJ: Prentice Hall. OECD. (2019). OECD Learning Compass 2030: A series of concept notes. Paris: OECD.
  • Panadero, E., Jonsson, A., Pinedo, L., & others. (2023). Effects of rubrics on academic performance, self-regulated learning, and self-efficacy: A meta-analytic review. Educational Psychology Review, 35, 113. https://doi.org/10.1007/s10648-023-09823-4
  • Peeters, M. J., Cor, M. K., Petite, S. E., & Schroeder, M. N. (2021). Validation Evidence using Generalizability Theory for an Objective Structured Clinical Examination. Innovations in pharmacy, 12(1), 10.24926/iip.v12i1.2110. https://doi.org/10.24926/iip.v12i1.2110
  • Shavelson, R. J. & Webb, N. M. (1991). Generalizability theory: A primer. Sage.
  • Shavelson, R. J., Baxter, G. P., & Gao, X. (1993). Sampling variability of performance assessments. Journal of Educational Measurement, 30(3), 215–232. https://doi.org/10.1111/j.1745-3984.1993.tb00424.x
  • Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420–428. https://doi.org/10.1037/0033-2909.86.2.420
  • Sturgis, P. W., Marchand, L., Miller, M. D., Xu, W., & Castiglioni, A. (2022). Generalizability Theory and its application to institutional research (AIR Professional File No. 156). Association for Institutional Research. https://doi.org/10.34315/apf1562022
  • Sudweeks, R. R., Reeve, S., & Bradshaw, W. S. (2004). A comparison of generalizability theory and many-facet Rasch measurement in an analysis of college sophomore writing. Assessing Writing, 9(3), 239–261. https://doi.org/10.1016/j.asw.2004.11.001
  • T.C. Millî Eğitim Bakanlığı. (2024). Türkiye Yüzyılı Maarif Modeli: Okuryazarlık Becerileri. Ankara: MEB. Viera, A. J., & Garrett, J. M. (2005). Understanding interobserver agreement: the kappa statistic. Fam med, 37(5), 360-363.
  • Villa, K. R., Sprunger, T. L., Walton, A. M., Costello, T. J., & Isaacs, A. N. (2020). Inter-rater reliability of a clinical documentation rubric within pharmacotherapy problem-based learning courses. American Journal of Pharmaceutical Education, 84(7), 7648. https://doi.org/10.5688/ajpe7648
  • Volpe, R. J., McConaughy, S. H., & Hintze, J. M. (2009). Generalizability of Classroom Behavior Problem and On-Task Scores From the Direct Observation Form. School Psychology Review, 38(3), 382–401. https://doi.org/10.1080/02796015.2009.12087822
  • Wind, S. A. (2018). Examining the Impacts of Rater Effects in Performance Assessments. Applied Psychological Measurement, 43(2), 159-171. https://doi.org/10.1177/0146621618789391
  • Yılmaz, F. N. (2024). Comparing the reliability of performance task scores obtained from rating scale and analytic rubric using the generalizability theory. Studies in Educational Evaluation, 83. https://doi.org/10.1016/j.stueduc.2024.101413

Inter-Rater Reliability Analysis in Performance-Based Assessment: A Comparison of Generalizability Coefficients and Rater Consistency

Yıl 2025, Cilt: 7 Sayı: 2, 218 - 234, 30.09.2025
https://doi.org/10.38151/akef.2025.158

Öz

This study investigates the reliability of a performance-based assessment tool used to evaluate university students’ basic statistical skills within the framework of Generalizability Theory (GT). A total of 80 students from the Guidance and Psychological Counseling program participated in a two-hour examination consisting of 10 applied tasks. The tasks were scored independently by two raters using a detailed analytic rubric. The scores were analyzed using a fully crossed design (p × i × r), with variance components estimated via the maximum likelihood method, and 95% confidence intervals calculated using a cluster bootstrap procedure (1,000 resamples). Results showed that 50.2% of the total variance was attributable to students, 25.6% to items, and 16.6% to raters, while interaction terms remained at low levels. The initial relative generalizability coefficient was calculated as .98, and the absolute decision coefficient (Φ) was .81. When the number of items was increased to 15 and the number of raters to five, the Φ coefficient improved to .90, and absolute error variance decreased to .45. Findings indicated that true performance differences among students were strongly captured, although rater effects could not be completely eliminated. Expanding task coverage and increasing the number of raters were found to be effective strategies for reducing both absolute and relative error variances. The study supports the importance of rubric use, investment in rater training, a multi-task–multi-rater approach, and GT-based revision cycles in high-stakes performance assessments. The findings are expected to inform practical assessment strategies aimed at improving statistical literacy in teacher education programs. Additionally, it is recommended that the study be replicated with larger and more diverse samples across disciplines to enhance internal validity. Future directions may include implementing rater feedback cycles through online platforms and integrating rubric-supported scoring systems.
Keywords: Generalizability theory, performance-based assessment, rater reliability

Kaynakça

  • Andersen, S. A. W., Nayahangan, L. J., Park, Y. S., & Konge, L. (2021). Use of Generalizability Theory for exploring reliability of and sources of variance in assessment of technical skills: A systematic review and meta-analysis. Academic Medicine, 96(11), 1609–1619. https://doi.org/10.1097/ACM.0000000000004150.
  • Bacon, D. R. (2003). Assessing Learning Outcomes: A Comparison of Multiple-Choice and Short-Answer Questions in a Marketing Context. Journal of Marketing Education, 25(1), 31-36. https://doi.org/10.1177/0273475302250570
  • Baykul, Y. (2000). Measurement in education and psychology: Classical test theory and its application. Ankara: ÖSYM.
  • Brennan, R. L. (2000). Performance assessments from the perspective of generalizability theory. Applied Psychological Measurement, 24(4), 339–353. https://doi.org/10.1177/01466210022031796
  • Brennan, R. L. (2001). Generalizability theory. Springer-Verlag Publishing. https://doi.org/10.1007/978-1-4757-3456-0
  • Burry-Stock, J. A., Shaw, D. G., Laurie, C., & Chissom, B. S. (1996). Rater Agreement Indexes for Performance Assessment. Educational and Psychological Measurement, 56(2), 251-262. https://doi.org/10.1177/0013164496056002006
  • Cohen, J. (1960). A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement, 20(1), 37-46. https://doi.org/10.1177/001316446002000104
  • Cohen, J., Swerdlik, M. E., & Phillips, S. M. (1996). Psychological testing and assessment (4th ed.). California: Mayfield
  • Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. Wiley.
  • Dunbar, S.B., Koretz, D., & Hoover, H.D. (1991). Quality Control in the Development and Use of Performance Assessments. Applied Measurement in Education, 4, 289-303.
  • Efron, B., & Tibshirani, R.J. (1994). An Introduction to the Bootstrap (1st ed.). Chapman and Hall/CRC. https://doi.org/10.1201/9780429246593
  • Fitzpatrick, R. & Morrison, F. (1971). Performance-based testing. Educational Technology, 11(5), 63-64 Güler, N. (2009). Generalizability Theory and Comparison of the Results of G and D Studies Computed by SPSS and GENOVA Packet Programs. Education and Science, 34(154). https://doi.org/10.15390/ES.2009.840
  • Huebner, A., & Lucht, M. (2019). Generalizability theory in R. Practical Assessment, Research & Evaluation, 24(5), 1–12. https://doi.org/10.7275/5065-gc10.
  • Jiang, Z., Raymond, M., Shi, D., & DiStefano, C. (2020). Using a linear mixed-effect model framework to estimate multivariate generalizability theory parameters in R. Behavior Research Methods, 52, 2383–2393. https://doi.org/10.3758/s13428-020-01399-z.
  • Jonsson, A. & Svingby, G. (2007). The use of scoring rubrics: Reliability, validity and educational consequences. Educational Research Review, 2(2), 130-144 https://doi.org/10.1016/j.edurev.2007.05.002
  • Karasar, N. (2020). Scientific research method (35th ed.). Ankara: Nobel Academic Publishing.
  • Kim, G. Y., Schatschneider, C., Wanzek, J., Gatlin, B., & Al Otaiba, S. (2017). Writing evaluation: Rater and task effects on the reliability of writing scores for children in grades 3 and 4. Reading and Writing, 30(6), 1287–1310. https://doi.org/10.1007/s11145-017-9724-6
  • Koo, T. K., & Li, M. Y. (2016). A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research. Journal of chiropractic medicine, 15(2), 155–163. https://doi.org/10.1016/j.jcm.2016.02.012
  • Kutlu, Ö., Doğan, C. D. & Karakaya, İ. (2009). Determining student achievement: Performance and portfoliobased assessment. Ankara: Pegem Akademi. http://dx.doi.org/10.14527/9786053647003
  • Linn, R. L., Baker, E. L., & Dunbar, S. B. (1991). Complex, performance-based assessment: Expectations and validation criteria. Educational Researcher, 20(8), 15–21. https://doi.org/10.3102/0013189X020008015
  • Mandinach, E. B., & Gummer, E. S. (2016). What does it mean for teachers to be data literate? Teaching and Teacher Education, 60, 366–376. https://doi.org/10.1016/j.tate.2016.07.011
  • Mertler, C. A., (2000). Designing scoring rubrics for your classroom, Practical Assessment, Research, and Evaluation 7(1): 25. doi: https://doi.org/10.7275/gcy8-0w24
  • Moskal, B. M. & Leydens, J. A., (2000) “Scoring Rubric Development: Validity and Reliability”, Practical Assessment, Research, and Evaluation 7(1): 10. doi: https://doi.org/10.7275/q7rm-gg74
  • Nitko, A. J. (2001). Educational assessment of students (3rd ed.). Upper Saddle River, NJ: Prentice Hall. OECD. (2019). OECD Learning Compass 2030: A series of concept notes. Paris: OECD.
  • Panadero, E., Jonsson, A., Pinedo, L., & others. (2023). Effects of rubrics on academic performance, self-regulated learning, and self-efficacy: A meta-analytic review. Educational Psychology Review, 35, 113. https://doi.org/10.1007/s10648-023-09823-4
  • Peeters, M. J., Cor, M. K., Petite, S. E., & Schroeder, M. N. (2021). Validation Evidence using Generalizability Theory for an Objective Structured Clinical Examination. Innovations in pharmacy, 12(1), 10.24926/iip.v12i1.2110. https://doi.org/10.24926/iip.v12i1.2110
  • Shavelson, R. J. & Webb, N. M. (1991). Generalizability theory: A primer. Sage.
  • Shavelson, R. J., Baxter, G. P., & Gao, X. (1993). Sampling variability of performance assessments. Journal of Educational Measurement, 30(3), 215–232. https://doi.org/10.1111/j.1745-3984.1993.tb00424.x
  • Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420–428. https://doi.org/10.1037/0033-2909.86.2.420
  • Sturgis, P. W., Marchand, L., Miller, M. D., Xu, W., & Castiglioni, A. (2022). Generalizability Theory and its application to institutional research (AIR Professional File No. 156). Association for Institutional Research. https://doi.org/10.34315/apf1562022
  • Sudweeks, R. R., Reeve, S., & Bradshaw, W. S. (2004). A comparison of generalizability theory and many-facet Rasch measurement in an analysis of college sophomore writing. Assessing Writing, 9(3), 239–261. https://doi.org/10.1016/j.asw.2004.11.001
  • T.C. Millî Eğitim Bakanlığı. (2024). Türkiye Yüzyılı Maarif Modeli: Okuryazarlık Becerileri. Ankara: MEB. Viera, A. J., & Garrett, J. M. (2005). Understanding interobserver agreement: the kappa statistic. Fam med, 37(5), 360-363.
  • Villa, K. R., Sprunger, T. L., Walton, A. M., Costello, T. J., & Isaacs, A. N. (2020). Inter-rater reliability of a clinical documentation rubric within pharmacotherapy problem-based learning courses. American Journal of Pharmaceutical Education, 84(7), 7648. https://doi.org/10.5688/ajpe7648
  • Volpe, R. J., McConaughy, S. H., & Hintze, J. M. (2009). Generalizability of Classroom Behavior Problem and On-Task Scores From the Direct Observation Form. School Psychology Review, 38(3), 382–401. https://doi.org/10.1080/02796015.2009.12087822
  • Wind, S. A. (2018). Examining the Impacts of Rater Effects in Performance Assessments. Applied Psychological Measurement, 43(2), 159-171. https://doi.org/10.1177/0146621618789391
  • Yılmaz, F. N. (2024). Comparing the reliability of performance task scores obtained from rating scale and analytic rubric using the generalizability theory. Studies in Educational Evaluation, 83. https://doi.org/10.1016/j.stueduc.2024.101413
Toplam 36 adet kaynakça vardır.

Ayrıntılar

Birincil Dil İngilizce
Konular Eğitimde ve Psikolojide Ölçme Teorileri ve Uygulamaları
Bölüm Araştırma Makaleleri
Yazarlar

Mustafa Köroğlu 0000-0001-9610-8523

Erken Görünüm Tarihi 30 Eylül 2025
Yayımlanma Tarihi 30 Eylül 2025
Gönderilme Tarihi 19 Temmuz 2025
Kabul Tarihi 23 Eylül 2025
Yayımlandığı Sayı Yıl 2025 Cilt: 7 Sayı: 2

Kaynak Göster

APA Köroğlu, M. (2025). Inter-Rater Reliability Analysis in Performance-Based Assessment: A Comparison of Generalizability Coefficients and Rater Consistency. Ahmet Keleşoğlu Eğitim Fakültesi Dergisi, 7(2), 218-234. https://doi.org/10.38151/akef.2025.158

289812580829733