Comparison of Inter-Rater Reliability Techniques in Performance-Based Assessment

Sinem Arslan Mancar; Hamide Deniz Gülleroğlu

doi:10.21449/ijate.993805

TR EN

Comparison of Inter-Rater Reliability Techniques in Performance-Based Assessment

Abstract

The aim of this study is to analyse the importance of the number of raters and compare the results obtained by techniques based on Classical Test Theory (CTT) and Generalizability (G) Theory. The Kappa and Krippendorff alpha techniques based on CTT were used to determine the inter-rater reliability. In this descriptive research data consists of twenty individual investigation performance reports prepared by the learners of the International Baccalaureate Diploma Programme (IBDP) and also five raters who rated these reports. Raters used an analytical rubric developed by the International Baccalaureate Organization (IBO) as a scoring tool. The results of the CTT study show that Kappa and Krippendorff alpha statistical techniques failed to provide information about the sources of the errors causing incompatibility in the criteria. The studies based on G Theory provided comprehensive data about the sources of the errors and increasing the number of raters would also increase the reliability of the values. However, the raters raised the idea that it is important to develop descriptors in the criteria in the rubric.

Keywords

Comparison of Inter-Rater Reliability Techniques in Performance-Based Assessment

Abstract

The aim of this study is to analyse the importance of the number of raters and compare the results obtained by techniques based on Classical Test Theory (CTT) and Generalizability (G) Theory. The Kappa and Krippendorff alpha techniques based on CTT were used to determine the inter-rater reliability. In this descriptive research data consists of twenty individual investigation performance reports prepared by the learners of the International Baccalaureate Diploma Programme (IBDP) and also five raters who rated these reports. Raters used an analytical rubric developed by the International Baccalaureate Organization (IBO) as a scoring tool. The results of the CTT study show that Kappa and Krippendorff alpha statistical techniques failed to provide information about the sources of the errors causing incompatibility in the criteria. The studies based on G Theory provided comprehensive data about the sources of the errors and increasing the number of raters would also increase the reliability of the values. However, the raters raised the idea that it is important to develop descriptors in the criteria in the rubric.

Keywords

References

Abedi, J., Baker, E.L., & Herl, H. (1995). Comparing reliability indices obtained by different approaches for performance assessments (CSE Report 401). Los Angeles: University of California, National Center for Research on Evaluation, Standards, and Student Testing (CRESST). https://cresst.org/wp-content/uploads/TECH401.pdf
Agresti, A. (2013). Categorical data analysis (3rd ed.). John Wiley & Sons.
Airasian, P.W. (1994). Classroom assessment (2nd ed.). McGraw-Hill.
Aktaş, M. & Alıcı, D. (2017). Kontrol listesi, analitik rubrik ve dereceleme ölçeklerinde puanlayıcı güvenirliğinin genellenebilirlik kuramına göre incelenmesi [Examination of scoring reliability according to generalizability theory in checklist, analytic rubric, and rating scales]. International Journal of Eurasia Social Sciences, 8(29), 991-1010.
Anadol, H.Ö., & Doğan, C.D. (2018). Dereceli puanlama anahtarlarının güvenirliğinin farklı deneyim yıllarına sahip puanlayıcıların kullanıldığı durumlarda i̇ncelenmesi [The examination of realiability of scoring rubrics regarding raters with different experience years]. İlköğretim Online, 1066-1076. https://doi.org/10.17051/ilkonline.2018.419355
Ananiadou, K., & Claro, M. (2009), 21st century skills and competences for new millennium learners in OECD countries. OECD Education Working Papers, 41. OECD Publishing, Paris,https://doi.org/10.1787/218525261154
Atılgan, H.E., (2005). Genellenebilirlik kuramı ve puanlayıcılar arası güvenirlik için örnek bir uygulama [Generalizability theory and a sample application for inter-rater reliability]. Educational Sciences and Practice, 4(7), 95-108. http://ebuline.com/pdfs/7Sayi/7_6.pdf
Atılgan, H., Kan, A., & Doğan, N. (2007). Eğitimde ölçme ve değerlendirme [Assessment and evaluation in an education] (2nd ed.). Anı Yayıncılık.

Bailey, D.K. (1994). Methods of social research (4th ed.). Free-Press.
Baykul, Y. (2015). Eğitimde ve psikolojide ölçme: klasik test teorisi ve uygulaması [Measurement in education and psychology: classical test theory and practice] (3rd ed.). Pegem Yayıncılık.
Bıkmaz Bilgen, Ö., & Doğan, N. (2017). Puanlayıcılar arası güvenirlik belirleme tekniklerinin karşılaştırılması [The comparison of ınterrater reliability estimating techniques]. Journal of Measurement and Evaluation in Education and Psychology, 8(1), 63-78. https://doi.org/10.21031/epod.294847
Brennan, R.L. (2001). Generalizability theory. Springer-Verlag.
Burry-Stock, J.A., Shaw, D.G., Laurie, C., & Chissom, B.S. (1996). Rater-agreement indexes for performance assessment. Educational and Psychological Measurement, 56(2), 251-262. https://doi.org/10.1177/0013164496056002006
Büyükkıdık, S. (2012). Problem çözme becerisinin değerlendirilmesinde puanlayıcılar arası güvenirliğin klasik test kuramı ve genellenebilirlik kuramına göre karşılaştırılması. [Comparison of interrater reliability based on the classical test theory and generalizability theory in problem solving skills assessment] [Master’s Thesis, Hacettepe University]. Hacettepe University Libraries.
Büyüköztürk, Ş., Kılıç Çakmak E., Akgün Ö.E., Karadeniz Ş., & Demirel F. (2012). Bilimsel araştırma yöntemleri [Scientific research methods] (11th ed.). Pegem Yayıncılık.
Bybee R.W. (1997). Towards an understanding of scientific literacy. In: W. Gräber & C. Bolte. (Eds.). Scientific literacy. An international symposium (p. 37-68). Institut für die Pädagogikder Naturwissenschaften (IPN): Kiel, Germany.
Callison, D. (2000). Rubrics. School Library Media Activities Monthly, 17(2), 34-6,42.
Cambridge International Examinations (2015). Cambridge IGCSE global perspectives 0457. Syllabus for examination in 2018, 2019 and 2020. https://www.cambridgeinternational.org/Images/252230-2018-2020-syllabus.pdf
Cohen J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46. https://doi.org/10.1177/001316446002000104
Collins, R. (2014). Skills for the 21st Century: teaching higher-order thinking. Curriculum & Leadership Journal, 12(14). http://www.curriculum.edu.au/leader/teaching_higher_order_thinking,37431.html?issueID=12910
Coombe, C.A., Davidson, P., O'Sullivan, B., & Stoynoff, S. (Eds.). (2012). The Cambridge guide to second language assessment. Cambridge University Press.
Deliceoğlu, G. (2009). Futbol yetilerine ilişkin dereceleme ölçeğinin genellenebilirlik ve klasik test kuramına dayalı güvenirliklerinin karşılaştırılması [The comparison of the reliabilities of the soccer abilites’ rating scale based on the classical test theory and generalizability]. [Doctoral dissertation, Ankara University, Ankara]. https://tez.yok.gov.tr/UlusalTezMerkezi/tezSorguSonucYeni.jsp
Dietel, R.J., Herman, J.L., & Knuth, R.A. (1991). What does research say about assessment? NCREL, Oak Brook. http://www.ncrel.org/sdrs/areas/stw_esys/4assess.htm
Fleiss, J.L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378- 382. https://doi.org/10.1037/h0031619
Goodrich Andrade, H. (2001) The effects of instructional rubrics on learning to write. Current issues in Education, 4. http://cie.asu.edu/ojs/index.php/cieatasu/article/view/1630
Güler, N. (2009). Genellenebilirlik kuramı ve SPSS ile GENOVA programlarıyla hesaplanan G ve K çalışmalarına ilişkin sonuçların karşılaştırılması [Generalizability theory and comparison of the results of G and D studies computed by SPSS and GENOVA packet programs]. Eğitim ve Bilim, 34(154). http://eb.ted.org.tr/index.php/EB/article/view/551/45
Güler, N. (2011). Rasgele veriler üzerinde genellenebilirlik kuramı ve klasik test kuramına göre güvenirliğin karşılaştırılması [The comparison of reliability according to generalizability theory and classical test theory on random data]. Eğitim ve Bilim. 36(162), 225-234. http://egitimvebilim.ted.org.tr/index.php/EB/article/view/993
Güler, N., & Taşdelen, G. (2015). Açık uçlu maddelerde farklı yaklaşımlarla elde edilen puanlayıcılar arası güvenirliğin değerlendirilmesi [The evaluation of rater reliability of open-ended items obtained from different approaches] Journal of Measurement and Evaluation in Education and Psychology, 6(1). 12 24. https://doi.org/10.21031/epod.63041
Gwet, K. (2002), Kappa statistic is not satisfactory for assessing the extent of agreement between raters. Series: Statistical Methods for Inter-Rater Reliability Assessment, 1(1).1-5. https://www.agreestat.com/papers/kappa_statistic_is_not_satisfactory.pdf
Holbrook, J., & Rannikmae, M. (2009). The meaning of scientific literacy. International Journal of Environmental & Science Education, 4(3), 275 288. https://files.eric.ed.gov/fulltext/EJ884397.pdf
Hurd, P. D. (1998) Scientific literacy: new minds for a changing world. Science Education, 82, 407-416.
International Baccalaureate Organization (IBO). (2014a). International Baccalaureate Diploma Programme Biology Guide First Assessment 2016. https://internationalbaccalaureate.force.com/ibportal/IBPortalLogin?lang=en_US
International Baccalaureate Organization (IBO). (2014b). International Baccalaureate Diploma Programme Chemistry Guide First Assessment 2016. https://www.ibchem.com/root_pdf/Chemistry_guide_2016.pdf
International Baccalaureate Organization (IBO). (2014c). International Baccalaureate Diploma Programme Physics Guide First Assessment 2016. https://ibphysics.org/wp-content/uploads/2016/01/ib-physics-syllabus.pdf
International Baccalaureate Organization (IBO). (2015). International Baccalaureate Diploma Programme: From principles into practice. International Baccalaureate Organization.
International Baccalaureate Organization (IBO). (2018). International Baccalaureate Organization (IBO). (2018). The IB Diploma Programme Statistical Bulletin, May 2018 Examination Session. https://www.ibo.org/contentassets/bc850970f4e54b87828f83c7976a4db6/dp-statistical-bulletin-may-2018-en.pdf
International Baccalaureate Organization (IBO). (2018). Assessment principles and practices-Quality assessments in a digital age. https://www.ibo.org/contentassets/1cdf850e366447e99b5a862aab622883/assessment-principles-and-practices-2018-en.pdf
Kamış, Ö., & Doğan, C. (2017). Genellenebilirlik kuramında gerçekleştirilen karar çalışmaları ne kadar kararlı? [How consistent are decision studies in G theory?]. Journal of Education and Learning, 7(4). https://dergipark.org.tr/en/download/article-file/336342
Klucevsek, K. (2017). The intersection of information and science literacy. Communications in Information Literacy, 11(2), 354-365. https://files.eric.ed.gov/fulltext/EJ1166457.pdf
Krippendorff, K. (2004). Measuring the reliability of qualitative text analysis data. Quality and Quantity, 38(6), 787-800. https://doi.org/10.1007/s11135-004-8107-7
Krippendorff, K. (2004). Content analysis: An introduction to its methodology. Sage.
Krippendorff, K. (2011). Computing Krippendorff‟s alpha reliability. http://repository.upenn.edu/asc_papers/43
Kulieke, M., Bakker, J., Collins, C., Fennimore, T., Fine, C., Herman, J., Jones, B.F., Raack, L., & Tinzmann, M.B. (1990). Why should assessment be based on a vision of learning? [online document] NCREL, Oak Brook: IL. Available online: http://www.ncrel.org/sdrs/areas/rpl_esys/assess.htm
Kutlu, Ö., Doğan, D.C., & Karakaya, İ. (2008). Performansa ve portfolyoya dayalı durum belirleme [Assessment and evaluation determination based on performance and portfolio] (5th ed.). Pegem Yayıncılık.
Landis, J,R., & Koch, G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1) 159-174. https://doi.org/10.2307/2529310
Lane, S., & Sabers, D. (1989). Use of generalizability theory for estimating the dependability of a scoring system for sample essays, Applied Measurement in Education, 2(3). 195-205. https://doi.org/10.1207/s15324818ame0203_1
Law, B., & Eckes, M. (1995). Assessment and ESL. Peguis publishers.
Lord F.M. (1959). Statistical inferences about true scores. Psychometrika, 24(1), 1–17. https://doi.org/10.1007/BF02289759 .
Maienschein, J. (1998). Scientific literacy. Science, 281(5379), 917. https://www.proquest.com/openview/568e8a30ee2b1c68d787bbcb39e3f94e/1?pq-origsite=gscholar&cbl=1256
Marzano, R. (2002). A comparison of selected methods of scoring classroom assessments. Applied Measurement in Education, 15(3). 249 268. https://doi.org/10.1207/S15324818AME1503_2
Marzano, R.J., & Heflebower, T. (2012). Teaching & assessing 21st century skills. Marzano Research Laboratory.
Mertler, C.A. (2001). Designing scoring rubrics for your classroom. Practical Assessment, Research and Evaluation, 7(25), 1-8. https://doi.org/10.7275/gcy8-0w24
Millî Eğitim Bakanlığı (MEB) (2016). PISA 2015 Ulusal Raporu [PISA 2015: National Report for Turkey]. Millî Eğitim Bakanlığı, Eğitimi Araştırma ve Geliştirme Dairesi Başkanlığı, Ankara. https://odsgm.meb.gov.tr/test/analizler/docs/PISA/PISA2015_Ulusal_Rapor.pdf
Moskal, B.M. (2000) Scoring rubrics: What, When, How? Practical Assessment Research and Evaluation, 7(3), 1-11. https://doi.org/10.7275/a5vq-7q66
National Research Council. (2012). Education for life and work: developing transferable knowledge and skills in the 21st century. The National Academies Press. https://doi.org/10.17226/13398
Nbina, J., & Obomanu, B. (2010). The meaning of scientific literacy: A model of relevance in science education. Academic Leadership: The Online Journal, 8(4). https://scholars.fhsu.edu/alj/.
Novick M.R. (1966) The axioms and principal results of classical test theory. Journal of Mathematical Psychology, 3(1), 1-18. https://doi.org/10.1016/0022-2496(66)90002-2
Nying, E. (2004). A comparative study of interrater reliability coefficients obtained from different statistical procedures using monte carlo simulation techniques [Doctoral dissertation, Western Michigan University]. https://scholarworks.wmich.edu/dissertations/1267
Oakleaf, M. (2009). The information literacy instruction assessment cycle: a guide for increasing student learning and improving librarian instructional skills. Journal of Documentation, 65(4). 539-560. https://doi.org/10.1108/00220410910970249
Organisation for Economic Cooperation and Development (OECD). (2017). PISA 2015 Assessment and Analytical Framework: Science, Reading, Mathematic, Financial Literacy and Collaborative Problem Solving. OECD Publishing. https://doi.org/10.1787/9789264281820-en
Özmen Hızarcıoğlu, B. (2013). Problem çözme sürecinde dereceli puanlama anahtarı (Rubrik) kullanımında puanlayıcı uyumunun incelenmesi [Examining scorer's coherence of using rubric in the problem solving process] [Master's dissertation, Abant Izzet Baysal University]. https://tez.yok.gov.tr/UlusalTezMerkezi/tezDetay.jsp?id=-9VIu1xAI6tVn8H1Pmf2Mg&no=XE36zEJKy4iJQQ-bARoPnA
Öztürk, M.E. (2011). Voleybol becerileri gözlem formu ile elde edilen puanların, genellenebilirlik ve klasik test kuramına göre karşılaştırılması [The comparison of points of the volleyball abilities observation form (VAOF) according to the generalizability theory and the classical test theory] [Unpublished doctoral dissertation, Hacettepe University]. National Thesis Centre. https://tez.yok.gov.tr/UlusalTezMerkezi/tezDetay.jsp?id=K9erNYiV2Ks_xzov1XrfSQ&no=5OJsxJV1JE2E3hGJDkB8lQ
Partnership for 21st Century Learning. (2007). Framework for 21st century learning. https://files.eric.ed.gov/fulltext/ED519462.pdf
Reeves, T.C. (2000). Alternative assessment approaches for online learning environments in higher education. Educational Computing Research, 3(1), 101-111.
Rychen, D.S., & Salganik, L.H. (Eds.). (2003). Key competencies for a successful life and a well functioning society. Cambridge.
Schleicher, A. (2015), Schools for 21st-Century Learners: Strong Leaders, Confident Teachers, Innovative Approaches, International Summit on the Teaching Profession, OECD Publishing. https://doi.org/10.1787/9789264231191-en
Shavelson, R.J., & Webb, N.M. (1991). Generalizability theory: a primer. Sage.
Simonson, M., Smaldino, S, Albright, M., & Zvacek, S. (2000). Assessment for distance education (ch 11). Teaching and learning at a distance: foundations of distance education. Prentice-Hall.
Mullis, I. V. S., & Martin, M. O. (Eds.). (2017). TIMSS 2019 Assessment frameworks. http://timssandpirls.bc.edu/timss2019/frameworks/
Trilling, B., & Fadel, C. (2009). 21st century skills: Learning for life in our times. John Wiley & Sons
Turgut, H. (2007). Scientific literacy for all. Ankara University Journal of Faculty of Educational Sciences (JFES), 40 (2), 233-256. https://doi.org/10.1501/Egifak_0000000176
Turgut, M.F., & Baykul, Y. (2010). Eğitimde ölçme ve değerlendirme [Assessment and evaluation in an education]. Pegem Yayınları.
Uçak, S., & Erdem, H.H. (2020). Eğitimde yeni bir yön arayışı bağlamında 21. Yüzyıl becerileri ve eğitim felsefesi [On the skills of 21st century and philosophy of education in terms of searching a new aspect in education]. Uşak Üniversitesi Eğitim Araştırmaları Dergisi, 6(1), 76-93. https://doi.org/10.29065/usakead.690205
Viere, A.J., & Garrett, J.M. (2005). Understanding interobserver agreement: The Kappa statistic. Family Medicine, 37(5), 360-362.
Zurkowski, P.G. (1974). The Information Service Environment Relationships and Priorities. Related Paper No. 5. National Commission on Libraries and Information Science, Washington, D.C. National Program for Library and Information Services. https://files.eric.ed.gov/fulltext/ED100391.pdf

Details

Primary Language

English

Subjects

Studies on Education

Journal Section

Research Article

Authors

Sinem Arslan Mancar ^*
0000-0002-2031-2189
United Kingdom

Hamide Deniz Gülleroğlu
0000-0001-6995-8223
Türkiye

Publication Date

June 26, 2022

Submission Date

September 10, 2021

Acceptance Date

May 17, 2022

Published in Issue

Year 2022 Volume: 9 Number: 2

DOI

https://doi.org/10.21449/ijate.993805

IZ

https://izlik.org/JA85FF53TP

Cite

RIS / Bibtex

APA

Arslan Mancar, S., & Gülleroğlu, H. D. (2022). Comparison of Inter-Rater Reliability Techniques in Performance-Based Assessment. International Journal of Assessment Tools in Education, 9(2), 515-533. https://doi.org/10.21449/ijate.993805

AMA

1.Arslan Mancar S, Gülleroğlu HD. Comparison of Inter-Rater Reliability Techniques in Performance-Based Assessment. Int. J. Assess. Tools Educ. 2022;9(2):515-533. doi:10.21449/ijate.993805

Chicago

Arslan Mancar, Sinem, and Hamide Deniz Gülleroğlu. 2022. “Comparison of Inter-Rater Reliability Techniques in Performance-Based Assessment”. International Journal of Assessment Tools in Education 9 (2): 515-33. https://doi.org/10.21449/ijate.993805.

EndNote

Arslan Mancar S, Gülleroğlu HD (June 1, 2022) Comparison of Inter-Rater Reliability Techniques in Performance-Based Assessment. International Journal of Assessment Tools in Education 9 2 515–533.

IEEE

[1]S. Arslan Mancar and H. D. Gülleroğlu, “Comparison of Inter-Rater Reliability Techniques in Performance-Based Assessment”, Int. J. Assess. Tools Educ., vol. 9, no. 2, pp. 515–533, June 2022, doi: 10.21449/ijate.993805.

ISNAD

Arslan Mancar, Sinem - Gülleroğlu, Hamide Deniz. “Comparison of Inter-Rater Reliability Techniques in Performance-Based Assessment”. International Journal of Assessment Tools in Education 9/2 (June 1, 2022): 515-533. https://doi.org/10.21449/ijate.993805.

JAMA

1.Arslan Mancar S, Gülleroğlu HD. Comparison of Inter-Rater Reliability Techniques in Performance-Based Assessment. Int. J. Assess. Tools Educ. 2022;9:515–533.

MLA

Arslan Mancar, Sinem, and Hamide Deniz Gülleroğlu. “Comparison of Inter-Rater Reliability Techniques in Performance-Based Assessment”. International Journal of Assessment Tools in Education, vol. 9, no. 2, June 2022, pp. 515-33, doi:10.21449/ijate.993805.

Vancouver

1.Sinem Arslan Mancar, Hamide Deniz Gülleroğlu. Comparison of Inter-Rater Reliability Techniques in Performance-Based Assessment. Int. J. Assess. Tools Educ. 2022 Jun. 1;9(2):515-33. doi:10.21449/ijate.993805

Cited By

Assessing second-language academic writing: AI vs. Human raters

Journal of Educational Technology and Online Learning

https://doi.org/10.31681/jetol.1336599

Reliability and validity of simulation-based electrocardiogram assessment rubrics for cardiac life support skills among medical students using generalizability theory

Medical Education Online

https://doi.org/10.1080/10872981.2025.2479962

Hubungan antara Keterampilan Metakognitif dan Keterampilan Berpikir Kritis Siswa Madrasah Aliyah

Multi Discere Journal

https://doi.org/10.36312/mj.v2i2.2718