Comparison of Inter-Rater Reliability Techniques in Performance-Based Assessment
Year 2022,
, 515 - 533, 26.06.2022
Sinem Arslan Mancar
,
Hamide Deniz Gülleroğlu
Abstract
The aim of this study is to analyse the importance of the number of raters and compare the results obtained by techniques based on Classical Test Theory (CTT) and Generalizability (G) Theory. The Kappa and Krippendorff alpha techniques based on CTT were used to determine the inter-rater reliability. In this descriptive research data consists of twenty individual investigation performance reports prepared by the learners of the International Baccalaureate Diploma Programme (IBDP) and also five raters who rated these reports. Raters used an analytical rubric developed by the International Baccalaureate Organization (IBO) as a scoring tool. The results of the CTT study show that Kappa and Krippendorff alpha statistical techniques failed to provide information about the sources of the errors causing incompatibility in the criteria. The studies based on G Theory provided comprehensive data about the sources of the errors and increasing the number of raters would also increase the reliability of the values. However, the raters raised the idea that it is important to develop descriptors in the criteria in the rubric.
References
- Abedi, J., Baker, E.L., & Herl, H. (1995). Comparing reliability indices obtained by different approaches for performance assessments (CSE Report 401). Los Angeles: University of California, National Center for Research on Evaluation, Standards, and Student Testing (CRESST). https://cresst.org/wp-content/uploads/TECH401.pdf
- Agresti, A. (2013). Categorical data analysis (3rd ed.). John Wiley & Sons.
- Airasian, P.W. (1994). Classroom assessment (2nd ed.). McGraw-Hill.
- Aktaş, M. & Alıcı, D. (2017). Kontrol listesi, analitik rubrik ve dereceleme ölçeklerinde puanlayıcı güvenirliğinin genellenebilirlik kuramına göre incelenmesi [Examination of scoring reliability according to generalizability theory in checklist, analytic rubric, and rating scales]. International Journal of Eurasia Social Sciences, 8(29), 991-1010.
- Anadol, H.Ö., & Doğan, C.D. (2018). Dereceli puanlama anahtarlarının güvenirliğinin farklı deneyim yıllarına sahip puanlayıcıların kullanıldığı durumlarda i̇ncelenmesi [The examination of realiability of scoring rubrics regarding raters with different experience years]. İlköğretim Online, 1066-1076. https://doi.org/10.17051/ilkonline.2018.419355
- Ananiadou, K., & Claro, M. (2009), 21st century skills and competences for new millennium learners in OECD countries. OECD Education Working Papers, 41. OECD Publishing, Paris,https://doi.org/10.1787/218525261154
- Atılgan, H.E., (2005). Genellenebilirlik kuramı ve puanlayıcılar arası güvenirlik için örnek bir uygulama [Generalizability theory and a sample application for inter-rater reliability]. Educational Sciences and Practice, 4(7), 95-108. http://ebuline.com/pdfs/7Sayi/7_6.pdf
- Atılgan, H., Kan, A., & Doğan, N. (2007). Eğitimde ölçme ve değerlendirme [Assessment and evaluation in an education] (2nd ed.). Anı Yayıncılık.
- Bailey, D.K. (1994). Methods of social research (4th ed.). Free-Press.
- Baykul, Y. (2015). Eğitimde ve psikolojide ölçme: klasik test teorisi ve uygulaması [Measurement in education and psychology: classical test theory and practice] (3rd ed.). Pegem Yayıncılık.
- Bıkmaz Bilgen, Ö., & Doğan, N. (2017). Puanlayıcılar arası güvenirlik belirleme tekniklerinin karşılaştırılması [The comparison of ınterrater reliability estimating techniques]. Journal of Measurement and Evaluation in Education and Psychology, 8(1), 63-78. https://doi.org/10.21031/epod.294847
- Brennan, R.L. (2001). Generalizability theory. Springer-Verlag.
- Burry-Stock, J.A., Shaw, D.G., Laurie, C., & Chissom, B.S. (1996). Rater-agreement indexes for performance assessment. Educational and Psychological Measurement, 56(2), 251-262. https://doi.org/10.1177/0013164496056002006
- Büyükkıdık, S. (2012). Problem çözme becerisinin değerlendirilmesinde puanlayıcılar arası güvenirliğin klasik test kuramı ve genellenebilirlik kuramına göre karşılaştırılması. [Comparison of interrater reliability based on the classical test theory and generalizability theory in problem solving skills assessment] [Master’s Thesis, Hacettepe University]. Hacettepe University Libraries.
- Büyüköztürk, Ş., Kılıç Çakmak E., Akgün Ö.E., Karadeniz Ş., & Demirel F. (2012). Bilimsel araştırma yöntemleri [Scientific research methods] (11th ed.). Pegem Yayıncılık.
- Bybee R.W. (1997). Towards an understanding of scientific literacy. In: W. Gräber & C. Bolte. (Eds.). Scientific literacy. An international symposium (p. 37-68). Institut für die Pädagogikder Naturwissenschaften (IPN): Kiel, Germany.
- Callison, D. (2000). Rubrics. School Library Media Activities Monthly, 17(2), 34-6,42.
- Cambridge International Examinations (2015). Cambridge IGCSE global perspectives 0457. Syllabus for examination in 2018, 2019 and 2020. https://www.cambridgeinternational.org/Images/252230-2018-2020-syllabus.pdf
- Cohen J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46. https://doi.org/10.1177/001316446002000104
- Collins, R. (2014). Skills for the 21st Century: teaching higher-order thinking. Curriculum & Leadership Journal, 12(14). http://www.curriculum.edu.au/leader/teaching_higher_order_thinking,37431.html?issueID=12910
- Coombe, C.A., Davidson, P., O'Sullivan, B., & Stoynoff, S. (Eds.). (2012). The Cambridge guide to second language assessment. Cambridge University Press.
- Deliceoğlu, G. (2009). Futbol yetilerine ilişkin dereceleme ölçeğinin genellenebilirlik ve klasik test kuramına dayalı güvenirliklerinin karşılaştırılması [The comparison of the reliabilities of the soccer abilites’ rating scale based on the classical test theory and generalizability]. [Doctoral dissertation, Ankara University, Ankara]. https://tez.yok.gov.tr/UlusalTezMerkezi/tezSorguSonucYeni.jsp
- Dietel, R.J., Herman, J.L., & Knuth, R.A. (1991). What does research say about assessment? NCREL, Oak Brook. http://www.ncrel.org/sdrs/areas/stw_esys/4assess.htm
- Fleiss, J.L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378- 382. https://doi.org/10.1037/h0031619
- Goodrich Andrade, H. (2001) The effects of instructional rubrics on learning to write. Current issues in Education, 4. http://cie.asu.edu/ojs/index.php/cieatasu/article/view/1630
- Güler, N. (2009). Genellenebilirlik kuramı ve SPSS ile GENOVA programlarıyla hesaplanan G ve K çalışmalarına ilişkin sonuçların karşılaştırılması [Generalizability theory and comparison of the results of G and D studies computed by SPSS and GENOVA packet programs]. Eğitim ve Bilim, 34(154). http://eb.ted.org.tr/index.php/EB/article/view/551/45
- Güler, N. (2011). Rasgele veriler üzerinde genellenebilirlik kuramı ve klasik test kuramına göre güvenirliğin karşılaştırılması [The comparison of reliability according to generalizability theory and classical test theory on random data]. Eğitim ve Bilim. 36(162), 225-234. http://egitimvebilim.ted.org.tr/index.php/EB/article/view/993
- Güler, N., & Taşdelen, G. (2015). Açık uçlu maddelerde farklı yaklaşımlarla elde edilen puanlayıcılar arası güvenirliğin değerlendirilmesi [The evaluation of rater reliability of open-ended items obtained from different approaches] Journal of Measurement and Evaluation in Education and Psychology, 6(1). 12 24. https://doi.org/10.21031/epod.63041
- Gwet, K. (2002), Kappa statistic is not satisfactory for assessing the extent of agreement between raters. Series: Statistical Methods for Inter-Rater Reliability Assessment, 1(1).1-5. https://www.agreestat.com/papers/kappa_statistic_is_not_satisfactory.pdf
- Holbrook, J., & Rannikmae, M. (2009). The meaning of scientific literacy. International Journal of Environmental & Science Education, 4(3), 275 288. https://files.eric.ed.gov/fulltext/EJ884397.pdf
- Hurd, P. D. (1998) Scientific literacy: new minds for a changing world. Science Education, 82, 407-416.
- International Baccalaureate Organization (IBO). (2014a). International Baccalaureate Diploma Programme Biology Guide First Assessment 2016. https://internationalbaccalaureate.force.com/ibportal/IBPortalLogin?lang=en_US
- International Baccalaureate Organization (IBO). (2014b). International Baccalaureate Diploma Programme Chemistry Guide First Assessment 2016. https://www.ibchem.com/root_pdf/Chemistry_guide_2016.pdf
- International Baccalaureate Organization (IBO). (2014c). International Baccalaureate Diploma Programme Physics Guide First Assessment 2016. https://ibphysics.org/wp-content/uploads/2016/01/ib-physics-syllabus.pdf
- International Baccalaureate Organization (IBO). (2015). International Baccalaureate Diploma Programme: From principles into practice. International Baccalaureate Organization.
- International Baccalaureate Organization (IBO). (2018). International Baccalaureate Organization (IBO). (2018). The IB Diploma Programme Statistical Bulletin, May 2018 Examination Session. https://www.ibo.org/contentassets/bc850970f4e54b87828f83c7976a4db6/dp-statistical-bulletin-may-2018-en.pdf
- International Baccalaureate Organization (IBO). (2018). Assessment principles and practices-Quality assessments in a digital age. https://www.ibo.org/contentassets/1cdf850e366447e99b5a862aab622883/assessment-principles-and-practices-2018-en.pdf
- Kamış, Ö., & Doğan, C. (2017). Genellenebilirlik kuramında gerçekleştirilen karar çalışmaları ne kadar kararlı? [How consistent are decision studies in G theory?]. Journal of Education and Learning, 7(4). https://dergipark.org.tr/en/download/article-file/336342
- Klucevsek, K. (2017). The intersection of information and science literacy. Communications in Information Literacy, 11(2), 354-365. https://files.eric.ed.gov/fulltext/EJ1166457.pdf
- Krippendorff, K. (2004). Measuring the reliability of qualitative text analysis data. Quality and Quantity, 38(6), 787-800. https://doi.org/10.1007/s11135-004-8107-7
- Krippendorff, K. (2004). Content analysis: An introduction to its methodology. Sage.
- Krippendorff, K. (2011). Computing Krippendorff‟s alpha reliability. http://repository.upenn.edu/asc_papers/43
- Kulieke, M., Bakker, J., Collins, C., Fennimore, T., Fine, C., Herman, J., Jones, B.F., Raack, L., & Tinzmann, M.B. (1990). Why should assessment be based on a vision of learning? [online document] NCREL, Oak Brook: IL. Available online: http://www.ncrel.org/sdrs/areas/rpl_esys/assess.htm
- Kutlu, Ö., Doğan, D.C., & Karakaya, İ. (2008). Performansa ve portfolyoya dayalı durum belirleme [Assessment and evaluation determination based on performance and portfolio] (5th ed.). Pegem Yayıncılık.
- Landis, J,R., & Koch, G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1) 159-174. https://doi.org/10.2307/2529310
- Lane, S., & Sabers, D. (1989). Use of generalizability theory for estimating the dependability of a scoring system for sample essays, Applied Measurement in Education, 2(3). 195-205. https://doi.org/10.1207/s15324818ame0203_1
- Law, B., & Eckes, M. (1995). Assessment and ESL. Peguis publishers.
- Lord F.M. (1959). Statistical inferences about true scores. Psychometrika, 24(1), 1–17. https://doi.org/10.1007/BF02289759 .
- Maienschein, J. (1998). Scientific literacy. Science, 281(5379), 917. https://www.proquest.com/openview/568e8a30ee2b1c68d787bbcb39e3f94e/1?pq-origsite=gscholar&cbl=1256
- Marzano, R. (2002). A comparison of selected methods of scoring classroom assessments. Applied Measurement in Education, 15(3). 249 268. https://doi.org/10.1207/S15324818AME1503_2
- Marzano, R.J., & Heflebower, T. (2012). Teaching & assessing 21st century skills. Marzano Research Laboratory.
- Mertler, C.A. (2001). Designing scoring rubrics for your classroom. Practical Assessment, Research and Evaluation, 7(25), 1-8. https://doi.org/10.7275/gcy8-0w24
- Millî Eğitim Bakanlığı (MEB) (2016). PISA 2015 Ulusal Raporu [PISA 2015: National Report for Turkey]. Millî Eğitim Bakanlığı, Eğitimi Araştırma ve Geliştirme Dairesi Başkanlığı, Ankara. https://odsgm.meb.gov.tr/test/analizler/docs/PISA/PISA2015_Ulusal_Rapor.pdf
- Moskal, B.M. (2000) Scoring rubrics: What, When, How? Practical Assessment Research and Evaluation, 7(3), 1-11. https://doi.org/10.7275/a5vq-7q66
- National Research Council. (2012). Education for life and work: developing transferable knowledge and skills in the 21st century. The National Academies Press. https://doi.org/10.17226/13398
- Nbina, J., & Obomanu, B. (2010). The meaning of scientific literacy: A model of relevance in science education. Academic Leadership: The Online Journal, 8(4). https://scholars.fhsu.edu/alj/.
- Novick M.R. (1966) The axioms and principal results of classical test theory. Journal of Mathematical Psychology, 3(1), 1-18. https://doi.org/10.1016/0022-2496(66)90002-2
- Nying, E. (2004). A comparative study of interrater reliability coefficients obtained from different statistical procedures using monte carlo simulation techniques [Doctoral dissertation, Western Michigan University]. https://scholarworks.wmich.edu/dissertations/1267
- Oakleaf, M. (2009). The information literacy instruction assessment cycle: a guide for increasing student learning and improving librarian instructional skills. Journal of Documentation, 65(4). 539-560. https://doi.org/10.1108/00220410910970249
- Organisation for Economic Cooperation and Development (OECD). (2017). PISA 2015 Assessment and Analytical Framework: Science, Reading, Mathematic, Financial Literacy and Collaborative Problem Solving. OECD Publishing. https://doi.org/10.1787/9789264281820-en
- Özmen Hızarcıoğlu, B. (2013). Problem çözme sürecinde dereceli puanlama anahtarı (Rubrik) kullanımında puanlayıcı uyumunun incelenmesi [Examining scorer's coherence of using rubric in the problem solving process] [Master's dissertation, Abant Izzet Baysal University]. https://tez.yok.gov.tr/UlusalTezMerkezi/tezDetay.jsp?id=-9VIu1xAI6tVn8H1Pmf2Mg&no=XE36zEJKy4iJQQ-bARoPnA
- Öztürk, M.E. (2011). Voleybol becerileri gözlem formu ile elde edilen puanların, genellenebilirlik ve klasik test kuramına göre karşılaştırılması [The comparison of points of the volleyball abilities observation form (VAOF) according to the generalizability theory and the classical test theory] [Unpublished doctoral dissertation, Hacettepe University]. National Thesis Centre. https://tez.yok.gov.tr/UlusalTezMerkezi/tezDetay.jsp?id=K9erNYiV2Ks_xzov1XrfSQ&no=5OJsxJV1JE2E3hGJDkB8lQ
- Partnership for 21st Century Learning. (2007). Framework for 21st century learning. https://files.eric.ed.gov/fulltext/ED519462.pdf
- Reeves, T.C. (2000). Alternative assessment approaches for online learning environments in higher education. Educational Computing Research, 3(1), 101-111.
- Rychen, D.S., & Salganik, L.H. (Eds.). (2003). Key competencies for a successful life and a well functioning society. Cambridge.
- Schleicher, A. (2015), Schools for 21st-Century Learners: Strong Leaders, Confident Teachers, Innovative Approaches, International Summit on the Teaching Profession, OECD Publishing. https://doi.org/10.1787/9789264231191-en
- Shavelson, R.J., & Webb, N.M. (1991). Generalizability theory: a primer. Sage.
- Simonson, M., Smaldino, S, Albright, M., & Zvacek, S. (2000). Assessment for distance education (ch 11). Teaching and learning at a distance: foundations of distance education. Prentice-Hall.
- Mullis, I. V. S., & Martin, M. O. (Eds.). (2017). TIMSS 2019 Assessment frameworks. http://timssandpirls.bc.edu/timss2019/frameworks/
- Trilling, B., & Fadel, C. (2009). 21st century skills: Learning for life in our times. John Wiley & Sons
- Turgut, H. (2007). Scientific literacy for all. Ankara University Journal of Faculty of Educational Sciences (JFES), 40 (2), 233-256. https://doi.org/10.1501/Egifak_0000000176
- Turgut, M.F., & Baykul, Y. (2010). Eğitimde ölçme ve değerlendirme [Assessment and evaluation in an education]. Pegem Yayınları.
- Uçak, S., & Erdem, H.H. (2020). Eğitimde yeni bir yön arayışı bağlamında 21. Yüzyıl becerileri ve eğitim felsefesi [On the skills of 21st century and philosophy of education in terms of searching a new aspect in education]. Uşak Üniversitesi Eğitim Araştırmaları Dergisi, 6(1), 76-93. https://doi.org/10.29065/usakead.690205
- Viere, A.J., & Garrett, J.M. (2005). Understanding interobserver agreement: The Kappa statistic. Family Medicine, 37(5), 360-362.
- Zurkowski, P.G. (1974). The Information Service Environment Relationships and Priorities. Related Paper No. 5. National Commission on Libraries and Information Science, Washington, D.C. National Program for Library and Information Services. https://files.eric.ed.gov/fulltext/ED100391.pdf
Comparison of Inter-Rater Reliability Techniques in Performance-Based Assessment
Year 2022,
, 515 - 533, 26.06.2022
Sinem Arslan Mancar
,
Hamide Deniz Gülleroğlu
Abstract
The aim of this study is to analyse the importance of the number of raters and compare the results obtained by techniques based on Classical Test Theory (CTT) and Generalizability (G) Theory. The Kappa and Krippendorff alpha techniques based on CTT were used to determine the inter-rater reliability. In this descriptive research data consists of twenty individual investigation performance reports prepared by the learners of the International Baccalaureate Diploma Programme (IBDP) and also five raters who rated these reports. Raters used an analytical rubric developed by the International Baccalaureate Organization (IBO) as a scoring tool. The results of the CTT study show that Kappa and Krippendorff alpha statistical techniques failed to provide information about the sources of the errors causing incompatibility in the criteria. The studies based on G Theory provided comprehensive data about the sources of the errors and increasing the number of raters would also increase the reliability of the values. However, the raters raised the idea that it is important to develop descriptors in the criteria in the rubric.
References
- Abedi, J., Baker, E.L., & Herl, H. (1995). Comparing reliability indices obtained by different approaches for performance assessments (CSE Report 401). Los Angeles: University of California, National Center for Research on Evaluation, Standards, and Student Testing (CRESST). https://cresst.org/wp-content/uploads/TECH401.pdf
- Agresti, A. (2013). Categorical data analysis (3rd ed.). John Wiley & Sons.
- Airasian, P.W. (1994). Classroom assessment (2nd ed.). McGraw-Hill.
- Aktaş, M. & Alıcı, D. (2017). Kontrol listesi, analitik rubrik ve dereceleme ölçeklerinde puanlayıcı güvenirliğinin genellenebilirlik kuramına göre incelenmesi [Examination of scoring reliability according to generalizability theory in checklist, analytic rubric, and rating scales]. International Journal of Eurasia Social Sciences, 8(29), 991-1010.
- Anadol, H.Ö., & Doğan, C.D. (2018). Dereceli puanlama anahtarlarının güvenirliğinin farklı deneyim yıllarına sahip puanlayıcıların kullanıldığı durumlarda i̇ncelenmesi [The examination of realiability of scoring rubrics regarding raters with different experience years]. İlköğretim Online, 1066-1076. https://doi.org/10.17051/ilkonline.2018.419355
- Ananiadou, K., & Claro, M. (2009), 21st century skills and competences for new millennium learners in OECD countries. OECD Education Working Papers, 41. OECD Publishing, Paris,https://doi.org/10.1787/218525261154
- Atılgan, H.E., (2005). Genellenebilirlik kuramı ve puanlayıcılar arası güvenirlik için örnek bir uygulama [Generalizability theory and a sample application for inter-rater reliability]. Educational Sciences and Practice, 4(7), 95-108. http://ebuline.com/pdfs/7Sayi/7_6.pdf
- Atılgan, H., Kan, A., & Doğan, N. (2007). Eğitimde ölçme ve değerlendirme [Assessment and evaluation in an education] (2nd ed.). Anı Yayıncılık.
- Bailey, D.K. (1994). Methods of social research (4th ed.). Free-Press.
- Baykul, Y. (2015). Eğitimde ve psikolojide ölçme: klasik test teorisi ve uygulaması [Measurement in education and psychology: classical test theory and practice] (3rd ed.). Pegem Yayıncılık.
- Bıkmaz Bilgen, Ö., & Doğan, N. (2017). Puanlayıcılar arası güvenirlik belirleme tekniklerinin karşılaştırılması [The comparison of ınterrater reliability estimating techniques]. Journal of Measurement and Evaluation in Education and Psychology, 8(1), 63-78. https://doi.org/10.21031/epod.294847
- Brennan, R.L. (2001). Generalizability theory. Springer-Verlag.
- Burry-Stock, J.A., Shaw, D.G., Laurie, C., & Chissom, B.S. (1996). Rater-agreement indexes for performance assessment. Educational and Psychological Measurement, 56(2), 251-262. https://doi.org/10.1177/0013164496056002006
- Büyükkıdık, S. (2012). Problem çözme becerisinin değerlendirilmesinde puanlayıcılar arası güvenirliğin klasik test kuramı ve genellenebilirlik kuramına göre karşılaştırılması. [Comparison of interrater reliability based on the classical test theory and generalizability theory in problem solving skills assessment] [Master’s Thesis, Hacettepe University]. Hacettepe University Libraries.
- Büyüköztürk, Ş., Kılıç Çakmak E., Akgün Ö.E., Karadeniz Ş., & Demirel F. (2012). Bilimsel araştırma yöntemleri [Scientific research methods] (11th ed.). Pegem Yayıncılık.
- Bybee R.W. (1997). Towards an understanding of scientific literacy. In: W. Gräber & C. Bolte. (Eds.). Scientific literacy. An international symposium (p. 37-68). Institut für die Pädagogikder Naturwissenschaften (IPN): Kiel, Germany.
- Callison, D. (2000). Rubrics. School Library Media Activities Monthly, 17(2), 34-6,42.
- Cambridge International Examinations (2015). Cambridge IGCSE global perspectives 0457. Syllabus for examination in 2018, 2019 and 2020. https://www.cambridgeinternational.org/Images/252230-2018-2020-syllabus.pdf
- Cohen J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46. https://doi.org/10.1177/001316446002000104
- Collins, R. (2014). Skills for the 21st Century: teaching higher-order thinking. Curriculum & Leadership Journal, 12(14). http://www.curriculum.edu.au/leader/teaching_higher_order_thinking,37431.html?issueID=12910
- Coombe, C.A., Davidson, P., O'Sullivan, B., & Stoynoff, S. (Eds.). (2012). The Cambridge guide to second language assessment. Cambridge University Press.
- Deliceoğlu, G. (2009). Futbol yetilerine ilişkin dereceleme ölçeğinin genellenebilirlik ve klasik test kuramına dayalı güvenirliklerinin karşılaştırılması [The comparison of the reliabilities of the soccer abilites’ rating scale based on the classical test theory and generalizability]. [Doctoral dissertation, Ankara University, Ankara]. https://tez.yok.gov.tr/UlusalTezMerkezi/tezSorguSonucYeni.jsp
- Dietel, R.J., Herman, J.L., & Knuth, R.A. (1991). What does research say about assessment? NCREL, Oak Brook. http://www.ncrel.org/sdrs/areas/stw_esys/4assess.htm
- Fleiss, J.L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378- 382. https://doi.org/10.1037/h0031619
- Goodrich Andrade, H. (2001) The effects of instructional rubrics on learning to write. Current issues in Education, 4. http://cie.asu.edu/ojs/index.php/cieatasu/article/view/1630
- Güler, N. (2009). Genellenebilirlik kuramı ve SPSS ile GENOVA programlarıyla hesaplanan G ve K çalışmalarına ilişkin sonuçların karşılaştırılması [Generalizability theory and comparison of the results of G and D studies computed by SPSS and GENOVA packet programs]. Eğitim ve Bilim, 34(154). http://eb.ted.org.tr/index.php/EB/article/view/551/45
- Güler, N. (2011). Rasgele veriler üzerinde genellenebilirlik kuramı ve klasik test kuramına göre güvenirliğin karşılaştırılması [The comparison of reliability according to generalizability theory and classical test theory on random data]. Eğitim ve Bilim. 36(162), 225-234. http://egitimvebilim.ted.org.tr/index.php/EB/article/view/993
- Güler, N., & Taşdelen, G. (2015). Açık uçlu maddelerde farklı yaklaşımlarla elde edilen puanlayıcılar arası güvenirliğin değerlendirilmesi [The evaluation of rater reliability of open-ended items obtained from different approaches] Journal of Measurement and Evaluation in Education and Psychology, 6(1). 12 24. https://doi.org/10.21031/epod.63041
- Gwet, K. (2002), Kappa statistic is not satisfactory for assessing the extent of agreement between raters. Series: Statistical Methods for Inter-Rater Reliability Assessment, 1(1).1-5. https://www.agreestat.com/papers/kappa_statistic_is_not_satisfactory.pdf
- Holbrook, J., & Rannikmae, M. (2009). The meaning of scientific literacy. International Journal of Environmental & Science Education, 4(3), 275 288. https://files.eric.ed.gov/fulltext/EJ884397.pdf
- Hurd, P. D. (1998) Scientific literacy: new minds for a changing world. Science Education, 82, 407-416.
- International Baccalaureate Organization (IBO). (2014a). International Baccalaureate Diploma Programme Biology Guide First Assessment 2016. https://internationalbaccalaureate.force.com/ibportal/IBPortalLogin?lang=en_US
- International Baccalaureate Organization (IBO). (2014b). International Baccalaureate Diploma Programme Chemistry Guide First Assessment 2016. https://www.ibchem.com/root_pdf/Chemistry_guide_2016.pdf
- International Baccalaureate Organization (IBO). (2014c). International Baccalaureate Diploma Programme Physics Guide First Assessment 2016. https://ibphysics.org/wp-content/uploads/2016/01/ib-physics-syllabus.pdf
- International Baccalaureate Organization (IBO). (2015). International Baccalaureate Diploma Programme: From principles into practice. International Baccalaureate Organization.
- International Baccalaureate Organization (IBO). (2018). International Baccalaureate Organization (IBO). (2018). The IB Diploma Programme Statistical Bulletin, May 2018 Examination Session. https://www.ibo.org/contentassets/bc850970f4e54b87828f83c7976a4db6/dp-statistical-bulletin-may-2018-en.pdf
- International Baccalaureate Organization (IBO). (2018). Assessment principles and practices-Quality assessments in a digital age. https://www.ibo.org/contentassets/1cdf850e366447e99b5a862aab622883/assessment-principles-and-practices-2018-en.pdf
- Kamış, Ö., & Doğan, C. (2017). Genellenebilirlik kuramında gerçekleştirilen karar çalışmaları ne kadar kararlı? [How consistent are decision studies in G theory?]. Journal of Education and Learning, 7(4). https://dergipark.org.tr/en/download/article-file/336342
- Klucevsek, K. (2017). The intersection of information and science literacy. Communications in Information Literacy, 11(2), 354-365. https://files.eric.ed.gov/fulltext/EJ1166457.pdf
- Krippendorff, K. (2004). Measuring the reliability of qualitative text analysis data. Quality and Quantity, 38(6), 787-800. https://doi.org/10.1007/s11135-004-8107-7
- Krippendorff, K. (2004). Content analysis: An introduction to its methodology. Sage.
- Krippendorff, K. (2011). Computing Krippendorff‟s alpha reliability. http://repository.upenn.edu/asc_papers/43
- Kulieke, M., Bakker, J., Collins, C., Fennimore, T., Fine, C., Herman, J., Jones, B.F., Raack, L., & Tinzmann, M.B. (1990). Why should assessment be based on a vision of learning? [online document] NCREL, Oak Brook: IL. Available online: http://www.ncrel.org/sdrs/areas/rpl_esys/assess.htm
- Kutlu, Ö., Doğan, D.C., & Karakaya, İ. (2008). Performansa ve portfolyoya dayalı durum belirleme [Assessment and evaluation determination based on performance and portfolio] (5th ed.). Pegem Yayıncılık.
- Landis, J,R., & Koch, G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1) 159-174. https://doi.org/10.2307/2529310
- Lane, S., & Sabers, D. (1989). Use of generalizability theory for estimating the dependability of a scoring system for sample essays, Applied Measurement in Education, 2(3). 195-205. https://doi.org/10.1207/s15324818ame0203_1
- Law, B., & Eckes, M. (1995). Assessment and ESL. Peguis publishers.
- Lord F.M. (1959). Statistical inferences about true scores. Psychometrika, 24(1), 1–17. https://doi.org/10.1007/BF02289759 .
- Maienschein, J. (1998). Scientific literacy. Science, 281(5379), 917. https://www.proquest.com/openview/568e8a30ee2b1c68d787bbcb39e3f94e/1?pq-origsite=gscholar&cbl=1256
- Marzano, R. (2002). A comparison of selected methods of scoring classroom assessments. Applied Measurement in Education, 15(3). 249 268. https://doi.org/10.1207/S15324818AME1503_2
- Marzano, R.J., & Heflebower, T. (2012). Teaching & assessing 21st century skills. Marzano Research Laboratory.
- Mertler, C.A. (2001). Designing scoring rubrics for your classroom. Practical Assessment, Research and Evaluation, 7(25), 1-8. https://doi.org/10.7275/gcy8-0w24
- Millî Eğitim Bakanlığı (MEB) (2016). PISA 2015 Ulusal Raporu [PISA 2015: National Report for Turkey]. Millî Eğitim Bakanlığı, Eğitimi Araştırma ve Geliştirme Dairesi Başkanlığı, Ankara. https://odsgm.meb.gov.tr/test/analizler/docs/PISA/PISA2015_Ulusal_Rapor.pdf
- Moskal, B.M. (2000) Scoring rubrics: What, When, How? Practical Assessment Research and Evaluation, 7(3), 1-11. https://doi.org/10.7275/a5vq-7q66
- National Research Council. (2012). Education for life and work: developing transferable knowledge and skills in the 21st century. The National Academies Press. https://doi.org/10.17226/13398
- Nbina, J., & Obomanu, B. (2010). The meaning of scientific literacy: A model of relevance in science education. Academic Leadership: The Online Journal, 8(4). https://scholars.fhsu.edu/alj/.
- Novick M.R. (1966) The axioms and principal results of classical test theory. Journal of Mathematical Psychology, 3(1), 1-18. https://doi.org/10.1016/0022-2496(66)90002-2
- Nying, E. (2004). A comparative study of interrater reliability coefficients obtained from different statistical procedures using monte carlo simulation techniques [Doctoral dissertation, Western Michigan University]. https://scholarworks.wmich.edu/dissertations/1267
- Oakleaf, M. (2009). The information literacy instruction assessment cycle: a guide for increasing student learning and improving librarian instructional skills. Journal of Documentation, 65(4). 539-560. https://doi.org/10.1108/00220410910970249
- Organisation for Economic Cooperation and Development (OECD). (2017). PISA 2015 Assessment and Analytical Framework: Science, Reading, Mathematic, Financial Literacy and Collaborative Problem Solving. OECD Publishing. https://doi.org/10.1787/9789264281820-en
- Özmen Hızarcıoğlu, B. (2013). Problem çözme sürecinde dereceli puanlama anahtarı (Rubrik) kullanımında puanlayıcı uyumunun incelenmesi [Examining scorer's coherence of using rubric in the problem solving process] [Master's dissertation, Abant Izzet Baysal University]. https://tez.yok.gov.tr/UlusalTezMerkezi/tezDetay.jsp?id=-9VIu1xAI6tVn8H1Pmf2Mg&no=XE36zEJKy4iJQQ-bARoPnA
- Öztürk, M.E. (2011). Voleybol becerileri gözlem formu ile elde edilen puanların, genellenebilirlik ve klasik test kuramına göre karşılaştırılması [The comparison of points of the volleyball abilities observation form (VAOF) according to the generalizability theory and the classical test theory] [Unpublished doctoral dissertation, Hacettepe University]. National Thesis Centre. https://tez.yok.gov.tr/UlusalTezMerkezi/tezDetay.jsp?id=K9erNYiV2Ks_xzov1XrfSQ&no=5OJsxJV1JE2E3hGJDkB8lQ
- Partnership for 21st Century Learning. (2007). Framework for 21st century learning. https://files.eric.ed.gov/fulltext/ED519462.pdf
- Reeves, T.C. (2000). Alternative assessment approaches for online learning environments in higher education. Educational Computing Research, 3(1), 101-111.
- Rychen, D.S., & Salganik, L.H. (Eds.). (2003). Key competencies for a successful life and a well functioning society. Cambridge.
- Schleicher, A. (2015), Schools for 21st-Century Learners: Strong Leaders, Confident Teachers, Innovative Approaches, International Summit on the Teaching Profession, OECD Publishing. https://doi.org/10.1787/9789264231191-en
- Shavelson, R.J., & Webb, N.M. (1991). Generalizability theory: a primer. Sage.
- Simonson, M., Smaldino, S, Albright, M., & Zvacek, S. (2000). Assessment for distance education (ch 11). Teaching and learning at a distance: foundations of distance education. Prentice-Hall.
- Mullis, I. V. S., & Martin, M. O. (Eds.). (2017). TIMSS 2019 Assessment frameworks. http://timssandpirls.bc.edu/timss2019/frameworks/
- Trilling, B., & Fadel, C. (2009). 21st century skills: Learning for life in our times. John Wiley & Sons
- Turgut, H. (2007). Scientific literacy for all. Ankara University Journal of Faculty of Educational Sciences (JFES), 40 (2), 233-256. https://doi.org/10.1501/Egifak_0000000176
- Turgut, M.F., & Baykul, Y. (2010). Eğitimde ölçme ve değerlendirme [Assessment and evaluation in an education]. Pegem Yayınları.
- Uçak, S., & Erdem, H.H. (2020). Eğitimde yeni bir yön arayışı bağlamında 21. Yüzyıl becerileri ve eğitim felsefesi [On the skills of 21st century and philosophy of education in terms of searching a new aspect in education]. Uşak Üniversitesi Eğitim Araştırmaları Dergisi, 6(1), 76-93. https://doi.org/10.29065/usakead.690205
- Viere, A.J., & Garrett, J.M. (2005). Understanding interobserver agreement: The Kappa statistic. Family Medicine, 37(5), 360-362.
- Zurkowski, P.G. (1974). The Information Service Environment Relationships and Priorities. Related Paper No. 5. National Commission on Libraries and Information Science, Washington, D.C. National Program for Library and Information Services. https://files.eric.ed.gov/fulltext/ED100391.pdf