Grup Uyumu Değerlendirme Modeli: Açık Uçlu Matematik Sınavı Örneği
Year 2021,
, 357 - 378, 30.08.2021
Mithat Takunyacı
,
Emin Aydın
Abstract
Bu çalışmanın amacı, açık uçlu bir matematik sınavının değerlendirmesinde Grup Uyumu Değerlendirme Modeli’ nin etkilerini incelemektir. Grup Uyumu Değerlendirme Modeli, öğretmenlerin beklentilerini ve anlayış standartlarını birbirleriyle paylaşarak öğrencilerinin öğrenme kararlarının tutarlılığını arttırdıkları bir süreçtir. Bu çalışmada rastgele olmayan örnekleme yöntemlerinden biri olan uygun örnekleme yöntemi kullanılmıştır. Çalışmada kullanılan sınav kağıtları 10. sınıfta öğrenim gören 22 öğrencinin bir dönemdeki üç matematik sınavına aittir. Öğrencilerin sınav kağıtları, grup uyumu değerlendirme modelindeki beş matematik öğretmeninden oluşan bir değerlendirme ekibi tarafından değerlendirilmiştir. Çalışmanın bulguları, değerlendiricilerin birbirlerinden olumlu etkilendiğini ve yargılarda bulunarak güvenilir bir değerlendirme sistemi oluşturduklarını göstermektedir. Ayrıca grup uyumu değerlendirme modeli kapsamında yürütülen çalıştaylardan sonra yapılan sınavlarda, değerlendiricilerin birbirleriyle tutarlı bir şekilde puan verdiği ve öğretmenlerin birbirleriyle bilgi ve görüşlerinin öğretmenlerin sınav kağıtlarını değerlendirme yeteneğini olumlu yönde etkilediği tespit edilmiştir.
References
- Adie, L. E. (2013). The development of teacher assessment identity through participation in online moderation. Assessment in Education: Principles, Policy & Practice, 20(1), 91–106.
- Aiken, L. R. (2000). Psychological testing and assessment (10. Edition). Boston: Allyn and Bacon.
- Allal, L., & Mottier Lopez, L. (2014). Teachers’ professional judgment in the context of collaborative assessment practice. In C. Wyatt-Smith, V. Klenowski & P. Colbert (Eds.), Designing Assessment for Quality Learning (pp. 151-165). London: Springer (The Enabling Power of Assessment).
- Association for Advanced Training, (1988). Association for Advanced Training in The Behavioral Sciences. Pub: Los Angeles.
- Baykul, Y. (2000). Eğitimde ve Psikolojide Ölçme: Klasik Test Teorisi ve Uygulaması. Ankara: ÖSYM Yayınları.
- Benton, T., & Gallacher, T. (2018). Is comparative judgement just a quick form of multiple marking? Research Matters: A Cambridge Assessment Publication, 24, 37–40.
- Black, P., & Wiliam, D. (2010). Inside the black box: Raising standards through classroom assessment. Phi Delta Kappan, 92(1), 81-90.
- Black, P., Harrison, C., Hodgen, J., Marshall, B., & Serret, N. (2010). Validity in teachers’ summative assessments. Assessment in Education: Principles, Policy & Practice 17(2), 217–34.
- Bramley, T., & Vitello, S. (2018). The effect of adaptivity on the reliability coefficient in adaptive comparative judgement. Assessment in Education: Principles, Policy & Practice, 26(1), 43–58.
- Büyükkıdık, S., & Anıl, D. (2015). Performansa dayalı durum belirlemede güvenirliğin genellenebilirlik kuramında farklı desenlerde incelenmesi. Eğitim ve Bilim, 40(177), 285-296.
- Büyüköztürk, Ş. (2014). Sosyal Bilimler için Veri Analizi El Kitabı İstatistik, Araştırma Deseni SPSS Uygulamaları ve Yorum (20. bs.). Ankara: Pegem Akademi Yayıncılık.
- Cooksey, R. W., Freebody, P., & Wyatt-Smith, C. (2007). Assessment as judgment-in-context: Analysing how teachers evaluate students' writing. Educational Research and Evaluation, 13(5), 101–434.
- Clarke, S. (2011). Formative Assessment in Action Weaving The Elements Together. Londres: Hodder Murray.
- Cunningham, G.K. (1998). Assessment in the classroom: Constructing and interpreting tests. London: Falmer Press. vii +225 pages. Australian Journal of Teacher Education, 23(1).
- Davis, L. (2016). The influence of training and experience on rater performance in scoring spoken language. Language Testing, 33(1), 117–135.
- DeLuca, C., & Johnson, S. (2017). Developing assessment capable teachers in this age of accountability. Assessment in Education: Principles, Policy & Practice, 24(2), 121–126.
- Doğan C.D., & Anadol, H.Ö. (2017). Genellenebilirlik Kuramında Tümüyle Çaprazlanmış ve Maddelerin Puanlayıcılara Yuvalandığı Desenlerin Karşılaştırılması. Kastamonu Üniversitesi Kastamonu eğitim Dergisi, 25(1), 361-372.
- Downing, S.M. (2009). Written Tests: Constructed-Response and Selected-Response Formats. In Downing, S.M. & Yudkowsky, R. (Eds.) Assessment in Health Professions Education (pp. 149-184). New York and London: Routledge.
- Earle, S. (2020). Balancing the demands of validity and reliability in practice: Case study of a changing system of primary science summative assessment. London Review of Education, 18(2), 221–235.
- Evans-Hampton, T. N., Skinner, C. H., Henington, C., Sims, S., & McDaniel, C. E. (2002). An investigation of situational bias: Conspicuous and covert timing during curriculum-based measurement of mathematics across African American and Caucasian students. School Psychology Review, 31(4), 529–539.
- Gipps, C., & Stobart, G. (2003). Alternative Assessment (Vol. 2). Los Angelas, London, New Delhi, Singapore: SAGE Publications.
Gipps, C.V. (1994). Beyond testing. London: The Farmer Press.
- Goodwin, L. D. (2001). Interrater agreement and reliability. Measurement in Physical Education and Exercise Science, 5(1), 13-34.
- Gravetter, F. J., & Forzano, L. B. (2012). Research Methods for the Behavioral Sciences (4th ed.). Belmont, CA: Wadsworth.
- Gronlund, N.E., & Linn, R.L. (1990) Measurement and Evaluation in Teaching. McMillan Company, New York.
- Güler, N., & Gelbal, S. (2010). A Study Based on Classical Test Theory and Many Facet Rasch Model. Eurasian Journal of Educational Research, 38, 108-125.
- Güler, N., & Teker Taşdelen, G. (2015). Açık Uçlu Maddelerde Farklı Yaklaşımlarla Elde Edilen Puanlayıcılar Arası Güvenirliğin Değerlendirilmesi. Journal of Measurement and Evaluation in Education and Psychology, 6(1).
- Harlen, W. (2005). Teachers' summative practices and assessment for learning – tensions and synergies. The Curriculum Journal, 16, 207 - 223.
- Harlen, W. (2010). Professional learning to support teacher assessment. In J. Gardner, W. Harlen, L. Hayward, & G. Stobart (Eds.), Developing teacher assessment (1st ed). Open University Press.
- Humphry, S. M., & Heldsinger, S. (2019). A two-stage method for classroom assessments of essay writing. Journal of Educational Measurement, 56(3), 505–520.
- Humphry, S. M., & Heldsinger, S. (2020) A Two-Stage Method for Obtaining Reliable Teacher Assessments of Writing. Frontiers in Education, 5(6).
- Hutchinson, C. and Hayward, L. (2005) The journey so far: assessment for learning in Scotland. Curriculum Journal, 16(2), pp. 225-248.
- İlhan, M. (2016). Açık Uçlu Sorularla Yapılan Ölçmelerde Klasik Test Kuramı ve Çok Yüzeyli Rasch Modeline Göre Hesaplanan Yetenek Kestirimlerinin Karşılaştırılması. Hacettepe University Journal of Education. 31.
- İlhan, M., & Çetin, B. (2014). Performans Değerlendirmeye Karışan Puanlayıcı Etkilerini Azaltmanın Yollarından Biri Olarak Puanlayıcı Eğitimleri, Journal of European Education, 4(2), 29-38.
- Kamış, Ö., & Doğan, C. (2017). Genellenebilirlik Kuramında Gerçekleştirilen Karar Çalışmaları Ne Kadar Kararlı?. Gazi Üniversitesi Gazi Eğitim Fakültesi Dergisi, 37(2), 591-610.
- Kan, A. (2005). Yazılı yoklamaların puanlanmasında puanlama cetveli ve yanıt anahtarı kullanımının (aynı) puanlayıcı güvenirliğine etkisi. Eğitim Araştırmaları Dergisi, 5(20), 166-177.
- Kerlinger, F. N. (1992). Foundations of Behavioral Research. New York: Harcourt Brace College Publishers.
- Kim, Y.K. (2009). Combining constructed response items and multiple-choice items using a hierarchical rater model (PhD Thesis). Teachers College, Columbia University.
- Klenowski, V., & Wyatt-Smith, C. (2013). Assessment for Education: Standards, Judgement and Moderation.
- Klenowski, V., & Wyatt-Smith, C. (2010). Standards, Teacher Judgement and Moderation in the Contexts of National Curriculum and Assessment Reform. Assessment Matters, 2, 107-131.
- Lane, S., & Sabers, D. (1989) Use of Generalizability Theory for Estimating the Dependability of a Scoring System for Sample Essays. Applied Measurement in Education, 2(3), 195-205.
- Lim, G. (2011). The development and maintenance of rating quality in performance writing assessment: A longitudinal study of new and experienced raters. Language Testing, 28, 543-560.
- London, M., & Wohlers, A. J. (1991). Agreement between subordinate and self-ratings in upward feedback. Personnel Psychology, 44(2), 375–390.
- McNamara, T. F. (1996). Measuring second language performance. London: Longman
- Maxwell, G., & Gipps, C. (1996). Teacher Assessments of Performance Standards: A Cross-National Study of Teacher Judgements of Student Achievement in the Context of National Assessment Schemes. Application for funding to the ARC: Interdisciplinary and International Research.
- Malone, L., Long, K., & De Lucchi, L. (2004). All things in moderation. Science and Children, 41(5), 30-34.
- Maxwell, G.S. (2007). Implications for moderation of proposed changes to senior secondary school syllabuses. Brisbane: Queensland Studies Authority.
- Meister, D. (1985). Behavioral Analysis and Measurement Methods. Publisher: Wiley-Interscience.
- Moskal, Barbara M., & Leydens, J.A. (2000). Scoring rubric development: validity and reliability. Practical Assessment, Research & Evaluation, 7(10).
- Nalbantoğlu Yılmaz, F., Başusta, B. (2015). Genellenebilirlik Kuramıyla Dikiş Atma ve Alma Becerileri İstasyonu Güvenirliğinin Değerlendirilmesi. Journal of Measurement and Evaluation in Education and Psychology, 6(1), 107-116.
- Özçelik, D. A. (1992). Ölçme ve Değerlendirme, Ankara: ÖSYM Yayınları. No:2.
- Page, T. J., & Iwata, B. A. (1986). Interobserver agreement: History, theory and current methods. In A. Poling & R. W. Fuqua (Eds.), Research methods in applied behavior analysis: Issues and advances (pp. 99– 126). New York: Plenum.
- Reiner, C. M., Bothell, T. W., Sudweeks, R. R., & Wood, B. (2002). Preparing effective essay questions: A self-directed workbook for educators: New Forums Press.
- Romagnano, L. (2001). The myth of objectivity in mathematics assessment. Mathematics Teacher, 94(1), 31-37.
- Sadler, D. (1998). Formative Assessment: revisiting the territory. Assessment in Education: Principles, Policy & Practice, 5, 77-84.
- Shavelson, R. J., Yin, Y., Furtak, E. M., Ruiz-Primo, M. A., & Ayala, C. C. (2008). On the role and impact of formative assessment on science inquiry teaching and learning. In J. Coffey, R. Douglas & C. Stearns (Eds.), Assessing science learning: Perspectives from research and practice. Arlington, VA: NSTA Press.
- Shermis, M. D., & Di Vesta, F. J. (2011). Classroom assessment in action. Lanham, MD: Rowman & Littlefied.
- Smaill, E. (2020). Using involvement in moderation to strengthen teachers’ assessment for learning capability. Assessment in Education: Principles, Policy & Practice, DOI: 10.1080/0969594X.2020.1777087.
- Smaill, E. (2018). Social moderation: Assessment for teacher professional learning. Doctoral thesis, University of Otago. https://ourarchive.otago.ac.nz/handle/10523/7850.
- Spiller, D. (2012). Assessment Matters: Self-assessment and peer assessment. Teaching Development Unit, University of Waikato, New Zealand.
- Stecher, B. (2010). Performance assessment in an era of standards-based educational accountability. Stanford, CA: Stanford University, Stanford Center for Opportunity Policy in Education.
- Stecker, P. M., & Fuchs, L. S. (2000). Effecting superior achievement using curriculum-based measurement: The importance of individual progress monitoring. Learning Disabilities Research and Practice, 15, 128–134.
- Strachan, J. (2002). Assessment in change: Some reflections on the local and international background to the National Certificate of Educational Achievement (NCEA). New Zealand Annual Review of Education, 11, 245- 258.
- Swartz, C. W., Hooper, S. R., Montgomery, J. W., Wakely, M. B., de Kruif, R. E. L., Reed, M., Brown, T. T., Levine, M. D., & White, K. P. (1999). Using generalizability theory to estimate the reliability of writing scores derived from holistic and analytical scoring methods. Education and Psychological Measurement, 59, 492–506.
- Takunyacı, M. (2016). Çoktan seçmeli sorulara dayalı olmayan bir kitle matematik sınavı sürecinin değerlendirilmesi: Grup uyumu değerlendirme modeli. Yayınlanmamış Doktora Tezi, Marmara Üniversitesi, Eğitim Bilimleri Enstitüsü.
- Tekin, H. (2000). Eğitimde ölçme ve değerlendirme (14. Baskı). Yargı Yayınları, Ankara.
- Thurber, R. S., Shinn, M. R., & Smolkowski, K. (2002). What is measured in mathematics tests? Construct validity of curriculum-based mathematics measures. School Psychology Review, 31(4), 498–513.
- Tsui, A. S., & Ohlott, P. (1988). Multiple assessment of managerial effectiveness: Interrater agreement and consensus in effectiveness models. Personnel Psychology, 41(4), 779-803. Retrieved from Google Scholar.
- Turgut, M.F. (1992). Eğitimde ölçme ve değerlendirme. Ankara: Saydam Matbaacılık, 9. Baskı.
- Weigle, S. C. (1994). Effects of training on raters of ESL compositions. Language Testing, 11(2), 197-223.
- Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15(2), 263–287.
- Wheadon, C., Barmby, P., Christodoulou, D., & Henderson, B. (2019). A comparative judgement approach to the large-scale assessment of primary writing in England. Assessment in Education: Principles, Policy & Practice, doi: 10.1080/0969594X.2019.1700212.
- Wilson, M. (2004). Assessment, accountability, and the classroom: A community of judgment. In M. Wilson (Ed.), Toward coherence between classroom assessment and accountability. 103rd yearbook of the National Society for the Study of Education. Chicago, IL: The University of Chicago Press.
- Wohlers, A. J., & London, M. (1989). Ratings of managerial characteristics: Evaluation difficulty, co-worker agreement, and self-awareness. Personnel Psychology, 42(2), 235–261.
- van Daal, T., Lesterhuis, M., Coertjens, L., Donche, V., & De Maeyer, S. (2019). Validity of comparative judgement to assess academic writing: Examined implications of its holistic character and building on a shared consensus. Assessment in Education: Principles, Policy & Practice, 26(1), 59–74.
- Verhavert, S., Bouwer, R., Donche, V., & De Maeyer, S. (2019). A meta-analysis on the reliability of comparative judgement. Assessment in Education: Principles, Policy & Practice. doi:10.1080/ 0969594X.2019.1602027.
Group Moderation Assessment Model: An Example of an Open-Ended Mathematics Exam
Year 2021,
, 357 - 378, 30.08.2021
Mithat Takunyacı
,
Emin Aydın
Abstract
The purpose of this study is to examine the assessment of an open-ended mathematics exam to reveal the effects of the Group Moderation Assessment Model. The Group Moderation Assessment Model is a process in which teachers share their expectations and understanding standards with each other to improve the consistency of their students’ learning decisions. In this study, one of the non-random sampling methods, the appropriate sampling method was used. The exam papers used in our study belong to a total of 22 students studying in the 10th grade. The students’ exam papers (for three math exams) were evaluated by an assessment team of five mathematics teachers in the group moderation assessment model. The findings show that the raters were positively influenced by each other and that they formed a reliable evaluation system by making judgments. In addition, it was found that the raters scored in a consistent way with each other in the exams conducted after the group moderation assessment model workshops. In conclusion, in the workshops held during the implementation of the group moderation assessment model, it was found that the teachers’ knowledge and opinion with each other positively affected the teachers’ ability to assess exam papers.
References
- Adie, L. E. (2013). The development of teacher assessment identity through participation in online moderation. Assessment in Education: Principles, Policy & Practice, 20(1), 91–106.
- Aiken, L. R. (2000). Psychological testing and assessment (10. Edition). Boston: Allyn and Bacon.
- Allal, L., & Mottier Lopez, L. (2014). Teachers’ professional judgment in the context of collaborative assessment practice. In C. Wyatt-Smith, V. Klenowski & P. Colbert (Eds.), Designing Assessment for Quality Learning (pp. 151-165). London: Springer (The Enabling Power of Assessment).
- Association for Advanced Training, (1988). Association for Advanced Training in The Behavioral Sciences. Pub: Los Angeles.
- Baykul, Y. (2000). Eğitimde ve Psikolojide Ölçme: Klasik Test Teorisi ve Uygulaması. Ankara: ÖSYM Yayınları.
- Benton, T., & Gallacher, T. (2018). Is comparative judgement just a quick form of multiple marking? Research Matters: A Cambridge Assessment Publication, 24, 37–40.
- Black, P., & Wiliam, D. (2010). Inside the black box: Raising standards through classroom assessment. Phi Delta Kappan, 92(1), 81-90.
- Black, P., Harrison, C., Hodgen, J., Marshall, B., & Serret, N. (2010). Validity in teachers’ summative assessments. Assessment in Education: Principles, Policy & Practice 17(2), 217–34.
- Bramley, T., & Vitello, S. (2018). The effect of adaptivity on the reliability coefficient in adaptive comparative judgement. Assessment in Education: Principles, Policy & Practice, 26(1), 43–58.
- Büyükkıdık, S., & Anıl, D. (2015). Performansa dayalı durum belirlemede güvenirliğin genellenebilirlik kuramında farklı desenlerde incelenmesi. Eğitim ve Bilim, 40(177), 285-296.
- Büyüköztürk, Ş. (2014). Sosyal Bilimler için Veri Analizi El Kitabı İstatistik, Araştırma Deseni SPSS Uygulamaları ve Yorum (20. bs.). Ankara: Pegem Akademi Yayıncılık.
- Cooksey, R. W., Freebody, P., & Wyatt-Smith, C. (2007). Assessment as judgment-in-context: Analysing how teachers evaluate students' writing. Educational Research and Evaluation, 13(5), 101–434.
- Clarke, S. (2011). Formative Assessment in Action Weaving The Elements Together. Londres: Hodder Murray.
- Cunningham, G.K. (1998). Assessment in the classroom: Constructing and interpreting tests. London: Falmer Press. vii +225 pages. Australian Journal of Teacher Education, 23(1).
- Davis, L. (2016). The influence of training and experience on rater performance in scoring spoken language. Language Testing, 33(1), 117–135.
- DeLuca, C., & Johnson, S. (2017). Developing assessment capable teachers in this age of accountability. Assessment in Education: Principles, Policy & Practice, 24(2), 121–126.
- Doğan C.D., & Anadol, H.Ö. (2017). Genellenebilirlik Kuramında Tümüyle Çaprazlanmış ve Maddelerin Puanlayıcılara Yuvalandığı Desenlerin Karşılaştırılması. Kastamonu Üniversitesi Kastamonu eğitim Dergisi, 25(1), 361-372.
- Downing, S.M. (2009). Written Tests: Constructed-Response and Selected-Response Formats. In Downing, S.M. & Yudkowsky, R. (Eds.) Assessment in Health Professions Education (pp. 149-184). New York and London: Routledge.
- Earle, S. (2020). Balancing the demands of validity and reliability in practice: Case study of a changing system of primary science summative assessment. London Review of Education, 18(2), 221–235.
- Evans-Hampton, T. N., Skinner, C. H., Henington, C., Sims, S., & McDaniel, C. E. (2002). An investigation of situational bias: Conspicuous and covert timing during curriculum-based measurement of mathematics across African American and Caucasian students. School Psychology Review, 31(4), 529–539.
- Gipps, C., & Stobart, G. (2003). Alternative Assessment (Vol. 2). Los Angelas, London, New Delhi, Singapore: SAGE Publications.
Gipps, C.V. (1994). Beyond testing. London: The Farmer Press.
- Goodwin, L. D. (2001). Interrater agreement and reliability. Measurement in Physical Education and Exercise Science, 5(1), 13-34.
- Gravetter, F. J., & Forzano, L. B. (2012). Research Methods for the Behavioral Sciences (4th ed.). Belmont, CA: Wadsworth.
- Gronlund, N.E., & Linn, R.L. (1990) Measurement and Evaluation in Teaching. McMillan Company, New York.
- Güler, N., & Gelbal, S. (2010). A Study Based on Classical Test Theory and Many Facet Rasch Model. Eurasian Journal of Educational Research, 38, 108-125.
- Güler, N., & Teker Taşdelen, G. (2015). Açık Uçlu Maddelerde Farklı Yaklaşımlarla Elde Edilen Puanlayıcılar Arası Güvenirliğin Değerlendirilmesi. Journal of Measurement and Evaluation in Education and Psychology, 6(1).
- Harlen, W. (2005). Teachers' summative practices and assessment for learning – tensions and synergies. The Curriculum Journal, 16, 207 - 223.
- Harlen, W. (2010). Professional learning to support teacher assessment. In J. Gardner, W. Harlen, L. Hayward, & G. Stobart (Eds.), Developing teacher assessment (1st ed). Open University Press.
- Humphry, S. M., & Heldsinger, S. (2019). A two-stage method for classroom assessments of essay writing. Journal of Educational Measurement, 56(3), 505–520.
- Humphry, S. M., & Heldsinger, S. (2020) A Two-Stage Method for Obtaining Reliable Teacher Assessments of Writing. Frontiers in Education, 5(6).
- Hutchinson, C. and Hayward, L. (2005) The journey so far: assessment for learning in Scotland. Curriculum Journal, 16(2), pp. 225-248.
- İlhan, M. (2016). Açık Uçlu Sorularla Yapılan Ölçmelerde Klasik Test Kuramı ve Çok Yüzeyli Rasch Modeline Göre Hesaplanan Yetenek Kestirimlerinin Karşılaştırılması. Hacettepe University Journal of Education. 31.
- İlhan, M., & Çetin, B. (2014). Performans Değerlendirmeye Karışan Puanlayıcı Etkilerini Azaltmanın Yollarından Biri Olarak Puanlayıcı Eğitimleri, Journal of European Education, 4(2), 29-38.
- Kamış, Ö., & Doğan, C. (2017). Genellenebilirlik Kuramında Gerçekleştirilen Karar Çalışmaları Ne Kadar Kararlı?. Gazi Üniversitesi Gazi Eğitim Fakültesi Dergisi, 37(2), 591-610.
- Kan, A. (2005). Yazılı yoklamaların puanlanmasında puanlama cetveli ve yanıt anahtarı kullanımının (aynı) puanlayıcı güvenirliğine etkisi. Eğitim Araştırmaları Dergisi, 5(20), 166-177.
- Kerlinger, F. N. (1992). Foundations of Behavioral Research. New York: Harcourt Brace College Publishers.
- Kim, Y.K. (2009). Combining constructed response items and multiple-choice items using a hierarchical rater model (PhD Thesis). Teachers College, Columbia University.
- Klenowski, V., & Wyatt-Smith, C. (2013). Assessment for Education: Standards, Judgement and Moderation.
- Klenowski, V., & Wyatt-Smith, C. (2010). Standards, Teacher Judgement and Moderation in the Contexts of National Curriculum and Assessment Reform. Assessment Matters, 2, 107-131.
- Lane, S., & Sabers, D. (1989) Use of Generalizability Theory for Estimating the Dependability of a Scoring System for Sample Essays. Applied Measurement in Education, 2(3), 195-205.
- Lim, G. (2011). The development and maintenance of rating quality in performance writing assessment: A longitudinal study of new and experienced raters. Language Testing, 28, 543-560.
- London, M., & Wohlers, A. J. (1991). Agreement between subordinate and self-ratings in upward feedback. Personnel Psychology, 44(2), 375–390.
- McNamara, T. F. (1996). Measuring second language performance. London: Longman
- Maxwell, G., & Gipps, C. (1996). Teacher Assessments of Performance Standards: A Cross-National Study of Teacher Judgements of Student Achievement in the Context of National Assessment Schemes. Application for funding to the ARC: Interdisciplinary and International Research.
- Malone, L., Long, K., & De Lucchi, L. (2004). All things in moderation. Science and Children, 41(5), 30-34.
- Maxwell, G.S. (2007). Implications for moderation of proposed changes to senior secondary school syllabuses. Brisbane: Queensland Studies Authority.
- Meister, D. (1985). Behavioral Analysis and Measurement Methods. Publisher: Wiley-Interscience.
- Moskal, Barbara M., & Leydens, J.A. (2000). Scoring rubric development: validity and reliability. Practical Assessment, Research & Evaluation, 7(10).
- Nalbantoğlu Yılmaz, F., Başusta, B. (2015). Genellenebilirlik Kuramıyla Dikiş Atma ve Alma Becerileri İstasyonu Güvenirliğinin Değerlendirilmesi. Journal of Measurement and Evaluation in Education and Psychology, 6(1), 107-116.
- Özçelik, D. A. (1992). Ölçme ve Değerlendirme, Ankara: ÖSYM Yayınları. No:2.
- Page, T. J., & Iwata, B. A. (1986). Interobserver agreement: History, theory and current methods. In A. Poling & R. W. Fuqua (Eds.), Research methods in applied behavior analysis: Issues and advances (pp. 99– 126). New York: Plenum.
- Reiner, C. M., Bothell, T. W., Sudweeks, R. R., & Wood, B. (2002). Preparing effective essay questions: A self-directed workbook for educators: New Forums Press.
- Romagnano, L. (2001). The myth of objectivity in mathematics assessment. Mathematics Teacher, 94(1), 31-37.
- Sadler, D. (1998). Formative Assessment: revisiting the territory. Assessment in Education: Principles, Policy & Practice, 5, 77-84.
- Shavelson, R. J., Yin, Y., Furtak, E. M., Ruiz-Primo, M. A., & Ayala, C. C. (2008). On the role and impact of formative assessment on science inquiry teaching and learning. In J. Coffey, R. Douglas & C. Stearns (Eds.), Assessing science learning: Perspectives from research and practice. Arlington, VA: NSTA Press.
- Shermis, M. D., & Di Vesta, F. J. (2011). Classroom assessment in action. Lanham, MD: Rowman & Littlefied.
- Smaill, E. (2020). Using involvement in moderation to strengthen teachers’ assessment for learning capability. Assessment in Education: Principles, Policy & Practice, DOI: 10.1080/0969594X.2020.1777087.
- Smaill, E. (2018). Social moderation: Assessment for teacher professional learning. Doctoral thesis, University of Otago. https://ourarchive.otago.ac.nz/handle/10523/7850.
- Spiller, D. (2012). Assessment Matters: Self-assessment and peer assessment. Teaching Development Unit, University of Waikato, New Zealand.
- Stecher, B. (2010). Performance assessment in an era of standards-based educational accountability. Stanford, CA: Stanford University, Stanford Center for Opportunity Policy in Education.
- Stecker, P. M., & Fuchs, L. S. (2000). Effecting superior achievement using curriculum-based measurement: The importance of individual progress monitoring. Learning Disabilities Research and Practice, 15, 128–134.
- Strachan, J. (2002). Assessment in change: Some reflections on the local and international background to the National Certificate of Educational Achievement (NCEA). New Zealand Annual Review of Education, 11, 245- 258.
- Swartz, C. W., Hooper, S. R., Montgomery, J. W., Wakely, M. B., de Kruif, R. E. L., Reed, M., Brown, T. T., Levine, M. D., & White, K. P. (1999). Using generalizability theory to estimate the reliability of writing scores derived from holistic and analytical scoring methods. Education and Psychological Measurement, 59, 492–506.
- Takunyacı, M. (2016). Çoktan seçmeli sorulara dayalı olmayan bir kitle matematik sınavı sürecinin değerlendirilmesi: Grup uyumu değerlendirme modeli. Yayınlanmamış Doktora Tezi, Marmara Üniversitesi, Eğitim Bilimleri Enstitüsü.
- Tekin, H. (2000). Eğitimde ölçme ve değerlendirme (14. Baskı). Yargı Yayınları, Ankara.
- Thurber, R. S., Shinn, M. R., & Smolkowski, K. (2002). What is measured in mathematics tests? Construct validity of curriculum-based mathematics measures. School Psychology Review, 31(4), 498–513.
- Tsui, A. S., & Ohlott, P. (1988). Multiple assessment of managerial effectiveness: Interrater agreement and consensus in effectiveness models. Personnel Psychology, 41(4), 779-803. Retrieved from Google Scholar.
- Turgut, M.F. (1992). Eğitimde ölçme ve değerlendirme. Ankara: Saydam Matbaacılık, 9. Baskı.
- Weigle, S. C. (1994). Effects of training on raters of ESL compositions. Language Testing, 11(2), 197-223.
- Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15(2), 263–287.
- Wheadon, C., Barmby, P., Christodoulou, D., & Henderson, B. (2019). A comparative judgement approach to the large-scale assessment of primary writing in England. Assessment in Education: Principles, Policy & Practice, doi: 10.1080/0969594X.2019.1700212.
- Wilson, M. (2004). Assessment, accountability, and the classroom: A community of judgment. In M. Wilson (Ed.), Toward coherence between classroom assessment and accountability. 103rd yearbook of the National Society for the Study of Education. Chicago, IL: The University of Chicago Press.
- Wohlers, A. J., & London, M. (1989). Ratings of managerial characteristics: Evaluation difficulty, co-worker agreement, and self-awareness. Personnel Psychology, 42(2), 235–261.
- van Daal, T., Lesterhuis, M., Coertjens, L., Donche, V., & De Maeyer, S. (2019). Validity of comparative judgement to assess academic writing: Examined implications of its holistic character and building on a shared consensus. Assessment in Education: Principles, Policy & Practice, 26(1), 59–74.
- Verhavert, S., Bouwer, R., Donche, V., & De Maeyer, S. (2019). A meta-analysis on the reliability of comparative judgement. Assessment in Education: Principles, Policy & Practice. doi:10.1080/ 0969594X.2019.1602027.