Investigating a new method for standardising essay marking using levels-based mark schemes

Jackie Greatorex; Tom Sutch; Magda Werno; Jess Bowyer; Karen Dunn

doi:10.21449/ijate.564824

Research Article

Investigating a new method for standardising essay marking using levels-based mark schemes

Year 2019, Volume: 6 Issue: 2, 218 - 234, 15.07.2019

Jackie Greatorex Tom Sutch Magda Werno Jess Bowyer Karen Dunn

https://doi.org/10.21449/ijate.564824

Cited By: 1

Abstract

Standardisation is a procedure used by Awarding Organisations to maximise marking reliability, by teaching examiners to consistently judge scripts using a mark scheme. However, research shows that people are better at comparing two objects than judging each object individually. Consequently, Oxford, Cambridge and RSA (OCR, a UK awarding organisation) proposed investigating a new procedure, involving ranking essays, where essay quality is judged in comparison to other essays. This study investigated the marking reliability yielded by traditional standardisation and ranking standardisation. The study entailed a marking experiment followed by examiners completing a questionnaire. In the control condition live procedures were emulated as authentically as possible within the confines of a study. The experimental condition involved ranking the quality of essays from the best to the worst and then assigning marks. After each standardisation procedure the examiners marked 50 essays from an AS History unit. All participants experienced both procedures, and marking reliability was measured. Additionally, the participants’ questionnaire responses were analysed to gain an insight into examiners’ experience. It is concluded that the Ranking Procedure is unsuitable for use in public examinations in its current form. The Traditional Procedure produced statistically significantly more reliable marking, whilst the Ranking Procedure involved a complex decision-making process. However, the Ranking Procedure produced slightly more reliable marking at the extremities of the mark range, where previous research has shown that marking tends to be less reliable.

Keywords

Comparative judgement, Marking, Standardisation, Reliability, Essay

References

Ahmed, A., & Pollitt, A. (2011). Improving marking quality through a taxonomy of mark schemes. Assessment in Education: Principles, Policy & Practice, 18(3), 259-278. doi: http://dx.doi.org/10.1080/0969594X.2010.546775
Baird, J.-A., Greatorex, J., & Bell, J. F. (2004). What makes marking reliable? Experiments with UK examinations. Assessment in Education: Principles, Policy & Practice, 11(3), 331-348.
Barkaoui, K. (2011). Effects of marking method and rater experience on ESL essay scores and rater performance. Assessment in Education: Principles, Policy & Practice, 18(3), 279-293.
Benton, T., & Gallagher, T. (2018). Is comparative judgement just a quick form of multiple marking. Research Matters: A Cambridge Assessment Publication (26), 22-28. Billington, L., & Davenport, C. (2011). On line standardisation trial, Winter 2008: Evaluation of examiner performance and examiner satisfaction. Manchester: AQA Centre for Education Research Policy.
Black, B., Suto, W. M. I., & Bramley, T. (2011). The interrelations of features of questions, mark schemes and examinee responses and their impact upon marker agreement. Assessment in Education: Principles, Policy & Practice, 18(3), 295-318.
Bramley, T. (2009). Mark scheme features associated with different levels of marker agreement. Research Matters: A Cambridge Assessment Publication (8), 16-23.
Bramley, T. (2015). Investigating the reliability of Adaptive Comparative Judgment Cambridge Assessment Research Report. Cambridge, UK: Cambridge Assessment.
Bramley, T., & Vitello, S. (2018). The effect of adaptivity on the reliability coefficient in adaptive comparative judgement. Assessment in Education: Principles, Policy & Practice, 1-16. doi: 10.1080/0969594X.2017.1418734
Braun, V., & Clarke, V. (2006). Using thematic analysis in psychology. Qualitative Research in Psychology, 3, 77-101.
Çetin, Y. (2011). Reliability of raters for writing assessment: analytic - holistic, analytic-analytic, holistic-holistic. Mustafa Kemal University Journal of Social Sciences Institute, 8(16), 471-486.
Chamberlain, S., & Taylor, R. (2010). Online or face to face? An experimental study of examiner training. British Journal of Educational Technology, 42(4), 665-675.
Greatorex, J., & Bell, J. F. (2008a). What makes AS marking reliable? An experiment with some stages from the standardisation process. Research Papers in Education, 23(3), 333-355.
Greatorex, J., & Bell, J. F. (2008b). What makes AS marking reliable? An experiment with some stages from the standardisation process. Research Papers in Education, 23(3), 333-355.
Harsch, C., & Martin, G. (2013). Comparing holistic and analytic scoring methods: issues of validity and reliability. Assessment in Education: Principles, Policy & Practice, 20(3), 281-307.
Johnson, M., & Black, B. (2012). Feedback as scaffolding: senior examiner monitoring processes and their effects on examiner marking. Research in Post-Compulsory Education, 17(4), 391-407.
Jones, B., & Kenward, M. G. (1989). Design and Analysis of Cross-Over Trials. London: Chapman and Hall.
Kimbell, R. (2007). e-assessment in project e-scape. Design and Technology Education: an International Journal, 12(2), 66-76.
Kimbell, R., Wheeler, T., Miller, S., & Pollitt, A. (2007). E-scape portfolio assessment. Phase 2 report. London: Department for Education and Skills.
Knoch, U. (2007). ‘ Little coherence, considerable strain for reader’: A comparison between two rating scales for the assessment of coherence. Assessing Writing, 12(2), 108-128. doi: 10.1016/j.asw.2007.07.002
Knoch, U., Read, J., & von Randow, J. (2007). Re-training writing raters online: How does it compare with face-to-face training? Assessing Writing, 12, 26-43. doi: 10.1016/j.asw.2007.04.001
Lai, E. R., Wolfe, E. W., & Vickers, D. H. (2012). Halo Effects and Analytic Scoring: A Summary of Two Empirical Studies Research Report. New York: Pearson Research and Innovation Network.
Laming, D. (2004). Human judgment: the eye of the beholder. Hong Kong: Thomson Learning.
Meadows, M., & Billington, L. (2007). NAA Enhancing the Quality of Marking Project: Final Report for Research on Marker Selection. Manchester: National Assessment Agency.
Meadows, M., & Billington, L. (2010). The effect of marker background and training on the quality of marking in GCSE English. Manchester: Centre for Education Research and Policy.
Michieka, M. (2010). Holistic or Analytic Scoring? Issues in Grading ESL Writing. TNTESOL Journal.
O'Donovan, N. (2005). There are no wrong answers: an investigation into the assessment of candidates' responses to essay-based examinations. Oxford Review of Education, 31, 395-422.
Pinot de Moira, A. (2011a). Effective discrimination in mark schemes. Manchester: AQA.
Pinot de Moira, A. (2011b). Levels-based mark schemes and marking bias. Manchester: AQA.
Pinot de Moira, A. (2013). Features of a levels-based mark scheme and their effect on marking reliability. Manchester: AQA.
Pollitt, A. (2009). Abolishing marksism and rescuing validity. Paper presented at the International Association for Educational Assessment, Brisbane, Australia. http://www.iaea.info/documents/paper_4d527d4e.pdf
Pollitt, A. (2012a). Comparative judgement for assessment. International Journal of Technology and Design Education, 22(2), 157-170.
Pollitt, A. (2012b). The method of adaptive comparative judgement. Assessment in Education: Principles, Policy & Practice, 19(3), 281 - 300. doi: http://dx.doi.org/10.1080/0969594X.2012.665354
Pollitt, A., Elliott, G., & Ahmed, A. (2004). Let's stop marking exams. Paper presented at the International Association for Educational Assessment, Philadelphia, USA.
Raikes, N., Fidler, J., & Gill, T. (2009). Must examiners meet in order to standardise their marking? An experiment with new and experienced examiners of GCE AS Psychology Paper presented at the British Educational Research Association, University of Manchester, UK.
Senn, S. (2002). Cross-Over Trials in Clinical Research. Chichester: Wiley.
Suto, I., Nádas, R., & Bell, J. (2011a). Who should mark what? A study of factors affecting marking accuracy in a biology examination. Research Papers in Education, 26(1), 21-51.
Suto, W. M. I., & Greatorex, J. (2008). A quantitative analysis of cognitive strategy usage in the marking of two GCSE examinations. Assessment in Education: Principles, Policy & Practice, 15(1), 73-89.
Suto, W. M. I., Greatorex, J., & Nádas, R. (2009). Thinking about making the right mark: Using cognitive strategy research to explore examiner training. Research Matters: A Cambridge Assessment Publication(8), 23-32.
Suto, W. M. I., & Nádas, R. (2008). What determines GCSE marking accuracy? An exploration of expertise among maths and physics markers. Research Papers in Education, 23(4), 477-497. doi: 10.1080/02671520701755499
Suto, W. M. I., & Nádas, R. (2009). Why are some GCSE examination questions harder to mark accurately than others? Using Kelly's Repertory Grid technique to identify relevant question features. Research Papers in Education, 24(3), 335-377. doi: http://dx.doi.org/10.1080/02671520801945925
Suto, W. M. I., Nádas, R., & Bell, J. (2011b). Who should mark what? A study of factors affecting marking accuracy in a biology examination. Research Papers in Education, 26(1), 21-51.
Sykes, E., Novakovic, N., Greatorex, J., Bell, J., Nádas, R., & Gill, T. (2009). How effective is fast and automated feedback to examiners in tackling the size of marking errors? Research Matters: A Cambridge Assessment Publication (8), 8-15.
Whitehouse, C., & Pollitt, A. (2012). Using adaptive comparative judgement to obtain a highly reliable rank order in summative assessment. Manchester: AQA Centre for Education Research and Policy.
Wolfe, E. W., Matthews, S., & Vickers, D. (2010). The effectiveness and efficiency of distributed online, regional online, and regional face-to-face training for writing assessment raters. The Journal of Technology, Learning and Assessment, 10(1). http://ejournals.bc.edu/ojs/index.php/jtla/article/view/1601/1457
Wolfe, E. W., & McVay, A. (2010). Rater effects as a function of rater training context. New York: Pearson Research and Innovation Network.

Investigating a new method for standardising essay marking using levels-based mark schemes

Year 2019, Volume: 6 Issue: 2, 218 - 234, 15.07.2019

Jackie Greatorex Tom Sutch Magda Werno Jess Bowyer Karen Dunn

https://doi.org/10.21449/ijate.564824

Cited By: 1

Abstract

Standardisation
is a procedure used by Awarding Organisations to maximise marking reliability,
by teaching examiners to consistently judge scripts using a mark scheme.
However, research shows that people are better at comparing two objects than
judging each object individually. Consequently, Oxford, Cambridge and RSA (OCR,
a UK awarding organisation) proposed investigating a new procedure, involving
ranking essays, where essay quality is judged in comparison to other essays.
This study investigated the marking reliability yielded by traditional
standardisation and ranking standardisation. The study entailed a marking
experiment followed by examiners completing a questionnaire. In the control
condition live procedures were emulated as authentically as possible within
the confines of a study. The experimental condition involved ranking the quality of
essays from the best to the worst and then assigning marks. After each
standardisation procedure the examiners marked 50 essays from an AS History
unit. All participants experienced both procedures, and marking reliability was
measured. Additionally, the participants’ questionnaire responses were analysed
to gain an insight into examiners’ experience. It is concluded that the Ranking
Procedure is unsuitable for use in public examinations in its current form. The
Traditional Procedure produced statistically significantly more reliable
marking, whilst the Ranking Procedure involved a complex decision-making
process. However, the Ranking Procedure produced slightly more reliable marking
at the extremities of the mark range, where previous research has shown that
marking tends to be less reliable.

Keywords

Comparative judgement, Marking, Standardisation, Reliability, Essay

References

Ahmed, A., & Pollitt, A. (2011). Improving marking quality through a taxonomy of mark schemes. Assessment in Education: Principles, Policy & Practice, 18(3), 259-278. doi: http://dx.doi.org/10.1080/0969594X.2010.546775
Baird, J.-A., Greatorex, J., & Bell, J. F. (2004). What makes marking reliable? Experiments with UK examinations. Assessment in Education: Principles, Policy & Practice, 11(3), 331-348.
Barkaoui, K. (2011). Effects of marking method and rater experience on ESL essay scores and rater performance. Assessment in Education: Principles, Policy & Practice, 18(3), 279-293.
Benton, T., & Gallagher, T. (2018). Is comparative judgement just a quick form of multiple marking. Research Matters: A Cambridge Assessment Publication (26), 22-28. Billington, L., & Davenport, C. (2011). On line standardisation trial, Winter 2008: Evaluation of examiner performance and examiner satisfaction. Manchester: AQA Centre for Education Research Policy.
Black, B., Suto, W. M. I., & Bramley, T. (2011). The interrelations of features of questions, mark schemes and examinee responses and their impact upon marker agreement. Assessment in Education: Principles, Policy & Practice, 18(3), 295-318.
Bramley, T. (2009). Mark scheme features associated with different levels of marker agreement. Research Matters: A Cambridge Assessment Publication (8), 16-23.
Bramley, T. (2015). Investigating the reliability of Adaptive Comparative Judgment Cambridge Assessment Research Report. Cambridge, UK: Cambridge Assessment.
Bramley, T., & Vitello, S. (2018). The effect of adaptivity on the reliability coefficient in adaptive comparative judgement. Assessment in Education: Principles, Policy & Practice, 1-16. doi: 10.1080/0969594X.2017.1418734
Braun, V., & Clarke, V. (2006). Using thematic analysis in psychology. Qualitative Research in Psychology, 3, 77-101.
Çetin, Y. (2011). Reliability of raters for writing assessment: analytic - holistic, analytic-analytic, holistic-holistic. Mustafa Kemal University Journal of Social Sciences Institute, 8(16), 471-486.
Chamberlain, S., & Taylor, R. (2010). Online or face to face? An experimental study of examiner training. British Journal of Educational Technology, 42(4), 665-675.
Greatorex, J., & Bell, J. F. (2008a). What makes AS marking reliable? An experiment with some stages from the standardisation process. Research Papers in Education, 23(3), 333-355.
Greatorex, J., & Bell, J. F. (2008b). What makes AS marking reliable? An experiment with some stages from the standardisation process. Research Papers in Education, 23(3), 333-355.
Harsch, C., & Martin, G. (2013). Comparing holistic and analytic scoring methods: issues of validity and reliability. Assessment in Education: Principles, Policy & Practice, 20(3), 281-307.
Johnson, M., & Black, B. (2012). Feedback as scaffolding: senior examiner monitoring processes and their effects on examiner marking. Research in Post-Compulsory Education, 17(4), 391-407.
Jones, B., & Kenward, M. G. (1989). Design and Analysis of Cross-Over Trials. London: Chapman and Hall.
Kimbell, R. (2007). e-assessment in project e-scape. Design and Technology Education: an International Journal, 12(2), 66-76.
Kimbell, R., Wheeler, T., Miller, S., & Pollitt, A. (2007). E-scape portfolio assessment. Phase 2 report. London: Department for Education and Skills.
Knoch, U. (2007). ‘ Little coherence, considerable strain for reader’: A comparison between two rating scales for the assessment of coherence. Assessing Writing, 12(2), 108-128. doi: 10.1016/j.asw.2007.07.002
Knoch, U., Read, J., & von Randow, J. (2007). Re-training writing raters online: How does it compare with face-to-face training? Assessing Writing, 12, 26-43. doi: 10.1016/j.asw.2007.04.001
Lai, E. R., Wolfe, E. W., & Vickers, D. H. (2012). Halo Effects and Analytic Scoring: A Summary of Two Empirical Studies Research Report. New York: Pearson Research and Innovation Network.
Laming, D. (2004). Human judgment: the eye of the beholder. Hong Kong: Thomson Learning.
Meadows, M., & Billington, L. (2007). NAA Enhancing the Quality of Marking Project: Final Report for Research on Marker Selection. Manchester: National Assessment Agency.
Meadows, M., & Billington, L. (2010). The effect of marker background and training on the quality of marking in GCSE English. Manchester: Centre for Education Research and Policy.
Michieka, M. (2010). Holistic or Analytic Scoring? Issues in Grading ESL Writing. TNTESOL Journal.
O'Donovan, N. (2005). There are no wrong answers: an investigation into the assessment of candidates' responses to essay-based examinations. Oxford Review of Education, 31, 395-422.
Pinot de Moira, A. (2011a). Effective discrimination in mark schemes. Manchester: AQA.
Pinot de Moira, A. (2011b). Levels-based mark schemes and marking bias. Manchester: AQA.
Pinot de Moira, A. (2013). Features of a levels-based mark scheme and their effect on marking reliability. Manchester: AQA.
Pollitt, A. (2009). Abolishing marksism and rescuing validity. Paper presented at the International Association for Educational Assessment, Brisbane, Australia. http://www.iaea.info/documents/paper_4d527d4e.pdf
Pollitt, A. (2012a). Comparative judgement for assessment. International Journal of Technology and Design Education, 22(2), 157-170.
Pollitt, A. (2012b). The method of adaptive comparative judgement. Assessment in Education: Principles, Policy & Practice, 19(3), 281 - 300. doi: http://dx.doi.org/10.1080/0969594X.2012.665354
Pollitt, A., Elliott, G., & Ahmed, A. (2004). Let's stop marking exams. Paper presented at the International Association for Educational Assessment, Philadelphia, USA.
Raikes, N., Fidler, J., & Gill, T. (2009). Must examiners meet in order to standardise their marking? An experiment with new and experienced examiners of GCE AS Psychology Paper presented at the British Educational Research Association, University of Manchester, UK.
Senn, S. (2002). Cross-Over Trials in Clinical Research. Chichester: Wiley.
Suto, I., Nádas, R., & Bell, J. (2011a). Who should mark what? A study of factors affecting marking accuracy in a biology examination. Research Papers in Education, 26(1), 21-51.
Suto, W. M. I., & Greatorex, J. (2008). A quantitative analysis of cognitive strategy usage in the marking of two GCSE examinations. Assessment in Education: Principles, Policy & Practice, 15(1), 73-89.
Suto, W. M. I., Greatorex, J., & Nádas, R. (2009). Thinking about making the right mark: Using cognitive strategy research to explore examiner training. Research Matters: A Cambridge Assessment Publication(8), 23-32.
Suto, W. M. I., & Nádas, R. (2008). What determines GCSE marking accuracy? An exploration of expertise among maths and physics markers. Research Papers in Education, 23(4), 477-497. doi: 10.1080/02671520701755499
Suto, W. M. I., & Nádas, R. (2009). Why are some GCSE examination questions harder to mark accurately than others? Using Kelly's Repertory Grid technique to identify relevant question features. Research Papers in Education, 24(3), 335-377. doi: http://dx.doi.org/10.1080/02671520801945925
Suto, W. M. I., Nádas, R., & Bell, J. (2011b). Who should mark what? A study of factors affecting marking accuracy in a biology examination. Research Papers in Education, 26(1), 21-51.
Sykes, E., Novakovic, N., Greatorex, J., Bell, J., Nádas, R., & Gill, T. (2009). How effective is fast and automated feedback to examiners in tackling the size of marking errors? Research Matters: A Cambridge Assessment Publication (8), 8-15.
Whitehouse, C., & Pollitt, A. (2012). Using adaptive comparative judgement to obtain a highly reliable rank order in summative assessment. Manchester: AQA Centre for Education Research and Policy.
Wolfe, E. W., Matthews, S., & Vickers, D. (2010). The effectiveness and efficiency of distributed online, regional online, and regional face-to-face training for writing assessment raters. The Journal of Technology, Learning and Assessment, 10(1). http://ejournals.bc.edu/ojs/index.php/jtla/article/view/1601/1457
Wolfe, E. W., & McVay, A. (2010). Rater effects as a function of rater training context. New York: Pearson Research and Innovation Network.

There are 45 citations in total.

Details

Primary Language	English
Subjects	Studies on Education
Journal Section	Articles
Authors	Jackie Greatorex This is me 0000-0002-2303-0638 Tom Sutch This is me 0000-0001-8157-277X Magda Werno This is me Jess Bowyer This is me Karen Dunn This is me 0000-0002-7499-9895
Publication Date	July 15, 2019
Submission Date	January 18, 2019
Published in Issue	Year 2019 Volume: 6 Issue: 2

Cite

APA	Greatorex, J., Sutch, T., Werno, M., Bowyer, J., et al. (2019). Investigating a new method for standardising essay marking using levels-based mark schemes. International Journal of Assessment Tools in Education, 6(2), 218-234. https://doi.org/10.21449/ijate.564824

Cited By

Twelve tips for developing effective marking schemes for constructed-response examination questions

Medical Teacher

https://doi.org/10.1080/0142159X.2024.2323178

Article Files

Full Text

23823 23825 23824