Investigating the Impact of Rater Training on Rater Errors in the Process of Assessing Writing Skill

Mehmet Şata; İsmail Karakaya

doi:10.21449/ijate.877035

Research Article

Investigating the Impact of Rater Training on Rater Errors in the Process of Assessing Writing Skill

Year 2022, Volume: 9 Issue: 2, 492 - 514, 26.06.2022

Mehmet Şata , İsmail Karakaya

https://doi.org/10.21449/ijate.877035

Cited By: 3

Abstract

In the process of measuring and assessing high-level cognitive skills, interference of rater errors in measurements brings about a constant concern and low objectivity. The main purpose of this study was to investigate the impact of rater training on rater errors in the process of assessing individual performance. The study was conducted with a pretest-posttest control group quasi-experimental design. In this research, 45 raters were employed, 23 from the control group and 22 from the experimental group. As data collection tools, a writing task that was developed by IELTS and an analytical rubric that was developed to assess academic writing skills were used. As part of the experimental procedure, rater training was provided and this training was implemented by combining rater error training and frame of reference training. When the findings of the study were examined, it was found that the control and experimental groups were similar to each other before the experiment, however, after the experimental process, the study group made more valid and reliable measurements. As a result, it was investigated that the rater training given had an impact on rater errors such as rater severity, rater leniency, central tendency, and Halo effect. Based on the obtained findings, some suggestions were offered for researchers and future studies.

Keywords

Rater training , Rater errors , Many facet Rasch model , Validity , Reliability

References

Abu Kassim, N.L. (2011). Judging behaviour and rater errors: an application of the many-facet Rasch model. GEMA Online Journal of Language Studies, 11(3), 179-197.
Abu Kassim, N.L. (2007). Exploring rater judging behaviour using the many-facet Rasch model. Paper Presented in the Second Biennial International Conference on Teaching and Learning of English in Asia: Exploring New Frontiers (TELiA2), Universiti Utara, Malaysia.
Aguinis, H., Mazurkiewicz, M.D., & Heggestad, E.D. (2009). Using web‐based frame‐of‐reference training to decrease biases in personality‐based job analysis: An experimental field study. Personnel Psychology, 62(2), 405-438. https://doi.org/10.1111/j.1744-6570.2009.01144.x
Athey, T.R., & McIntyre, R.M. (1987). Effect of rater training on rater accuracy: Levels–of–processing theory and social facilitation theory perspectives. Journal of Applied Psychology, 72, 567–572. https://doi.org/10.1037/0021-9010.72.4.567
Attali, Y., Bridgeman, B., & Trapani, C. (2010). Performance of a generic approach in automated essay scoring. Journal of Technology, Learning, and Assessment, 10(3), 1-16.
Baird, J.A., Hayes, M., Johnson, R., Johnson, S., & Lamprianou, I. (2013). Marker effects and examination reliability. A Comparative exploration from the perspectives of generalisability theory, Rash model and multilevel modelling. Oxford: University of Oxford for Educational Assessment.
Bennet, J. (1998). Human resources management. Singapore: Prentice Hall.
Bernardin, H.J. (1978). Effects of rater training on leniency and halo errors in student ratings of instructors. Journal of Applied Psychology, 63(3), 301 308. http://dx.doi.org/10.1037/0021-9010.63.3.301
Bernardin, H.J., & Buckley, M.R. (1981). Strategies in rater training. Academy of Management Review, 6(2), 205-212.
Bernardin, H.J. & Pence, E.C. (1980). Effects of rater training: New response sets and decreasing accuracy. Journal of Applied Psychology, 65, 60 66. https://doi.org/10.1037/0021-9010.65.1.60
Bijani, H. (2018). Investigating the validity of oral assessment rater training program: A mixed-methods study of raters’ perceptions and attitudes before and after training. Cogent Education, 5(1), 1-20. https://doi.org/10.1080/2331186X.2018.1460901
Bond, T., & Fox, C.M. (2015). Applying the Rasch model: Fundamental measurement in the human sciences. Routledge. https://doi.org/10.4324/9781315814698
Borman, W.C. (1975). Effects of instructions to avoid halo error on reliability and validity of performance evaluation ratings. Journal of Applied Psychology, 60(5), 556-560. https://doi.org/10.1037/0021-9010.60.5.556
Brennan, R.L., Gao, X., & Colton, D.A. (1995). Generalizability analyses of work key listening and writing tests. Educational and Psychological Measurement, 55(2), 157-176. https://doi.org/10.1177/0013164495055002001
Brijmohan, A. (2016). A many-facet RASCH measurement analysis to explore rater effects and rater training in medical school admissions [Doctoral dissertation]. https://hdl.handle.net/1807/74534
Brookhart, S.M. (2013). How to create and use rubrics for formative assessment and grading. ASCD.
Brown, H.D. (2004). Language assessment: Principles and classroom practices. Pearson Education.
Brown, H.D. (2007). Teaching by principles: An interactive approach to language pedagogy. Pearson Education.
Brown, J.D., & Hudson, T. (1998). The alternatives in language assessment. TESOL quarterly, 32(4), 653-675. https://doi.org/10.2307/3587999
Burstein, J., Kukich, K., Wolff, S., Lu, C., Chodorow, M., Braden-Harder, L., & Harris, M.D. (1998). Automated scoring using a hybrid feature identification technique. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, Montreal, Quebec, Canada. https://doi.org/10.3115/980845.980879
Büyüköztürk, Ş. (2011). Deneysel desenler- öntest-sontest kontrol grubu desen ve veri analizi [Experimental designs-pretest-posttest control group design and data analysis]. Pegem Akademi.
Chen, W.H., & Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22(3), 265-289. https://doi.org/10.3102/10769986022003265
Congdon, P., & McQueen, J. (2000). The stability of rater severity in large-scale assessment programs. Journal of Educational Measurement, 37(2), 163 178. https://doi.org/10.1111/j.1745-3984.2000.tb01081.x
Cronbach, L.I. (1990). Essentials of psychological testing. Harper and Row.
Çokluk, Ö., Şekercioğlu, G., & Büyüköztürk, Ş. (2012). Sosyal bilimler için çok değişkenli istatistik: SPSS ve LISREL uygulamaları [Multivariate statistics for social sciences: SPSS and LISREL applications]. Pegem Akademi.
Davis, L. (2016). The influence of training and experience on rater performance in scoring spoken language. Language Testing, 33(1), 117 135. https://doi.org/10.1177/0265532215582282
Dunbar, N.E., Brooks, C.F., & Miller, T.K. (2006). Oral communication skills in higher education: Using a performance-based evaluation rubric to assess communication skills. Innovative Higher Education, 31(2), 115-128. https://doi.org/10.1007/s10755-006-9012-x
Ebel, R.L. (1965). Measuring educational achievement. Prentice- Hall Press.
Ebel, R.L., & Frisbie, D.A. (1991). Essentials of educational measurement. Prentice Hall Press.
Eckes, T. (2008). Rater types in writing performance assessments: A classification approach to rater variability. Language Testing, 25(2), 155 185. https://doi.org/10.1177/0265532207086780
Eckes, T. (2015). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments. Peter Lang.
Ellis, R.O.D., Johnson, K.E., & Papajohn, D. (2002). Concept mapping for rater training. TESOL Quarterly, 36(2), 219-233. https://doi.org/10.2307/3588333
Engelhard, G. (1994). Examining rater errors in the assessment of written composition with a many‐faceted Rasch model. Journal of Educational Measurement, 31(2), 93-112. https://doi.org/10.1111/j.1745-3984.1994.tb00436.x
Esfandiari, R. (2015). Rater errors among peer-assessors: applying the many-facet Rasch measurement model. Iranian Journal of Applied Linguistics, 18(2), 77-107. https://doi.org/10.18869/acadpub.ijal.18.2.77
Fahim, M., & Bijani, H. (2011). The effects of rater training on raters’ severity and bias in second language writing assessment. Iranian Journal of Language Testing, 1(1), 1-16.
Farrokhi, F., & Esfandiari, R. (2011). A many-facet Rasch model to detect halo effect in three types of raters. Theory & Practice in Language Studies, 1(11), 1531-1540. https://doi.org/10.4304/tpls.1.11.1531-1540
Farrokhi, F., Esfandiari, R., & Schaefer, E. (2012). A many-facet Rasch measurement of differential rater severity/leniency in three types of assessment. JALT Journal, 34(1), 79-101.
Farrokhi, F., Esfandiari, R., & Vaez Dalili, M. (2011). Applying the many-facet Rasch model to detect centrality in self-assessment, peer-assessment and teacher assessment. World Applied Sciences Journal, 15(11), 76-83.
Feldman, M., Lazzara, E.H., Vanderbilt, A.A., & DiazGranados, D. (2012). Rater training to support high‐stakes simulation‐based assessments. Journal of Continuing Education in the Health Professions, 32(4), 279-286. https://doi.org/10.1002/chp.21156
Goodrich, H. (1997). Understanding Rubrics: The dictionary may define" rubric," but these models provide more clarity. Educational Leadership, 54(4), 14-17.
Gronlund, N.E. (1977). Constructing achievement test. Prentice-Hall Press.
Haladyna, T.M. (1997). Writing test items in order to evaluate higher order thinking. Allyn & Bacon.
Harik, P., Clauser, B.E., Grabovsky, I., Nungester, R.J., Swanson, D., & Nandakumar, R. (2009). An examination of rater drift within a generalizability theory framework. Journal of Educational Measurement, 46(1), 43-58. https://doi.org/10.1111/j.1745-3984.2009.01068.x
Hauenstein, N.M., & McCusker, M.E. (2017). Rater training: Understanding effects of training content, practice ratings, and feedback. International Journal of Selection and Assessment, 25(3), 253-266. https://doi.org/10.1111/ijsa.12177
Howitt, D., & Cramer, D. (2008). Introduction to statistics in psychology. Pearson Education.
Hughes, A. (2003). Testing for language teachers. Cambridge University Press.
İlhan, M. (2015). Standart ve SOLO taksonomisine dayalı rubrikler ile puanlanan açık uçlu matematik sorularında puanlayıcı etkilerinin çok yüzeyli Rasch modeli ile incelenmesi [The identifıcation of rater effects on open-ended math questions rated through standard rubrics and rubrics based on the SOLO taxonomy in reference to the many facet rasch model] [Doctoral dissertation, Gaziantep University]. https://tez.yok.gov.tr/UlusalTezMerkezi/
İlhan, M., & Çetin, B. (2014). Rater training as a means of decreasing interfering rater effects related to performance assessment. Journal of European Education, 4(2), 29-38. https://doi.org/10.18656/jee.77087
Johnson, R.L., Penny, J.A., & Gordon, B. (2008). Assessing performance: Designing, scoring, and validating performance tasks. Guilford Press.
Kane, J., Bernardin, H., Villanueva, J., & Peyrefitte, J. (1995). Stability of rater leniency: Three studies. Academy of Management Journal, 38, 1036-1051.
Khaatri, N., Kane, M.B., & Reeve, A.L. (1995). How performance assessments affect teaching and learning. Educational Leadership, 53(3), 80-83.
Kim, Y.K. (2009). Combining constructed response items and multiple choice items using a hierarchical rater model [Doctoral dissertation, Columbia University]. https://www.proquest.com/
Knoch, U., Fairbairn, J., Myford, C., & Huisman, A. (2018). Evaluating the relative effectiveness of online and face-to-face training for new writing raters. Papers in Language Testing and Assessment, 7(1), 61-86.
Knoch, U., Read, J., & von Randow, T. (2007). Re-training writing raters online: How does compare with face to face training?, Assessing Writing, 12(2), 26 43. https://doi.org/10.1016/j.asw.2007.04.001
Kondo, Y. (2010). Examination of rater training effect and rater eligibility in L2 performance assessment. Journal of Pan-Pacific Association of Applied Linguistics, 14(2), 1-23.
Kubiszyn, T., & Borich, G. (2013). Educational testing and measurement. John Wiley & Sons Incorporated.
Kutlu, Ö., Doğan, C.D., & Karaya, İ. (2014). Öğrenci başarısının belirlenmesi: Performansa ve portfolyoya dayalı durum belirleme [Determining student success: Determining the situation based on performance and portfolio]. Pegem Akademi
Landauer, T.K., Laham, D., & Foltz, P.W. (2003). Automated scoring and annotation of essays with the Intelligent Essay Assessor. In M. D. Shermis & J. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 87–112). Lawrence Erlbaum Associates, Inc.
Lawshe, C.H. (1975). A quantitative approach to content validity. Personnel psychology, 28(4), 563-575. https://doi.org/10.1111/j.1744-6570.1975.tb01393.x
Lawshe, C.H. (1985). Inferences from personnel tests and their validity. Journal of Applied Psychology, 70(1), 237-238. https://doi.org/10.1037/0021-9010.70.1.237
Leckie, G., & Baird, J.A. (2011). Rater effects on essay scoring: A multilevel analysis of severity drift, central tendency, and rater experience. Journal of Educational Measurement, 48(4), 399-418. https://doi.org/10.1111/j.1745-3984.2011.00152.x
Linacre, J.M. (1993). Rasch-based generalizability theory. Rasch Measurement Transaction, 7(1), 283-284.
Linacre, J.M. (1994). Many-facet Rasch measurement. Mesa Press.
Linacre, J.M. (2017). A user’s guide to FACETS: Rasch-model computer programs. MESA Press.
Lumley, T., & McNamara, T.F. (1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12(1), 54-71. https://doi.org/10.1177/026553229501200104
Lunz, M.E., Wright, B.D. & Linacre, J.M. (1990). Measuring the impact of judge severity on examination scores. Applied Measurement in Education, 3(4), 331 345. https://doi.org/10.1207/s15324818ame0304_3
May, G.L. (2008). The effect of rater training on reducing social style bias in peer evaluation. Business Communication Quarterly, 71(3), 297 313. https://doi.org/10.1177/1080569908321431
McDonald, R.P. (1999). Test theory: A unified approach. Erlbaum.
McNamara, T.F. (1996). Measuring second language performance. Longman.
Moore, B.B. (2009). Consideration of rater effects and rater design via signal detection theory [Doctoral dissertation, Columbia University]. https://www.proquest.com/
Moser, K., Kemter, V., Wachsmann, K., Köver, N.Z., & Soucek, R. (2016). Evaluating rater training with double-pretest one-posttest designs: an analysis of testing effects and the moderating role of rater self-efficacy. The International Journal of Human Resource Management, 1-23. https://doi.org/10.1080/09585192.2016.1254102
Moskal, B.M. (2000). Scoring rubrics: What, when and how?.
Murphy, K.R. & Balzer, W.K. (1989). Rater errors and rating accuracy. Journal of Applied Psychology, 74, 619-624. https://doi.org/10.1037/0021-9010.74.4.619
Myford, C.M., & Wolfe, E.M. (2009). Monitoring rater performance over time: A framework for detecting differential accuracy and differential scale use. Journal of Educational Measurement, 46(4), 371-389. https://doi.org/10.1111/j.1745-3984.2009.00088.x
Myford, C.M., & Wolfe, E.W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II. Journal of Applied Measurement, 5(2), 189-227.
Oosterhof, A. (2003). Developing and using classroom assessments. Merrill-Prentice Hall Press.
Osburn, H.G. (2000). Coefficient alpha and related internal consistency reliability coefficients. Psychological methods, 5(3), 343. http://dx.doi.org/10.1037/1082-989X.5.3.343
Pallant, J. (2007). SPSS survival manual, a step by step guide to data analysis using spss for windows. McGraw-Hill.
Pulakos, E.D. (1984). A comparison of rater training programs: Error training and accuracy training. Journal of Applied Psychology, 69(4), 581 588. http://psycnet.apa.org/doi/10.1037/0021-9010.69.4.581
Roch, S.G., Woehr, D.J., Mishra, V., & Kieszczynska, U. (2012). Rater training revisited: An updated meta‐analytic review of frame‐of‐reference training. Journal of Occupational and Organizational Psychology, 85(2), 370-395. https://doi.org/10.1111/j.2044-8325.2011.02045.x
Romagnano, L. (2001). The myth of objectivity in mathematics assessment. Mathematics Teacher, 94(1), 31-37.
Royal, K. D., & Hecker, K. G. (2016). Rater errors in clinical performance assessments. Journal of veterinary medical education, 43(1), 5-8. https://doi.org/10.3138/jvme.0715-112R
Sarıtaş-Akyol, S., & Karakaya, İ. (2021). Investigating the consistency between students’ and teachers’ ratings for the assessment of problem-solving skills with many-facet Rasch measurement model. Eurasian Journal of Educational Research, 91, 281-300. https://doi.org/10.14689/ejer.2021.91.13
Shale, D. (1996). Essay reliability: Form and meaning. In: White, E. Lutz, W. & Kamusikiri S. (Eds.), Assessment of writing: Politics, policies, practices (pp. 76–96). MLAA.
Stamoulis, D.T. & Hauenstein, N.M.A. (1993). Rater training and rating accuracy: Training for dimensional accuracy versus training for ratee differentiation. Journal of Applied Psychology, 78(6), 994-1003. https://doi.org/10.1037/0021-9010.78.6.994
Sudweeks, R.R., Reeve, S. & Bradshaw, W.S. (2005). A comparison of generalizability theory and many-facet Rasch measurement in an analysis of college sophomore writing. Assessing Writing, 9, 239-261. https://doi.org/10.1016/j.asw.2004.11.001
Sulsky, L.M., & Day, D.V. (1992). Frame-of-reference training and cognitive categorization: An empirical investigation of rater memory issues. Journal of Applied Psychology, 77(4), 501-510. https://doi.org/10.1037/0021-9010.77.4.501
Weigle, S.C. (1998). Using FACETS to model rater training effects. Language Testing, 15(2), 263-287. https://doi.org/10.1177/026553229801500205
Weigle, S.C. (2002). Assessing writing. Cambridge University Press. https://doi.org/10.1017/CBO9780511732997
Weitz, G., Vinzentius, C., Twesten, C., Lehnert, H., Bonnemeier, H., & König, I.R. (2014). Effects of a rater training on rating accuracy in a physical examination skills assessment. GMS Zeitschrift für Medizinische Ausbildung, 31(4), 1-17.
Wilson, F.R., Pan, W., & Schumsky, D.A. (2012). Recalculation of the critical values for Lawshe’s content validity ratio. Measurement and Evaluation in Counseling and Development, 45(3), 197-210. https://doi.org/10.1177/0748175612440286
Woehr, D.J., & Huffuct, A.I. (1994). Rater training for performance appraisal. A qantitative review. Journal of Occupational and Organizational Psychology, 67(3), 189-205. https://doi.org/10.1111/j.2044-8325.1994.tb00562.x
Wu, S.M., & Tan, S. (2016). Managing rater effects through the use of FACETS analysis: the case of a university placement test. Higher Education Research & Development, 35(2), 380-394. https://doi.org/10.1080/07294360.2015.1087381
Zedeck, S., & Cascio, W.F. (1982). Performance appraisal decisions as a function of rater training and purpose of the appraisal. Journal of Applied Psychology, 67(6), 752-758. https://doi.org/10.1037/0021-9010.67.6.752

Investigating the Impact of Rater Training on Rater Errors in the Process of Assessing Writing Skill

Year 2022, Volume: 9 Issue: 2, 492 - 514, 26.06.2022

Mehmet Şata , İsmail Karakaya

https://doi.org/10.21449/ijate.877035

Cited By: 3

Abstract

Keywords

Rater training , Rater errors , Many Facet Rasch Model , Validity , Reliability

References

Abu Kassim, N.L. (2011). Judging behaviour and rater errors: an application of the many-facet Rasch model. GEMA Online Journal of Language Studies, 11(3), 179-197.
Abu Kassim, N.L. (2007). Exploring rater judging behaviour using the many-facet Rasch model. Paper Presented in the Second Biennial International Conference on Teaching and Learning of English in Asia: Exploring New Frontiers (TELiA2), Universiti Utara, Malaysia.
Aguinis, H., Mazurkiewicz, M.D., & Heggestad, E.D. (2009). Using web‐based frame‐of‐reference training to decrease biases in personality‐based job analysis: An experimental field study. Personnel Psychology, 62(2), 405-438. https://doi.org/10.1111/j.1744-6570.2009.01144.x
Athey, T.R., & McIntyre, R.M. (1987). Effect of rater training on rater accuracy: Levels–of–processing theory and social facilitation theory perspectives. Journal of Applied Psychology, 72, 567–572. https://doi.org/10.1037/0021-9010.72.4.567
Attali, Y., Bridgeman, B., & Trapani, C. (2010). Performance of a generic approach in automated essay scoring. Journal of Technology, Learning, and Assessment, 10(3), 1-16.
Baird, J.A., Hayes, M., Johnson, R., Johnson, S., & Lamprianou, I. (2013). Marker effects and examination reliability. A Comparative exploration from the perspectives of generalisability theory, Rash model and multilevel modelling. Oxford: University of Oxford for Educational Assessment.
Bennet, J. (1998). Human resources management. Singapore: Prentice Hall.
Bernardin, H.J. (1978). Effects of rater training on leniency and halo errors in student ratings of instructors. Journal of Applied Psychology, 63(3), 301 308. http://dx.doi.org/10.1037/0021-9010.63.3.301
Bernardin, H.J., & Buckley, M.R. (1981). Strategies in rater training. Academy of Management Review, 6(2), 205-212.
Bernardin, H.J. & Pence, E.C. (1980). Effects of rater training: New response sets and decreasing accuracy. Journal of Applied Psychology, 65, 60 66. https://doi.org/10.1037/0021-9010.65.1.60
Bijani, H. (2018). Investigating the validity of oral assessment rater training program: A mixed-methods study of raters’ perceptions and attitudes before and after training. Cogent Education, 5(1), 1-20. https://doi.org/10.1080/2331186X.2018.1460901
Bond, T., & Fox, C.M. (2015). Applying the Rasch model: Fundamental measurement in the human sciences. Routledge. https://doi.org/10.4324/9781315814698
Borman, W.C. (1975). Effects of instructions to avoid halo error on reliability and validity of performance evaluation ratings. Journal of Applied Psychology, 60(5), 556-560. https://doi.org/10.1037/0021-9010.60.5.556
Brennan, R.L., Gao, X., & Colton, D.A. (1995). Generalizability analyses of work key listening and writing tests. Educational and Psychological Measurement, 55(2), 157-176. https://doi.org/10.1177/0013164495055002001
Brijmohan, A. (2016). A many-facet RASCH measurement analysis to explore rater effects and rater training in medical school admissions [Doctoral dissertation]. https://hdl.handle.net/1807/74534
Brookhart, S.M. (2013). How to create and use rubrics for formative assessment and grading. ASCD.
Brown, H.D. (2004). Language assessment: Principles and classroom practices. Pearson Education.
Brown, H.D. (2007). Teaching by principles: An interactive approach to language pedagogy. Pearson Education.
Brown, J.D., & Hudson, T. (1998). The alternatives in language assessment. TESOL quarterly, 32(4), 653-675. https://doi.org/10.2307/3587999
Burstein, J., Kukich, K., Wolff, S., Lu, C., Chodorow, M., Braden-Harder, L., & Harris, M.D. (1998). Automated scoring using a hybrid feature identification technique. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, Montreal, Quebec, Canada. https://doi.org/10.3115/980845.980879
Büyüköztürk, Ş. (2011). Deneysel desenler- öntest-sontest kontrol grubu desen ve veri analizi [Experimental designs-pretest-posttest control group design and data analysis]. Pegem Akademi.
Chen, W.H., & Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22(3), 265-289. https://doi.org/10.3102/10769986022003265
Congdon, P., & McQueen, J. (2000). The stability of rater severity in large-scale assessment programs. Journal of Educational Measurement, 37(2), 163 178. https://doi.org/10.1111/j.1745-3984.2000.tb01081.x
Cronbach, L.I. (1990). Essentials of psychological testing. Harper and Row.
Çokluk, Ö., Şekercioğlu, G., & Büyüköztürk, Ş. (2012). Sosyal bilimler için çok değişkenli istatistik: SPSS ve LISREL uygulamaları [Multivariate statistics for social sciences: SPSS and LISREL applications]. Pegem Akademi.
Davis, L. (2016). The influence of training and experience on rater performance in scoring spoken language. Language Testing, 33(1), 117 135. https://doi.org/10.1177/0265532215582282
Dunbar, N.E., Brooks, C.F., & Miller, T.K. (2006). Oral communication skills in higher education: Using a performance-based evaluation rubric to assess communication skills. Innovative Higher Education, 31(2), 115-128. https://doi.org/10.1007/s10755-006-9012-x
Ebel, R.L. (1965). Measuring educational achievement. Prentice- Hall Press.
Ebel, R.L., & Frisbie, D.A. (1991). Essentials of educational measurement. Prentice Hall Press.
Eckes, T. (2008). Rater types in writing performance assessments: A classification approach to rater variability. Language Testing, 25(2), 155 185. https://doi.org/10.1177/0265532207086780
Eckes, T. (2015). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments. Peter Lang.
Ellis, R.O.D., Johnson, K.E., & Papajohn, D. (2002). Concept mapping for rater training. TESOL Quarterly, 36(2), 219-233. https://doi.org/10.2307/3588333
Engelhard, G. (1994). Examining rater errors in the assessment of written composition with a many‐faceted Rasch model. Journal of Educational Measurement, 31(2), 93-112. https://doi.org/10.1111/j.1745-3984.1994.tb00436.x
Esfandiari, R. (2015). Rater errors among peer-assessors: applying the many-facet Rasch measurement model. Iranian Journal of Applied Linguistics, 18(2), 77-107. https://doi.org/10.18869/acadpub.ijal.18.2.77
Fahim, M., & Bijani, H. (2011). The effects of rater training on raters’ severity and bias in second language writing assessment. Iranian Journal of Language Testing, 1(1), 1-16.
Farrokhi, F., & Esfandiari, R. (2011). A many-facet Rasch model to detect halo effect in three types of raters. Theory & Practice in Language Studies, 1(11), 1531-1540. https://doi.org/10.4304/tpls.1.11.1531-1540
Farrokhi, F., Esfandiari, R., & Schaefer, E. (2012). A many-facet Rasch measurement of differential rater severity/leniency in three types of assessment. JALT Journal, 34(1), 79-101.
Farrokhi, F., Esfandiari, R., & Vaez Dalili, M. (2011). Applying the many-facet Rasch model to detect centrality in self-assessment, peer-assessment and teacher assessment. World Applied Sciences Journal, 15(11), 76-83.
Feldman, M., Lazzara, E.H., Vanderbilt, A.A., & DiazGranados, D. (2012). Rater training to support high‐stakes simulation‐based assessments. Journal of Continuing Education in the Health Professions, 32(4), 279-286. https://doi.org/10.1002/chp.21156
Goodrich, H. (1997). Understanding Rubrics: The dictionary may define" rubric," but these models provide more clarity. Educational Leadership, 54(4), 14-17.
Gronlund, N.E. (1977). Constructing achievement test. Prentice-Hall Press.
Haladyna, T.M. (1997). Writing test items in order to evaluate higher order thinking. Allyn & Bacon.
Harik, P., Clauser, B.E., Grabovsky, I., Nungester, R.J., Swanson, D., & Nandakumar, R. (2009). An examination of rater drift within a generalizability theory framework. Journal of Educational Measurement, 46(1), 43-58. https://doi.org/10.1111/j.1745-3984.2009.01068.x
Hauenstein, N.M., & McCusker, M.E. (2017). Rater training: Understanding effects of training content, practice ratings, and feedback. International Journal of Selection and Assessment, 25(3), 253-266. https://doi.org/10.1111/ijsa.12177
Howitt, D., & Cramer, D. (2008). Introduction to statistics in psychology. Pearson Education.
Hughes, A. (2003). Testing for language teachers. Cambridge University Press.
İlhan, M. (2015). Standart ve SOLO taksonomisine dayalı rubrikler ile puanlanan açık uçlu matematik sorularında puanlayıcı etkilerinin çok yüzeyli Rasch modeli ile incelenmesi [The identifıcation of rater effects on open-ended math questions rated through standard rubrics and rubrics based on the SOLO taxonomy in reference to the many facet rasch model] [Doctoral dissertation, Gaziantep University]. https://tez.yok.gov.tr/UlusalTezMerkezi/
İlhan, M., & Çetin, B. (2014). Rater training as a means of decreasing interfering rater effects related to performance assessment. Journal of European Education, 4(2), 29-38. https://doi.org/10.18656/jee.77087
Johnson, R.L., Penny, J.A., & Gordon, B. (2008). Assessing performance: Designing, scoring, and validating performance tasks. Guilford Press.
Kane, J., Bernardin, H., Villanueva, J., & Peyrefitte, J. (1995). Stability of rater leniency: Three studies. Academy of Management Journal, 38, 1036-1051.
Khaatri, N., Kane, M.B., & Reeve, A.L. (1995). How performance assessments affect teaching and learning. Educational Leadership, 53(3), 80-83.
Kim, Y.K. (2009). Combining constructed response items and multiple choice items using a hierarchical rater model [Doctoral dissertation, Columbia University]. https://www.proquest.com/
Knoch, U., Fairbairn, J., Myford, C., & Huisman, A. (2018). Evaluating the relative effectiveness of online and face-to-face training for new writing raters. Papers in Language Testing and Assessment, 7(1), 61-86.
Knoch, U., Read, J., & von Randow, T. (2007). Re-training writing raters online: How does compare with face to face training?, Assessing Writing, 12(2), 26 43. https://doi.org/10.1016/j.asw.2007.04.001
Kondo, Y. (2010). Examination of rater training effect and rater eligibility in L2 performance assessment. Journal of Pan-Pacific Association of Applied Linguistics, 14(2), 1-23.
Kubiszyn, T., & Borich, G. (2013). Educational testing and measurement. John Wiley & Sons Incorporated.
Kutlu, Ö., Doğan, C.D., & Karaya, İ. (2014). Öğrenci başarısının belirlenmesi: Performansa ve portfolyoya dayalı durum belirleme [Determining student success: Determining the situation based on performance and portfolio]. Pegem Akademi
Landauer, T.K., Laham, D., & Foltz, P.W. (2003). Automated scoring and annotation of essays with the Intelligent Essay Assessor. In M. D. Shermis & J. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 87–112). Lawrence Erlbaum Associates, Inc.
Lawshe, C.H. (1975). A quantitative approach to content validity. Personnel psychology, 28(4), 563-575. https://doi.org/10.1111/j.1744-6570.1975.tb01393.x
Lawshe, C.H. (1985). Inferences from personnel tests and their validity. Journal of Applied Psychology, 70(1), 237-238. https://doi.org/10.1037/0021-9010.70.1.237
Leckie, G., & Baird, J.A. (2011). Rater effects on essay scoring: A multilevel analysis of severity drift, central tendency, and rater experience. Journal of Educational Measurement, 48(4), 399-418. https://doi.org/10.1111/j.1745-3984.2011.00152.x
Linacre, J.M. (1993). Rasch-based generalizability theory. Rasch Measurement Transaction, 7(1), 283-284.
Linacre, J.M. (1994). Many-facet Rasch measurement. Mesa Press.
Linacre, J.M. (2017). A user’s guide to FACETS: Rasch-model computer programs. MESA Press.
Lumley, T., & McNamara, T.F. (1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12(1), 54-71. https://doi.org/10.1177/026553229501200104
Lunz, M.E., Wright, B.D. & Linacre, J.M. (1990). Measuring the impact of judge severity on examination scores. Applied Measurement in Education, 3(4), 331 345. https://doi.org/10.1207/s15324818ame0304_3
May, G.L. (2008). The effect of rater training on reducing social style bias in peer evaluation. Business Communication Quarterly, 71(3), 297 313. https://doi.org/10.1177/1080569908321431
McDonald, R.P. (1999). Test theory: A unified approach. Erlbaum.
McNamara, T.F. (1996). Measuring second language performance. Longman.
Moore, B.B. (2009). Consideration of rater effects and rater design via signal detection theory [Doctoral dissertation, Columbia University]. https://www.proquest.com/
Moser, K., Kemter, V., Wachsmann, K., Köver, N.Z., & Soucek, R. (2016). Evaluating rater training with double-pretest one-posttest designs: an analysis of testing effects and the moderating role of rater self-efficacy. The International Journal of Human Resource Management, 1-23. https://doi.org/10.1080/09585192.2016.1254102
Moskal, B.M. (2000). Scoring rubrics: What, when and how?.
Murphy, K.R. & Balzer, W.K. (1989). Rater errors and rating accuracy. Journal of Applied Psychology, 74, 619-624. https://doi.org/10.1037/0021-9010.74.4.619
Myford, C.M., & Wolfe, E.M. (2009). Monitoring rater performance over time: A framework for detecting differential accuracy and differential scale use. Journal of Educational Measurement, 46(4), 371-389. https://doi.org/10.1111/j.1745-3984.2009.00088.x
Myford, C.M., & Wolfe, E.W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II. Journal of Applied Measurement, 5(2), 189-227.
Oosterhof, A. (2003). Developing and using classroom assessments. Merrill-Prentice Hall Press.
Osburn, H.G. (2000). Coefficient alpha and related internal consistency reliability coefficients. Psychological methods, 5(3), 343. http://dx.doi.org/10.1037/1082-989X.5.3.343
Pallant, J. (2007). SPSS survival manual, a step by step guide to data analysis using spss for windows. McGraw-Hill.
Pulakos, E.D. (1984). A comparison of rater training programs: Error training and accuracy training. Journal of Applied Psychology, 69(4), 581 588. http://psycnet.apa.org/doi/10.1037/0021-9010.69.4.581
Roch, S.G., Woehr, D.J., Mishra, V., & Kieszczynska, U. (2012). Rater training revisited: An updated meta‐analytic review of frame‐of‐reference training. Journal of Occupational and Organizational Psychology, 85(2), 370-395. https://doi.org/10.1111/j.2044-8325.2011.02045.x
Romagnano, L. (2001). The myth of objectivity in mathematics assessment. Mathematics Teacher, 94(1), 31-37.
Royal, K. D., & Hecker, K. G. (2016). Rater errors in clinical performance assessments. Journal of veterinary medical education, 43(1), 5-8. https://doi.org/10.3138/jvme.0715-112R
Sarıtaş-Akyol, S., & Karakaya, İ. (2021). Investigating the consistency between students’ and teachers’ ratings for the assessment of problem-solving skills with many-facet Rasch measurement model. Eurasian Journal of Educational Research, 91, 281-300. https://doi.org/10.14689/ejer.2021.91.13
Shale, D. (1996). Essay reliability: Form and meaning. In: White, E. Lutz, W. & Kamusikiri S. (Eds.), Assessment of writing: Politics, policies, practices (pp. 76–96). MLAA.
Stamoulis, D.T. & Hauenstein, N.M.A. (1993). Rater training and rating accuracy: Training for dimensional accuracy versus training for ratee differentiation. Journal of Applied Psychology, 78(6), 994-1003. https://doi.org/10.1037/0021-9010.78.6.994
Sudweeks, R.R., Reeve, S. & Bradshaw, W.S. (2005). A comparison of generalizability theory and many-facet Rasch measurement in an analysis of college sophomore writing. Assessing Writing, 9, 239-261. https://doi.org/10.1016/j.asw.2004.11.001
Sulsky, L.M., & Day, D.V. (1992). Frame-of-reference training and cognitive categorization: An empirical investigation of rater memory issues. Journal of Applied Psychology, 77(4), 501-510. https://doi.org/10.1037/0021-9010.77.4.501
Weigle, S.C. (1998). Using FACETS to model rater training effects. Language Testing, 15(2), 263-287. https://doi.org/10.1177/026553229801500205
Weigle, S.C. (2002). Assessing writing. Cambridge University Press. https://doi.org/10.1017/CBO9780511732997
Weitz, G., Vinzentius, C., Twesten, C., Lehnert, H., Bonnemeier, H., & König, I.R. (2014). Effects of a rater training on rating accuracy in a physical examination skills assessment. GMS Zeitschrift für Medizinische Ausbildung, 31(4), 1-17.
Wilson, F.R., Pan, W., & Schumsky, D.A. (2012). Recalculation of the critical values for Lawshe’s content validity ratio. Measurement and Evaluation in Counseling and Development, 45(3), 197-210. https://doi.org/10.1177/0748175612440286
Woehr, D.J., & Huffuct, A.I. (1994). Rater training for performance appraisal. A qantitative review. Journal of Occupational and Organizational Psychology, 67(3), 189-205. https://doi.org/10.1111/j.2044-8325.1994.tb00562.x
Wu, S.M., & Tan, S. (2016). Managing rater effects through the use of FACETS analysis: the case of a university placement test. Higher Education Research & Development, 35(2), 380-394. https://doi.org/10.1080/07294360.2015.1087381
Zedeck, S., & Cascio, W.F. (1982). Performance appraisal decisions as a function of rater training and purpose of the appraisal. Journal of Applied Psychology, 67(6), 752-758. https://doi.org/10.1037/0021-9010.67.6.752

There are 94 citations in total.

Details

Primary Language	English
Subjects	Studies on Education
Journal Section	Articles
Authors	Mehmet Şata 0000-0003-2683-4997 İsmail Karakaya 0000-0003-4308-6919
Early Pub Date	April 28, 2022
Publication Date	June 26, 2022
Submission Date	February 8, 2021
Published in Issue	Year 2022 Volume: 9 Issue: 2

Cite

APA	Şata, M., & Karakaya, İ. (2022). Investigating the Impact of Rater Training on Rater Errors in the Process of Assessing Writing Skill. International Journal of Assessment Tools in Education, 9(2), 492-514. https://doi.org/10.21449/ijate.877035

Cited By

The Role of Time on Performance Assessment (Self, Peer and Teacher) in Higher Education: Rater Drift

Participatory Educational Research

https://doi.org/10.17275/per.23.77.10.5

Enhancing Kendall’s W using genetic algorithm: A computational approach to inter-rater reliability optimization

Expert Systems with Applications

https://doi.org/10.1016/j.eswa.2025.127320

Development and validation of a GPT-based rater for assessing communication skills using the Gap-Kalamazoo Communication Skills Assessment Form

Medical Teacher

https://doi.org/10.1080/0142159X.2025.2532783

Article Files

Full Text

23823 23825 23824