Research Article
BibTex RIS Cite

Investigating the Impact of Rater Training on Rater Errors in the Process of Assessing Writing Skill

Year 2022, Volume: 9 Issue: 2, 492 - 514, 26.06.2022
https://doi.org/10.21449/ijate.877035

Abstract

In the process of measuring and assessing high-level cognitive skills, interference of rater errors in measurements brings about a constant concern and low objectivity. The main purpose of this study was to investigate the impact of rater training on rater errors in the process of assessing individual performance. The study was conducted with a pretest-posttest control group quasi-experimental design. In this research, 45 raters were employed, 23 from the control group and 22 from the experimental group. As data collection tools, a writing task that was developed by IELTS and an analytical rubric that was developed to assess academic writing skills were used. As part of the experimental procedure, rater training was provided and this training was implemented by combining rater error training and frame of reference training. When the findings of the study were examined, it was found that the control and experimental groups were similar to each other before the experiment, however, after the experimental process, the study group made more valid and reliable measurements. As a result, it was investigated that the rater training given had an impact on rater errors such as rater severity, rater leniency, central tendency, and Halo effect. Based on the obtained findings, some suggestions were offered for researchers and future studies.

References

  • Abu Kassim, N.L. (2011). Judging behaviour and rater errors: an application of the many-facet Rasch model. GEMA Online Journal of Language Studies, 11(3), 179-197.
  • Abu Kassim, N.L. (2007). Exploring rater judging behaviour using the many-facet Rasch model. Paper Presented in the Second Biennial International Conference on Teaching and Learning of English in Asia: Exploring New Frontiers (TELiA2), Universiti Utara, Malaysia.
  • Aguinis, H., Mazurkiewicz, M.D., & Heggestad, E.D. (2009). Using web‐based frame‐of‐reference training to decrease biases in personality‐based job analysis: An experimental field study. Personnel Psychology, 62(2), 405-438. https://doi.org/10.1111/j.1744-6570.2009.01144.x
  • Athey, T.R., & McIntyre, R.M. (1987). Effect of rater training on rater accuracy: Levels–of–processing theory and social facilitation theory perspectives. Journal of Applied Psychology, 72, 567–572. https://doi.org/10.1037/0021-9010.72.4.567
  • Attali, Y., Bridgeman, B., & Trapani, C. (2010). Performance of a generic approach in automated essay scoring. Journal of Technology, Learning, and Assessment, 10(3), 1-16.
  • Baird, J.A., Hayes, M., Johnson, R., Johnson, S., & Lamprianou, I. (2013). Marker effects and examination reliability. A Comparative exploration from the perspectives of generalisability theory, Rash model and multilevel modelling. Oxford: University of Oxford for Educational Assessment.
  • Bennet, J. (1998). Human resources management. Singapore: Prentice Hall.
  • Bernardin, H.J. (1978). Effects of rater training on leniency and halo errors in student ratings of instructors. Journal of Applied Psychology, 63(3), 301 308. http://dx.doi.org/10.1037/0021-9010.63.3.301
  • Bernardin, H.J., & Buckley, M.R. (1981). Strategies in rater training. Academy of Management Review, 6(2), 205-212.
  • Bernardin, H.J. & Pence, E.C. (1980). Effects of rater training: New response sets and decreasing accuracy. Journal of Applied Psychology, 65, 60 66. https://doi.org/10.1037/0021-9010.65.1.60
  • Bijani, H. (2018). Investigating the validity of oral assessment rater training program: A mixed-methods study of raters’ perceptions and attitudes before and after training. Cogent Education, 5(1), 1-20. https://doi.org/10.1080/2331186X.2018.1460901
  • Bond, T., & Fox, C.M. (2015). Applying the Rasch model: Fundamental measurement in the human sciences. Routledge. https://doi.org/10.4324/9781315814698
  • Borman, W.C. (1975). Effects of instructions to avoid halo error on reliability and validity of performance evaluation ratings. Journal of Applied Psychology, 60(5), 556-560. https://doi.org/10.1037/0021-9010.60.5.556
  • Brennan, R.L., Gao, X., & Colton, D.A. (1995). Generalizability analyses of work key listening and writing tests. Educational and Psychological Measurement, 55(2), 157-176. https://doi.org/10.1177/0013164495055002001
  • Brijmohan, A. (2016). A many-facet RASCH measurement analysis to explore rater effects and rater training in medical school admissions [Doctoral dissertation]. https://hdl.handle.net/1807/74534
  • Brookhart, S.M. (2013). How to create and use rubrics for formative assessment and grading. ASCD.
  • Brown, H.D. (2004). Language assessment: Principles and classroom practices. Pearson Education.
  • Brown, H.D. (2007). Teaching by principles: An interactive approach to language pedagogy. Pearson Education.
  • Brown, J.D., & Hudson, T. (1998). The alternatives in language assessment. TESOL quarterly, 32(4), 653-675. https://doi.org/10.2307/3587999
  • Burstein, J., Kukich, K., Wolff, S., Lu, C., Chodorow, M., Braden-Harder, L., & Harris, M.D. (1998). Automated scoring using a hybrid feature identification technique. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, Montreal, Quebec, Canada. https://doi.org/10.3115/980845.980879
  • Büyüköztürk, Ş. (2011). Deneysel desenler- öntest-sontest kontrol grubu desen ve veri analizi [Experimental designs-pretest-posttest control group design and data analysis]. Pegem Akademi.
  • Chen, W.H., & Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22(3), 265-289. https://doi.org/10.3102/10769986022003265
  • Congdon, P., & McQueen, J. (2000). The stability of rater severity in large-scale assessment programs. Journal of Educational Measurement, 37(2), 163 178. https://doi.org/10.1111/j.1745-3984.2000.tb01081.x
  • Cronbach, L.I. (1990). Essentials of psychological testing. Harper and Row.
  • Çokluk, Ö., Şekercioğlu, G., & Büyüköztürk, Ş. (2012). Sosyal bilimler için çok değişkenli istatistik: SPSS ve LISREL uygulamaları [Multivariate statistics for social sciences: SPSS and LISREL applications]. Pegem Akademi.
  • Davis, L. (2016). The influence of training and experience on rater performance in scoring spoken language. Language Testing, 33(1), 117 135. https://doi.org/10.1177/0265532215582282
  • Dunbar, N.E., Brooks, C.F., & Miller, T.K. (2006). Oral communication skills in higher education: Using a performance-based evaluation rubric to assess communication skills. Innovative Higher Education, 31(2), 115-128. https://doi.org/10.1007/s10755-006-9012-x
  • Ebel, R.L. (1965). Measuring educational achievement. Prentice- Hall Press.
  • Ebel, R.L., & Frisbie, D.A. (1991). Essentials of educational measurement. Prentice Hall Press.
  • Eckes, T. (2008). Rater types in writing performance assessments: A classification approach to rater variability. Language Testing, 25(2), 155 185. https://doi.org/10.1177/0265532207086780
  • Eckes, T. (2015). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments. Peter Lang.
  • Ellis, R.O.D., Johnson, K.E., & Papajohn, D. (2002). Concept mapping for rater training. TESOL Quarterly, 36(2), 219-233. https://doi.org/10.2307/3588333
  • Engelhard, G. (1994). Examining rater errors in the assessment of written composition with a many‐faceted Rasch model. Journal of Educational Measurement, 31(2), 93-112. https://doi.org/10.1111/j.1745-3984.1994.tb00436.x
  • Esfandiari, R. (2015). Rater errors among peer-assessors: applying the many-facet Rasch measurement model. Iranian Journal of Applied Linguistics, 18(2), 77-107. https://doi.org/10.18869/acadpub.ijal.18.2.77
  • Fahim, M., & Bijani, H. (2011). The effects of rater training on raters’ severity and bias in second language writing assessment. Iranian Journal of Language Testing, 1(1), 1-16.
  • Farrokhi, F., & Esfandiari, R. (2011). A many-facet Rasch model to detect halo effect in three types of raters. Theory & Practice in Language Studies, 1(11), 1531-1540. https://doi.org/10.4304/tpls.1.11.1531-1540
  • Farrokhi, F., Esfandiari, R., & Schaefer, E. (2012). A many-facet Rasch measurement of differential rater severity/leniency in three types of assessment. JALT Journal, 34(1), 79-101.
  • Farrokhi, F., Esfandiari, R., & Vaez Dalili, M. (2011). Applying the many-facet Rasch model to detect centrality in self-assessment, peer-assessment and teacher assessment. World Applied Sciences Journal, 15(11), 76-83.
  • Feldman, M., Lazzara, E.H., Vanderbilt, A.A., & DiazGranados, D. (2012). Rater training to support high‐stakes simulation‐based assessments. Journal of Continuing Education in the Health Professions, 32(4), 279-286. https://doi.org/10.1002/chp.21156
  • Goodrich, H. (1997). Understanding Rubrics: The dictionary may define" rubric," but these models provide more clarity. Educational Leadership, 54(4), 14-17.
  • Gronlund, N.E. (1977). Constructing achievement test. Prentice-Hall Press.
  • Haladyna, T.M. (1997). Writing test items in order to evaluate higher order thinking. Allyn & Bacon.
  • Harik, P., Clauser, B.E., Grabovsky, I., Nungester, R.J., Swanson, D., & Nandakumar, R. (2009). An examination of rater drift within a generalizability theory framework. Journal of Educational Measurement, 46(1), 43-58. https://doi.org/10.1111/j.1745-3984.2009.01068.x
  • Hauenstein, N.M., & McCusker, M.E. (2017). Rater training: Understanding effects of training content, practice ratings, and feedback. International Journal of Selection and Assessment, 25(3), 253-266. https://doi.org/10.1111/ijsa.12177
  • Howitt, D., & Cramer, D. (2008). Introduction to statistics in psychology. Pearson Education.
  • Hughes, A. (2003). Testing for language teachers. Cambridge University Press.
  • İlhan, M. (2015). Standart ve SOLO taksonomisine dayalı rubrikler ile puanlanan açık uçlu matematik sorularında puanlayıcı etkilerinin çok yüzeyli Rasch modeli ile incelenmesi [The identifıcation of rater effects on open-ended math questions rated through standard rubrics and rubrics based on the SOLO taxonomy in reference to the many facet rasch model] [Doctoral dissertation, Gaziantep University]. https://tez.yok.gov.tr/UlusalTezMerkezi/
  • İlhan, M., & Çetin, B. (2014). Rater training as a means of decreasing interfering rater effects related to performance assessment. Journal of European Education, 4(2), 29-38. https://doi.org/10.18656/jee.77087
  • Johnson, R.L., Penny, J.A., & Gordon, B. (2008). Assessing performance: Designing, scoring, and validating performance tasks. Guilford Press.
  • Kane, J., Bernardin, H., Villanueva, J., & Peyrefitte, J. (1995). Stability of rater leniency: Three studies. Academy of Management Journal, 38, 1036-1051.
  • Khaatri, N., Kane, M.B., & Reeve, A.L. (1995). How performance assessments affect teaching and learning. Educational Leadership, 53(3), 80-83.
  • Kim, Y.K. (2009). Combining constructed response items and multiple choice items using a hierarchical rater model [Doctoral dissertation, Columbia University]. https://www.proquest.com/
  • Knoch, U., Fairbairn, J., Myford, C., & Huisman, A. (2018). Evaluating the relative effectiveness of online and face-to-face training for new writing raters. Papers in Language Testing and Assessment, 7(1), 61-86.
  • Knoch, U., Read, J., & von Randow, T. (2007). Re-training writing raters online: How does compare with face to face training?, Assessing Writing, 12(2), 26 43. https://doi.org/10.1016/j.asw.2007.04.001
  • Kondo, Y. (2010). Examination of rater training effect and rater eligibility in L2 performance assessment. Journal of Pan-Pacific Association of Applied Linguistics, 14(2), 1-23.
  • Kubiszyn, T., & Borich, G. (2013). Educational testing and measurement. John Wiley & Sons Incorporated.
  • Kutlu, Ö., Doğan, C.D., & Karaya, İ. (2014). Öğrenci başarısının belirlenmesi: Performansa ve portfolyoya dayalı durum belirleme [Determining student success: Determining the situation based on performance and portfolio]. Pegem Akademi
  • Landauer, T.K., Laham, D., & Foltz, P.W. (2003). Automated scoring and annotation of essays with the Intelligent Essay Assessor. In M. D. Shermis & J. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 87–112). Lawrence Erlbaum Associates, Inc.
  • Lawshe, C.H. (1975). A quantitative approach to content validity. Personnel psychology, 28(4), 563-575. https://doi.org/10.1111/j.1744-6570.1975.tb01393.x
  • Lawshe, C.H. (1985). Inferences from personnel tests and their validity. Journal of Applied Psychology, 70(1), 237-238. https://doi.org/10.1037/0021-9010.70.1.237
  • Leckie, G., & Baird, J.A. (2011). Rater effects on essay scoring: A multilevel analysis of severity drift, central tendency, and rater experience. Journal of Educational Measurement, 48(4), 399-418. https://doi.org/10.1111/j.1745-3984.2011.00152.x
  • Linacre, J.M. (1993). Rasch-based generalizability theory. Rasch Measurement Transaction, 7(1), 283-284.
  • Linacre, J.M. (1994). Many-facet Rasch measurement. Mesa Press.
  • Linacre, J.M. (2017). A user’s guide to FACETS: Rasch-model computer programs. MESA Press.
  • Lumley, T., & McNamara, T.F. (1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12(1), 54-71. https://doi.org/10.1177/026553229501200104
  • Lunz, M.E., Wright, B.D. & Linacre, J.M. (1990). Measuring the impact of judge severity on examination scores. Applied Measurement in Education, 3(4), 331 345. https://doi.org/10.1207/s15324818ame0304_3
  • May, G.L. (2008). The effect of rater training on reducing social style bias in peer evaluation. Business Communication Quarterly, 71(3), 297 313. https://doi.org/10.1177/1080569908321431
  • McDonald, R.P. (1999). Test theory: A unified approach. Erlbaum.
  • McNamara, T.F. (1996). Measuring second language performance. Longman.
  • Moore, B.B. (2009). Consideration of rater effects and rater design via signal detection theory [Doctoral dissertation, Columbia University]. https://www.proquest.com/
  • Moser, K., Kemter, V., Wachsmann, K., Köver, N.Z., & Soucek, R. (2016). Evaluating rater training with double-pretest one-posttest designs: an analysis of testing effects and the moderating role of rater self-efficacy. The International Journal of Human Resource Management, 1-23. https://doi.org/10.1080/09585192.2016.1254102
  • Moskal, B.M. (2000). Scoring rubrics: What, when and how?.
  • Murphy, K.R. & Balzer, W.K. (1989). Rater errors and rating accuracy. Journal of Applied Psychology, 74, 619-624. https://doi.org/10.1037/0021-9010.74.4.619
  • Myford, C.M., & Wolfe, E.M. (2009). Monitoring rater performance over time: A framework for detecting differential accuracy and differential scale use. Journal of Educational Measurement, 46(4), 371-389. https://doi.org/10.1111/j.1745-3984.2009.00088.x
  • Myford, C.M., & Wolfe, E.W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II. Journal of Applied Measurement, 5(2), 189-227.
  • Oosterhof, A. (2003). Developing and using classroom assessments. Merrill-Prentice Hall Press.
  • Osburn, H.G. (2000). Coefficient alpha and related internal consistency reliability coefficients. Psychological methods, 5(3), 343. http://dx.doi.org/10.1037/1082-989X.5.3.343
  • Pallant, J. (2007). SPSS survival manual, a step by step guide to data analysis using spss for windows. McGraw-Hill.
  • Pulakos, E.D. (1984). A comparison of rater training programs: Error training and accuracy training. Journal of Applied Psychology, 69(4), 581 588. http://psycnet.apa.org/doi/10.1037/0021-9010.69.4.581
  • Roch, S.G., Woehr, D.J., Mishra, V., & Kieszczynska, U. (2012). Rater training revisited: An updated meta‐analytic review of frame‐of‐reference training. Journal of Occupational and Organizational Psychology, 85(2), 370-395. https://doi.org/10.1111/j.2044-8325.2011.02045.x
  • Romagnano, L. (2001). The myth of objectivity in mathematics assessment. Mathematics Teacher, 94(1), 31-37.
  • Royal, K. D., & Hecker, K. G. (2016). Rater errors in clinical performance assessments. Journal of veterinary medical education, 43(1), 5-8. https://doi.org/10.3138/jvme.0715-112R
  • Sarıtaş-Akyol, S., & Karakaya, İ. (2021). Investigating the consistency between students’ and teachers’ ratings for the assessment of problem-solving skills with many-facet Rasch measurement model. Eurasian Journal of Educational Research, 91, 281-300. https://doi.org/10.14689/ejer.2021.91.13
  • Shale, D. (1996). Essay reliability: Form and meaning. In: White, E. Lutz, W. & Kamusikiri S. (Eds.), Assessment of writing: Politics, policies, practices (pp. 76–96). MLAA.
  • Stamoulis, D.T. & Hauenstein, N.M.A. (1993). Rater training and rating accuracy: Training for dimensional accuracy versus training for ratee differentiation. Journal of Applied Psychology, 78(6), 994-1003. https://doi.org/10.1037/0021-9010.78.6.994
  • Sudweeks, R.R., Reeve, S. & Bradshaw, W.S. (2005). A comparison of generalizability theory and many-facet Rasch measurement in an analysis of college sophomore writing. Assessing Writing, 9, 239-261. https://doi.org/10.1016/j.asw.2004.11.001
  • Sulsky, L.M., & Day, D.V. (1992). Frame-of-reference training and cognitive categorization: An empirical investigation of rater memory issues. Journal of Applied Psychology, 77(4), 501-510. https://doi.org/10.1037/0021-9010.77.4.501
  • Weigle, S.C. (1998). Using FACETS to model rater training effects. Language Testing, 15(2), 263-287. https://doi.org/10.1177/026553229801500205
  • Weigle, S.C. (2002). Assessing writing. Cambridge University Press. https://doi.org/10.1017/CBO9780511732997
  • Weitz, G., Vinzentius, C., Twesten, C., Lehnert, H., Bonnemeier, H., & König, I.R. (2014). Effects of a rater training on rating accuracy in a physical examination skills assessment. GMS Zeitschrift für Medizinische Ausbildung, 31(4), 1-17.
  • Wilson, F.R., Pan, W., & Schumsky, D.A. (2012). Recalculation of the critical values for Lawshe’s content validity ratio. Measurement and Evaluation in Counseling and Development, 45(3), 197-210. https://doi.org/10.1177/0748175612440286
  • Woehr, D.J., & Huffuct, A.I. (1994). Rater training for performance appraisal. A qantitative review. Journal of Occupational and Organizational Psychology, 67(3), 189-205. https://doi.org/10.1111/j.2044-8325.1994.tb00562.x
  • Wu, S.M., & Tan, S. (2016). Managing rater effects through the use of FACETS analysis: the case of a university placement test. Higher Education Research & Development, 35(2), 380-394. https://doi.org/10.1080/07294360.2015.1087381
  • Zedeck, S., & Cascio, W.F. (1982). Performance appraisal decisions as a function of rater training and purpose of the appraisal. Journal of Applied Psychology, 67(6), 752-758. https://doi.org/10.1037/0021-9010.67.6.752

Investigating the Impact of Rater Training on Rater Errors in the Process of Assessing Writing Skill

Year 2022, Volume: 9 Issue: 2, 492 - 514, 26.06.2022
https://doi.org/10.21449/ijate.877035

Abstract

In the process of measuring and assessing high-level cognitive skills, interference of rater errors in measurements brings about a constant concern and low objectivity. The main purpose of this study was to investigate the impact of rater training on rater errors in the process of assessing individual performance. The study was conducted with a pretest-posttest control group quasi-experimental design. In this research, 45 raters were employed, 23 from the control group and 22 from the experimental group. As data collection tools, a writing task that was developed by IELTS and an analytical rubric that was developed to assess academic writing skills were used. As part of the experimental procedure, rater training was provided and this training was implemented by combining rater error training and frame of reference training. When the findings of the study were examined, it was found that the control and experimental groups were similar to each other before the experiment, however, after the experimental process, the study group made more valid and reliable measurements. As a result, it was investigated that the rater training given had an impact on rater errors such as rater severity, rater leniency, central tendency, and Halo effect. Based on the obtained findings, some suggestions were offered for researchers and future studies.

References

  • Abu Kassim, N.L. (2011). Judging behaviour and rater errors: an application of the many-facet Rasch model. GEMA Online Journal of Language Studies, 11(3), 179-197.
  • Abu Kassim, N.L. (2007). Exploring rater judging behaviour using the many-facet Rasch model. Paper Presented in the Second Biennial International Conference on Teaching and Learning of English in Asia: Exploring New Frontiers (TELiA2), Universiti Utara, Malaysia.
  • Aguinis, H., Mazurkiewicz, M.D., & Heggestad, E.D. (2009). Using web‐based frame‐of‐reference training to decrease biases in personality‐based job analysis: An experimental field study. Personnel Psychology, 62(2), 405-438. https://doi.org/10.1111/j.1744-6570.2009.01144.x
  • Athey, T.R., & McIntyre, R.M. (1987). Effect of rater training on rater accuracy: Levels–of–processing theory and social facilitation theory perspectives. Journal of Applied Psychology, 72, 567–572. https://doi.org/10.1037/0021-9010.72.4.567
  • Attali, Y., Bridgeman, B., & Trapani, C. (2010). Performance of a generic approach in automated essay scoring. Journal of Technology, Learning, and Assessment, 10(3), 1-16.
  • Baird, J.A., Hayes, M., Johnson, R., Johnson, S., & Lamprianou, I. (2013). Marker effects and examination reliability. A Comparative exploration from the perspectives of generalisability theory, Rash model and multilevel modelling. Oxford: University of Oxford for Educational Assessment.
  • Bennet, J. (1998). Human resources management. Singapore: Prentice Hall.
  • Bernardin, H.J. (1978). Effects of rater training on leniency and halo errors in student ratings of instructors. Journal of Applied Psychology, 63(3), 301 308. http://dx.doi.org/10.1037/0021-9010.63.3.301
  • Bernardin, H.J., & Buckley, M.R. (1981). Strategies in rater training. Academy of Management Review, 6(2), 205-212.
  • Bernardin, H.J. & Pence, E.C. (1980). Effects of rater training: New response sets and decreasing accuracy. Journal of Applied Psychology, 65, 60 66. https://doi.org/10.1037/0021-9010.65.1.60
  • Bijani, H. (2018). Investigating the validity of oral assessment rater training program: A mixed-methods study of raters’ perceptions and attitudes before and after training. Cogent Education, 5(1), 1-20. https://doi.org/10.1080/2331186X.2018.1460901
  • Bond, T., & Fox, C.M. (2015). Applying the Rasch model: Fundamental measurement in the human sciences. Routledge. https://doi.org/10.4324/9781315814698
  • Borman, W.C. (1975). Effects of instructions to avoid halo error on reliability and validity of performance evaluation ratings. Journal of Applied Psychology, 60(5), 556-560. https://doi.org/10.1037/0021-9010.60.5.556
  • Brennan, R.L., Gao, X., & Colton, D.A. (1995). Generalizability analyses of work key listening and writing tests. Educational and Psychological Measurement, 55(2), 157-176. https://doi.org/10.1177/0013164495055002001
  • Brijmohan, A. (2016). A many-facet RASCH measurement analysis to explore rater effects and rater training in medical school admissions [Doctoral dissertation]. https://hdl.handle.net/1807/74534
  • Brookhart, S.M. (2013). How to create and use rubrics for formative assessment and grading. ASCD.
  • Brown, H.D. (2004). Language assessment: Principles and classroom practices. Pearson Education.
  • Brown, H.D. (2007). Teaching by principles: An interactive approach to language pedagogy. Pearson Education.
  • Brown, J.D., & Hudson, T. (1998). The alternatives in language assessment. TESOL quarterly, 32(4), 653-675. https://doi.org/10.2307/3587999
  • Burstein, J., Kukich, K., Wolff, S., Lu, C., Chodorow, M., Braden-Harder, L., & Harris, M.D. (1998). Automated scoring using a hybrid feature identification technique. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, Montreal, Quebec, Canada. https://doi.org/10.3115/980845.980879
  • Büyüköztürk, Ş. (2011). Deneysel desenler- öntest-sontest kontrol grubu desen ve veri analizi [Experimental designs-pretest-posttest control group design and data analysis]. Pegem Akademi.
  • Chen, W.H., & Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22(3), 265-289. https://doi.org/10.3102/10769986022003265
  • Congdon, P., & McQueen, J. (2000). The stability of rater severity in large-scale assessment programs. Journal of Educational Measurement, 37(2), 163 178. https://doi.org/10.1111/j.1745-3984.2000.tb01081.x
  • Cronbach, L.I. (1990). Essentials of psychological testing. Harper and Row.
  • Çokluk, Ö., Şekercioğlu, G., & Büyüköztürk, Ş. (2012). Sosyal bilimler için çok değişkenli istatistik: SPSS ve LISREL uygulamaları [Multivariate statistics for social sciences: SPSS and LISREL applications]. Pegem Akademi.
  • Davis, L. (2016). The influence of training and experience on rater performance in scoring spoken language. Language Testing, 33(1), 117 135. https://doi.org/10.1177/0265532215582282
  • Dunbar, N.E., Brooks, C.F., & Miller, T.K. (2006). Oral communication skills in higher education: Using a performance-based evaluation rubric to assess communication skills. Innovative Higher Education, 31(2), 115-128. https://doi.org/10.1007/s10755-006-9012-x
  • Ebel, R.L. (1965). Measuring educational achievement. Prentice- Hall Press.
  • Ebel, R.L., & Frisbie, D.A. (1991). Essentials of educational measurement. Prentice Hall Press.
  • Eckes, T. (2008). Rater types in writing performance assessments: A classification approach to rater variability. Language Testing, 25(2), 155 185. https://doi.org/10.1177/0265532207086780
  • Eckes, T. (2015). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments. Peter Lang.
  • Ellis, R.O.D., Johnson, K.E., & Papajohn, D. (2002). Concept mapping for rater training. TESOL Quarterly, 36(2), 219-233. https://doi.org/10.2307/3588333
  • Engelhard, G. (1994). Examining rater errors in the assessment of written composition with a many‐faceted Rasch model. Journal of Educational Measurement, 31(2), 93-112. https://doi.org/10.1111/j.1745-3984.1994.tb00436.x
  • Esfandiari, R. (2015). Rater errors among peer-assessors: applying the many-facet Rasch measurement model. Iranian Journal of Applied Linguistics, 18(2), 77-107. https://doi.org/10.18869/acadpub.ijal.18.2.77
  • Fahim, M., & Bijani, H. (2011). The effects of rater training on raters’ severity and bias in second language writing assessment. Iranian Journal of Language Testing, 1(1), 1-16.
  • Farrokhi, F., & Esfandiari, R. (2011). A many-facet Rasch model to detect halo effect in three types of raters. Theory & Practice in Language Studies, 1(11), 1531-1540. https://doi.org/10.4304/tpls.1.11.1531-1540
  • Farrokhi, F., Esfandiari, R., & Schaefer, E. (2012). A many-facet Rasch measurement of differential rater severity/leniency in three types of assessment. JALT Journal, 34(1), 79-101.
  • Farrokhi, F., Esfandiari, R., & Vaez Dalili, M. (2011). Applying the many-facet Rasch model to detect centrality in self-assessment, peer-assessment and teacher assessment. World Applied Sciences Journal, 15(11), 76-83.
  • Feldman, M., Lazzara, E.H., Vanderbilt, A.A., & DiazGranados, D. (2012). Rater training to support high‐stakes simulation‐based assessments. Journal of Continuing Education in the Health Professions, 32(4), 279-286. https://doi.org/10.1002/chp.21156
  • Goodrich, H. (1997). Understanding Rubrics: The dictionary may define" rubric," but these models provide more clarity. Educational Leadership, 54(4), 14-17.
  • Gronlund, N.E. (1977). Constructing achievement test. Prentice-Hall Press.
  • Haladyna, T.M. (1997). Writing test items in order to evaluate higher order thinking. Allyn & Bacon.
  • Harik, P., Clauser, B.E., Grabovsky, I., Nungester, R.J., Swanson, D., & Nandakumar, R. (2009). An examination of rater drift within a generalizability theory framework. Journal of Educational Measurement, 46(1), 43-58. https://doi.org/10.1111/j.1745-3984.2009.01068.x
  • Hauenstein, N.M., & McCusker, M.E. (2017). Rater training: Understanding effects of training content, practice ratings, and feedback. International Journal of Selection and Assessment, 25(3), 253-266. https://doi.org/10.1111/ijsa.12177
  • Howitt, D., & Cramer, D. (2008). Introduction to statistics in psychology. Pearson Education.
  • Hughes, A. (2003). Testing for language teachers. Cambridge University Press.
  • İlhan, M. (2015). Standart ve SOLO taksonomisine dayalı rubrikler ile puanlanan açık uçlu matematik sorularında puanlayıcı etkilerinin çok yüzeyli Rasch modeli ile incelenmesi [The identifıcation of rater effects on open-ended math questions rated through standard rubrics and rubrics based on the SOLO taxonomy in reference to the many facet rasch model] [Doctoral dissertation, Gaziantep University]. https://tez.yok.gov.tr/UlusalTezMerkezi/
  • İlhan, M., & Çetin, B. (2014). Rater training as a means of decreasing interfering rater effects related to performance assessment. Journal of European Education, 4(2), 29-38. https://doi.org/10.18656/jee.77087
  • Johnson, R.L., Penny, J.A., & Gordon, B. (2008). Assessing performance: Designing, scoring, and validating performance tasks. Guilford Press.
  • Kane, J., Bernardin, H., Villanueva, J., & Peyrefitte, J. (1995). Stability of rater leniency: Three studies. Academy of Management Journal, 38, 1036-1051.
  • Khaatri, N., Kane, M.B., & Reeve, A.L. (1995). How performance assessments affect teaching and learning. Educational Leadership, 53(3), 80-83.
  • Kim, Y.K. (2009). Combining constructed response items and multiple choice items using a hierarchical rater model [Doctoral dissertation, Columbia University]. https://www.proquest.com/
  • Knoch, U., Fairbairn, J., Myford, C., & Huisman, A. (2018). Evaluating the relative effectiveness of online and face-to-face training for new writing raters. Papers in Language Testing and Assessment, 7(1), 61-86.
  • Knoch, U., Read, J., & von Randow, T. (2007). Re-training writing raters online: How does compare with face to face training?, Assessing Writing, 12(2), 26 43. https://doi.org/10.1016/j.asw.2007.04.001
  • Kondo, Y. (2010). Examination of rater training effect and rater eligibility in L2 performance assessment. Journal of Pan-Pacific Association of Applied Linguistics, 14(2), 1-23.
  • Kubiszyn, T., & Borich, G. (2013). Educational testing and measurement. John Wiley & Sons Incorporated.
  • Kutlu, Ö., Doğan, C.D., & Karaya, İ. (2014). Öğrenci başarısının belirlenmesi: Performansa ve portfolyoya dayalı durum belirleme [Determining student success: Determining the situation based on performance and portfolio]. Pegem Akademi
  • Landauer, T.K., Laham, D., & Foltz, P.W. (2003). Automated scoring and annotation of essays with the Intelligent Essay Assessor. In M. D. Shermis & J. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 87–112). Lawrence Erlbaum Associates, Inc.
  • Lawshe, C.H. (1975). A quantitative approach to content validity. Personnel psychology, 28(4), 563-575. https://doi.org/10.1111/j.1744-6570.1975.tb01393.x
  • Lawshe, C.H. (1985). Inferences from personnel tests and their validity. Journal of Applied Psychology, 70(1), 237-238. https://doi.org/10.1037/0021-9010.70.1.237
  • Leckie, G., & Baird, J.A. (2011). Rater effects on essay scoring: A multilevel analysis of severity drift, central tendency, and rater experience. Journal of Educational Measurement, 48(4), 399-418. https://doi.org/10.1111/j.1745-3984.2011.00152.x
  • Linacre, J.M. (1993). Rasch-based generalizability theory. Rasch Measurement Transaction, 7(1), 283-284.
  • Linacre, J.M. (1994). Many-facet Rasch measurement. Mesa Press.
  • Linacre, J.M. (2017). A user’s guide to FACETS: Rasch-model computer programs. MESA Press.
  • Lumley, T., & McNamara, T.F. (1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12(1), 54-71. https://doi.org/10.1177/026553229501200104
  • Lunz, M.E., Wright, B.D. & Linacre, J.M. (1990). Measuring the impact of judge severity on examination scores. Applied Measurement in Education, 3(4), 331 345. https://doi.org/10.1207/s15324818ame0304_3
  • May, G.L. (2008). The effect of rater training on reducing social style bias in peer evaluation. Business Communication Quarterly, 71(3), 297 313. https://doi.org/10.1177/1080569908321431
  • McDonald, R.P. (1999). Test theory: A unified approach. Erlbaum.
  • McNamara, T.F. (1996). Measuring second language performance. Longman.
  • Moore, B.B. (2009). Consideration of rater effects and rater design via signal detection theory [Doctoral dissertation, Columbia University]. https://www.proquest.com/
  • Moser, K., Kemter, V., Wachsmann, K., Köver, N.Z., & Soucek, R. (2016). Evaluating rater training with double-pretest one-posttest designs: an analysis of testing effects and the moderating role of rater self-efficacy. The International Journal of Human Resource Management, 1-23. https://doi.org/10.1080/09585192.2016.1254102
  • Moskal, B.M. (2000). Scoring rubrics: What, when and how?.
  • Murphy, K.R. & Balzer, W.K. (1989). Rater errors and rating accuracy. Journal of Applied Psychology, 74, 619-624. https://doi.org/10.1037/0021-9010.74.4.619
  • Myford, C.M., & Wolfe, E.M. (2009). Monitoring rater performance over time: A framework for detecting differential accuracy and differential scale use. Journal of Educational Measurement, 46(4), 371-389. https://doi.org/10.1111/j.1745-3984.2009.00088.x
  • Myford, C.M., & Wolfe, E.W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II. Journal of Applied Measurement, 5(2), 189-227.
  • Oosterhof, A. (2003). Developing and using classroom assessments. Merrill-Prentice Hall Press.
  • Osburn, H.G. (2000). Coefficient alpha and related internal consistency reliability coefficients. Psychological methods, 5(3), 343. http://dx.doi.org/10.1037/1082-989X.5.3.343
  • Pallant, J. (2007). SPSS survival manual, a step by step guide to data analysis using spss for windows. McGraw-Hill.
  • Pulakos, E.D. (1984). A comparison of rater training programs: Error training and accuracy training. Journal of Applied Psychology, 69(4), 581 588. http://psycnet.apa.org/doi/10.1037/0021-9010.69.4.581
  • Roch, S.G., Woehr, D.J., Mishra, V., & Kieszczynska, U. (2012). Rater training revisited: An updated meta‐analytic review of frame‐of‐reference training. Journal of Occupational and Organizational Psychology, 85(2), 370-395. https://doi.org/10.1111/j.2044-8325.2011.02045.x
  • Romagnano, L. (2001). The myth of objectivity in mathematics assessment. Mathematics Teacher, 94(1), 31-37.
  • Royal, K. D., & Hecker, K. G. (2016). Rater errors in clinical performance assessments. Journal of veterinary medical education, 43(1), 5-8. https://doi.org/10.3138/jvme.0715-112R
  • Sarıtaş-Akyol, S., & Karakaya, İ. (2021). Investigating the consistency between students’ and teachers’ ratings for the assessment of problem-solving skills with many-facet Rasch measurement model. Eurasian Journal of Educational Research, 91, 281-300. https://doi.org/10.14689/ejer.2021.91.13
  • Shale, D. (1996). Essay reliability: Form and meaning. In: White, E. Lutz, W. & Kamusikiri S. (Eds.), Assessment of writing: Politics, policies, practices (pp. 76–96). MLAA.
  • Stamoulis, D.T. & Hauenstein, N.M.A. (1993). Rater training and rating accuracy: Training for dimensional accuracy versus training for ratee differentiation. Journal of Applied Psychology, 78(6), 994-1003. https://doi.org/10.1037/0021-9010.78.6.994
  • Sudweeks, R.R., Reeve, S. & Bradshaw, W.S. (2005). A comparison of generalizability theory and many-facet Rasch measurement in an analysis of college sophomore writing. Assessing Writing, 9, 239-261. https://doi.org/10.1016/j.asw.2004.11.001
  • Sulsky, L.M., & Day, D.V. (1992). Frame-of-reference training and cognitive categorization: An empirical investigation of rater memory issues. Journal of Applied Psychology, 77(4), 501-510. https://doi.org/10.1037/0021-9010.77.4.501
  • Weigle, S.C. (1998). Using FACETS to model rater training effects. Language Testing, 15(2), 263-287. https://doi.org/10.1177/026553229801500205
  • Weigle, S.C. (2002). Assessing writing. Cambridge University Press. https://doi.org/10.1017/CBO9780511732997
  • Weitz, G., Vinzentius, C., Twesten, C., Lehnert, H., Bonnemeier, H., & König, I.R. (2014). Effects of a rater training on rating accuracy in a physical examination skills assessment. GMS Zeitschrift für Medizinische Ausbildung, 31(4), 1-17.
  • Wilson, F.R., Pan, W., & Schumsky, D.A. (2012). Recalculation of the critical values for Lawshe’s content validity ratio. Measurement and Evaluation in Counseling and Development, 45(3), 197-210. https://doi.org/10.1177/0748175612440286
  • Woehr, D.J., & Huffuct, A.I. (1994). Rater training for performance appraisal. A qantitative review. Journal of Occupational and Organizational Psychology, 67(3), 189-205. https://doi.org/10.1111/j.2044-8325.1994.tb00562.x
  • Wu, S.M., & Tan, S. (2016). Managing rater effects through the use of FACETS analysis: the case of a university placement test. Higher Education Research & Development, 35(2), 380-394. https://doi.org/10.1080/07294360.2015.1087381
  • Zedeck, S., & Cascio, W.F. (1982). Performance appraisal decisions as a function of rater training and purpose of the appraisal. Journal of Applied Psychology, 67(6), 752-758. https://doi.org/10.1037/0021-9010.67.6.752
There are 94 citations in total.

Details

Primary Language English
Subjects Studies on Education
Journal Section Articles
Authors

Mehmet Şata 0000-0003-2683-4997

İsmail Karakaya 0000-0003-4308-6919

Early Pub Date April 28, 2022
Publication Date June 26, 2022
Submission Date February 8, 2021
Published in Issue Year 2022 Volume: 9 Issue: 2

Cite

APA Şata, M., & Karakaya, İ. (2022). Investigating the Impact of Rater Training on Rater Errors in the Process of Assessing Writing Skill. International Journal of Assessment Tools in Education, 9(2), 492-514. https://doi.org/10.21449/ijate.877035

23823             23825             23824