Rating Performance among Raters of Different Experience Through Multi-Facet Rasch Measurement (MFRM) Model

Muhamad Firdaus Mohd Noh; Mohd Effendi Ewan Mohd Matore

doi:10.21031/epod.662964

Research Article

Year 2020, Volume: 11 Issue: 2, 147 - 162, 13.06.2020

Muhamad Firdaus Mohd Noh , Mohd Effendi Ewan Mohd Matore

https://doi.org/10.21031/epod.662964

Abstract

References

Ahmadi Shirazi, M. (2019). For a greater good: Bias analysis in writing assessment. SAGE Open, 9(1), 1–14. https://doi.org/10.1177/2158244018822377
Albano, A. D., & Rodrigues, M. (2018). Item development research and practice. Handbook of Accessible Instruction and Testing Practices: Issues, Innovations, and Applications, 181–198. https://doi.org/10.1007/978-3-319-71126-3_12
Alp, P., Epner, A., & Pajupuu, H. (2018). The influence of rater empathy, age and experience on writing performance assessment. Linguistics Beyond And Within, 3(2017), 7–19. Retrieved from https://www.ceeol.com/search/article-detail?id=716601
Anthony, L., & Miriam, S. (2019). Drill in English Skills Practice: CEFR-Alligned Curriculum. Selangor: Oxford Fajar
Attali, Y. (2016). A comparison of newly-trained and experienced raters on a standardized writing assessment. Language Testing, 33(1), 99–115. https://doi.org/10.1177/0265532215582283
Bond, T. G., & Fox, C. M. (2015). Applying the Rasch Model Fundamental Measurement in the Human Sciences. New Jersey: Lawrence Erlbaum Associates.
Creswell, J. W., & Creswell, J. D. (2018). Research Design: Qualitative, Quantitative, And Mixed Methods Approaches (5th ed.). California: Sage Publications.
Davis, L. (2016). The influence of training and experience on rater performance in scoring spoken language. Language Testing, 33(1), 117–135. https://doi.org/10.1177/0265532215582282
Eckes, T. (2015). Introduction to Many-Facet Rasch Measurement: Analyzing and Evaluating Rater-Mediated Assessments (2nd ed.). Peter Lang.
Eckes, T. (2019). Implications for rater-mediated language assessment. In Aryadoust, V., & Raquel, M. (Eds.), Quantitative Data Analysis for Language Assessment Volume I: Fundamental Techniques (pp. 153-175). London & New York: Routledge.
Engelhard, G., & Wind, S. A. (2018). Invariant Measurement with Raters and Rating Scales: Rasch Models for Rater-Mediated Assessments. Routledge. New York & London: Routledge. https://doi.org/10.1017/CBO9781107415324.004
Fisher, J. W. P. 2007. Rating scale instrument quality criteria. Rasch Measurement Transactions 21(1): 1095.
Govindasamy, P., Salazar, M. D. C., Lerner, J., & Green, K. E. (2019). Assessing the reliability of the framework for equitable and effective teaching with the many-facet rasch model. Frontiers in Psychology, 10(June), 1–10. https://doi.org/10.3389/fpsyg.2019.01363
Haladyna, T. M., & Rodrigues, M. C. (2013). Developing and Validating Test. New York: Routledge.
Huang, L., Kubelec, S., Keng, N., & Hsu, L. (2018). Evaluating CEFR rater performance through the analysis of spoken learner corpora. Language Testing in Asia, 8(1), 1–17. https://doi.org/http://dx.doi.org/10.1186/s40468-018-0069-0
Isaacs, T., & Thomson, R. I. (2013). Rater experience, rating scale length, and judgments of L2 pronunciation: Revisiting research conventions. Language Assessment Quarterly, 10(2), 135–159. https://doi.org/10.1080/15434303.2013.769545
Jeong, H. (2017). Narrative and expository genre effects on students, raters, and performance criteria. Assessing Writing, 31, 113–125. https://doi.org/10.1016/j.asw.2016.08.006
Jones, E., & Wind, S. A. (2018). Using Repeated Ratings to Improve Measurement Precision in Incomplete Rating Designs. Journal of Applied Measurement, 19(2), 148–161. Retrieved from https://www.ncbi.nlm.nih.gov/pubmed/29894984
Kang, O., Rubin, D., & Kermad, A. (2019). The effect of training and rater differences on oral proficiency assessment. Language Testing, 36(4), 481–504. https://doi.org/10.1177/0265532219849522
Kementerian Pendidikan Malaysia. (2019a). Quick Facts 2018: Malaysia Education Statistics. Retrieved from https://www.moe.gov.my/en/muat-turun/laporan-dan-statistik/quick-facts-malaysia-education-statistics/563-quick-facts-2018-malaysia-educational-statistics/file
Kementerian Pendidikan Malaysia. (2019b). Pengumuman Analisis Keputusan Sijil Pelajaran Malaysia (SPM) 2018. Retrieved from http://lp.moe.gov.my/images/bahan/spm/2019/14032019/Laporan%20Analisis%20Keputusan%20SPM%202018%20-%20Upload.pdf
Kim, H. J. (2015). A qualitative analysis of rater behavior on an L2 speaking assessment. Language Assessment Quarterly, 12(3), 239–261. https://doi.org/10.1080/15434303.2015.1049353
Koizumi, R., Okabe, Y., & Kashimada, Y. (2017). A Multi-faceted Rasch analysis of rater reliability of the Speaking section of the GTEC CBT. ARELE: Annual Review of English Language Education in Japan, 241–256. Retrieved from https://www.jstage.jst.go.jp/article/arele/28/0/28_241/_article/-char/ja/
Lembaga Peperiksaan. (2019). Instruction to Speaking Examiners (Pentaksiran Tingkatan 3). Retrived from http://lp.moe.gov.my/images/bahan/pt3/2019/21082019/S1%20MES%20PT3%20Instructions%20to%20Speaking%20%20Examiners_Revised%20version.pdf
Linacre, J. M. (2005). Standard errors: means, measures, origins and anchor values. Rasch Measurement Transactions, 19(3), 1030.
Linacre J. M. (2002). Understanding Rasch measurement: Optimizing rating scale category effectiveness. Journal of Applied Measurement, 3, 85–106. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.424.2811&rep=rep1&type=pdf
Linacre, J. M. (2014a). Facets Rasch measurement computer program (Version 3.71.4) [Computer software]. Chicago: Winsteps.com.
Linacre, J. M. (2014b). A user’s guide to FACETS: Rasch-model computer programs. Chicago: Winsteps.com. Retrieved from http://www.winsteps.com/facets.htm
Myers, J. L., Well, A. D., & Lorch, R. F. (2010). Research design and statistical analysis (3rd ed.). New York, NY: Routledge
Myford, C., & Wolfe, E. W. (2003). Detecting and measuring rater effects using Many-Facet Rasch Measurement: Part I. Journal Od Applied Measurement, 4(October 2015), 386–422. Retrieved from https://www.researchgate.net/profile/Carol_Myford/publication/9069043_Detecting_and_Measuring_Rater_Effects_Using_Many-Facet_Rasch_Measurement_Part_I/links/54cba70e0cf298d6565848ee.pdf
Şahan, Ö., & Razı, S. (2020). Do experience and text quality matter for raters’ decision-making behaviors? Language Testing, 1-22. https://doi.org/10.1177/0265532219900228
Scullen, S. E., Mount, M. K., & Goff, M. (2000). Understanding the latent structure of job performance ratings. Journal of Applied Psychology, 85(6), 956–970. https://doi.org/10.1037/0021-9010.85.6.956
Weilie, L. (2018). To what extent do non-teacher raters differ from teacher raters on assessing story-retelling. Journal of Language Testing & Assessment, 1, 1–13. Retrieved from http://clausiuspress.com/assets/default/article/2018/08/29/article_1535590233.pdf
Wesolowski, B. C., & Wind, S. A. (2019). Pedagogical considerations for examining rater variability in rater‐mediated assessments: A three‐model framework. Journal of Educational Measurement, 56(3), 521–546. https://doi.org/10.1111/jedm.12224
Wind, S. A., & Sebok-Syer, S. S. (2019). Examining differential rater functioning using a between-subgroup outfit approach. Journal of Educational Measurement, 56(2), 217–250. https://doi.org/10.1111/jedm.12198
Wind, S. A. (2018). Examining the impacts of rater effects in performance assessments. Applied Psychological Measurement, 43(2), 159–171. https://doi.org/10.1177/0146621618789391
Wind, S. A., & Guo, W. (2019). Exploring the combined effects of rater misfit and differential rater functioning in performance assessments. Educational and Psychological Measurement, 1–26. https://doi.org/10.1177/0013164419834613
Wu, M. (2017). Some IRT-based analyses for interpreting rater effects. Psychological Test and Assessment Modeling, 59(4), 453–470. Retrieved from https://www.psychologie-aktuell.com/fileadmin/download/ptam/4-2017_20171218/04_Wu.pdf
Wu, M. & Adams, R. (2007). Applying the Rasch model to psycho-social measurement: A practical approach. Melbourne: Educational Measurement Solutions
Wu, S. M., & Tan, S. (2016). Managing rater effects through the use of FACETS analysis: The case of a university placement test. Higher Education Research and Development, 35(2), 380–394. https://doi.org/10.1080/07294360.2015.1087381

Rating Performance among Raters of Different Experience Through Multi-Facet Rasch Measurement (MFRM) Model

Year 2020, Volume: 11 Issue: 2, 147 - 162, 13.06.2020

Muhamad Firdaus Mohd Noh , Mohd Effendi Ewan Mohd Matore

https://doi.org/10.21031/epod.662964

Abstract

One’s experience can greatly contribute to a diversified rating performance in educational scoring. Heterogeneous ratings can negatively affect examinees’ results. The aim of the study is to examine raters’ rating performance in assessing oral tests among lower secondary school students using Multi-facet Rasch Measurement (MFRM) model indicated by raters’ severity. Respondents are thirty English Language teachers clustered into two groups based on their rating experience in high-stakes assessment. The respondents listened to ten examinees’ recorded answers of three oral test items and provided their ratings. Instruments include items, examinees’ answers, scoring rubric, and scoring sheet used to appraise examinees’ competence in three domains which are vocabulary, grammar, and communicative competence. MFRM analysis showed that raters exhibited diversity in their severity level with chi-square χ2=2.661. Raters’ severity measures ranged from 2.13 to -1.45 logits. Independent t-test indicated that there was a significant difference in ratings provided by the inexperienced and the experienced raters, t-value = -0.96, df = 28, p<0.01. The findings of this study suggest that assessment developers must ensure raters are well versed before they can rate examinees in operational settings gained through assessment practices or rater training. Further research is needed to account for the varying effects of rating experience in other assessment contexts and the effects of interaction between facets on estimates of examinees’ measures. The present study provides additional evidence with respect to the role of rating experience in inspiring raters to provide accurate ratings.

Keywords

rating performance, rater-mediated assessment, Multi-faceted Rasch Measurement model, oral test, rating experience

References

Ahmadi Shirazi, M. (2019). For a greater good: Bias analysis in writing assessment. SAGE Open, 9(1), 1–14. https://doi.org/10.1177/2158244018822377
Albano, A. D., & Rodrigues, M. (2018). Item development research and practice. Handbook of Accessible Instruction and Testing Practices: Issues, Innovations, and Applications, 181–198. https://doi.org/10.1007/978-3-319-71126-3_12
Alp, P., Epner, A., & Pajupuu, H. (2018). The influence of rater empathy, age and experience on writing performance assessment. Linguistics Beyond And Within, 3(2017), 7–19. Retrieved from https://www.ceeol.com/search/article-detail?id=716601
Anthony, L., & Miriam, S. (2019). Drill in English Skills Practice: CEFR-Alligned Curriculum. Selangor: Oxford Fajar
Attali, Y. (2016). A comparison of newly-trained and experienced raters on a standardized writing assessment. Language Testing, 33(1), 99–115. https://doi.org/10.1177/0265532215582283
Bond, T. G., & Fox, C. M. (2015). Applying the Rasch Model Fundamental Measurement in the Human Sciences. New Jersey: Lawrence Erlbaum Associates.
Creswell, J. W., & Creswell, J. D. (2018). Research Design: Qualitative, Quantitative, And Mixed Methods Approaches (5th ed.). California: Sage Publications.
Davis, L. (2016). The influence of training and experience on rater performance in scoring spoken language. Language Testing, 33(1), 117–135. https://doi.org/10.1177/0265532215582282
Eckes, T. (2015). Introduction to Many-Facet Rasch Measurement: Analyzing and Evaluating Rater-Mediated Assessments (2nd ed.). Peter Lang.
Eckes, T. (2019). Implications for rater-mediated language assessment. In Aryadoust, V., & Raquel, M. (Eds.), Quantitative Data Analysis for Language Assessment Volume I: Fundamental Techniques (pp. 153-175). London & New York: Routledge.
Engelhard, G., & Wind, S. A. (2018). Invariant Measurement with Raters and Rating Scales: Rasch Models for Rater-Mediated Assessments. Routledge. New York & London: Routledge. https://doi.org/10.1017/CBO9781107415324.004
Fisher, J. W. P. 2007. Rating scale instrument quality criteria. Rasch Measurement Transactions 21(1): 1095.
Govindasamy, P., Salazar, M. D. C., Lerner, J., & Green, K. E. (2019). Assessing the reliability of the framework for equitable and effective teaching with the many-facet rasch model. Frontiers in Psychology, 10(June), 1–10. https://doi.org/10.3389/fpsyg.2019.01363
Haladyna, T. M., & Rodrigues, M. C. (2013). Developing and Validating Test. New York: Routledge.
Huang, L., Kubelec, S., Keng, N., & Hsu, L. (2018). Evaluating CEFR rater performance through the analysis of spoken learner corpora. Language Testing in Asia, 8(1), 1–17. https://doi.org/http://dx.doi.org/10.1186/s40468-018-0069-0
Isaacs, T., & Thomson, R. I. (2013). Rater experience, rating scale length, and judgments of L2 pronunciation: Revisiting research conventions. Language Assessment Quarterly, 10(2), 135–159. https://doi.org/10.1080/15434303.2013.769545
Jeong, H. (2017). Narrative and expository genre effects on students, raters, and performance criteria. Assessing Writing, 31, 113–125. https://doi.org/10.1016/j.asw.2016.08.006
Jones, E., & Wind, S. A. (2018). Using Repeated Ratings to Improve Measurement Precision in Incomplete Rating Designs. Journal of Applied Measurement, 19(2), 148–161. Retrieved from https://www.ncbi.nlm.nih.gov/pubmed/29894984
Kang, O., Rubin, D., & Kermad, A. (2019). The effect of training and rater differences on oral proficiency assessment. Language Testing, 36(4), 481–504. https://doi.org/10.1177/0265532219849522
Kementerian Pendidikan Malaysia. (2019a). Quick Facts 2018: Malaysia Education Statistics. Retrieved from https://www.moe.gov.my/en/muat-turun/laporan-dan-statistik/quick-facts-malaysia-education-statistics/563-quick-facts-2018-malaysia-educational-statistics/file
Kementerian Pendidikan Malaysia. (2019b). Pengumuman Analisis Keputusan Sijil Pelajaran Malaysia (SPM) 2018. Retrieved from http://lp.moe.gov.my/images/bahan/spm/2019/14032019/Laporan%20Analisis%20Keputusan%20SPM%202018%20-%20Upload.pdf
Kim, H. J. (2015). A qualitative analysis of rater behavior on an L2 speaking assessment. Language Assessment Quarterly, 12(3), 239–261. https://doi.org/10.1080/15434303.2015.1049353
Koizumi, R., Okabe, Y., & Kashimada, Y. (2017). A Multi-faceted Rasch analysis of rater reliability of the Speaking section of the GTEC CBT. ARELE: Annual Review of English Language Education in Japan, 241–256. Retrieved from https://www.jstage.jst.go.jp/article/arele/28/0/28_241/_article/-char/ja/
Lembaga Peperiksaan. (2019). Instruction to Speaking Examiners (Pentaksiran Tingkatan 3). Retrived from http://lp.moe.gov.my/images/bahan/pt3/2019/21082019/S1%20MES%20PT3%20Instructions%20to%20Speaking%20%20Examiners_Revised%20version.pdf
Linacre, J. M. (2005). Standard errors: means, measures, origins and anchor values. Rasch Measurement Transactions, 19(3), 1030.
Linacre J. M. (2002). Understanding Rasch measurement: Optimizing rating scale category effectiveness. Journal of Applied Measurement, 3, 85–106. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.424.2811&rep=rep1&type=pdf
Linacre, J. M. (2014a). Facets Rasch measurement computer program (Version 3.71.4) [Computer software]. Chicago: Winsteps.com.
Linacre, J. M. (2014b). A user’s guide to FACETS: Rasch-model computer programs. Chicago: Winsteps.com. Retrieved from http://www.winsteps.com/facets.htm
Myers, J. L., Well, A. D., & Lorch, R. F. (2010). Research design and statistical analysis (3rd ed.). New York, NY: Routledge
Myford, C., & Wolfe, E. W. (2003). Detecting and measuring rater effects using Many-Facet Rasch Measurement: Part I. Journal Od Applied Measurement, 4(October 2015), 386–422. Retrieved from https://www.researchgate.net/profile/Carol_Myford/publication/9069043_Detecting_and_Measuring_Rater_Effects_Using_Many-Facet_Rasch_Measurement_Part_I/links/54cba70e0cf298d6565848ee.pdf
Şahan, Ö., & Razı, S. (2020). Do experience and text quality matter for raters’ decision-making behaviors? Language Testing, 1-22. https://doi.org/10.1177/0265532219900228
Scullen, S. E., Mount, M. K., & Goff, M. (2000). Understanding the latent structure of job performance ratings. Journal of Applied Psychology, 85(6), 956–970. https://doi.org/10.1037/0021-9010.85.6.956
Weilie, L. (2018). To what extent do non-teacher raters differ from teacher raters on assessing story-retelling. Journal of Language Testing & Assessment, 1, 1–13. Retrieved from http://clausiuspress.com/assets/default/article/2018/08/29/article_1535590233.pdf
Wesolowski, B. C., & Wind, S. A. (2019). Pedagogical considerations for examining rater variability in rater‐mediated assessments: A three‐model framework. Journal of Educational Measurement, 56(3), 521–546. https://doi.org/10.1111/jedm.12224
Wind, S. A., & Sebok-Syer, S. S. (2019). Examining differential rater functioning using a between-subgroup outfit approach. Journal of Educational Measurement, 56(2), 217–250. https://doi.org/10.1111/jedm.12198
Wind, S. A. (2018). Examining the impacts of rater effects in performance assessments. Applied Psychological Measurement, 43(2), 159–171. https://doi.org/10.1177/0146621618789391
Wind, S. A., & Guo, W. (2019). Exploring the combined effects of rater misfit and differential rater functioning in performance assessments. Educational and Psychological Measurement, 1–26. https://doi.org/10.1177/0013164419834613
Wu, M. (2017). Some IRT-based analyses for interpreting rater effects. Psychological Test and Assessment Modeling, 59(4), 453–470. Retrieved from https://www.psychologie-aktuell.com/fileadmin/download/ptam/4-2017_20171218/04_Wu.pdf
Wu, M. & Adams, R. (2007). Applying the Rasch model to psycho-social measurement: A practical approach. Melbourne: Educational Measurement Solutions
Wu, S. M., & Tan, S. (2016). Managing rater effects through the use of FACETS analysis: The case of a university placement test. Higher Education Research and Development, 35(2), 380–394. https://doi.org/10.1080/07294360.2015.1087381

There are 40 citations in total.

Details

Primary Language	English
Journal Section	Articles
Authors	Muhamad Firdaus Mohd Noh 0000-0002-5429-6789 Mohd Effendi Ewan Mohd Matore 0000-0002-6369-8501
Publication Date	June 13, 2020
Acceptance Date	May 9, 2020
Published in Issue	Year 2020 Volume: 11 Issue: 2

Cite

APA	Mohd Noh, M. F., & Mohd Matore, M. E. E. (2020). Rating Performance among Raters of Different Experience Through Multi-Facet Rasch Measurement (MFRM) Model. Journal of Measurement and Evaluation in Education and Psychology, 11(2), 147-162. https://doi.org/10.21031/epod.662964

Download Cover Image

Article Files

Full Text