Research Article
BibTex RIS Cite
Year 2020, Volume: 11 Issue: 2, 147 - 162, 13.06.2020
https://doi.org/10.21031/epod.662964

Abstract

References

  • Ahmadi Shirazi, M. (2019). For a greater good: Bias analysis in writing assessment. SAGE Open, 9(1), 1–14. https://doi.org/10.1177/2158244018822377
  • Albano, A. D., & Rodrigues, M. (2018). Item development research and practice. Handbook of Accessible Instruction and Testing Practices: Issues, Innovations, and Applications, 181–198. https://doi.org/10.1007/978-3-319-71126-3_12
  • Alp, P., Epner, A., & Pajupuu, H. (2018). The influence of rater empathy, age and experience on writing performance assessment. Linguistics Beyond And Within, 3(2017), 7–19. Retrieved from https://www.ceeol.com/search/article-detail?id=716601
  • Anthony, L., & Miriam, S. (2019). Drill in English Skills Practice: CEFR-Alligned Curriculum. Selangor: Oxford Fajar
  • Attali, Y. (2016). A comparison of newly-trained and experienced raters on a standardized writing assessment. Language Testing, 33(1), 99–115. https://doi.org/10.1177/0265532215582283
  • Bond, T. G., & Fox, C. M. (2015). Applying the Rasch Model Fundamental Measurement in the Human Sciences. New Jersey: Lawrence Erlbaum Associates.
  • Creswell, J. W., & Creswell, J. D. (2018). Research Design: Qualitative, Quantitative, And Mixed Methods Approaches (5th ed.). California: Sage Publications.
  • Davis, L. (2016). The influence of training and experience on rater performance in scoring spoken language. Language Testing, 33(1), 117–135. https://doi.org/10.1177/0265532215582282
  • Eckes, T. (2015). Introduction to Many-Facet Rasch Measurement: Analyzing and Evaluating Rater-Mediated Assessments (2nd ed.). Peter Lang.
  • Eckes, T. (2019). Implications for rater-mediated language assessment. In Aryadoust, V., & Raquel, M. (Eds.), Quantitative Data Analysis for Language Assessment Volume I: Fundamental Techniques (pp. 153-175). London & New York: Routledge.
  • Engelhard, G., & Wind, S. A. (2018). Invariant Measurement with Raters and Rating Scales: Rasch Models for Rater-Mediated Assessments. Routledge. New York & London: Routledge. https://doi.org/10.1017/CBO9781107415324.004
  • Fisher, J. W. P. 2007. Rating scale instrument quality criteria. Rasch Measurement Transactions 21(1): 1095.
  • Govindasamy, P., Salazar, M. D. C., Lerner, J., & Green, K. E. (2019). Assessing the reliability of the framework for equitable and effective teaching with the many-facet rasch model. Frontiers in Psychology, 10(June), 1–10. https://doi.org/10.3389/fpsyg.2019.01363
  • Haladyna, T. M., & Rodrigues, M. C. (2013). Developing and Validating Test. New York: Routledge.
  • Huang, L., Kubelec, S., Keng, N., & Hsu, L. (2018). Evaluating CEFR rater performance through the analysis of spoken learner corpora. Language Testing in Asia, 8(1), 1–17. https://doi.org/http://dx.doi.org/10.1186/s40468-018-0069-0
  • Isaacs, T., & Thomson, R. I. (2013). Rater experience, rating scale length, and judgments of L2 pronunciation: Revisiting research conventions. Language Assessment Quarterly, 10(2), 135–159. https://doi.org/10.1080/15434303.2013.769545
  • Jeong, H. (2017). Narrative and expository genre effects on students, raters, and performance criteria. Assessing Writing, 31, 113–125. https://doi.org/10.1016/j.asw.2016.08.006
  • Jones, E., & Wind, S. A. (2018). Using Repeated Ratings to Improve Measurement Precision in Incomplete Rating Designs. Journal of Applied Measurement, 19(2), 148–161. Retrieved from https://www.ncbi.nlm.nih.gov/pubmed/29894984
  • Kang, O., Rubin, D., & Kermad, A. (2019). The effect of training and rater differences on oral proficiency assessment. Language Testing, 36(4), 481–504. https://doi.org/10.1177/0265532219849522
  • Kementerian Pendidikan Malaysia. (2019a). Quick Facts 2018: Malaysia Education Statistics. Retrieved from https://www.moe.gov.my/en/muat-turun/laporan-dan-statistik/quick-facts-malaysia-education-statistics/563-quick-facts-2018-malaysia-educational-statistics/file
  • Kementerian Pendidikan Malaysia. (2019b). Pengumuman Analisis Keputusan Sijil Pelajaran Malaysia (SPM) 2018. Retrieved from http://lp.moe.gov.my/images/bahan/spm/2019/14032019/Laporan%20Analisis%20Keputusan%20SPM%202018%20-%20Upload.pdf
  • Kim, H. J. (2015). A qualitative analysis of rater behavior on an L2 speaking assessment. Language Assessment Quarterly, 12(3), 239–261. https://doi.org/10.1080/15434303.2015.1049353
  • Koizumi, R., Okabe, Y., & Kashimada, Y. (2017). A Multi-faceted Rasch analysis of rater reliability of the Speaking section of the GTEC CBT. ARELE: Annual Review of English Language Education in Japan, 241–256. Retrieved from https://www.jstage.jst.go.jp/article/arele/28/0/28_241/_article/-char/ja/
  • Lembaga Peperiksaan. (2019). Instruction to Speaking Examiners (Pentaksiran Tingkatan 3). Retrived from http://lp.moe.gov.my/images/bahan/pt3/2019/21082019/S1%20MES%20PT3%20Instructions%20to%20Speaking%20%20Examiners_Revised%20version.pdf
  • Linacre, J. M. (2005). Standard errors: means, measures, origins and anchor values. Rasch Measurement Transactions, 19(3), 1030.
  • Linacre J. M. (2002). Understanding Rasch measurement: Optimizing rating scale category effectiveness. Journal of Applied Measurement, 3, 85–106. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.424.2811&rep=rep1&type=pdf
  • Linacre, J. M. (2014a). Facets Rasch measurement computer program (Version 3.71.4) [Computer software]. Chicago: Winsteps.com.
  • Linacre, J. M. (2014b). A user’s guide to FACETS: Rasch-model computer programs. Chicago: Winsteps.com. Retrieved from http://www.winsteps.com/facets.htm
  • Myers, J. L., Well, A. D., & Lorch, R. F. (2010). Research design and statistical analysis (3rd ed.). New York, NY: Routledge
  • Myford, C., & Wolfe, E. W. (2003). Detecting and measuring rater effects using Many-Facet Rasch Measurement: Part I. Journal Od Applied Measurement, 4(October 2015), 386–422. Retrieved from https://www.researchgate.net/profile/Carol_Myford/publication/9069043_Detecting_and_Measuring_Rater_Effects_Using_Many-Facet_Rasch_Measurement_Part_I/links/54cba70e0cf298d6565848ee.pdf
  • Şahan, Ö., & Razı, S. (2020). Do experience and text quality matter for raters’ decision-making behaviors? Language Testing, 1-22. https://doi.org/10.1177/0265532219900228
  • Scullen, S. E., Mount, M. K., & Goff, M. (2000). Understanding the latent structure of job performance ratings. Journal of Applied Psychology, 85(6), 956–970. https://doi.org/10.1037/0021-9010.85.6.956
  • Weilie, L. (2018). To what extent do non-teacher raters differ from teacher raters on assessing story-retelling. Journal of Language Testing & Assessment, 1, 1–13. Retrieved from http://clausiuspress.com/assets/default/article/2018/08/29/article_1535590233.pdf
  • Wesolowski, B. C., & Wind, S. A. (2019). Pedagogical considerations for examining rater variability in rater‐mediated assessments: A three‐model framework. Journal of Educational Measurement, 56(3), 521–546. https://doi.org/10.1111/jedm.12224
  • Wind, S. A., & Sebok-Syer, S. S. (2019). Examining differential rater functioning using a between-subgroup outfit approach. Journal of Educational Measurement, 56(2), 217–250. https://doi.org/10.1111/jedm.12198
  • Wind, S. A. (2018). Examining the impacts of rater effects in performance assessments. Applied Psychological Measurement, 43(2), 159–171. https://doi.org/10.1177/0146621618789391
  • Wind, S. A., & Guo, W. (2019). Exploring the combined effects of rater misfit and differential rater functioning in performance assessments. Educational and Psychological Measurement, 1–26. https://doi.org/10.1177/0013164419834613
  • Wu, M. (2017). Some IRT-based analyses for interpreting rater effects. Psychological Test and Assessment Modeling, 59(4), 453–470. Retrieved from https://www.psychologie-aktuell.com/fileadmin/download/ptam/4-2017_20171218/04_Wu.pdf
  • Wu, M. & Adams, R. (2007). Applying the Rasch model to psycho-social measurement: A practical approach. Melbourne: Educational Measurement Solutions
  • Wu, S. M., & Tan, S. (2016). Managing rater effects through the use of FACETS analysis: The case of a university placement test. Higher Education Research and Development, 35(2), 380–394. https://doi.org/10.1080/07294360.2015.1087381

Rating Performance among Raters of Different Experience Through Multi-Facet Rasch Measurement (MFRM) Model

Year 2020, Volume: 11 Issue: 2, 147 - 162, 13.06.2020
https://doi.org/10.21031/epod.662964

Abstract

One’s experience can greatly contribute to a diversified rating performance in educational scoring. Heterogeneous ratings can negatively affect examinees’ results. The aim of the study is to examine raters’ rating performance in assessing oral tests among lower secondary school students using Multi-facet Rasch Measurement (MFRM) model indicated by raters’ severity. Respondents are thirty English Language teachers clustered into two groups based on their rating experience in high-stakes assessment. The respondents listened to ten examinees’ recorded answers of three oral test items and provided their ratings. Instruments include items, examinees’ answers, scoring rubric, and scoring sheet used to appraise examinees’ competence in three domains which are vocabulary, grammar, and communicative competence. MFRM analysis showed that raters exhibited diversity in their severity level with chi-square χ2=2.661. Raters’ severity measures ranged from 2.13 to -1.45 logits. Independent t-test indicated that there was a significant difference in ratings provided by the inexperienced and the experienced raters, t-value = -0.96, df = 28, p<0.01. The findings of this study suggest that assessment developers must ensure raters are well versed before they can rate examinees in operational settings gained through assessment practices or rater training. Further research is needed to account for the varying effects of rating experience in other assessment contexts and the effects of interaction between facets on estimates of examinees’ measures. The present study provides additional evidence with respect to the role of rating experience in inspiring raters to provide accurate ratings.

References

  • Ahmadi Shirazi, M. (2019). For a greater good: Bias analysis in writing assessment. SAGE Open, 9(1), 1–14. https://doi.org/10.1177/2158244018822377
  • Albano, A. D., & Rodrigues, M. (2018). Item development research and practice. Handbook of Accessible Instruction and Testing Practices: Issues, Innovations, and Applications, 181–198. https://doi.org/10.1007/978-3-319-71126-3_12
  • Alp, P., Epner, A., & Pajupuu, H. (2018). The influence of rater empathy, age and experience on writing performance assessment. Linguistics Beyond And Within, 3(2017), 7–19. Retrieved from https://www.ceeol.com/search/article-detail?id=716601
  • Anthony, L., & Miriam, S. (2019). Drill in English Skills Practice: CEFR-Alligned Curriculum. Selangor: Oxford Fajar
  • Attali, Y. (2016). A comparison of newly-trained and experienced raters on a standardized writing assessment. Language Testing, 33(1), 99–115. https://doi.org/10.1177/0265532215582283
  • Bond, T. G., & Fox, C. M. (2015). Applying the Rasch Model Fundamental Measurement in the Human Sciences. New Jersey: Lawrence Erlbaum Associates.
  • Creswell, J. W., & Creswell, J. D. (2018). Research Design: Qualitative, Quantitative, And Mixed Methods Approaches (5th ed.). California: Sage Publications.
  • Davis, L. (2016). The influence of training and experience on rater performance in scoring spoken language. Language Testing, 33(1), 117–135. https://doi.org/10.1177/0265532215582282
  • Eckes, T. (2015). Introduction to Many-Facet Rasch Measurement: Analyzing and Evaluating Rater-Mediated Assessments (2nd ed.). Peter Lang.
  • Eckes, T. (2019). Implications for rater-mediated language assessment. In Aryadoust, V., & Raquel, M. (Eds.), Quantitative Data Analysis for Language Assessment Volume I: Fundamental Techniques (pp. 153-175). London & New York: Routledge.
  • Engelhard, G., & Wind, S. A. (2018). Invariant Measurement with Raters and Rating Scales: Rasch Models for Rater-Mediated Assessments. Routledge. New York & London: Routledge. https://doi.org/10.1017/CBO9781107415324.004
  • Fisher, J. W. P. 2007. Rating scale instrument quality criteria. Rasch Measurement Transactions 21(1): 1095.
  • Govindasamy, P., Salazar, M. D. C., Lerner, J., & Green, K. E. (2019). Assessing the reliability of the framework for equitable and effective teaching with the many-facet rasch model. Frontiers in Psychology, 10(June), 1–10. https://doi.org/10.3389/fpsyg.2019.01363
  • Haladyna, T. M., & Rodrigues, M. C. (2013). Developing and Validating Test. New York: Routledge.
  • Huang, L., Kubelec, S., Keng, N., & Hsu, L. (2018). Evaluating CEFR rater performance through the analysis of spoken learner corpora. Language Testing in Asia, 8(1), 1–17. https://doi.org/http://dx.doi.org/10.1186/s40468-018-0069-0
  • Isaacs, T., & Thomson, R. I. (2013). Rater experience, rating scale length, and judgments of L2 pronunciation: Revisiting research conventions. Language Assessment Quarterly, 10(2), 135–159. https://doi.org/10.1080/15434303.2013.769545
  • Jeong, H. (2017). Narrative and expository genre effects on students, raters, and performance criteria. Assessing Writing, 31, 113–125. https://doi.org/10.1016/j.asw.2016.08.006
  • Jones, E., & Wind, S. A. (2018). Using Repeated Ratings to Improve Measurement Precision in Incomplete Rating Designs. Journal of Applied Measurement, 19(2), 148–161. Retrieved from https://www.ncbi.nlm.nih.gov/pubmed/29894984
  • Kang, O., Rubin, D., & Kermad, A. (2019). The effect of training and rater differences on oral proficiency assessment. Language Testing, 36(4), 481–504. https://doi.org/10.1177/0265532219849522
  • Kementerian Pendidikan Malaysia. (2019a). Quick Facts 2018: Malaysia Education Statistics. Retrieved from https://www.moe.gov.my/en/muat-turun/laporan-dan-statistik/quick-facts-malaysia-education-statistics/563-quick-facts-2018-malaysia-educational-statistics/file
  • Kementerian Pendidikan Malaysia. (2019b). Pengumuman Analisis Keputusan Sijil Pelajaran Malaysia (SPM) 2018. Retrieved from http://lp.moe.gov.my/images/bahan/spm/2019/14032019/Laporan%20Analisis%20Keputusan%20SPM%202018%20-%20Upload.pdf
  • Kim, H. J. (2015). A qualitative analysis of rater behavior on an L2 speaking assessment. Language Assessment Quarterly, 12(3), 239–261. https://doi.org/10.1080/15434303.2015.1049353
  • Koizumi, R., Okabe, Y., & Kashimada, Y. (2017). A Multi-faceted Rasch analysis of rater reliability of the Speaking section of the GTEC CBT. ARELE: Annual Review of English Language Education in Japan, 241–256. Retrieved from https://www.jstage.jst.go.jp/article/arele/28/0/28_241/_article/-char/ja/
  • Lembaga Peperiksaan. (2019). Instruction to Speaking Examiners (Pentaksiran Tingkatan 3). Retrived from http://lp.moe.gov.my/images/bahan/pt3/2019/21082019/S1%20MES%20PT3%20Instructions%20to%20Speaking%20%20Examiners_Revised%20version.pdf
  • Linacre, J. M. (2005). Standard errors: means, measures, origins and anchor values. Rasch Measurement Transactions, 19(3), 1030.
  • Linacre J. M. (2002). Understanding Rasch measurement: Optimizing rating scale category effectiveness. Journal of Applied Measurement, 3, 85–106. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.424.2811&rep=rep1&type=pdf
  • Linacre, J. M. (2014a). Facets Rasch measurement computer program (Version 3.71.4) [Computer software]. Chicago: Winsteps.com.
  • Linacre, J. M. (2014b). A user’s guide to FACETS: Rasch-model computer programs. Chicago: Winsteps.com. Retrieved from http://www.winsteps.com/facets.htm
  • Myers, J. L., Well, A. D., & Lorch, R. F. (2010). Research design and statistical analysis (3rd ed.). New York, NY: Routledge
  • Myford, C., & Wolfe, E. W. (2003). Detecting and measuring rater effects using Many-Facet Rasch Measurement: Part I. Journal Od Applied Measurement, 4(October 2015), 386–422. Retrieved from https://www.researchgate.net/profile/Carol_Myford/publication/9069043_Detecting_and_Measuring_Rater_Effects_Using_Many-Facet_Rasch_Measurement_Part_I/links/54cba70e0cf298d6565848ee.pdf
  • Şahan, Ö., & Razı, S. (2020). Do experience and text quality matter for raters’ decision-making behaviors? Language Testing, 1-22. https://doi.org/10.1177/0265532219900228
  • Scullen, S. E., Mount, M. K., & Goff, M. (2000). Understanding the latent structure of job performance ratings. Journal of Applied Psychology, 85(6), 956–970. https://doi.org/10.1037/0021-9010.85.6.956
  • Weilie, L. (2018). To what extent do non-teacher raters differ from teacher raters on assessing story-retelling. Journal of Language Testing & Assessment, 1, 1–13. Retrieved from http://clausiuspress.com/assets/default/article/2018/08/29/article_1535590233.pdf
  • Wesolowski, B. C., & Wind, S. A. (2019). Pedagogical considerations for examining rater variability in rater‐mediated assessments: A three‐model framework. Journal of Educational Measurement, 56(3), 521–546. https://doi.org/10.1111/jedm.12224
  • Wind, S. A., & Sebok-Syer, S. S. (2019). Examining differential rater functioning using a between-subgroup outfit approach. Journal of Educational Measurement, 56(2), 217–250. https://doi.org/10.1111/jedm.12198
  • Wind, S. A. (2018). Examining the impacts of rater effects in performance assessments. Applied Psychological Measurement, 43(2), 159–171. https://doi.org/10.1177/0146621618789391
  • Wind, S. A., & Guo, W. (2019). Exploring the combined effects of rater misfit and differential rater functioning in performance assessments. Educational and Psychological Measurement, 1–26. https://doi.org/10.1177/0013164419834613
  • Wu, M. (2017). Some IRT-based analyses for interpreting rater effects. Psychological Test and Assessment Modeling, 59(4), 453–470. Retrieved from https://www.psychologie-aktuell.com/fileadmin/download/ptam/4-2017_20171218/04_Wu.pdf
  • Wu, M. & Adams, R. (2007). Applying the Rasch model to psycho-social measurement: A practical approach. Melbourne: Educational Measurement Solutions
  • Wu, S. M., & Tan, S. (2016). Managing rater effects through the use of FACETS analysis: The case of a university placement test. Higher Education Research and Development, 35(2), 380–394. https://doi.org/10.1080/07294360.2015.1087381
There are 40 citations in total.

Details

Primary Language English
Journal Section Articles
Authors

Muhamad Firdaus Mohd Noh 0000-0002-5429-6789

Mohd Effendi Ewan Mohd Matore 0000-0002-6369-8501

Publication Date June 13, 2020
Acceptance Date May 9, 2020
Published in Issue Year 2020 Volume: 11 Issue: 2

Cite

APA Mohd Noh, M. F., & Mohd Matore, M. E. E. (2020). Rating Performance among Raters of Different Experience Through Multi-Facet Rasch Measurement (MFRM) Model. Journal of Measurement and Evaluation in Education and Psychology, 11(2), 147-162. https://doi.org/10.21031/epod.662964