Validation of Writing Scales for Turkish as a Second Language through Many-Facet Rasch Measurement

Fatma Küçük Üçpınar; Aylin Ünaldı

Research Article

Validation of Writing Scales for Turkish as a Second Language through Many-Facet Rasch Measurement

Year 2017, Volume: 34 Issue: 1, 23 - 48, 17.12.2018

Abstract

Rating scales and the extent to which raters use them effectively are two important factors that influence scoring validity of language tests when open-ended writing tasks are concerned. Research regarding rating scale development and validation in the assessment of English is ample; however, there has been no research on scale validation in the assessment of Turkish as a Second Language (TSL) to this date. This study reports on the development of two analytical rating scales used to assess academic writing skills of test takers in TSL, and presents quantitative evidence on the rating scale validation. For this purpose, texts written by 39 TSL students were scored by three raters. The analyses were conducted using Many-facet Rasch Measurement. Results indicate that empirically-developed analytical rating scales were used consistently and appropriately by the raters providing evidence for the reliability and effectiveness of the rating scales.

Keywords

Scale development , writing assessment of Turkish as a second language , many-facet Rasch measurement

References

Bacha, N. (2001). Writing evaluation: What can analytic versus holistic essay scoring tell us? System, 29(3), 371-383.
Barkaoui, K. (2007). Rating scale impact on EFL essay marking: A mixed-method study. Assessing Writing, 12(2), 86-107.
Barkaoui, K. (2010). Variability in ESL essay rating processes: The role of the rating scale and rater experience. Language Assessment Quarterly, 7(1), 54-74.
Barkaoui, K. (2011). Effects of marking method and rater experience on ESL essay scores and rater performance. Assessment in Education: Principles, Policy & Practice, 18(3), 279-293.
Council of Europe (2001). Common European reference framework for languages. Strasbourg, France: Author. Retrieved from http://www.coe.int/T/DG4/Portfolio/?L=E&M=/documents_intro/common_framework.html
Council of Europe (2009). Manual for Relating Language Examinations to the Common European Framework of Reference for Languages (CEFR). Strasbourg, France: Author. Retrieved from http://www.coe.int/t/dg4/linguistic/Manuel1_EN.asp
Cumming, A. (1990). Expertise in evaluating second language compositions. Language Testing, 7(1), 31-51.
Delaney, Y. A. (2008). Investigating the reading-to-write construct. Journal of English for Academic Purposes, 7(3), 140-150.
Davies A, Brown, A., Elder, C., Hill, K., Lumley, T., & McNamara, T. (1999). Dictionary of Language Testing. Cambridge: Cambridge University Press.
East, M. (2009). Evaluating the reliability of a detailed analytic scoring rubric for foreign language writing. Assessing Writing, 14(2), 88-115.
Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Language Assessment Quarterly: An International Journal, 2(3), 197-221.
Eckes, T. (2008). Rater types in writing performance assessments: A classification approach to rater variability. Language Testing, 25(2), 155-185.
Eckes, T. (2009). Many-facet Rasch measurement. Reference supplement to the manual for relating language examinations to the Common European Framework of Reference for Languages: Learning, teaching, assessment. (Section H). Strasbourg, France: Council of Europe/Language Policy Division. Retrieved from http://www.coe.int/t/dg4/Linguistic/CEF-refSupp-SectionH.pdf
Eckes, T. (2012). Operational rater types in writing assessment: Linking rater cognition to rater behavior. Language Assessment Quarterly, 9(3), 270-292.
Engelhard, G. (1994). Examining Rater Errors in the Assessment of Written Composition with a Many‐Faceted Rasch Model. Journal of Educational Measurement, 31(2), 93-112.
Fulcher, G. (2003). Testing second language speaking. Harlow: Pearson Education.
Fulcher, G., Davidson, F., & Kemp, J. (2011). Effective rating scale development for speaking tests: Performance decision trees. Language Testing, 28(1), 5-29.
Hamp-Lyons, L., & Mathias, S. P. (1994). Examining expert judgments of task difficulty on essay tests. Journal of Second Language Writing, 3(1), 49-68.
Hoyt, W. T. (2000). Rater bias in psychological research: When is it a problem and what can we do about it? Psychological Methods, 5(1), 64-86.
Jonsson, A., & Svingby, G. (2007). The use of scoring rubrics: Reliability, validity and educational consequences. Educational Research Review, 2(2), 130-144.
Knoch, U. (2007). Do empirically developed rating scales function differently to conventional rating scales for academic writing? Spaan Fellow Working Papers in Second or Foreign Language Assessment, 5, 1–36.
Knoch, U. (2009). Diagnostic writing assessment: The development and validation of a rating scale. Frankfurt, Germany: Peter Lang.
Küçük, F. (2017). Assessing academic writing skills in Turkish as a foreign language. (Unpublished master’s thesis). Boğaziçi University, Istanbul, Turkey.
Linacre, J. M. (2002). Optimizing rating scale category effectiveness, Journal of Applied Measurement, 3(1), 85-106.
Linacre, J. M. (2014). FACETS (Version 3.71.4) [Computer software]. Chicago, IL: MESA Press.
Lumley, T., & McNamara, T. F. (1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12(1), 54-71.
Mendoza, A., & Knoch, U. (2018). Examining the validity of an analytic rating scale for a Spanish test for academic purposes using the argument-based approach to validation. Assessing Writing, 35, 41-55.
McNamara, T. F. (1996). Measuring second language performance. London and New York: Longman.
Myford, C. M. (2002). Investigating design features of descriptive graphic rating scales. Applied Measurement in Education, 15(2), 187-215.
Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4(4), 386-422.
Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II. Journal of Applied Measurement, 5(2), 189-227.
O’Sullivan, B. (2005). A practical introduction to using FACETS in language testing research. Unpublished manuscript, University of Roehampton, London, UK.
Ong, J., & Zhang, L. J. (2010). Effects of task complexity on the fluency and lexical complexity in EFL students’ argumentative writing. Journal of Second Language Writing, 19(4), 218-233.
Park, T. (2004). An investigation of an ESL placement test of writing using many-facet Rasch measurement. Papers in TESOL & Applied Linguistics, 4, 1-21.
Sakyi, A. (2000). Validation of holistic scoring for ESL writing assessment: A study of how raters evaluate ESL compositions on a holistic scale. In A. Kunnan (Ed.), Fairness and validation in language assessment (pp. 130–153). Cambridge: Cambridge University Press.
Shaw, S. D., & Weir, C. J. (2007). Examining writing: Research and practice in assessing second language writing (Vol. 26). Cambridge: Cambridge University Press.
Sudweeks, R. R., Reeve, S., & Bradshaw, W. S. (2005). A comparison of generalizability theory and many-facet Rasch measurement in an analysis of college sophomore writing. Assessing Writing, 9(3), 239-261.
Turner, C. E., & Upshur, J. A. (2002). Rating scales derived from student samples: Effects of the scale maker and the student sample on scale content and student scores. TESOL Quarterly, 36(1), 49–70.
Weigle, S. C. (1994). Effects of training on raters of ESL compositions. Language Testing, 11(2), 197-223.
Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15(2), 263-287.
Weigle, S. C. (1999). Investigating rater/prompt interactions in writing assessment: Quantitative and qualitative approaches. Assessing Writing, 6(2), 145-178.
Weigle, S. C. (2002). Assessing writing. Cambridge: Cambridge University Press.
Weigle, S. C., Boldt, H., & Valsecchi, M. I. (2003). Effects of task and rater background on the evaluation of ESL student writing: A pilot study. TESOL Quarterly, 37(2), 345-354.
Wolfe, E. W., Kao, C. W., & Ranney, M. (1998). Cognitive differences in proficient and nonproficient essay scorers. Written Communication, 15(4), 465-492.
Wolfe, E. W. (2004). Identifying rater effects using latent trait models. Psychology Science, 46, 35–51.

Year 2017, Volume: 34 Issue: 1, 23 - 48, 17.12.2018

Fatma Küçük Üçpınar , Aylin Ünaldı

Abstract

References

Bacha, N. (2001). Writing evaluation: What can analytic versus holistic essay scoring tell us? System, 29(3), 371-383.
Barkaoui, K. (2007). Rating scale impact on EFL essay marking: A mixed-method study. Assessing Writing, 12(2), 86-107.
Barkaoui, K. (2010). Variability in ESL essay rating processes: The role of the rating scale and rater experience. Language Assessment Quarterly, 7(1), 54-74.
Barkaoui, K. (2011). Effects of marking method and rater experience on ESL essay scores and rater performance. Assessment in Education: Principles, Policy & Practice, 18(3), 279-293.
Council of Europe (2001). Common European reference framework for languages. Strasbourg, France: Author. Retrieved from http://www.coe.int/T/DG4/Portfolio/?L=E&M=/documents_intro/common_framework.html
Council of Europe (2009). Manual for Relating Language Examinations to the Common European Framework of Reference for Languages (CEFR). Strasbourg, France: Author. Retrieved from http://www.coe.int/t/dg4/linguistic/Manuel1_EN.asp
Cumming, A. (1990). Expertise in evaluating second language compositions. Language Testing, 7(1), 31-51.
Delaney, Y. A. (2008). Investigating the reading-to-write construct. Journal of English for Academic Purposes, 7(3), 140-150.
Davies A, Brown, A., Elder, C., Hill, K., Lumley, T., & McNamara, T. (1999). Dictionary of Language Testing. Cambridge: Cambridge University Press.
East, M. (2009). Evaluating the reliability of a detailed analytic scoring rubric for foreign language writing. Assessing Writing, 14(2), 88-115.
Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Language Assessment Quarterly: An International Journal, 2(3), 197-221.
Eckes, T. (2008). Rater types in writing performance assessments: A classification approach to rater variability. Language Testing, 25(2), 155-185.
Eckes, T. (2009). Many-facet Rasch measurement. Reference supplement to the manual for relating language examinations to the Common European Framework of Reference for Languages: Learning, teaching, assessment. (Section H). Strasbourg, France: Council of Europe/Language Policy Division. Retrieved from http://www.coe.int/t/dg4/Linguistic/CEF-refSupp-SectionH.pdf
Eckes, T. (2012). Operational rater types in writing assessment: Linking rater cognition to rater behavior. Language Assessment Quarterly, 9(3), 270-292.
Engelhard, G. (1994). Examining Rater Errors in the Assessment of Written Composition with a Many‐Faceted Rasch Model. Journal of Educational Measurement, 31(2), 93-112.
Fulcher, G. (2003). Testing second language speaking. Harlow: Pearson Education.
Fulcher, G., Davidson, F., & Kemp, J. (2011). Effective rating scale development for speaking tests: Performance decision trees. Language Testing, 28(1), 5-29.
Hamp-Lyons, L., & Mathias, S. P. (1994). Examining expert judgments of task difficulty on essay tests. Journal of Second Language Writing, 3(1), 49-68.
Hoyt, W. T. (2000). Rater bias in psychological research: When is it a problem and what can we do about it? Psychological Methods, 5(1), 64-86.
Jonsson, A., & Svingby, G. (2007). The use of scoring rubrics: Reliability, validity and educational consequences. Educational Research Review, 2(2), 130-144.
Knoch, U. (2007). Do empirically developed rating scales function differently to conventional rating scales for academic writing? Spaan Fellow Working Papers in Second or Foreign Language Assessment, 5, 1–36.
Knoch, U. (2009). Diagnostic writing assessment: The development and validation of a rating scale. Frankfurt, Germany: Peter Lang.
Küçük, F. (2017). Assessing academic writing skills in Turkish as a foreign language. (Unpublished master’s thesis). Boğaziçi University, Istanbul, Turkey.
Linacre, J. M. (2002). Optimizing rating scale category effectiveness, Journal of Applied Measurement, 3(1), 85-106.
Linacre, J. M. (2014). FACETS (Version 3.71.4) [Computer software]. Chicago, IL: MESA Press.
Lumley, T., & McNamara, T. F. (1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12(1), 54-71.
Mendoza, A., & Knoch, U. (2018). Examining the validity of an analytic rating scale for a Spanish test for academic purposes using the argument-based approach to validation. Assessing Writing, 35, 41-55.
McNamara, T. F. (1996). Measuring second language performance. London and New York: Longman.
Myford, C. M. (2002). Investigating design features of descriptive graphic rating scales. Applied Measurement in Education, 15(2), 187-215.
Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4(4), 386-422.
Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II. Journal of Applied Measurement, 5(2), 189-227.
O’Sullivan, B. (2005). A practical introduction to using FACETS in language testing research. Unpublished manuscript, University of Roehampton, London, UK.
Ong, J., & Zhang, L. J. (2010). Effects of task complexity on the fluency and lexical complexity in EFL students’ argumentative writing. Journal of Second Language Writing, 19(4), 218-233.
Park, T. (2004). An investigation of an ESL placement test of writing using many-facet Rasch measurement. Papers in TESOL & Applied Linguistics, 4, 1-21.
Sakyi, A. (2000). Validation of holistic scoring for ESL writing assessment: A study of how raters evaluate ESL compositions on a holistic scale. In A. Kunnan (Ed.), Fairness and validation in language assessment (pp. 130–153). Cambridge: Cambridge University Press.
Shaw, S. D., & Weir, C. J. (2007). Examining writing: Research and practice in assessing second language writing (Vol. 26). Cambridge: Cambridge University Press.
Sudweeks, R. R., Reeve, S., & Bradshaw, W. S. (2005). A comparison of generalizability theory and many-facet Rasch measurement in an analysis of college sophomore writing. Assessing Writing, 9(3), 239-261.
Turner, C. E., & Upshur, J. A. (2002). Rating scales derived from student samples: Effects of the scale maker and the student sample on scale content and student scores. TESOL Quarterly, 36(1), 49–70.
Weigle, S. C. (1994). Effects of training on raters of ESL compositions. Language Testing, 11(2), 197-223.
Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15(2), 263-287.
Weigle, S. C. (1999). Investigating rater/prompt interactions in writing assessment: Quantitative and qualitative approaches. Assessing Writing, 6(2), 145-178.
Weigle, S. C. (2002). Assessing writing. Cambridge: Cambridge University Press.
Weigle, S. C., Boldt, H., & Valsecchi, M. I. (2003). Effects of task and rater background on the evaluation of ESL student writing: A pilot study. TESOL Quarterly, 37(2), 345-354.
Wolfe, E. W., Kao, C. W., & Ranney, M. (1998). Cognitive differences in proficient and nonproficient essay scorers. Written Communication, 15(4), 465-492.
Wolfe, E. W. (2004). Identifying rater effects using latent trait models. Psychology Science, 46, 35–51.

There are 45 citations in total.

Details

Primary Language	English
Journal Section	Original Articles
Authors	Fatma Küçük Üçpınar Aylin Ünaldı This is me
Publication Date	December 17, 2018
Published in Issue	Year 2017 Volume: 34 Issue: 1

Cite

APA	Küçük Üçpınar, F., & Ünaldı, A. (2018). Validation of Writing Scales for Turkish as a Second Language through Many-Facet Rasch Measurement. Bogazici University Journal of Education, 34(1), 23-48.

Article Files

Full Text

Unless otherwise stated, all content on this site is licensed under a Creative Commons Attribution 4.0 International Licence (CC BY 4.0).