Investigating the quality of a high-stakes EFL writing assessment procedure in the Turkish higher education context

Elif Sarı

doi:10.21449/ijate.1384824

Research Article

Investigating the quality of a high-stakes EFL writing assessment procedure in the Turkish higher education context

Year 2024, Volume: 11 Issue: 4, 660 - 674, 15.11.2024

Elif Sarı

https://doi.org/10.21449/ijate.1384824

Abstract

Employing G-theory and rater interviews, the study investigated how a high-stakes writing assessment procedure (i.e., a single-task, single-rater, and holistic scoring procedure) impacted the variability and reliability of its scores within the Turkish higher education context. Thirty-two essays written on two different writing tasks (i.e., narrative and opinion) by 16 EFL students studying at a Turkish state university were scored by 10 instructor raters both holistically and analytically. After the raters completed the scoring procedure, semi-structured individual interviews were held with them to gain insight into their views regarding the quality of the current scoring procedure. The G-theory results showed that the reliability coefficients obtained from the current scoring procedure would not be sufficient to draw sound conclusions. The quantitative results were partly supported by the qualitative data. Important implications were discussed to improve the quality of the current high-stakes EFL writing assessment policy.

Keywords

EFL writing assessment, Writing assessment in higher education, Scoring variability, Scoring reliability, Generalizability (G-) theory

Supporting Institution

The study has not been supported by any institutions.

References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
Attali, Y. (2020). Effect of Immediate Elaborated Feedback on Rater Accuracy. ETS Research Report Series, 2020(1), 1-15.
Bachman, L.F. (1990). Fundamental considerations in language testing. Oxford University Press.
Barkaoui, K. (2007). Rating scale impact on EFL essay marking: A mixed-method study. Assessing Writing, 12(2), 86-107. https://doi.org/10.1016/j.asw.2007.07.001
Barkaoui, K. (2008). Effects of scoring method and rater experience on ESL essay rating processes and outcomes [Unpublished doctoral dissertation, University of Toronto, Canada].
Barkaoui, K. (2010). Do ESL essays raters’ evaluation criteria change with experience? A mixed-methods, cross-sectional study. TESOL Quarterly, 44(1), 31-57.
Brennan, R.L. (2001). Generalizability theory: Statistics for social science and public policy. Springer Verlag. Retrieved from https://www.google.com.tr/search?hl=tr&tbo=p&tbm=bks&q=isbn:0387952829
Briesch, A.M., Swaminathan, H., Welsh, M., & Chafouleas, S.M. (2014). Generalizability theory: A practical guide to study design, implementation, and interpretation. Journal of Psychology, 52(1), 13-15. http://dx.doi.org/10.1016/j.jsp.2013.11.008
Cheong, S.H. (2012). Native-and nonnative-English-speaking raters’ assessment behavior in the evaluation of NEAT essay writing samples. 영어교육연구, 24(2), 49-73.
Creswell, John W. (2012). Educational research: Planning, conducting, and evaluating quantitative and qualitative research (4 th ed.). Pearson Education.
Cronbach, L.J., Gleser, G.C., Nada, H., & Rajaratnam, N. (1972). The Dependability of Behavioral Measurements: Theory of Generalizability for Scores and Profiles. Wiley.
Cumming, A., Kantor, R., & Powers, D. (2002). Decision making while rating ESL/EFL writing tasks: A descriptive framework. Modern Language Journal, 86(1), 67-96.
Elorbany, R., & Huang, J. (2012). Examining the impact of rater educational background on ESL writing assessment: A generalizability theory approach. Language and Communication Quarterly, 1(1), 2-24.
Fahim, M., & Bijani, H. (2011). The effects of rater training on raters’ severity and bias in second language writing assessment. International Journal of Language Testing, 1(1), 1-16.
Gebril, A. (2009). Score generalizability of academic writing tasks: Does one test method fit it all?. Language Testing, 26(4), 507-531.
Güler, N., Uyanık, G.K., & Teker, G.T. (2012). Genellenebilirlik kuramı. Pegem Akademi Yayınları.
Hamp-Lyons, L. (1990). Second language writing: Assessment issues. In B. Kroll (Ed.), Second language writing (pp. 69-87). United Kingdom: Cambridge University Press. https://doi.org/10.1017/CBO9781139524551.009
Hamp-Lyons, L., & Mathias, S.P. (1994). Examining expert judgments of task difficulty on essay tests. Journal of Second Language Writing, 3(1), 49 68. https://doi.org/10.1016/1060-3743(94)90005-1
Han, T., & Huang, J. (2017). Examining the Impact of Scoring Methods on the Institutional EFL Writing Assessment: A Turkish Perspective. PASAA: Journal of Language Teaching and Learning in Thailand, 53, 112-147.
Huang, J. (2008). How accurate are ESL students’ holistic writing scores on large-scale assessments? - A generalizability theory approach. Assessing Writing, 13(3), 201-218. http://dx.doi.org/10.1016/j.asw.2008.10.002
Huang, J. (2011). Generalizability theory as evidence of concerns about fairness in large-scale ESL writing assessments. TESOL Journal, 2(4), 423 443. https://doi.org/10.5054/tj.2011.269751
Huang, J., Han, T., Tavano, H., & Hairston, L. (2014). Using generalizability theory toexamine the impact of essay quality on rating variability and reliability of ESOL writing. In J. Huang & T. Han (Eds.), Empirical quantitative research in social sciences: Examining significant differences and relationships, (pp. 127-149). Untested Ideas Research Center.
Huot, B.A. (1990). Reliability, validity and holistic scoring: What we know and what we need to know. College Composition and Communication, 41, 201 213. https://www.jstor.org/stable/358160
Huot, B. (2002). (Re)Articulating writing assessment: Writing assessment for teaching and learning. Logan, Utah: Utah State University Press.
Jacobs, H.J., Zingraf, S.A., Wormuth, D.R., Ha rtfiel, V.F., &Hughey, J.B.(1981). Testing ESL composition: A practical approach. Massachusetts: Newbury House.
Johnson, R.L., Penny, J.A., & Gordon, B. (2009). Assessing performance: Designing, scoring, and validating performance tasks. The Guilford Press.
Kane, M. (2010). Validity and fairness. Language Testing, 27(2), 177-182.
Kieffer, K.M. (1998). Why generalizability theory is essential and classical test theory is often inadequate? Paper presented at the Annual Meeting of the South Western Psychological Association, New Orleans, LA.
Kenyon, D. (1992, February). Introductory remarks at symposium on development and use of rating scales in language testing. Paper presented at the 14th Language Testing Research Colloquium, Vancouver, British Columbia.
Kim, A.Y., & Gennaro, D.K. (2012). Scoring behavior of native vs. non-native speaker raters of writing exams. Language Research, 48(2), 319-342.
Lee, Y.-W., Kantor, R., & Mollaun, P. (2002). “Score Dependability of the Writing and Speaking Sections of New TOEFL”. [Proceeding]. Paper Presented at the Annual Meeting of National Council on Measurement in Education, New Orleans: LA. Abstract retrieved on December 11, 2012 from ERIC. (ERIC No. ED464962)
Liu, Y., & Huang, J. (2020). The quality assurance of a national English writing assessment: Policy implications for quality improvement. Studies in Educational Evaluation, 67, 100941.
McNamara, T.F. (1996). Measuring second language performance. Addison Wesley Longman.
Popham, J.W. (1981). Modern educational measurement. Englewood: Prentice.
Rinnert, C., & Kobayashi, H. (2001). “Differing Perceptions of EFL Writing among Readers in Japan”. The Modern Language Journal, 85, 189-209.
Shavelson, R.J., & Webb, N.M. (1991). Generalizability theory: A premier. Sage
Shi, L. (2001). Native- and nonnative-speaking EFL teachers’ evaluation of Chinese students’ English writing. Language Testing, 18(3), 303 325. https://doi.org/10.1177/026553220101800303
Song, B., & Caruso, I. (1996). “Do English and ESL Faculty Differ in Evaluating the Essays of Native English-Speaking, and ESL Students?” Journal of Second Language Writing, 5, 163-182.
Şahan, Ö., & Razı, S. (2020). Do experience and text quality matter for raters’ decision-making behaviors?. Language Testing, 37(3), 311-332.Weigle, S.C. (1994). Effects of training on raters of ESL compositions. Language Testing, 11(2), 197 223. http://dx.doi.org/10.1177/026553229401100206
Weigle, S.C. (1999). Investigating rater/prompt interactions in writing assessment: Quantitative and qualitative approaches. Assessing Writing, 6(2), 145-178.
Weigle, S.C. (2002). Assessing writing. United Kingdom: Cambridge University Press.
Weigle, S.C., Boldt, H., & Valsecchi, M. I. (2003). Effects of task and rater background on the evaluation of ESL writing: A pilot study. TESOL Quarterly, 37(2), 345-354.
Zhao, C., & Huang, J. (2020). The impact of the scoring system of a large-scale standardized EFL writing assessment on its score variability and reliability: Implications for assessment policy makers. Studies in Educational Evaluation, 67, 100911.

Investigating the quality of a high-stakes EFL writing assessment procedure in the Turkish higher education context

Year 2024, Volume: 11 Issue: 4, 660 - 674, 15.11.2024

Elif Sarı

https://doi.org/10.21449/ijate.1384824

Abstract

Keywords

EFL writing assessment, Writing assessment in higher education, Scoring variability, Scoring reliability, Generalizability (G-) theory

References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
Attali, Y. (2020). Effect of Immediate Elaborated Feedback on Rater Accuracy. ETS Research Report Series, 2020(1), 1-15.
Bachman, L.F. (1990). Fundamental considerations in language testing. Oxford University Press.
Barkaoui, K. (2007). Rating scale impact on EFL essay marking: A mixed-method study. Assessing Writing, 12(2), 86-107. https://doi.org/10.1016/j.asw.2007.07.001
Barkaoui, K. (2008). Effects of scoring method and rater experience on ESL essay rating processes and outcomes [Unpublished doctoral dissertation, University of Toronto, Canada].
Barkaoui, K. (2010). Do ESL essays raters’ evaluation criteria change with experience? A mixed-methods, cross-sectional study. TESOL Quarterly, 44(1), 31-57.
Brennan, R.L. (2001). Generalizability theory: Statistics for social science and public policy. Springer Verlag. Retrieved from https://www.google.com.tr/search?hl=tr&tbo=p&tbm=bks&q=isbn:0387952829
Briesch, A.M., Swaminathan, H., Welsh, M., & Chafouleas, S.M. (2014). Generalizability theory: A practical guide to study design, implementation, and interpretation. Journal of Psychology, 52(1), 13-15. http://dx.doi.org/10.1016/j.jsp.2013.11.008
Cheong, S.H. (2012). Native-and nonnative-English-speaking raters’ assessment behavior in the evaluation of NEAT essay writing samples. 영어교육연구, 24(2), 49-73.
Creswell, John W. (2012). Educational research: Planning, conducting, and evaluating quantitative and qualitative research (4 th ed.). Pearson Education.
Cronbach, L.J., Gleser, G.C., Nada, H., & Rajaratnam, N. (1972). The Dependability of Behavioral Measurements: Theory of Generalizability for Scores and Profiles. Wiley.
Cumming, A., Kantor, R., & Powers, D. (2002). Decision making while rating ESL/EFL writing tasks: A descriptive framework. Modern Language Journal, 86(1), 67-96.
Elorbany, R., & Huang, J. (2012). Examining the impact of rater educational background on ESL writing assessment: A generalizability theory approach. Language and Communication Quarterly, 1(1), 2-24.
Fahim, M., & Bijani, H. (2011). The effects of rater training on raters’ severity and bias in second language writing assessment. International Journal of Language Testing, 1(1), 1-16.
Gebril, A. (2009). Score generalizability of academic writing tasks: Does one test method fit it all?. Language Testing, 26(4), 507-531.
Güler, N., Uyanık, G.K., & Teker, G.T. (2012). Genellenebilirlik kuramı. Pegem Akademi Yayınları.
Hamp-Lyons, L. (1990). Second language writing: Assessment issues. In B. Kroll (Ed.), Second language writing (pp. 69-87). United Kingdom: Cambridge University Press. https://doi.org/10.1017/CBO9781139524551.009
Hamp-Lyons, L., & Mathias, S.P. (1994). Examining expert judgments of task difficulty on essay tests. Journal of Second Language Writing, 3(1), 49 68. https://doi.org/10.1016/1060-3743(94)90005-1
Han, T., & Huang, J. (2017). Examining the Impact of Scoring Methods on the Institutional EFL Writing Assessment: A Turkish Perspective. PASAA: Journal of Language Teaching and Learning in Thailand, 53, 112-147.
Huang, J. (2008). How accurate are ESL students’ holistic writing scores on large-scale assessments? - A generalizability theory approach. Assessing Writing, 13(3), 201-218. http://dx.doi.org/10.1016/j.asw.2008.10.002
Huang, J. (2011). Generalizability theory as evidence of concerns about fairness in large-scale ESL writing assessments. TESOL Journal, 2(4), 423 443. https://doi.org/10.5054/tj.2011.269751
Huang, J., Han, T., Tavano, H., & Hairston, L. (2014). Using generalizability theory toexamine the impact of essay quality on rating variability and reliability of ESOL writing. In J. Huang & T. Han (Eds.), Empirical quantitative research in social sciences: Examining significant differences and relationships, (pp. 127-149). Untested Ideas Research Center.
Huot, B.A. (1990). Reliability, validity and holistic scoring: What we know and what we need to know. College Composition and Communication, 41, 201 213. https://www.jstor.org/stable/358160
Huot, B. (2002). (Re)Articulating writing assessment: Writing assessment for teaching and learning. Logan, Utah: Utah State University Press.
Jacobs, H.J., Zingraf, S.A., Wormuth, D.R., Ha rtfiel, V.F., &Hughey, J.B.(1981). Testing ESL composition: A practical approach. Massachusetts: Newbury House.
Johnson, R.L., Penny, J.A., & Gordon, B. (2009). Assessing performance: Designing, scoring, and validating performance tasks. The Guilford Press.
Kane, M. (2010). Validity and fairness. Language Testing, 27(2), 177-182.
Kieffer, K.M. (1998). Why generalizability theory is essential and classical test theory is often inadequate? Paper presented at the Annual Meeting of the South Western Psychological Association, New Orleans, LA.
Kenyon, D. (1992, February). Introductory remarks at symposium on development and use of rating scales in language testing. Paper presented at the 14th Language Testing Research Colloquium, Vancouver, British Columbia.
Kim, A.Y., & Gennaro, D.K. (2012). Scoring behavior of native vs. non-native speaker raters of writing exams. Language Research, 48(2), 319-342.
Lee, Y.-W., Kantor, R., & Mollaun, P. (2002). “Score Dependability of the Writing and Speaking Sections of New TOEFL”. [Proceeding]. Paper Presented at the Annual Meeting of National Council on Measurement in Education, New Orleans: LA. Abstract retrieved on December 11, 2012 from ERIC. (ERIC No. ED464962)
Liu, Y., & Huang, J. (2020). The quality assurance of a national English writing assessment: Policy implications for quality improvement. Studies in Educational Evaluation, 67, 100941.
McNamara, T.F. (1996). Measuring second language performance. Addison Wesley Longman.
Popham, J.W. (1981). Modern educational measurement. Englewood: Prentice.
Rinnert, C., & Kobayashi, H. (2001). “Differing Perceptions of EFL Writing among Readers in Japan”. The Modern Language Journal, 85, 189-209.
Shavelson, R.J., & Webb, N.M. (1991). Generalizability theory: A premier. Sage
Shi, L. (2001). Native- and nonnative-speaking EFL teachers’ evaluation of Chinese students’ English writing. Language Testing, 18(3), 303 325. https://doi.org/10.1177/026553220101800303
Song, B., & Caruso, I. (1996). “Do English and ESL Faculty Differ in Evaluating the Essays of Native English-Speaking, and ESL Students?” Journal of Second Language Writing, 5, 163-182.
Şahan, Ö., & Razı, S. (2020). Do experience and text quality matter for raters’ decision-making behaviors?. Language Testing, 37(3), 311-332.Weigle, S.C. (1994). Effects of training on raters of ESL compositions. Language Testing, 11(2), 197 223. http://dx.doi.org/10.1177/026553229401100206
Weigle, S.C. (1999). Investigating rater/prompt interactions in writing assessment: Quantitative and qualitative approaches. Assessing Writing, 6(2), 145-178.
Weigle, S.C. (2002). Assessing writing. United Kingdom: Cambridge University Press.
Weigle, S.C., Boldt, H., & Valsecchi, M. I. (2003). Effects of task and rater background on the evaluation of ESL writing: A pilot study. TESOL Quarterly, 37(2), 345-354.
Zhao, C., & Huang, J. (2020). The impact of the scoring system of a large-scale standardized EFL writing assessment on its score variability and reliability: Implications for assessment policy makers. Studies in Educational Evaluation, 67, 100911.

There are 43 citations in total.

Details

Primary Language	English
Subjects	Measurement and Evaluation in Education (Other)
Journal Section	Articles
Authors	Elif Sarı 0000-0002-3597-7212
Early Pub Date	October 21, 2024
Publication Date	November 15, 2024
Submission Date	November 1, 2023
Acceptance Date	August 26, 2024
Published in Issue	Year 2024 Volume: 11 Issue: 4

Cite

APA	Sarı, E. (2024). Investigating the quality of a high-stakes EFL writing assessment procedure in the Turkish higher education context. International Journal of Assessment Tools in Education, 11(4), 660-674. https://doi.org/10.21449/ijate.1384824

Article Files

Full Text

23823 23825 23824