Research Article

Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability

Volume: 8 Number: 2 December 31, 2025
TR EN

Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability

Abstract

This study investigates the reliability of large language models (LLMs) in assessing English as a Foreign Language (EFL) writing compared to human raters. Specifically, the performances of ChatGPT 4.0 and DeepSeek R1 were examined across three genres; argumentative, opinion, and persuasive essays, under rubric-free and rubric-based scoring conditions. Participants were 65 undergraduate ELT students at a Turkish university who produced a total of 162 essays. Two experienced human raters scored all essays, and their evaluations demonstrated near-perfect inter-rater reliability, providing a stable benchmark for comparison. The same essays were then rated by ChatGPT and DeepSeek under both scoring conditions. Statistical analyses included intraclass correlation coefficients (ICC), Pearson correlations, paired-samples t-tests, and ANOVAs. Findings revealed that rubric integration substantially improved alignment between AI and human scores, particularly for ChatGPT, which showed stronger sensitivity to rubric criteria than DeepSeek. Genre effects were also evident: opinion essays yielded the highest AI-human agreement, persuasive texts moderate alignment, and argumentative essays the weakest consistency. While both AI tools produced more centralized scores with less variability than human raters, they also exhibited risk-averse tendencies, especially without rubric guidance. The results indicate that AI-based scoring can complement, but not replace, human evaluation, especially in cognitively demanding genres. The study highlights the importance of rubric clarity, prompt design, and genre awareness in maximizing the educational value of AI-assisted writing assessment.

Keywords

Ethical Statement

This research was conducted with the permission granted by the Nevşehir Hacı Bektaş Veli University Scientific Research and Publication Ethics Committee, based on the decision dated 05/02/2025 and numbered 2025.01.42.

Thanks

We are grateful to the students who participated in this study and to Instructor Uğur Ünalır for his invaluable assistance in evaluating the student essays.

References

  1. Ahmadi Shirazi, M. (2019). For a greater good: Bias analysis in writing assessment. Sage Open, 9(1), 1-14. https://doi.org/10.1177/2158244018822377
  2. Barkaoui, K. (2010). Variability in ESL essay rating processes: The role of the rating scale and rater experience. Language Assessment Quarterly, 7(1), 54-74. https://doi.org/10.1080/15434300903464418
  3. Bond, M., Khosravi, H., De Laat, M., Bergdahl, N., Negrea, V., Oxley, E., Pham, P., Chong, S. W., & Siemens, G. (2024). A meta systematic review of artificial intelligence in higher education: a call for increased ethics, collaboration, and rigour. International Journal of Educational Technology in Higher Education, 21(1). https://doi.org/10.1186/s41239-023-00436-z
  4. Bouziane, K., & Bouziane, A. (2024). AI versus human effectiveness in essay evaluation. Discover Education, 3(1), 201. https://doi.org/10.1007/s44217-024-00320-6
  5. Bucol, J. L., & Sangkawong, N. (2024). Exploring ChatGPT as a writing assessment tool. Innovations in Education and Teaching International, 1-16. https://doi.org/10.1080/14703297.2024.2363901
  6. Bui, N. M., & Barrot, J. S. (2024). ChatGPT as an automated essay scoring tool in the writing classrooms: how it compares with human scoring. Education and Information Technologies, 1-18. https://doi.org/10.1007/s10639-024-12891-w
  7. Chapelle, C. A., & Douglas, D. (2006). Assessing language through computer technology. Cambridge University Press.
  8. Crossley, S. (2020). Linguistic features in writing quality and development: An overview. Journal of Writing Research, 11(3), 415-443. https://doi.org/10.17239/jowr-2020.11.03.01

Details

Primary Language

English

Subjects

Measurement and Evaluation in Education (Other)

Journal Section

Research Article

Publication Date

December 31, 2025

Submission Date

September 16, 2025

Acceptance Date

November 4, 2025

Published in Issue

Year 2025 Volume: 8 Number: 2

APA
Taşçı, S. (2025). Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability. Eğitim Ve Yeni Yaklaşımlar Dergisi, 8(2), 191-210. https://doi.org/10.52974/jena.1785369
AMA
1.Taşçı S. Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability. Eğitim ve Yeni Yaklaşımlar Dergisi. 2025;8(2):191-210. doi:10.52974/jena.1785369
Chicago
Taşçı, Samet. 2025. “Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability”. Eğitim Ve Yeni Yaklaşımlar Dergisi 8 (2): 191-210. https://doi.org/10.52974/jena.1785369.
EndNote
Taşçı S (December 1, 2025) Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability. Eğitim ve Yeni Yaklaşımlar Dergisi 8 2 191–210.
IEEE
[1]S. Taşçı, “Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability”, Eğitim ve Yeni Yaklaşımlar Dergisi, vol. 8, no. 2, pp. 191–210, Dec. 2025, doi: 10.52974/jena.1785369.
ISNAD
Taşçı, Samet. “Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability”. Eğitim ve Yeni Yaklaşımlar Dergisi 8/2 (December 1, 2025): 191-210. https://doi.org/10.52974/jena.1785369.
JAMA
1.Taşçı S. Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability. Eğitim ve Yeni Yaklaşımlar Dergisi. 2025;8:191–210.
MLA
Taşçı, Samet. “Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability”. Eğitim Ve Yeni Yaklaşımlar Dergisi, vol. 8, no. 2, Dec. 2025, pp. 191-10, doi:10.52974/jena.1785369.
Vancouver
1.Samet Taşçı. Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability. Eğitim ve Yeni Yaklaşımlar Dergisi. 2025 Dec. 1;8(2):191-210. doi:10.52974/jena.1785369