Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability

Samet Taşçı

doi:10.52974/jena.1785369

TR EN

İngilizce Yazma Becerisinin Değerlendirilmesinde Yapay Zeka ve İnsan Değerlendiriciler: Puanlama Ölçeği ve Metin Türünün Güvenirliğe Etkisi

Abstract

Bu çalışma, İngilizceyi Yabancı Dil Olarak (EFL) yazmayı değerlendirmede büyük dil modellerinin (LLM) insan değerlendiricilere kıyasla güvenilirliğini araştırmaktadır. Özellikle, ChatGPT 4.0 ve DeepSeek R1'in performansı, rubriksiz ve rubrik tabanlı puanlama koşulları altında üç türde (tartışmalı, görüş ve ikna edici denemeler) incelenmiştir. Katılımcılar, toplam 162 deneme üreten Türkiye'deki bir üniversitede okuyan 65 lisans İngilizce Öğretmenliği öğrencisidir. İki deneyimli insan değerlendirici tüm denemeleri puanlamış ve değerlendirmeleri neredeyse mükemmel bir değerlendiriciler arası güvenilirlik göstererek karşılaştırma için istikrarlı bir kıstas sağlamıştır. Aynı denemeler daha sonra her iki puanlama koşulunda ChatGPT ve DeepSeek ile derecelendirilmiştir. İstatistiksel analizler, sınıf içi korelasyon katsayıları (ICC), Pearson korelasyonları, eşleştirilmiş örneklem t-testleri ve ANOVA'ları içermiştir. Bulgular, rubrik entegrasyonunun, özellikle DeepSeek'ten daha güçlü rubrik kriterlerine duyarlılık gösteren ChatGPT için, yapay zeka ve insan puanları arasındaki uyumu önemli ölçüde iyileştirdiğini ortaya koymuştur. Tür etkileri de belirgindi: görüş yazıları en yüksek yapay zeka-insan uyumunu, ikna edici metinler orta düzeyde uyumu ve tartışmacı yazılar en düşük tutarlılığı sağladı. Her iki yapay zeka aracı da insan değerlendiricilere göre daha az değişkenlikle daha merkezi puanlar üretse de, özellikle değerlendirme ölçütü rehberliği olmadan riskten kaçınma eğilimleri sergilediler. Sonuçlar, yapay zeka tabanlı puanlamanın, özellikle bilişsel olarak zorlayıcı türlerde, insan değerlendirmesini tamamlayabileceğini, ancak yerini alamayacağını göstermektedir. Çalışma, yapay zeka destekli yazma değerlendirmesinin eğitim değerini en üst düzeye çıkarmada değerlendirme ölçütünün açıklığının, hızlı tasarımın ve tür farkındalığının önemini vurgulamaktadır.

Keywords

Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability

Abstract

This study investigates the reliability of large language models (LLMs) in assessing English as a Foreign Language (EFL) writing compared to human raters. Specifically, the performances of ChatGPT 4.0 and DeepSeek R1 were examined across three genres; argumentative, opinion, and persuasive essays, under rubric-free and rubric-based scoring conditions. Participants were 65 undergraduate ELT students at a Turkish university who produced a total of 162 essays. Two experienced human raters scored all essays, and their evaluations demonstrated near-perfect inter-rater reliability, providing a stable benchmark for comparison. The same essays were then rated by ChatGPT and DeepSeek under both scoring conditions. Statistical analyses included intraclass correlation coefficients (ICC), Pearson correlations, paired-samples t-tests, and ANOVAs. Findings revealed that rubric integration substantially improved alignment between AI and human scores, particularly for ChatGPT, which showed stronger sensitivity to rubric criteria than DeepSeek. Genre effects were also evident: opinion essays yielded the highest AI-human agreement, persuasive texts moderate alignment, and argumentative essays the weakest consistency. While both AI tools produced more centralized scores with less variability than human raters, they also exhibited risk-averse tendencies, especially without rubric guidance. The results indicate that AI-based scoring can complement, but not replace, human evaluation, especially in cognitively demanding genres. The study highlights the importance of rubric clarity, prompt design, and genre awareness in maximizing the educational value of AI-assisted writing assessment.

Keywords

Ethical Statement

This research was conducted with the permission granted by the Nevşehir Hacı Bektaş Veli University Scientific Research and Publication Ethics Committee, based on the decision dated 05/02/2025 and numbered 2025.01.42.

Thanks

We are grateful to the students who participated in this study and to Instructor Uğur Ünalır for his invaluable assistance in evaluating the student essays.

References

Ahmadi Shirazi, M. (2019). For a greater good: Bias analysis in writing assessment. Sage Open, 9(1), 1-14. https://doi.org/10.1177/2158244018822377
Barkaoui, K. (2010). Variability in ESL essay rating processes: The role of the rating scale and rater experience. Language Assessment Quarterly, 7(1), 54-74. https://doi.org/10.1080/15434300903464418
Bond, M., Khosravi, H., De Laat, M., Bergdahl, N., Negrea, V., Oxley, E., Pham, P., Chong, S. W., & Siemens, G. (2024). A meta systematic review of artificial intelligence in higher education: a call for increased ethics, collaboration, and rigour. International Journal of Educational Technology in Higher Education, 21(1). https://doi.org/10.1186/s41239-023-00436-z
Bouziane, K., & Bouziane, A. (2024). AI versus human effectiveness in essay evaluation. Discover Education, 3(1), 201. https://doi.org/10.1007/s44217-024-00320-6
Bucol, J. L., & Sangkawong, N. (2024). Exploring ChatGPT as a writing assessment tool. Innovations in Education and Teaching International, 1-16. https://doi.org/10.1080/14703297.2024.2363901
Bui, N. M., & Barrot, J. S. (2024). ChatGPT as an automated essay scoring tool in the writing classrooms: how it compares with human scoring. Education and Information Technologies, 1-18. https://doi.org/10.1007/s10639-024-12891-w
Chapelle, C. A., & Douglas, D. (2006). Assessing language through computer technology. Cambridge University Press.
Crossley, S. (2020). Linguistic features in writing quality and development: An overview. Journal of Writing Research, 11(3), 415-443. https://doi.org/10.17239/jowr-2020.11.03.01

Crusan, D., Plakans, L., & Gebril, A. (2016). Writing assessment literacy: Surveying second language teachers’ knowledge, beliefs, and practices. Assessing Writing, 28, 43-56. https://doi.org/10.1016/j.asw.2016.03.001
Cumming, A., Kantor, R., & Powers, D. E. (2002). Decision making while rating ESL/EFL writing tasks: A descriptive framework. The Modern Language Journal, 86(1), 67-96. https://doi.org/10.1111/1540-4781.00137
Dempsey, M. S., PytlikZillig, L. M., & Bruning, R. H. (2009). Helping preservice teachers learn to assess writing: Practice and feedback in a Web-based environment. Assessing Writing, 14(1), 38-61. https://doi.org/10.1016/j.asw.2008.12.003
Eckes, T. (2015). Introduction to many-facet Rasch measurement. Peter Lang.
Geçkin, V., Kızıltaş, E., & Çınar, Ç. (2023). Assessing second-language academic writing: AI vs. Human raters. Journal of Educational Technology & Online Learning, 6(4), 1096-1108. https://doi.org/10.31681/jetol.1336599
González-Calatayud, V., Prendes-Espinosa, P., & Roig-Vila, R. (2021). Artificial intelligence for student assessment: A systematic review. Applied Sciences, 11(12), 5467, 1-15. https://doi.org/10.3390/app11125467
Huang, J. (2008). How accurate are ESL students’ holistic writing scores on large-scale assessments?-A generalizability theory approach. Assessing Writing, 13(3), 201-218. https://doi.org/10.1016/j.asw.2008.10.002
Hussein, M. A., Hassan, H., & Nassef, M. (2019). Automated language essay scoring systems: A literature review. PeerJ Computer Science, 5, e208. https://doi.org/10.7717/peerj-cs.208
Hyland, K. (2019). Second language writing. Cambridge University Press. https://doi.org/10.1017/9781108635547
Jackaria, P. M., Hajan, B. H., & Mastul, A. H. (2024). A Comparative Analysis of the Rating of College Students’ Essays by ChatGPT versus Human Raters. International Journal of Learning Teaching and Educational Research, 23(2), 478-492. https://doi.org/10.26803/ijlter.23.2.23
Khosravi, H., Viberg, O., Kovanovic, V., & Ferguson, R. (2023). Generative AI and learning analytics. Journal of Learning Analytics, 10(3), 1-6. https://doi.org/10.18608/jla.2023.8333
Kim, H., Baghestani, Sh., Yin, Sh., Karatay, Y., Kurt, S., Beck, J., & Karatay, L. (2024). ChatGPT for writing evaluation: Examining the accuracy and reliability of AI-generated scores compared to human raters. In C. A. Chapelle, G. H. Beckett, & J. Ranalli (Eds.), Exploring artificial intelligence in applied linguistics (pp. 73-95). Iowa State University Digital Press. https://doi.org/10.31274/isudp.2024.154.06
Knoch, U. (2011). Rating scales for diagnostic assessment of writing: What should they look like and where should the criteria come from?. Assessing writing, 16(2), 81-96. https://doi.org/10.1016/j.asw.2011.02.003
Koltovskaia, S. (2020). Student engagement with automated written corrective feedback (AWCF) provided by Grammarly: A multiple case study. Assessing Writing, 44, 100450. https://doi.org/10.1016/j.asw.2020.100450
Korkmaz, H., & Akbıyık, M. (2024). Unlocking the potential: Attitudes of tertiary level EFL learners towards using AI in language learning. Participatory Educational Research, 11(6), 1-19. https://doi.org/10.17275/per.24.76.11.6
Lantolf, J. (Ed.) (2000). Sociocultural theory and second language learning. Oxford University Press.
Leow, R. P., & Suh, B-R. (2022). Theoretical perspectives on writing, corrective feedback, and language learning in individual writing conditions. In R. M. Manchón & C. Polio (Eds.), Routledge handbook of second language acquisition and writing (pp. 9-21). Routledge. https://doi.org/10.4324/9780429199691-3
Li, J., Jangamreddy, N. K., Hisamoto, R., Bhansali, R., Dyda, A., Zaphir, L., & Glencross, M. (2024). AI-assisted marking: Functionality and limitations of ChatGPT in written assessment evaluation. Australasian Journal of Educational Technology, 40(4), 56-72. https://doi.org/10.14742/ajet.9463
Lim, G. S. (2011). The development and maintenance of rating quality in performance writing assessment: A longitudinal study of new and experienced raters. Language Testing, 28(4), 543-560. https://doi.org/10.1177/0265532211406422
Lu, X. (2011). A Corpus‐Based evaluation of syntactic complexity measures as indices of College‐Level ESL writers’ language development. TESOL Quarterly, 45(1), 36-62. https://doi.org/10.5054/tq.2011.240859
Lundgren, M. 2024. Large Language Models in Student Assessment: Comparing ChatGPT and Human Graders. arXiv preprint arXiv:2406.16510.
Mahshanian, A., & Shahnazari, M. (2020). The effect of raters’ fatigue on scoring EFL writing tasks. Indonesian Journal of Applied Linguistics, 10(1), 1-13. https://doi.org/10.17509/ijal.v10i1.24956
Manchón, R. M. (2011). Writing to learn the language: Issues in theory and research. In R. M. Manchón (Ed.), Learning‐to‐Write and Writing‐to‐Learn in an Additional Language, (pp. 61‐82). Johns Benjamins Publishing Company.
Manning, J., Baldwin, J., & Powell, N. (2025). Human versus machine: The effectiveness of ChatGPT in automated essay scoring. Innovations in Education and Teaching International, 1-14. https://doi.org/10.1080/14703297.2025.2469089
McConlogue, T. (2012). But is it fair? Developing students’ understanding of grading complex written work through peer assessment. Assessment & Evaluation in Higher Education, 37(1), 113-123. https://doi.org/10.1080/02602938.2010.515010
Mizumoto, A., & Eguchi, M. (2023). Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics, 2(2), 100050. https://doi.org/10.1016/j.rmal.2023.100050
Ragupathi, K., & Lee, A. (2020). Beyond fairness and consistency in grading: The role of rubrics in higher education. In C. S. Sanger & N. W. Gleason (Eds.), Diversity and inclusion in global higher education: Lessons from across Asia (pp. 73–95). Palgrave Macmillan.
Shin, D., & Lee, J. H. (2024). Exploratory study on the potential of ChatGPT as a rater of second language writing. Education and information technologies, 29, 24735-24757. https://doi.org/10.1007/s10639-024-12817-6
Stiggins, R. J. (1995). Assessment literacy for the 21st century. Phi Delta Kappan, 77(3), 238-245.
Tömen, M. (2022). Automated Essay Scoring Feedback in Foreign Language Writing: Does it coincide with instructor feedback? Disiplinler Arası Dil Araştırmaları, 4(4), 53-62. https://doi.org/10.48147/dada.60
Weigle, S. C. (2013). English as a second language writing and automated essay evaluation. In M. D. Shermis & J. Burstein (Eds.), The handbook of automated essay evaluation: Current applications and new directions (pp. 36-54). Routledge.
Williams, J. (2012). The potential role(s) of writing in second language development. Journal of Second Language Writing, 21, 321-331. https://doi.org/10.1016/j.jslw.2012.09.007
Wood, E. H., & Henderson, S. (2010). Large cohort assessment: depth, interaction and manageable marking. Marketing Intelligence & Planning, 28(7), 898-907. https://doi.org/10.1108/02634501011086481
Yavuz, F., Çelik, Ö., & Çelik, G. Y. (2024). Utilizing large language models for EFL essay grading: An examination of reliability and validity in rubric‐based assessments. British Journal of Educational Technology, 56(1), 150-166. https://doi.org/10.1111/bjet.13494
Yue, X. (2024). A comparative study on ERNIE Bot 4.0 Turbo and ChatGPT 4O’s performance in evaluating First-Year undergraduate persuasive essays. Arts Culture and Language, 1(9). https://doi.org/10.61173/nk1ywa21
Zhang, J. (2016). Same text different processing? Exploring how raters’ cognitive and meta-cognitive strategies influence rating accuracy in essay scoring. Assessing Writing, 27, 37-53. https://doi.org/10.1016/j.asw.2015.11.001
Zhao, C., & Huang, J. (2020). The impact of the scoring system of a large-scale standardized EFL writing assessment on its score variability and reliability: Implications for assessment policy makers. Studies in Educational Evaluation, 67, 100911. https://doi.org/10.1016/j.stueduc.2020.100911

Details

Primary Language

English

Subjects

Measurement and Evaluation in Education (Other)

Journal Section

Research Article

Authors

Samet Taşçı ^*
0000-0003-3925-3825
Türkiye

Publication Date

December 31, 2025

Submission Date

September 16, 2025

Acceptance Date

November 4, 2025

Published in Issue

Year 2025 Volume: 8 Number: 2

DOI

https://doi.org/10.52974/jena.1785369

IZ

https://izlik.org/JA36GH24ZE

Cite

RIS / Bibtex

APA

Taşçı, S. (2025). Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability. Eğitim Ve Yeni Yaklaşımlar Dergisi, 8(2), 191-210. https://doi.org/10.52974/jena.1785369

AMA

1.Taşçı S. Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability. Eğitim ve Yeni Yaklaşımlar Dergisi. 2025;8(2):191-210. doi:10.52974/jena.1785369

Chicago

Taşçı, Samet. 2025. “Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability”. Eğitim Ve Yeni Yaklaşımlar Dergisi 8 (2): 191-210. https://doi.org/10.52974/jena.1785369.

EndNote

Taşçı S (December 1, 2025) Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability. Eğitim ve Yeni Yaklaşımlar Dergisi 8 2 191–210.

IEEE

[1]S. Taşçı, “Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability”, Eğitim ve Yeni Yaklaşımlar Dergisi, vol. 8, no. 2, pp. 191–210, Dec. 2025, doi: 10.52974/jena.1785369.

ISNAD

Taşçı, Samet. “Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability”. Eğitim ve Yeni Yaklaşımlar Dergisi 8/2 (December 1, 2025): 191-210. https://doi.org/10.52974/jena.1785369.

JAMA

1.Taşçı S. Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability. Eğitim ve Yeni Yaklaşımlar Dergisi. 2025;8:191–210.

MLA

Taşçı, Samet. “Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability”. Eğitim Ve Yeni Yaklaşımlar Dergisi, vol. 8, no. 2, Dec. 2025, pp. 191-10, doi:10.52974/jena.1785369.

Vancouver

1.Samet Taşçı. Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability. Eğitim ve Yeni Yaklaşımlar Dergisi. 2025 Dec. 1;8(2):191-210. doi:10.52974/jena.1785369