Araştırma Makalesi

Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability

Cilt: 8 Sayı: 2 31 Aralık 2025
PDF İndir
TR EN

Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability

Öz

This study investigates the reliability of large language models (LLMs) in assessing English as a Foreign Language (EFL) writing compared to human raters. Specifically, the performances of ChatGPT 4.0 and DeepSeek R1 were examined across three genres; argumentative, opinion, and persuasive essays, under rubric-free and rubric-based scoring conditions. Participants were 65 undergraduate ELT students at a Turkish university who produced a total of 162 essays. Two experienced human raters scored all essays, and their evaluations demonstrated near-perfect inter-rater reliability, providing a stable benchmark for comparison. The same essays were then rated by ChatGPT and DeepSeek under both scoring conditions. Statistical analyses included intraclass correlation coefficients (ICC), Pearson correlations, paired-samples t-tests, and ANOVAs. Findings revealed that rubric integration substantially improved alignment between AI and human scores, particularly for ChatGPT, which showed stronger sensitivity to rubric criteria than DeepSeek. Genre effects were also evident: opinion essays yielded the highest AI-human agreement, persuasive texts moderate alignment, and argumentative essays the weakest consistency. While both AI tools produced more centralized scores with less variability than human raters, they also exhibited risk-averse tendencies, especially without rubric guidance. The results indicate that AI-based scoring can complement, but not replace, human evaluation, especially in cognitively demanding genres. The study highlights the importance of rubric clarity, prompt design, and genre awareness in maximizing the educational value of AI-assisted writing assessment.

Anahtar Kelimeler

Etik Beyan

Bu araştırma, Nevşehir Hacı Bektaş Veli Üniversitesi Bilimsel Araştırma ve Yayın Etiği Kurulu'nun 05/02/2025 tarih ve 2025.01.42 sayılı kararına dayanarak verilen izinle yürütülmüştür.

Teşekkür

Bu çalışmaya katılan öğrencilere ve öğrenci kompozisyonlarının değerlendirilmesinde değerli yardımlarından dolayı Öğretim Görevlisi Uğur Ünalır'a teşekkür ederiz.

Kaynakça

  1. Ahmadi Shirazi, M. (2019). For a greater good: Bias analysis in writing assessment. Sage Open, 9(1), 1-14. https://doi.org/10.1177/2158244018822377
  2. Barkaoui, K. (2010). Variability in ESL essay rating processes: The role of the rating scale and rater experience. Language Assessment Quarterly, 7(1), 54-74. https://doi.org/10.1080/15434300903464418
  3. Bond, M., Khosravi, H., De Laat, M., Bergdahl, N., Negrea, V., Oxley, E., Pham, P., Chong, S. W., & Siemens, G. (2024). A meta systematic review of artificial intelligence in higher education: a call for increased ethics, collaboration, and rigour. International Journal of Educational Technology in Higher Education, 21(1). https://doi.org/10.1186/s41239-023-00436-z
  4. Bouziane, K., & Bouziane, A. (2024). AI versus human effectiveness in essay evaluation. Discover Education, 3(1), 201. https://doi.org/10.1007/s44217-024-00320-6
  5. Bucol, J. L., & Sangkawong, N. (2024). Exploring ChatGPT as a writing assessment tool. Innovations in Education and Teaching International, 1-16. https://doi.org/10.1080/14703297.2024.2363901
  6. Bui, N. M., & Barrot, J. S. (2024). ChatGPT as an automated essay scoring tool in the writing classrooms: how it compares with human scoring. Education and Information Technologies, 1-18. https://doi.org/10.1007/s10639-024-12891-w
  7. Chapelle, C. A., & Douglas, D. (2006). Assessing language through computer technology. Cambridge University Press.
  8. Crossley, S. (2020). Linguistic features in writing quality and development: An overview. Journal of Writing Research, 11(3), 415-443. https://doi.org/10.17239/jowr-2020.11.03.01

Ayrıntılar

Birincil Dil

İngilizce

Konular

Eğitimde Ölçme ve Değerlendirme (Diğer)

Bölüm

Araştırma Makalesi

Yayımlanma Tarihi

31 Aralık 2025

Gönderilme Tarihi

16 Eylül 2025

Kabul Tarihi

4 Kasım 2025

Yayımlandığı Sayı

Yıl 2025 Cilt: 8 Sayı: 2

Kaynak Göster

APA
Taşçı, S. (2025). Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability. Eğitim ve Yeni Yaklaşımlar Dergisi, 8(2), 191-210. https://doi.org/10.52974/jena.1785369
AMA
1.Taşçı S. Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability. Eğitim ve Yeni Yaklaşımlar Dergisi. 2025;8(2):191-210. doi:10.52974/jena.1785369
Chicago
Taşçı, Samet. 2025. “Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability”. Eğitim ve Yeni Yaklaşımlar Dergisi 8 (2): 191-210. https://doi.org/10.52974/jena.1785369.
EndNote
Taşçı S (01 Aralık 2025) Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability. Eğitim ve Yeni Yaklaşımlar Dergisi 8 2 191–210.
IEEE
[1]S. Taşçı, “Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability”, Eğitim ve Yeni Yaklaşımlar Dergisi, c. 8, sy 2, ss. 191–210, Ara. 2025, doi: 10.52974/jena.1785369.
ISNAD
Taşçı, Samet. “Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability”. Eğitim ve Yeni Yaklaşımlar Dergisi 8/2 (01 Aralık 2025): 191-210. https://doi.org/10.52974/jena.1785369.
JAMA
1.Taşçı S. Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability. Eğitim ve Yeni Yaklaşımlar Dergisi. 2025;8:191–210.
MLA
Taşçı, Samet. “Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability”. Eğitim ve Yeni Yaklaşımlar Dergisi, c. 8, sy 2, Aralık 2025, ss. 191-10, doi:10.52974/jena.1785369.
Vancouver
1.Samet Taşçı. Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability. Eğitim ve Yeni Yaklaşımlar Dergisi. 01 Aralık 2025;8(2):191-210. doi:10.52974/jena.1785369

Flag Counter