Research Article

ChatGPT-4o as an automated scoring tool for writing assessment: Strengths and weaknesses

Volume: 13 Number: 1 January 2, 2026
EN TR

ChatGPT-4o as an automated scoring tool for writing assessment: Strengths and weaknesses

Abstract

ChatGPT is widely used for many educational purposes such as content generation and language translation, however, its role as an automated scoring tool requires further empirical investigation. This mixed-method study explores the effectiveness of ChatGPT-4o as an automated scoring tool for English as a Foreign Language (EFL) learners’ written output. It particularly aims to discover to what extent ChatGPT-4o can produce reliable and accurate scores in writing assessment and whether it can serve as an alternative scoring tool to the traditional human scoring. 240 argumentative essays were first scored by 13 human raters working in pairs. 28 of them were selected as model essays while the remaining 212 essays were thereafter scored by ChatGPT-4o only. Quantitative analysis employed the Quadratic Weighted Kappa statistic to measure inter-rater reliability, focusing on the agreement among human raters and ChatGPT-4o. Findings suggest that ChatGPT-4o demonstrates only fair agreement with human raters, producing significantly lower and inconsistent scores. Regarding this discrepancy, five experienced human raters were interviewed about the strengths and weaknesses of ChatGPT as a scoring tool, with their perspectives and practices thematically analyzed to triangulate the quantitative findings. The key differences were classified under the themes such as rubric adherence, scoring bias and sensitivity to nuances. Due to AI-enabled automation, ChatGPT exhibits pragmatic dualities in practicality, providing feedback and linguistic capacity. The remarkable strengths involve less manual effort, faster detailed scoring feedback and broader linguistic dataset. However, human-driven optimization through constant supervision, care and pedagogical expertise is essential for a more nuanced scoring.

Keywords

Ethical Statement

Istanbul University-Cerrahpaşa Ethics Committee for Social and Human Sciences Research, 5.11.2024-1149624.

References

  1. Aliakbari, M., Barzan, P., & Sayyadi, M. (2025). Automated feedback vs. Human feedback: A study on AI-Driven language assessment. AI and Tech in Behavioral and Social Sciences, 3(2), 113-126. https://doi.org/10.61838/kman.aitech.3.2.9
  2. Attali, Y., & Burstein, J. (2005). Automated essay scoring with e-rater® V.2.0. ETS Research Report Series, 2004(2), i–21. https://doi.org/10.1002/j.2333-8504.2004.tb01972.x
  3. Attali, Y., Bridgeman, B., & Trapani, C. (2010). Performance of a generic approach in automated essay scoring. Journal of Technology, Learning and Assessment, 10(3), 1 17. https://ejournals.bc.edu/index.php/jtla/article/view/1603
  4. Barrot, J.S. (2023). Using ChatGPT for second language writing: Pitfalls and potentials. Assessing Writing, 57, Article 100745, 1-6. https://doi.org/10.1016/j.asw.2023.100745
  5. Bandura, A. (2002). Growing primacy of human agency in adaptation and change in the electronic era. European Psychologist, 7(1), 2-16. https://doi.org/10.1027/1016-9040.7.1.2
  6. Bilican Demir, S., & Yıldırım, Ö. (2019). Yazılı anlatım becerilerinin değerlendirilmesi için dereceli puanlama anahtarı geliştirme çalışması [Development of an analytical rubric for assessing writing skills]. Pamukkale University Journal of Education, 47, 457 473. https://doi.org/10.9779/pauefd.588565
  7. Bui, N.M., & Barrot, J.S. (2024). ChatGPT as an automated essay scoring tool in the writing classrooms: how it compares with human scoring. Education and Information Technologies, 30, 1 18. https://doi.org/10.1007/s10639-024-12891-w
  8. Chapelle, C., Cotos, E., & Lee, J. (2015). Validity argument for diagnostic assessment using automated writing evaluations. Language Testing, 32(3), 385 405. https://doi.org/10.1177/0265532214565386

Details

Primary Language

English

Subjects

Computer Based Exam Applications , Measurement and Evaluation in Education (Other)

Journal Section

Research Article

Publication Date

January 2, 2026

Submission Date

May 23, 2025

Acceptance Date

September 27, 2025

Published in Issue

Year 2026 Volume: 13 Number: 1

APA
Karaçeper, R. D., & Kıray, G. (2026). ChatGPT-4o as an automated scoring tool for writing assessment: Strengths and weaknesses. International Journal of Assessment Tools in Education, 13(1), 66-94. https://doi.org/10.21449/ijate.1701871

23823             23825             23824