Artificial intelligence as an automated essay scoring tool: A focus on ChatGPT

Ahmet Can Uyar; Dilek Büyükahıska

doi:10.21449/ijate.1517994

TR EN

Artificial intelligence as an automated essay scoring tool: A focus on ChatGPT

Öz

This study explores the effectiveness of using ChatGPT, an Artificial Intelligence (AI) language model, as an Automated Essay Scoring (AES) tool for grading English as a Foreign Language (EFL) learners’ essays. The corpus consists of 50 essays representing various types including analysis, compare and contrast, descriptive, narrative, and opinion essays written by 10 EFL learners at the B2 level. Human raters and ChatGPT (4o mini version) scored the essays using the International English Language Testing System (IELTS) TASK 2 Writing band descriptors. Adopting a quantitative approach, the Wilcoxon signed-rank tests and Spearman correlation tests were employed to compare the scores generated, revealing a significant difference between the two methods of scoring, with human raters assigning higher scores than ChatGPT. Similarly, significant differences with varying degrees were also evident for each of the various types of essays, suggesting that the genre of the essays was not a parameter affecting the agreement between human raters and ChatGPT. After all, it was discussed that while ChatGPT shows promise as an AES tool, the observed disparities suggest that it has not reached sufficient proficiency for practical use. The study emphasizes the need for improvements in AI language models to meet the nuanced nature of essay evaluation in EFL contexts.

Anahtar Kelimeler

Artificial intelligence as an automated essay scoring tool: A focus on ChatGPT

Abstract

This study explores the effectiveness of using ChatGPT, an Artificial Intelligence (AI) language model, as an Automated Essay Scoring (AES) tool for grading English as a Foreign Language (EFL) learners’ essays. The corpus consists of 50 essays representing various types including analysis, compare and contrast, descriptive, narrative, and opinion essays written by 10 EFL learners at the B2 level. Human raters and ChatGPT (4o mini version) scored the essays using the International English Language Testing System (IELTS) TASK 2 Writing band descriptors. Adopting a quantitative approach, the Wilcoxon signed-rank tests and Spearman correlation tests were employed to compare the scores generated, revealing a significant difference between the two methods of scoring, with human raters assigning higher scores than ChatGPT. Similarly, significant differences with varying degrees were also evident for each of the various types of essays, suggesting that the genre of the essays was not a parameter affecting the agreement between human raters and ChatGPT. After all, it was discussed that while ChatGPT shows promise as an AES tool, the observed disparities suggest that it has not reached sufficient proficiency for practical use. The study emphasizes the need for improvements in AI language models to meet the nuanced nature of essay evaluation in EFL contexts.

Keywords

Ethical Statement

Sivas Cumhuriyet University, Educational Sciences Ethics Committee, 24.05.2024-431192.

References

Almusharraf, N., & Alotaibi, H. (2022). An error-analysis study from an EFL writing context: Human and automated essay scoring approaches. Technology, Knowledge and Learning, 28, 1015-1031. https://doi.org/10.1007/s10758-022-09592-z
Attali, Y. (2013). Validity and reliability of automated essay scoring. In M.D. Shermis & J.C. Burstein (Eds.), Handbook of automated essay evaluation: Current applications and new directions (pp. 181-198). Routledge.
Bui, N.M., & Barrot, J.S. (2024). ChatGPT as an automated essay scoring tool in the writing classrooms: how it compares with human scoring. Education and Information Technologies. https://doi.org/10.1007/s10639-024-12891-w
Chen, H., & Pan, J. (2022). Computer or human: a comparative study of automated evaluation scoring and instructors’ feedback on Chinese college students’ English writing. Asian-Pacific Journal of Second and Foreign Language Education, 7(34), 1 20. https://doi.org/10.1186/s40862-022-00171-4
Coghlan, D., & Brydon-Miller, M. (2014). The SAGE encyclopedia of action research. SAGE.
Creswell, J.W. (2009). Research design: Qualitative, quantitative, and mixed methods approaches. SAGE Publications.
Davies A. (2008). Assessing academic English language proficiency: 40+ years of U.K. language tests. In Fox J., Wesche M., Bayliss D., Cheng L., Turner C.E., Doe C. (Eds.), Language testing reconsidered (pp. 73–86). University of Ottawa Press.
Guo, K., & Wang, D. (2024). To resist it or to embrace it? Examining ChatGPT’s potential to support teacher feedback in EFL writing. Education and Information Technologies, 29, 8435–8463. https://doi.org/10.1007/s10639-023-12146-0

Huang, S.J. (2014). Automated versus human scoring: A case study in an EFL context. Electronic Journal of Foreign Language Teaching, 11(1), 149-164.
IELTS. (2019). Guide for educational institutions, governments, professional bodies and commercial organisations. Cambridge Assessment English, The British Council, IDP Australia. https://www.ielts.org/-/media/publications/guide-for-institutions/ielts-guide-for-institutions-2015-uk.ashx
IELTS. (2023). IELTS Task 2 Writing band descriptors (Public version). https://takeielts.britishcouncil.org/sites/default/files/ielts_writing_band_descriptors.pdf
Larson-Hall, J. (2012). How to run statistical analyses. In A. Mackey & S. M. Gass (Eds.), Research methods in second language acquisition: A practical guide (pp. 245-274). Wiley-Blackwell.
Manap, M.R., Ramli, N.F., & Kassim, A.A.M. (2019). Web 2.0 automated essay scoring application and human ESL essay assessment: A comparison study. European Journal of English Language Teaching, 5(1), 146-162. https://doi.org/10.5281/zenodo.3461784
Mason, O., & Grove-Stephenson, I. (2002). Automated free text marking with paperless school. In M. Danson (Ed.), Proceedings of the Sixth International Computer Assisted Assessment Conference (pp. 216–222). Loughborough: Loughborough University.
Mizumoto, A., & Eguchi, M. (2023). Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics, 2, 1 13. https://doi.org/10.1016/j.rmal.2023.100050
Page, E. (1966). The imminence of ... grading essays by computer. Phi Delta Kappan, 47(5), 238–243.
Parker, J.L., Becker, K., & Carroca, C. (2023). ChatGPT for automated writing evaluation in scholarly writing instruction. Journal of Nursing Education, 62(12), 721 727. https://doi.org/10.3928/01484834-20231006-02
Pearson, W.S. (2022). Student Engagement with Teacher Written Feedback on Rehearsal Essays Undertaken in Preparation for IELTS. Sage Open, 12(1). https://doi.org/10.1177/21582440221079842
Wang, J., & Bai, L. (2021). Unveiling the scoring validity of two Chinese automated writing evaluation systems: A quantitative study. International Journal of English Linguistics, 11(2), 68-84. https://doi.org/10.5539/0jel.v11n2p68
Willard, C.A. (2020). Statistical methods: An introduction to basic statistical concepts and analysis. Routledge.
Yancey, K.P., Laflair, G., Verardi, A., & Burstein, J. (2023). Rating short L2 essays on the CEFR scale with GPT-4. In E. Kochmar, J. Burstein, A. Horbach, R. Laarmann-Quante, N. Madnani, A. Tack, V. Yaneva, Z. Yuan, & T. Zesch (Eds.), Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (pp. 576-584). Retrieved October 2, 2024, from https://aclanthology.org/2023.bea-1.49
Zribi, R., & Smaoui, C. (2021). Automated versus human essay scoring: A comparative study. International Journal of Information Technology and Language Studies, 5(1), 62-71.

Details

Primary Language

English

Subjects

Measurement and Evaluation in Education (Other)

Journal Section

Research Article

Authors

Ahmet Can Uyar ^*
0000-0003-2438-9877
Türkiye

Dilek Büyükahıska
0000-0001-5074-7805
Türkiye

Early Pub Date

January 9, 2025

Publication Date

February 20, 2025

Submission Date

July 18, 2024

Acceptance Date

October 7, 2024

Published in Issue

Year 2025 Volume: 12 Number: 1

DOI

https://doi.org/10.21449/ijate.1517994

IZ

https://izlik.org/JA27LD22ZN

Cite

RIS / Bibtex

APA

Uyar, A. C., & Büyükahıska, D. (2025). Artificial intelligence as an automated essay scoring tool: A focus on ChatGPT. International Journal of Assessment Tools in Education, 12(1), 20-32. https://doi.org/10.21449/ijate.1517994

Cited By

ChatGPT as a Stable and Fair Tool for Automated Essay Scoring

Education Sciences

https://doi.org/10.3390/educsci15080946

Answer-based and reference-based BERT models for automatic scoring of Turkish short answers: The decisive role of task complexity

International Journal of Assessment Tools in Education

https://doi.org/10.21449/ijate.1687429

The Benefits and Limitations of the Use of Generative Artificial Intelligence Tools in the Acquisition of Productive Skills in English as a Foreign Language—A Systematic Analysis

Applied Sciences

https://doi.org/10.3390/app152111476

Automated Evaluation of ESL Learners’ English Writing Skills in English-Medium Instruction (EMI) through AI Writing Analytics

Journal of Communication Language and Culture

https://doi.org/10.33093/jclc.2026.6.1.8

Artificial intelligence as an automated essay scoring tool: A focus on ChatGPT

Artificial intelligence as an automated essay scoring tool: A focus on ChatGPT

Öz

Anahtar Kelimeler

Artificial intelligence as an automated essay scoring tool: A focus on ChatGPT

Abstract

Keywords

Ethical Statement

References

Details

Primary Language

Subjects

Journal Section

Authors

Early Pub Date

Publication Date

Submission Date

Acceptance Date

Published in Issue

DOI

IZ

Cite

Cited By

ChatGPT as a Stable and Fair Tool for Automated Essay Scoring

Answer-based and reference-based BERT models for automatic scoring of Turkish short answers: The decisive role of task complexity

The Benefits and Limitations of the Use of Generative Artificial Intelligence Tools in the Acquisition of Productive Skills in English as a Foreign Language—A Systematic Analysis

GenAI and human assessments of L2 Chinese writing: Interrater reliability and rater bias

Hey AI, Are You Sure? Analyzing Reflective Prompting Attempts in AI-Based Writing Assessment

Reliability of Human Expert and AI Raters in Translation Assessment

Automated Assessment of Students' Critical Writing Skills with ChatGPT

Norwegian secondary teachers’ perceptions of artificial intelligence in summative assessments

Automated Evaluation of ESL Learners’ English Writing Skills in English-Medium Instruction (EMI) through AI Writing Analytics

Online learning with a ChatGPT-assisted revision approach to EFL graph writing: students’ performance and perceptions

Comparing GPT and human raters in essay assessment: Variability, bias, and the potential of LLM-based scoring

Statistical and qualitative analysis of ChatGPT and human raters in preservice teachers’ writing assessment

ChatGPT-4o as an automated scoring tool for writing assessment: Strengths and weaknesses