Automated Assessment of Students' Critical Writing Skills with ChatGPT
Year 2025,
Volume: 6 Issue: 2, 343 - 357
Serdar Tekin
,
Şeyhmus Aydoğdu
Abstract
Critical writing, a subskill of critical thinking, is a crucial skill for students to obtain in their education life. Since these skills require high-level cognitive skills such as analysis and evaluation, open-ended questions are used to evaluate students. Automated essay scoring (AES) tools can be used to overcome the difficulties in evaluating open-ended questions. This study aims to investigate the reliability of ChatGPT 3.5 as an AES tool for evaluating critical writing. It examines variations in average scores between a human rater and ChatGPT across diverse critical writing criteria, utilizing 59 essays from tertiary-level students majoring in teaching English as a foreign language. Reliability between raters was determined by intraclass correlation coefficients and the average score difference between raters was determined by Repeated Measures ANOVA. The findings indicate that ChatGPT, as an AES tool, demonstrates low reliability in assessing critical writing skills, suggesting its current role as a supplementary tool rather than a replacement for human raters. It was also found that ChatGPT tends to give higher scores than the human rater. The discussion aligns the results with existing literature, proposing future research avenues to leverage ChatGPT's potential as a supplementary tool for enhancing critical writing skills.
Ethical Statement
The research was carried out with the approval of Nevşehir Hacı Bektaş Veli University Ethics Commission dated 27.09.2023 and numbered 2023.11.269.
References
-
Almusharraf, N., & Alotaibi, H. (2023). An error-analysis study from an EFL writing context: Human and Automated Essay Scoring Approaches. Technology, Knowledge and Learning, 28(3), 1015-1031. https://doi.org/10.1007/s10758-022-09592-z
-
Attali, Y. (2013). Validity and reliability of automated essay scoring. In M. D. Shermis, & J. Burstein (Eds.), Handbook of automated essay evaluation: Current applications and new directions (pp. 181-198). Routledge.
-
Attali, Y., Lewis, W., & Steier, M. (2013). Scoring with the computer: Alternative procedures for improving the reliability of holistic essay scoring. Language Testing, 30(1), 125-141. https://doi.org/10.1177/0265532212452396
-
Barnet, S., Bedau, H., & O’Hara, J. (2017). Critical thinking, reading, and writing: A brief guide to argument. Macmillan.
-
Barrot, J. S. (2023). Trends in automated writing evaluation systems research for teaching, learning, and assessment: A bibliometric analysis. Education and Information Technologies. https://doi.org/10.1007/s10639-023-12083-y
-
Bean, J. C., & Melzer, D. (2021). Engaging ideas: The professor’s guide to integrating writing, critical thinking, and active learning in the classroom. Jossey-Bass.
-
British Educational Research Association (BERA) (2024). Ethical Guidelines for Educational Research. Available at: http://www.bera.ac.uk/publication/ethical-guidelines-for-educational-research-2024
-
Bui, N. M., & Barrot, J. S. (2024). ChatGPT as an automated essay scoring tool in the writing classrooms: how it compares with human scoring. Education and Information Technologies, 30(2), 2041-2058. https://doi.org/10.1007/s10639-024-12891-w
-
Canagarajah, A. S. (2012). Understanding critical writing. In Luria, H., Seymour, D. M., & Smoke, T. (Eds.) Language and Linguistics in context (pp. 307-314). Lawrence Erlbaum Associates.
-
Chen, X., Zhou, Z., & Prado, M. (2025). ChatGPT-3.5 as an automatic scoring system and feedback provider in IELTS exams. International Journal of Assessment Tools in Education, 12(1), 62-77. https://doi.org/10.21449/ijate.1496193
-
Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Lawrence Erlbaum Associates.
-
Dikli, S., & Bleyle, S. (2014). Automated essay scoring feedback for second language writers: How does it compare to instructor feedback? Assessing Writing, 22, 1-17. https://doi.org/10.1016/j.asw.2014.03.006
-
Graff, G., & Birkenstein, C. (2018). They say/I say: The moves that matter in academic writing. W. W. Norton & Company.
-
Hoang, G. T. L., & Kunnan, A. J. (2016). Automated essay evaluation for English language learners: A case study of MY access. Language Assessment Quarterly, 13(4), 359-376. https://doi.org/10.1080/15434303.2016.1230121
-
Huang, S. Y. (2012). The integration of ‘critical’ and ‘literacy’ education in the EFL curriculum: Expanding the possibilities of critical writing practices. Language, Culture and Curriculum, 25(3), 283-298. https://doi.org/10.1080/07908318.2012.723715
-
Huawei, S., & Aryadoust, V. (2023). A systematic review of automated writing evaluation systems. Education and Information Technologies, 28(1), 771-795. https://doi.org/10.1007/s10639-022-11200-7
-
Hussein, M. A., Hassan, H., & Nassef, M. (2019). Automated language essay scoring systems: A literature review. PeerJ Computer Science, 5, 1-24. https://doi.org/10.7287/peerj.preprints.27715v1
-
Koo, T. K., & Li, M. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine, 15(2), 155-163. https://doi.org/10.1016/j.jcm.2016.02.012
-
Latif, E., & Zhai, X. (2024). Fine-tuning ChatGPT for automatic scoring. Computers and Education: Artificial Intelligence, 6, 1-10. https://doi.org/10.1016/j.caeai.2024.100210
-
Lee, J. (2008). Is test-driven external accountability effective? Synthesizing the evidence from cross-state causal-comparative and correlational studies. Review of educational research, 78(3), 608-644. https://doi.org/10.3102/0034654308324427
-
Li, Z. (2021). Teachers in automated writing evaluation (AWE) system-supported ESL writing classes: Perception, implementation, and influence. System, 99, 1-14. https://doi.org/10.1016/j.system.2021.102505
-
Liu, O. L., Frankel, L., & Roohr, K. C. (2014). Assessing critical thinking in higher education: Current state and directions for next‐generation assessment. ETS Research Report Series, (1), 1-23. https://doi.org/10.1002/ets2.12009
-
Manning, J., Baldwin, J., & Powell, N. (2025). Human versus machine: The effectiveness of ChatGPT in automated essay scoring. Innovations in Education and Teaching International. 62(2), 1-14. https://doi.org/10.1080/14703297.2025.2469089
-
Mizumoto, A., & Eguchi, M. (2023). Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics, 2(2), 100050. https://doi.org/10.1016/j.rmal.2023.100050
-
Moore, K. A., Rutherford, C., & Crawford, K. A. (2016). Supporting postsecondary English language learners’ writing proficiency using technological tools. Journal of International Students, 6(4), 857-872.
-
Nosich, G. (2022). Critical writing: A guide to writing a paper using the concepts and processes of critical thinking: Rowman & Littlefield.
-
Pithers, R. T., & Soden, R. (2000). Critical thinking in education: A review. Educational research, 42(3), 237-249. https://doi.org/10.1080/001318800440579
-
Powers, D. E., Burstein, J. C., Chodorow, M. S., Fowles, M. E., & Kukich, K. (2002). Comparing the validity of automated and human scoring of essays. Journal of Educational Computing Research, 26(4), 407-425. https://doi.org/10.2190/CX92-7WKV-N7WC-JL0A
-
Ramesh, D., & Sanampudi, S. K. (2022). An automated essay scoring systems: A systematic literature review. Artificial Intelligence Review, 55(3), 2495-2527. https://doi.org/10.1007/s10462-021-10068-2
-
Ranalli, J. (2018). Automated written corrective feedback: How well can students make use of it? Computer Assisted Language Learning, 31(7), 653-674. https://doi.org/10.1080/09588221.2018.1428994
-
Shermis, M. D. (2014). State-of-the-art automated essay scoring: Competition, results, and future directions from a United States demonstration. Assessing Writing, 20, 53-76. https://doi.org/10.1016/j.asw.2013.04.001
-
Shermis, M. D., & Burstein, J. C. (2003). Automated essay scoring: A cross-disciplinary perspective. Routledge.
-
Shin, D., & Lee, J. H. (2024). Exploratory study on the potential of ChatGPT as a rater of second language writing. Education and Information Technologies, 29(18), 24735-24757. https://doi.org/10.1007/s10639-024-12817-6
-
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420-428. https://doi.org/10.1037/0033-2909.86.2.420
-
Spector, J. M., & Ma, S. (2019). Inquiry and critical thinking skills for the next generation: from artificial intelligence back to human intelligence. Smart Learning Environments, 6(8), 1-11. https://doi.org/10.1186/s40561-019-0088-z
-
Strobl, C., Ailhaud, E., Benetos, K., Devitt, A., Kruse, O., Proske, A., & Rapp, C. (2019). Digital support for academic writing: A review of technologies and pedagogies. Computers & Education, 131, 33-48. https://doi.org/10.1016/j.compedu.2018.12.005
-
Susanti, M. N. I., Ramadhan, A., & Warnars, H. L. H. S. (2023). Automatic essay exam scoring system: A systematic literature review. Procedia Computer Science, 216, 531-538. https://doi.org/10.1016/j.procs.2022.12.166
-
Tsai, C. Y., Lin, Y. T., & Brown, I. K. (2024). Impacts of ChatGPT-assisted writing for EFL English majors: Feasibility and challenges. Education and information technologies, 29(2), 1-19. https://doi.org/10.1007/s10639-024-12722-y
-
Uyar, A. C., & Büyükahıska, D. (2025). Artificial intelligence as an automated essay scoring tool: A focus on ChatGPT. International Journal of Assessment Tools in Education, 12(1), 20-32. https://doi.org/10.21449/ijate.1517994
-
Yamashita, T. (2024). An application of many-facet Rasch measurement to evaluate automated essay scoring: A case of ChatGPT-4.0. Research Methods in Applied Linguistics, 3(3), 1-14. https://doi.org/10.1016/j.rmal.2024.100133
-
Yavuz, F., Çelik, Ö., & Çelik, G. Y. (2025). Utilizing large language models for EFL essay grading: An examination of reliability and validity in rubric-based assessments. British Journal of Educational Technology, 56(1), 150-166. https://doi.org/10.1111/bjet.13494