Research Article
BibTex RIS Cite

ChatGPT-4o as an automated scoring tool for writing assessment: Strengths and weaknesses

Year 2026, Volume: 13 Issue: 1, 66 - 94, 02.01.2026
https://doi.org/10.21449/ijate.1701871

Abstract

ChatGPT is widely used for many educational purposes such as content generation and language translation, however, its role as an automated scoring tool requires further empirical investigation. This mixed-method study explores the effectiveness of ChatGPT-4o as an automated scoring tool for English as a Foreign Language (EFL) learners’ written output. It particularly aims to discover to what extent ChatGPT-4o can produce reliable and accurate scores in writing assessment and whether it can serve as an alternative scoring tool to the traditional human scoring. 240 argumentative essays were first scored by 13 human raters working in pairs. 28 of them were selected as model essays while the remaining 212 essays were thereafter scored by ChatGPT-4o only. Quantitative analysis employed the Quadratic Weighted Kappa statistic to measure inter-rater reliability, focusing on the agreement among human raters and ChatGPT-4o. Findings suggest that ChatGPT-4o demonstrates only fair agreement with human raters, producing significantly lower and inconsistent scores. Regarding this discrepancy, five experienced human raters were interviewed about the strengths and weaknesses of ChatGPT as a scoring tool, with their perspectives and practices thematically analyzed to triangulate the quantitative findings. The key differences were classified under the themes such as rubric adherence, scoring bias and sensitivity to nuances. Due to AI-enabled automation, ChatGPT exhibits pragmatic dualities in practicality, providing feedback and linguistic capacity. The remarkable strengths involve less manual effort, faster detailed scoring feedback and broader linguistic dataset. However, human-driven optimization through constant supervision, care and pedagogical expertise is essential for a more nuanced scoring.

Ethical Statement

Istanbul University-Cerrahpaşa Ethics Committee for Social and Human Sciences Research, 5.11.2024-1149624.

References

  • Aliakbari, M., Barzan, P., & Sayyadi, M. (2025). Automated feedback vs. Human feedback: A study on AI-Driven language assessment. AI and Tech in Behavioral and Social Sciences, 3(2), 113-126. https://doi.org/10.61838/kman.aitech.3.2.9
  • Attali, Y., & Burstein, J. (2005). Automated essay scoring with e-rater® V.2.0. ETS Research Report Series, 2004(2), i–21. https://doi.org/10.1002/j.2333-8504.2004.tb01972.x
  • Attali, Y., Bridgeman, B., & Trapani, C. (2010). Performance of a generic approach in automated essay scoring. Journal of Technology, Learning and Assessment, 10(3), 1 17. https://ejournals.bc.edu/index.php/jtla/article/view/1603
  • Barrot, J.S. (2023). Using ChatGPT for second language writing: Pitfalls and potentials. Assessing Writing, 57, Article 100745, 1-6. https://doi.org/10.1016/j.asw.2023.100745
  • Bandura, A. (2002). Growing primacy of human agency in adaptation and change in the electronic era. European Psychologist, 7(1), 2-16. https://doi.org/10.1027/1016-9040.7.1.2
  • Bilican Demir, S., & Yıldırım, Ö. (2019). Yazılı anlatım becerilerinin değerlendirilmesi için dereceli puanlama anahtarı geliştirme çalışması [Development of an analytical rubric for assessing writing skills]. Pamukkale University Journal of Education, 47, 457 473. https://doi.org/10.9779/pauefd.588565
  • Bui, N.M., & Barrot, J.S. (2024). ChatGPT as an automated essay scoring tool in the writing classrooms: how it compares with human scoring. Education and Information Technologies, 30, 1 18. https://doi.org/10.1007/s10639-024-12891-w
  • Chapelle, C., Cotos, E., & Lee, J. (2015). Validity argument for diagnostic assessment using automated writing evaluations. Language Testing, 32(3), 385 405. https://doi.org/10.1177/0265532214565386
  • Chen, X., Zhou, Z., & Prado, M. (2025). ChatGPT-3.5 as an automatic scoring system and feedback provider in IELTS exams. International Journal of Assessment Tools in Education, 12(1), 62-77. https://doi.org/10.21449/ijate.1496193
  • Clarke, V., & Braun, V. (2016). Thematic analysis. The Journal of Positive Psychology, 12(3), 297-298. https://doi.org/10.1080/17439760.2016.1262613
  • Creswell, J.W., & Creswell, J.D. (2018). Research design: Qualitative, quantitative, and mixed methods approaches (5th ed.). Sage Publications.
  • Creswell J.W., & Plano Clark V.L. (2018). Designing and conducting mixed methods research (3rd ed.). Sage Publications.
  • Crossley, S.A. (2020). Linguistic features in writing quality and development: An overview. Journal of Writing Research, 11(3), 415-443. https://doi.org/10.17239/jowr-2020.11.03.01
  • Curran, P.J., West, S.G., & Finch, J.F. (1996). The robustness of test statistics to nonnormality and specification error in confirmatory factor analysis. Psychological Methods, 1(1), 16 29. https://doi.org/10.1037/1082-989X.1.1.16
  • Demir, S. (2023). Investigation of ChatGPT and real raters in scoring open-ended items in terms of inter-rater reliability. International Journal of Turkish Educational Sciences, 11(21), 1072-1099. https://doi.org/10.46778/goputeb.1345752
  • Dikli, S. (2006). An overview of automated scoring of essays. Journal of Technology, Learning, and Assessment, 5(1), 1-35. https://files.eric.ed.gov/fulltext/EJ843855.pdf
  • Dörnyei, Z. (2007). Research methods in applied linguistics. Oxford University Press.
  • Escalante, J., Pack, A., & Barrett, A. (2023). AI-generated feedback on writing: Insights into efficacy and ENL student preference. International Journal of Educational Technology in Higher Education, 20(1), 57. https://doi.org/10.1186/s41239-023-00425-2
  • Fang, T., Yang, S., Lan, K., Wong, D.F., Hu, J., Chao, L.S., & Zhang, Y. (2023). Is ChatGPT a highly fluent grammatical error correction system? A comprehensive evaluation. arXiv preprint arXiv:2304.01746. https://doi.org/10.48550/arXiv.2304.01746
  • Foltz, P.W., Streeter, L.A., Lochbaum, K.E., & Landuer, T.K. (2013). Implementation and applications of the Intelligent Essay Assessor. In M.D. Shermis & J. Burnstein (Eds.), Handbook of automated essay evaluation: Current applications and new directions (pp. 68–88). Routledge.
  • Geçkin, V., Kızıltaş, E., & Çınar, Ç. (2023). Assessing second language academic writing: AI vs. Human raters. Journal of Educational Technology & Online Learning, 6(4), 1096 1108. https://doi.org/10.31681/jetol.1336599
  • Gill, G.S., Blair, J., & Litinsky, S. (2024) Evaluating the performance of ChatGPT 3.5 and 4.0 on StatPearls oculoplastic surgery text and image based exam questions. Cureus, 16(11). https://doi.org/10.7759/cureus.73812
  • Glazko, K., Mohammed, Y., Kosa, B., Potluri, V., & Mankoff, J. (2024, June). Identifying and improving disability bias in GPT-based resume screening. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (pp. 687 700). Association for Computing Machinery. https://doi.org/10.1145/3630106.3658933
  • Guo, K., & Wang, D. (2023). To resist it or to embrace it? Examining ChatGPT’s potential to support teacher feedback in EFL writing. Educ Inf Technol, 29, 8435-8463. https://doi.org/10.1007/s10639-023-12146-0
  • Huffman, S. (2015). Exploring learner perceptions of and interaction behaviors using the Research Writing Tutor for research article Introduction section draft analysis [Unpublished doctoral dissertation]. Iowa State University.
  • Hyland, K., & Hyland, F. (2006). Feedback on second language students’ writing. Language Teaching, 39(2), 83-101. https://doi.org/10.1017/S0261444806003399
  • Jukiewicz, M. (2024). The future of grading programming assignments in education: The role of ChatGPT in automating the assessment and feedback process. Thinking Skills and Creativity, 52, Article 101522. https://doi.org/10.1016/j.tsc.2024.101522
  • Kostka, I., & Toncelli, R. (2023). Exploring applications of ChatGPT to English language teaching: Opportunities, challenges, and recommendations. The Electronic Journal for English as a Second Language, 27(3), 1-19. https://doi.org/10.55593/ej.27107int
  • Landis, J.R., & Koch, G.G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159-176. https://doi.org/10.2307/2529310
  • Li, Z., Link, S., Ma, H., Yang, H. & Hegelheimer, V. (2014). The role of automated writing evaluation holistic scores in the ESL classroom. System, 44(1), 66 78. https://doi.org/10.1016/j.system.2014.02.007
  • Li, J., Zong, H., Wu, E., Wu, R., Peng, Z., Zhao, J., Yang, L., Xie H., & Shen, B. (2024). Exploring the potential of artificial intelligence to enhance the writing of English academic papers by non-native English-speaking medical students - the educational application of ChatGPT. BMC Medical Education, 24, Article 736. https://doi.org/10.1186/s12909-024-05738-y
  • Lo, C.K., Yu, P.L.H., Xu, S., Ng, D.T.K, & Jong, M.S. (2024). Exploring the application of ChatGPT in ESL/EFL education and related research issues: a systematic review of empirical studies. Smart Learning Environments, 11(1), Article 50, https://doi.org/10.1186/s40561-024-00342-5
  • Mizumoto, A., & Eguchi, M. (2023). Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics, 2(2), Article 100050. https://doi.org/10.1016/j.rmal.2023.100050
  • Mollick, E., & Mollick, L. (2023). Assigning AI: Seven approaches for students, with prompts. arXiv preprint arXiv:2306.10052. https://doi.org/10.48550/arXiv.2306.10052
  • Nguyen, P., & Hegelheimer, V. (2022). Technology and assessment. In N. Ziegler & M. González-Lloret (Eds.), The Routledge handbook of second language acquisition and technology (pp. 107-118). Routledge.
  • OpenAI (2022, November 30). ChatGPT: Optimizing language models for dialogue. OpenAI. https://openai.com/blog/chatgpt/
  • Page, E.B. (2003). Project Essay Grade: PEG. In M.D. Shermis & J. Burstein (Eds.). Automated essay scoring: A cross-disciplinary perspective, (pp. 39-50). Lawrence Erlbaum Associates.
  • Palinkas, L.A., Horwitz, S.M., Green, C.A., Wisdom, J.P., Duan, N., & Hoagwood, K. (2013). Purposeful sampling for qualitative data collection and analysis in mixed method implementation research. Administration and Policy in Mental Health and Mental Health Services Research, 42(5), 533–544. https://doi.org/10.1007/s10488-013-0528-y
  • Patton, M.Q. (2002). Qualitative research and evaluation methods (3rd ed.). Sage Publications.
  • Quah, B., Zheng, L., Sng, T.J.H., Yong C.W., & Islam, I. (2024). Reliability of ChatGPT in automated essay scoring for dental undergraduate examinations. BMC Medical Education, 24, Article 962. https://doi.org/10.1186/s12909-024-05881-6
  • Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre training. OpenAI. https://cdn.openai.com/research covers/language unsupervised/language_understanding_paper.pdf
  • Shen, X., Chen, Z., Backes, M., & Zhang, Y. (2023). In ChatGPT we trust? Measuring and characterizing the reliability of ChatGPT. arXiv preprint arXiv:2304.08979. https://doi.org/10.48550/arXiv.2304.08979
  • Shermis, M.D. (2024). Using ChatGPT to score essays and short-form constructed responses. arXiv preprint arXiv:2408.09540. https://doi.org/10.48550/arXiv.2408.09540
  • Shermis, M.D., & Burstein, J. (2013). Handbook of Automated Essay Evaluation: Current applications and new directions. (1st ed.). Routledge. https://doi.org/10.4324/9780203122761
  • Sihite, M.R., Meisuri, M., & Sibarani, B. (2023). Examining the validity and reliability of ChatGPT 3.5-generated reading comprehension questions for academic texts. Randwick International of Education and Linguistics Science Journal, 4(4), 937-944. https://doi.org/10.47175/rielsj.v4i4.835
  • Steiss, J., Tate, T., Graham, S., Cruz, J., Hebert, M., Wang, J., Moon, Y., Tseng, W., Warschauer, M., & Olson, C.B. (2024). Comparing the quality of human and ChatGPT feedback of students’ writing. Learning and Instruction, 91, Article 101894. https://doi.org/10.1016/j.learninstruc.2024.101894
  • Uyar, A., & Büyükahıska, D. (2025). Artificial intelligence as an automated essay scoring tool: A focus on ChatGPT. International Journal of Assessment Tools in Education, 12(1), 20 32. https://doi.org/10.21449/ijate.1517994
  • Wang, F., & Wang, S. (2012). A comparative study on the influence of automated evaluation system and teacher grading on students’ English writing. Procedia Engineering, 29, 993 997. https://doi.org/10.1016/j.proeng.2012.01.077
  • Wang, J., & Brown, M. (2007). Automated essay scoring versus human scoring: A comparative study. Journal of Technology, Learning, and Assessment, 6(2). https://ejournals.bc.edu/index.php/jtla/article/view/1632/1476
  • Wang, J., Hu, X., Hou, W., Chen, H., Zheng, R., Wang, Y., Yang, L., Huang, H., Ye, W., Geng, X., Jiao, B., Zhang, Y., & Xie, X. (2023). On the robustness of ChatGPT: An adversarial and out-of-distribution perspective. arXiv preprintarXiv:2302.12095. https://doi.org/10.48550/arXiv.2302.12095
  • Wang, K., Harrington, M., & White, P. (2012). Detecting breakdowns in local coherence in the writing of Chinese English learners. Journal of Computer Assisted Learning, 28(4), 396 410. https://doi.org/10.1111/j.1365-2729.2011.00475.x
  • Warschauer, M., Tseng, W., Yim, S., Webster, T., Jacob, S., Du, Q., & Tate, T. (2023). The affordances and contradictions of AI-generated text for writers of English as a second or foreign language. Journal of Second Language Writing, 62, Article 101071. https://doi.org/10.1016/j.jslw.2023.101071
  • Wu, H., Wang, W., Wan, Y., Jiao, W., & Lyu, M. (2023). ChatGPT or Grammarly? Evaluating ChatGPT on grammatical error correction benchmark. arXiv preprint arXiv:2303.13648. https://doi.org/10.48550/arXiv.2303.13648
  • Xia, W., Mao, S., & Zheng, C. (2024). Empirical study of large language models as automated essay scoring tools in English composition-taking TOEFL independent writing task for example. arXiv preprint arXiv:2401.03401. https://doi.org/10.48550/arXiv.2401.03401
  • Xiao, C., Ma, W., Song, Q., Xu, S.X., Zhang, K., Wang, Y., & Fu, Q. (2025). Human-ai collaborative essay scoring: A dual-process framework with LLMs. In Proceedings of the 15th International Learning Analytics and Knowledge Conference (pp. 293-305). Association for Computing Machinery. https://doi.org/10.1145/3706468.3706507
  • Yancey, K.P., Laflair, G., Verardi, A., & Burstein, J. (2023). Rating short L2 essays on the CEFR scale with GPT-4. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (pp. 576 584). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.bea-1.49
  • Yiğiter, M., & Boduroğlu, E. (2025). Examining the performance of artificial intelligence in scoring students' handwritten responses to open ended Items. Education and Science, 50, 1 18. https://doi.org/10.15390/EB.2025.14119
  • Yoon, S.Y., Miszoglad, E., & Pierce, L.R. (2023). Evaluation of ChatGPT feedback on ELL writers' coherence and cohesion. arXiv preprint arXiv:2310.06505. https://doi.org/10.48550/arXiv.2310.06505

ChatGPT-4o as an automated scoring tool for writing assessment: Strengths and weaknesses

Year 2026, Volume: 13 Issue: 1, 66 - 94, 02.01.2026
https://doi.org/10.21449/ijate.1701871

Abstract

ChatGPT is widely used for many educational purposes such as content generation and language translation, however, its role as an automated scoring tool requires further empirical investigation. This mixed-method study explores the effectiveness of ChatGPT-4o as an automated scoring tool for English as a Foreign Language (EFL) learners’ written output. It particularly aims to discover to what extent ChatGPT-4o can produce reliable and accurate scores in writing assessment and whether it can serve as an alternative scoring tool to the traditional human scoring. 240 argumentative essays were first scored by 13 human raters working in pairs. 28 of them were selected as model essays while the remaining 212 essays were thereafter scored by ChatGPT-4o only. Quantitative analysis employed the Quadratic Weighted Kappa statistic to measure inter-rater reliability, focusing on the agreement among human raters and ChatGPT-4o. Findings suggest that ChatGPT-4o demonstrates only fair agreement with human raters, producing significantly lower and inconsistent scores. Regarding this discrepancy, five experienced human raters were interviewed about the strengths and weaknesses of ChatGPT as a scoring tool, with their perspectives and practices thematically analyzed to triangulate the quantitative findings. The key differences were classified under the themes such as rubric adherence, scoring bias and sensitivity to nuances. Due to AI-enabled automation, ChatGPT exhibits pragmatic dualities in practicality, providing feedback and linguistic capacity. The remarkable strengths involve less manual effort, faster detailed scoring feedback and broader linguistic dataset. However, human-driven optimization through constant supervision, care and pedagogical expertise is essential for a more nuanced scoring.

Ethical Statement

Istanbul University-Cerrahpaşa Ethics Committee for Social and Human Sciences Research, 5.11.2024-1149624.

References

  • Aliakbari, M., Barzan, P., & Sayyadi, M. (2025). Automated feedback vs. Human feedback: A study on AI-Driven language assessment. AI and Tech in Behavioral and Social Sciences, 3(2), 113-126. https://doi.org/10.61838/kman.aitech.3.2.9
  • Attali, Y., & Burstein, J. (2005). Automated essay scoring with e-rater® V.2.0. ETS Research Report Series, 2004(2), i–21. https://doi.org/10.1002/j.2333-8504.2004.tb01972.x
  • Attali, Y., Bridgeman, B., & Trapani, C. (2010). Performance of a generic approach in automated essay scoring. Journal of Technology, Learning and Assessment, 10(3), 1 17. https://ejournals.bc.edu/index.php/jtla/article/view/1603
  • Barrot, J.S. (2023). Using ChatGPT for second language writing: Pitfalls and potentials. Assessing Writing, 57, Article 100745, 1-6. https://doi.org/10.1016/j.asw.2023.100745
  • Bandura, A. (2002). Growing primacy of human agency in adaptation and change in the electronic era. European Psychologist, 7(1), 2-16. https://doi.org/10.1027/1016-9040.7.1.2
  • Bilican Demir, S., & Yıldırım, Ö. (2019). Yazılı anlatım becerilerinin değerlendirilmesi için dereceli puanlama anahtarı geliştirme çalışması [Development of an analytical rubric for assessing writing skills]. Pamukkale University Journal of Education, 47, 457 473. https://doi.org/10.9779/pauefd.588565
  • Bui, N.M., & Barrot, J.S. (2024). ChatGPT as an automated essay scoring tool in the writing classrooms: how it compares with human scoring. Education and Information Technologies, 30, 1 18. https://doi.org/10.1007/s10639-024-12891-w
  • Chapelle, C., Cotos, E., & Lee, J. (2015). Validity argument for diagnostic assessment using automated writing evaluations. Language Testing, 32(3), 385 405. https://doi.org/10.1177/0265532214565386
  • Chen, X., Zhou, Z., & Prado, M. (2025). ChatGPT-3.5 as an automatic scoring system and feedback provider in IELTS exams. International Journal of Assessment Tools in Education, 12(1), 62-77. https://doi.org/10.21449/ijate.1496193
  • Clarke, V., & Braun, V. (2016). Thematic analysis. The Journal of Positive Psychology, 12(3), 297-298. https://doi.org/10.1080/17439760.2016.1262613
  • Creswell, J.W., & Creswell, J.D. (2018). Research design: Qualitative, quantitative, and mixed methods approaches (5th ed.). Sage Publications.
  • Creswell J.W., & Plano Clark V.L. (2018). Designing and conducting mixed methods research (3rd ed.). Sage Publications.
  • Crossley, S.A. (2020). Linguistic features in writing quality and development: An overview. Journal of Writing Research, 11(3), 415-443. https://doi.org/10.17239/jowr-2020.11.03.01
  • Curran, P.J., West, S.G., & Finch, J.F. (1996). The robustness of test statistics to nonnormality and specification error in confirmatory factor analysis. Psychological Methods, 1(1), 16 29. https://doi.org/10.1037/1082-989X.1.1.16
  • Demir, S. (2023). Investigation of ChatGPT and real raters in scoring open-ended items in terms of inter-rater reliability. International Journal of Turkish Educational Sciences, 11(21), 1072-1099. https://doi.org/10.46778/goputeb.1345752
  • Dikli, S. (2006). An overview of automated scoring of essays. Journal of Technology, Learning, and Assessment, 5(1), 1-35. https://files.eric.ed.gov/fulltext/EJ843855.pdf
  • Dörnyei, Z. (2007). Research methods in applied linguistics. Oxford University Press.
  • Escalante, J., Pack, A., & Barrett, A. (2023). AI-generated feedback on writing: Insights into efficacy and ENL student preference. International Journal of Educational Technology in Higher Education, 20(1), 57. https://doi.org/10.1186/s41239-023-00425-2
  • Fang, T., Yang, S., Lan, K., Wong, D.F., Hu, J., Chao, L.S., & Zhang, Y. (2023). Is ChatGPT a highly fluent grammatical error correction system? A comprehensive evaluation. arXiv preprint arXiv:2304.01746. https://doi.org/10.48550/arXiv.2304.01746
  • Foltz, P.W., Streeter, L.A., Lochbaum, K.E., & Landuer, T.K. (2013). Implementation and applications of the Intelligent Essay Assessor. In M.D. Shermis & J. Burnstein (Eds.), Handbook of automated essay evaluation: Current applications and new directions (pp. 68–88). Routledge.
  • Geçkin, V., Kızıltaş, E., & Çınar, Ç. (2023). Assessing second language academic writing: AI vs. Human raters. Journal of Educational Technology & Online Learning, 6(4), 1096 1108. https://doi.org/10.31681/jetol.1336599
  • Gill, G.S., Blair, J., & Litinsky, S. (2024) Evaluating the performance of ChatGPT 3.5 and 4.0 on StatPearls oculoplastic surgery text and image based exam questions. Cureus, 16(11). https://doi.org/10.7759/cureus.73812
  • Glazko, K., Mohammed, Y., Kosa, B., Potluri, V., & Mankoff, J. (2024, June). Identifying and improving disability bias in GPT-based resume screening. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (pp. 687 700). Association for Computing Machinery. https://doi.org/10.1145/3630106.3658933
  • Guo, K., & Wang, D. (2023). To resist it or to embrace it? Examining ChatGPT’s potential to support teacher feedback in EFL writing. Educ Inf Technol, 29, 8435-8463. https://doi.org/10.1007/s10639-023-12146-0
  • Huffman, S. (2015). Exploring learner perceptions of and interaction behaviors using the Research Writing Tutor for research article Introduction section draft analysis [Unpublished doctoral dissertation]. Iowa State University.
  • Hyland, K., & Hyland, F. (2006). Feedback on second language students’ writing. Language Teaching, 39(2), 83-101. https://doi.org/10.1017/S0261444806003399
  • Jukiewicz, M. (2024). The future of grading programming assignments in education: The role of ChatGPT in automating the assessment and feedback process. Thinking Skills and Creativity, 52, Article 101522. https://doi.org/10.1016/j.tsc.2024.101522
  • Kostka, I., & Toncelli, R. (2023). Exploring applications of ChatGPT to English language teaching: Opportunities, challenges, and recommendations. The Electronic Journal for English as a Second Language, 27(3), 1-19. https://doi.org/10.55593/ej.27107int
  • Landis, J.R., & Koch, G.G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159-176. https://doi.org/10.2307/2529310
  • Li, Z., Link, S., Ma, H., Yang, H. & Hegelheimer, V. (2014). The role of automated writing evaluation holistic scores in the ESL classroom. System, 44(1), 66 78. https://doi.org/10.1016/j.system.2014.02.007
  • Li, J., Zong, H., Wu, E., Wu, R., Peng, Z., Zhao, J., Yang, L., Xie H., & Shen, B. (2024). Exploring the potential of artificial intelligence to enhance the writing of English academic papers by non-native English-speaking medical students - the educational application of ChatGPT. BMC Medical Education, 24, Article 736. https://doi.org/10.1186/s12909-024-05738-y
  • Lo, C.K., Yu, P.L.H., Xu, S., Ng, D.T.K, & Jong, M.S. (2024). Exploring the application of ChatGPT in ESL/EFL education and related research issues: a systematic review of empirical studies. Smart Learning Environments, 11(1), Article 50, https://doi.org/10.1186/s40561-024-00342-5
  • Mizumoto, A., & Eguchi, M. (2023). Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics, 2(2), Article 100050. https://doi.org/10.1016/j.rmal.2023.100050
  • Mollick, E., & Mollick, L. (2023). Assigning AI: Seven approaches for students, with prompts. arXiv preprint arXiv:2306.10052. https://doi.org/10.48550/arXiv.2306.10052
  • Nguyen, P., & Hegelheimer, V. (2022). Technology and assessment. In N. Ziegler & M. González-Lloret (Eds.), The Routledge handbook of second language acquisition and technology (pp. 107-118). Routledge.
  • OpenAI (2022, November 30). ChatGPT: Optimizing language models for dialogue. OpenAI. https://openai.com/blog/chatgpt/
  • Page, E.B. (2003). Project Essay Grade: PEG. In M.D. Shermis & J. Burstein (Eds.). Automated essay scoring: A cross-disciplinary perspective, (pp. 39-50). Lawrence Erlbaum Associates.
  • Palinkas, L.A., Horwitz, S.M., Green, C.A., Wisdom, J.P., Duan, N., & Hoagwood, K. (2013). Purposeful sampling for qualitative data collection and analysis in mixed method implementation research. Administration and Policy in Mental Health and Mental Health Services Research, 42(5), 533–544. https://doi.org/10.1007/s10488-013-0528-y
  • Patton, M.Q. (2002). Qualitative research and evaluation methods (3rd ed.). Sage Publications.
  • Quah, B., Zheng, L., Sng, T.J.H., Yong C.W., & Islam, I. (2024). Reliability of ChatGPT in automated essay scoring for dental undergraduate examinations. BMC Medical Education, 24, Article 962. https://doi.org/10.1186/s12909-024-05881-6
  • Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre training. OpenAI. https://cdn.openai.com/research covers/language unsupervised/language_understanding_paper.pdf
  • Shen, X., Chen, Z., Backes, M., & Zhang, Y. (2023). In ChatGPT we trust? Measuring and characterizing the reliability of ChatGPT. arXiv preprint arXiv:2304.08979. https://doi.org/10.48550/arXiv.2304.08979
  • Shermis, M.D. (2024). Using ChatGPT to score essays and short-form constructed responses. arXiv preprint arXiv:2408.09540. https://doi.org/10.48550/arXiv.2408.09540
  • Shermis, M.D., & Burstein, J. (2013). Handbook of Automated Essay Evaluation: Current applications and new directions. (1st ed.). Routledge. https://doi.org/10.4324/9780203122761
  • Sihite, M.R., Meisuri, M., & Sibarani, B. (2023). Examining the validity and reliability of ChatGPT 3.5-generated reading comprehension questions for academic texts. Randwick International of Education and Linguistics Science Journal, 4(4), 937-944. https://doi.org/10.47175/rielsj.v4i4.835
  • Steiss, J., Tate, T., Graham, S., Cruz, J., Hebert, M., Wang, J., Moon, Y., Tseng, W., Warschauer, M., & Olson, C.B. (2024). Comparing the quality of human and ChatGPT feedback of students’ writing. Learning and Instruction, 91, Article 101894. https://doi.org/10.1016/j.learninstruc.2024.101894
  • Uyar, A., & Büyükahıska, D. (2025). Artificial intelligence as an automated essay scoring tool: A focus on ChatGPT. International Journal of Assessment Tools in Education, 12(1), 20 32. https://doi.org/10.21449/ijate.1517994
  • Wang, F., & Wang, S. (2012). A comparative study on the influence of automated evaluation system and teacher grading on students’ English writing. Procedia Engineering, 29, 993 997. https://doi.org/10.1016/j.proeng.2012.01.077
  • Wang, J., & Brown, M. (2007). Automated essay scoring versus human scoring: A comparative study. Journal of Technology, Learning, and Assessment, 6(2). https://ejournals.bc.edu/index.php/jtla/article/view/1632/1476
  • Wang, J., Hu, X., Hou, W., Chen, H., Zheng, R., Wang, Y., Yang, L., Huang, H., Ye, W., Geng, X., Jiao, B., Zhang, Y., & Xie, X. (2023). On the robustness of ChatGPT: An adversarial and out-of-distribution perspective. arXiv preprintarXiv:2302.12095. https://doi.org/10.48550/arXiv.2302.12095
  • Wang, K., Harrington, M., & White, P. (2012). Detecting breakdowns in local coherence in the writing of Chinese English learners. Journal of Computer Assisted Learning, 28(4), 396 410. https://doi.org/10.1111/j.1365-2729.2011.00475.x
  • Warschauer, M., Tseng, W., Yim, S., Webster, T., Jacob, S., Du, Q., & Tate, T. (2023). The affordances and contradictions of AI-generated text for writers of English as a second or foreign language. Journal of Second Language Writing, 62, Article 101071. https://doi.org/10.1016/j.jslw.2023.101071
  • Wu, H., Wang, W., Wan, Y., Jiao, W., & Lyu, M. (2023). ChatGPT or Grammarly? Evaluating ChatGPT on grammatical error correction benchmark. arXiv preprint arXiv:2303.13648. https://doi.org/10.48550/arXiv.2303.13648
  • Xia, W., Mao, S., & Zheng, C. (2024). Empirical study of large language models as automated essay scoring tools in English composition-taking TOEFL independent writing task for example. arXiv preprint arXiv:2401.03401. https://doi.org/10.48550/arXiv.2401.03401
  • Xiao, C., Ma, W., Song, Q., Xu, S.X., Zhang, K., Wang, Y., & Fu, Q. (2025). Human-ai collaborative essay scoring: A dual-process framework with LLMs. In Proceedings of the 15th International Learning Analytics and Knowledge Conference (pp. 293-305). Association for Computing Machinery. https://doi.org/10.1145/3706468.3706507
  • Yancey, K.P., Laflair, G., Verardi, A., & Burstein, J. (2023). Rating short L2 essays on the CEFR scale with GPT-4. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (pp. 576 584). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.bea-1.49
  • Yiğiter, M., & Boduroğlu, E. (2025). Examining the performance of artificial intelligence in scoring students' handwritten responses to open ended Items. Education and Science, 50, 1 18. https://doi.org/10.15390/EB.2025.14119
  • Yoon, S.Y., Miszoglad, E., & Pierce, L.R. (2023). Evaluation of ChatGPT feedback on ELL writers' coherence and cohesion. arXiv preprint arXiv:2310.06505. https://doi.org/10.48550/arXiv.2310.06505
There are 58 citations in total.

Details

Primary Language English
Subjects Computer Based Exam Applications, Measurement and Evaluation in Education (Other)
Journal Section Research Article
Authors

Rabia Damla Karaçeper 0000-0001-6155-9973

Gülay Kıray 0000-0003-2045-8636

Submission Date May 23, 2025
Acceptance Date September 27, 2025
Publication Date January 2, 2026
Published in Issue Year 2026 Volume: 13 Issue: 1

Cite

APA Karaçeper, R. D., & Kıray, G. (2026). ChatGPT-4o as an automated scoring tool for writing assessment: Strengths and weaknesses. International Journal of Assessment Tools in Education, 13(1), 66-94. https://doi.org/10.21449/ijate.1701871

23823             23825             23824