ChatGPT-4o as an automated scoring tool for writing assessment: Strengths and weaknesses

Rabia Damla Karaçeper; Gülay Kıray

doi:10.21449/ijate.1701871

Research Article

BibTex

RIS

Cite

ChatGPT-4o as an automated scoring tool for writing assessment: Strengths and weaknesses

Year 2026, Volume: 13 Issue: 1, 66 - 94, 02.01.2026

Rabia Damla Karaçeper , Gülay Kıray

https://doi.org/10.21449/ijate.1701871

https://izlik.org/JA72ZG36PM

Abstract

ChatGPT is widely used for many educational purposes such as content generation and language translation, however, its role as an automated scoring tool requires further empirical investigation. This mixed-method study explores the effectiveness of ChatGPT-4o as an automated scoring tool for English as a Foreign Language (EFL) learners’ written output. It particularly aims to discover to what extent ChatGPT-4o can produce reliable and accurate scores in writing assessment and whether it can serve as an alternative scoring tool to the traditional human scoring. 240 argumentative essays were first scored by 13 human raters working in pairs. 28 of them were selected as model essays while the remaining 212 essays were thereafter scored by ChatGPT-4o only. Quantitative analysis employed the Quadratic Weighted Kappa statistic to measure inter-rater reliability, focusing on the agreement among human raters and ChatGPT-4o. Findings suggest that ChatGPT-4o demonstrates only fair agreement with human raters, producing significantly lower and inconsistent scores. Regarding this discrepancy, five experienced human raters were interviewed about the strengths and weaknesses of ChatGPT as a scoring tool, with their perspectives and practices thematically analyzed to triangulate the quantitative findings. The key differences were classified under the themes such as rubric adherence, scoring bias and sensitivity to nuances. Due to AI-enabled automation, ChatGPT exhibits pragmatic dualities in practicality, providing feedback and linguistic capacity. The remarkable strengths involve less manual effort, faster detailed scoring feedback and broader linguistic dataset. However, human-driven optimization through constant supervision, care and pedagogical expertise is essential for a more nuanced scoring.

Keywords

Automated scoring tool , ChatGPT-4o , Writing assessment

Ethical Statement

Istanbul University-Cerrahpaşa Ethics Committee for Social and Human Sciences Research, 5.11.2024-1149624.

References

Aliakbari, M., Barzan, P., & Sayyadi, M. (2025). Automated feedback vs. Human feedback: A study on AI-Driven language assessment. AI and Tech in Behavioral and Social Sciences, 3(2), 113-126. https://doi.org/10.61838/kman.aitech.3.2.9
Attali, Y., & Burstein, J. (2005). Automated essay scoring with e-rater® V.2.0. ETS Research Report Series, 2004(2), i–21. https://doi.org/10.1002/j.2333-8504.2004.tb01972.x
Attali, Y., Bridgeman, B., & Trapani, C. (2010). Performance of a generic approach in automated essay scoring. Journal of Technology, Learning and Assessment, 10(3), 1 17. https://ejournals.bc.edu/index.php/jtla/article/view/1603
Barrot, J.S. (2023). Using ChatGPT for second language writing: Pitfalls and potentials. Assessing Writing, 57, Article 100745, 1-6. https://doi.org/10.1016/j.asw.2023.100745
Bandura, A. (2002). Growing primacy of human agency in adaptation and change in the electronic era. European Psychologist, 7(1), 2-16. https://doi.org/10.1027/1016-9040.7.1.2
Bilican Demir, S., & Yıldırım, Ö. (2019). Yazılı anlatım becerilerinin değerlendirilmesi için dereceli puanlama anahtarı geliştirme çalışması [Development of an analytical rubric for assessing writing skills]. Pamukkale University Journal of Education, 47, 457 473. https://doi.org/10.9779/pauefd.588565
Bui, N.M., & Barrot, J.S. (2024). ChatGPT as an automated essay scoring tool in the writing classrooms: how it compares with human scoring. Education and Information Technologies, 30, 1 18. https://doi.org/10.1007/s10639-024-12891-w
Chapelle, C., Cotos, E., & Lee, J. (2015). Validity argument for diagnostic assessment using automated writing evaluations. Language Testing, 32(3), 385 405. https://doi.org/10.1177/0265532214565386
Chen, X., Zhou, Z., & Prado, M. (2025). ChatGPT-3.5 as an automatic scoring system and feedback provider in IELTS exams. International Journal of Assessment Tools in Education, 12(1), 62-77. https://doi.org/10.21449/ijate.1496193
Clarke, V., & Braun, V. (2016). Thematic analysis. The Journal of Positive Psychology, 12(3), 297-298. https://doi.org/10.1080/17439760.2016.1262613
Creswell, J.W., & Creswell, J.D. (2018). Research design: Qualitative, quantitative, and mixed methods approaches (5th ed.). Sage Publications.
Creswell J.W., & Plano Clark V.L. (2018). Designing and conducting mixed methods research (3rd ed.). Sage Publications.
Crossley, S.A. (2020). Linguistic features in writing quality and development: An overview. Journal of Writing Research, 11(3), 415-443. https://doi.org/10.17239/jowr-2020.11.03.01
Curran, P.J., West, S.G., & Finch, J.F. (1996). The robustness of test statistics to nonnormality and specification error in confirmatory factor analysis. Psychological Methods, 1(1), 16 29. https://doi.org/10.1037/1082-989X.1.1.16
Demir, S. (2023). Investigation of ChatGPT and real raters in scoring open-ended items in terms of inter-rater reliability. International Journal of Turkish Educational Sciences, 11(21), 1072-1099. https://doi.org/10.46778/goputeb.1345752
Dikli, S. (2006). An overview of automated scoring of essays. Journal of Technology, Learning, and Assessment, 5(1), 1-35. https://files.eric.ed.gov/fulltext/EJ843855.pdf
Dörnyei, Z. (2007). Research methods in applied linguistics. Oxford University Press.
Escalante, J., Pack, A., & Barrett, A. (2023). AI-generated feedback on writing: Insights into efficacy and ENL student preference. International Journal of Educational Technology in Higher Education, 20(1), 57. https://doi.org/10.1186/s41239-023-00425-2
Fang, T., Yang, S., Lan, K., Wong, D.F., Hu, J., Chao, L.S., & Zhang, Y. (2023). Is ChatGPT a highly fluent grammatical error correction system? A comprehensive evaluation. arXiv preprint arXiv:2304.01746. https://doi.org/10.48550/arXiv.2304.01746
Foltz, P.W., Streeter, L.A., Lochbaum, K.E., & Landuer, T.K. (2013). Implementation and applications of the Intelligent Essay Assessor. In M.D. Shermis & J. Burnstein (Eds.), Handbook of automated essay evaluation: Current applications and new directions (pp. 68–88). Routledge.
Geçkin, V., Kızıltaş, E., & Çınar, Ç. (2023). Assessing second language academic writing: AI vs. Human raters. Journal of Educational Technology & Online Learning, 6(4), 1096 1108. https://doi.org/10.31681/jetol.1336599
Gill, G.S., Blair, J., & Litinsky, S. (2024) Evaluating the performance of ChatGPT 3.5 and 4.0 on StatPearls oculoplastic surgery text and image based exam questions. Cureus, 16(11). https://doi.org/10.7759/cureus.73812
Glazko, K., Mohammed, Y., Kosa, B., Potluri, V., & Mankoff, J. (2024, June). Identifying and improving disability bias in GPT-based resume screening. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (pp. 687 700). Association for Computing Machinery. https://doi.org/10.1145/3630106.3658933
Guo, K., & Wang, D. (2023). To resist it or to embrace it? Examining ChatGPT’s potential to support teacher feedback in EFL writing. Educ Inf Technol, 29, 8435-8463. https://doi.org/10.1007/s10639-023-12146-0
Huffman, S. (2015). Exploring learner perceptions of and interaction behaviors using the Research Writing Tutor for research article Introduction section draft analysis [Unpublished doctoral dissertation]. Iowa State University.
Hyland, K., & Hyland, F. (2006). Feedback on second language students’ writing. Language Teaching, 39(2), 83-101. https://doi.org/10.1017/S0261444806003399
Jukiewicz, M. (2024). The future of grading programming assignments in education: The role of ChatGPT in automating the assessment and feedback process. Thinking Skills and Creativity, 52, Article 101522. https://doi.org/10.1016/j.tsc.2024.101522
Kostka, I., & Toncelli, R. (2023). Exploring applications of ChatGPT to English language teaching: Opportunities, challenges, and recommendations. The Electronic Journal for English as a Second Language, 27(3), 1-19. https://doi.org/10.55593/ej.27107int
Landis, J.R., & Koch, G.G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159-176. https://doi.org/10.2307/2529310
Li, Z., Link, S., Ma, H., Yang, H. & Hegelheimer, V. (2014). The role of automated writing evaluation holistic scores in the ESL classroom. System, 44(1), 66 78. https://doi.org/10.1016/j.system.2014.02.007
Li, J., Zong, H., Wu, E., Wu, R., Peng, Z., Zhao, J., Yang, L., Xie H., & Shen, B. (2024). Exploring the potential of artificial intelligence to enhance the writing of English academic papers by non-native English-speaking medical students - the educational application of ChatGPT. BMC Medical Education, 24, Article 736. https://doi.org/10.1186/s12909-024-05738-y
Lo, C.K., Yu, P.L.H., Xu, S., Ng, D.T.K, & Jong, M.S. (2024). Exploring the application of ChatGPT in ESL/EFL education and related research issues: a systematic review of empirical studies. Smart Learning Environments, 11(1), Article 50, https://doi.org/10.1186/s40561-024-00342-5
Mizumoto, A., & Eguchi, M. (2023). Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics, 2(2), Article 100050. https://doi.org/10.1016/j.rmal.2023.100050
Mollick, E., & Mollick, L. (2023). Assigning AI: Seven approaches for students, with prompts. arXiv preprint arXiv:2306.10052. https://doi.org/10.48550/arXiv.2306.10052
Nguyen, P., & Hegelheimer, V. (2022). Technology and assessment. In N. Ziegler & M. González-Lloret (Eds.), The Routledge handbook of second language acquisition and technology (pp. 107-118). Routledge.
OpenAI (2022, November 30). ChatGPT: Optimizing language models for dialogue. OpenAI. https://openai.com/blog/chatgpt/
Page, E.B. (2003). Project Essay Grade: PEG. In M.D. Shermis & J. Burstein (Eds.). Automated essay scoring: A cross-disciplinary perspective, (pp. 39-50). Lawrence Erlbaum Associates.
Palinkas, L.A., Horwitz, S.M., Green, C.A., Wisdom, J.P., Duan, N., & Hoagwood, K. (2013). Purposeful sampling for qualitative data collection and analysis in mixed method implementation research. Administration and Policy in Mental Health and Mental Health Services Research, 42(5), 533–544. https://doi.org/10.1007/s10488-013-0528-y
Patton, M.Q. (2002). Qualitative research and evaluation methods (3rd ed.). Sage Publications.
Quah, B., Zheng, L., Sng, T.J.H., Yong C.W., & Islam, I. (2024). Reliability of ChatGPT in automated essay scoring for dental undergraduate examinations. BMC Medical Education, 24, Article 962. https://doi.org/10.1186/s12909-024-05881-6
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre training. OpenAI. https://cdn.openai.com/research covers/language unsupervised/language_understanding_paper.pdf
Shen, X., Chen, Z., Backes, M., & Zhang, Y. (2023). In ChatGPT we trust? Measuring and characterizing the reliability of ChatGPT. arXiv preprint arXiv:2304.08979. https://doi.org/10.48550/arXiv.2304.08979
Shermis, M.D. (2024). Using ChatGPT to score essays and short-form constructed responses. arXiv preprint arXiv:2408.09540. https://doi.org/10.48550/arXiv.2408.09540
Shermis, M.D., & Burstein, J. (2013). Handbook of Automated Essay Evaluation: Current applications and new directions. (1st ed.). Routledge. https://doi.org/10.4324/9780203122761
Sihite, M.R., Meisuri, M., & Sibarani, B. (2023). Examining the validity and reliability of ChatGPT 3.5-generated reading comprehension questions for academic texts. Randwick International of Education and Linguistics Science Journal, 4(4), 937-944. https://doi.org/10.47175/rielsj.v4i4.835
Steiss, J., Tate, T., Graham, S., Cruz, J., Hebert, M., Wang, J., Moon, Y., Tseng, W., Warschauer, M., & Olson, C.B. (2024). Comparing the quality of human and ChatGPT feedback of students’ writing. Learning and Instruction, 91, Article 101894. https://doi.org/10.1016/j.learninstruc.2024.101894
Uyar, A., & Büyükahıska, D. (2025). Artificial intelligence as an automated essay scoring tool: A focus on ChatGPT. International Journal of Assessment Tools in Education, 12(1), 20 32. https://doi.org/10.21449/ijate.1517994
Wang, F., & Wang, S. (2012). A comparative study on the influence of automated evaluation system and teacher grading on students’ English writing. Procedia Engineering, 29, 993 997. https://doi.org/10.1016/j.proeng.2012.01.077
Wang, J., & Brown, M. (2007). Automated essay scoring versus human scoring: A comparative study. Journal of Technology, Learning, and Assessment, 6(2). https://ejournals.bc.edu/index.php/jtla/article/view/1632/1476
Wang, J., Hu, X., Hou, W., Chen, H., Zheng, R., Wang, Y., Yang, L., Huang, H., Ye, W., Geng, X., Jiao, B., Zhang, Y., & Xie, X. (2023). On the robustness of ChatGPT: An adversarial and out-of-distribution perspective. arXiv preprintarXiv:2302.12095. https://doi.org/10.48550/arXiv.2302.12095
Wang, K., Harrington, M., & White, P. (2012). Detecting breakdowns in local coherence in the writing of Chinese English learners. Journal of Computer Assisted Learning, 28(4), 396 410. https://doi.org/10.1111/j.1365-2729.2011.00475.x
Warschauer, M., Tseng, W., Yim, S., Webster, T., Jacob, S., Du, Q., & Tate, T. (2023). The affordances and contradictions of AI-generated text for writers of English as a second or foreign language. Journal of Second Language Writing, 62, Article 101071. https://doi.org/10.1016/j.jslw.2023.101071
Wu, H., Wang, W., Wan, Y., Jiao, W., & Lyu, M. (2023). ChatGPT or Grammarly? Evaluating ChatGPT on grammatical error correction benchmark. arXiv preprint arXiv:2303.13648. https://doi.org/10.48550/arXiv.2303.13648
Xia, W., Mao, S., & Zheng, C. (2024). Empirical study of large language models as automated essay scoring tools in English composition-taking TOEFL independent writing task for example. arXiv preprint arXiv:2401.03401. https://doi.org/10.48550/arXiv.2401.03401
Xiao, C., Ma, W., Song, Q., Xu, S.X., Zhang, K., Wang, Y., & Fu, Q. (2025). Human-ai collaborative essay scoring: A dual-process framework with LLMs. In Proceedings of the 15th International Learning Analytics and Knowledge Conference (pp. 293-305). Association for Computing Machinery. https://doi.org/10.1145/3706468.3706507
Yancey, K.P., Laflair, G., Verardi, A., & Burstein, J. (2023). Rating short L2 essays on the CEFR scale with GPT-4. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (pp. 576 584). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.bea-1.49
Yiğiter, M., & Boduroğlu, E. (2025). Examining the performance of artificial intelligence in scoring students' handwritten responses to open ended Items. Education and Science, 50, 1 18. https://doi.org/10.15390/EB.2025.14119
Yoon, S.Y., Miszoglad, E., & Pierce, L.R. (2023). Evaluation of ChatGPT feedback on ELL writers' coherence and cohesion. arXiv preprint arXiv:2310.06505. https://doi.org/10.48550/arXiv.2310.06505

ChatGPT-4o as an automated scoring tool for writing assessment: Strengths and weaknesses

Year 2026, Volume: 13 Issue: 1, 66 - 94, 02.01.2026

Rabia Damla Karaçeper , Gülay Kıray

https://doi.org/10.21449/ijate.1701871

https://izlik.org/JA72ZG36PM

Abstract

Keywords

Automated scoring tool , ChatGPT-4o , Writing assessment

Ethical Statement

Istanbul University-Cerrahpaşa Ethics Committee for Social and Human Sciences Research, 5.11.2024-1149624.

References

Aliakbari, M., Barzan, P., & Sayyadi, M. (2025). Automated feedback vs. Human feedback: A study on AI-Driven language assessment. AI and Tech in Behavioral and Social Sciences, 3(2), 113-126. https://doi.org/10.61838/kman.aitech.3.2.9
Attali, Y., & Burstein, J. (2005). Automated essay scoring with e-rater® V.2.0. ETS Research Report Series, 2004(2), i–21. https://doi.org/10.1002/j.2333-8504.2004.tb01972.x
Attali, Y., Bridgeman, B., & Trapani, C. (2010). Performance of a generic approach in automated essay scoring. Journal of Technology, Learning and Assessment, 10(3), 1 17. https://ejournals.bc.edu/index.php/jtla/article/view/1603
Barrot, J.S. (2023). Using ChatGPT for second language writing: Pitfalls and potentials. Assessing Writing, 57, Article 100745, 1-6. https://doi.org/10.1016/j.asw.2023.100745
Bandura, A. (2002). Growing primacy of human agency in adaptation and change in the electronic era. European Psychologist, 7(1), 2-16. https://doi.org/10.1027/1016-9040.7.1.2
Bilican Demir, S., & Yıldırım, Ö. (2019). Yazılı anlatım becerilerinin değerlendirilmesi için dereceli puanlama anahtarı geliştirme çalışması [Development of an analytical rubric for assessing writing skills]. Pamukkale University Journal of Education, 47, 457 473. https://doi.org/10.9779/pauefd.588565
Bui, N.M., & Barrot, J.S. (2024). ChatGPT as an automated essay scoring tool in the writing classrooms: how it compares with human scoring. Education and Information Technologies, 30, 1 18. https://doi.org/10.1007/s10639-024-12891-w
Chapelle, C., Cotos, E., & Lee, J. (2015). Validity argument for diagnostic assessment using automated writing evaluations. Language Testing, 32(3), 385 405. https://doi.org/10.1177/0265532214565386
Chen, X., Zhou, Z., & Prado, M. (2025). ChatGPT-3.5 as an automatic scoring system and feedback provider in IELTS exams. International Journal of Assessment Tools in Education, 12(1), 62-77. https://doi.org/10.21449/ijate.1496193
Clarke, V., & Braun, V. (2016). Thematic analysis. The Journal of Positive Psychology, 12(3), 297-298. https://doi.org/10.1080/17439760.2016.1262613
Creswell, J.W., & Creswell, J.D. (2018). Research design: Qualitative, quantitative, and mixed methods approaches (5th ed.). Sage Publications.
Creswell J.W., & Plano Clark V.L. (2018). Designing and conducting mixed methods research (3rd ed.). Sage Publications.
Crossley, S.A. (2020). Linguistic features in writing quality and development: An overview. Journal of Writing Research, 11(3), 415-443. https://doi.org/10.17239/jowr-2020.11.03.01
Curran, P.J., West, S.G., & Finch, J.F. (1996). The robustness of test statistics to nonnormality and specification error in confirmatory factor analysis. Psychological Methods, 1(1), 16 29. https://doi.org/10.1037/1082-989X.1.1.16
Demir, S. (2023). Investigation of ChatGPT and real raters in scoring open-ended items in terms of inter-rater reliability. International Journal of Turkish Educational Sciences, 11(21), 1072-1099. https://doi.org/10.46778/goputeb.1345752
Dikli, S. (2006). An overview of automated scoring of essays. Journal of Technology, Learning, and Assessment, 5(1), 1-35. https://files.eric.ed.gov/fulltext/EJ843855.pdf
Dörnyei, Z. (2007). Research methods in applied linguistics. Oxford University Press.
Escalante, J., Pack, A., & Barrett, A. (2023). AI-generated feedback on writing: Insights into efficacy and ENL student preference. International Journal of Educational Technology in Higher Education, 20(1), 57. https://doi.org/10.1186/s41239-023-00425-2
Fang, T., Yang, S., Lan, K., Wong, D.F., Hu, J., Chao, L.S., & Zhang, Y. (2023). Is ChatGPT a highly fluent grammatical error correction system? A comprehensive evaluation. arXiv preprint arXiv:2304.01746. https://doi.org/10.48550/arXiv.2304.01746
Foltz, P.W., Streeter, L.A., Lochbaum, K.E., & Landuer, T.K. (2013). Implementation and applications of the Intelligent Essay Assessor. In M.D. Shermis & J. Burnstein (Eds.), Handbook of automated essay evaluation: Current applications and new directions (pp. 68–88). Routledge.
Geçkin, V., Kızıltaş, E., & Çınar, Ç. (2023). Assessing second language academic writing: AI vs. Human raters. Journal of Educational Technology & Online Learning, 6(4), 1096 1108. https://doi.org/10.31681/jetol.1336599
Gill, G.S., Blair, J., & Litinsky, S. (2024) Evaluating the performance of ChatGPT 3.5 and 4.0 on StatPearls oculoplastic surgery text and image based exam questions. Cureus, 16(11). https://doi.org/10.7759/cureus.73812
Glazko, K., Mohammed, Y., Kosa, B., Potluri, V., & Mankoff, J. (2024, June). Identifying and improving disability bias in GPT-based resume screening. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (pp. 687 700). Association for Computing Machinery. https://doi.org/10.1145/3630106.3658933
Guo, K., & Wang, D. (2023). To resist it or to embrace it? Examining ChatGPT’s potential to support teacher feedback in EFL writing. Educ Inf Technol, 29, 8435-8463. https://doi.org/10.1007/s10639-023-12146-0
Huffman, S. (2015). Exploring learner perceptions of and interaction behaviors using the Research Writing Tutor for research article Introduction section draft analysis [Unpublished doctoral dissertation]. Iowa State University.
Hyland, K., & Hyland, F. (2006). Feedback on second language students’ writing. Language Teaching, 39(2), 83-101. https://doi.org/10.1017/S0261444806003399
Jukiewicz, M. (2024). The future of grading programming assignments in education: The role of ChatGPT in automating the assessment and feedback process. Thinking Skills and Creativity, 52, Article 101522. https://doi.org/10.1016/j.tsc.2024.101522
Kostka, I., & Toncelli, R. (2023). Exploring applications of ChatGPT to English language teaching: Opportunities, challenges, and recommendations. The Electronic Journal for English as a Second Language, 27(3), 1-19. https://doi.org/10.55593/ej.27107int
Landis, J.R., & Koch, G.G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159-176. https://doi.org/10.2307/2529310
Li, Z., Link, S., Ma, H., Yang, H. & Hegelheimer, V. (2014). The role of automated writing evaluation holistic scores in the ESL classroom. System, 44(1), 66 78. https://doi.org/10.1016/j.system.2014.02.007
Li, J., Zong, H., Wu, E., Wu, R., Peng, Z., Zhao, J., Yang, L., Xie H., & Shen, B. (2024). Exploring the potential of artificial intelligence to enhance the writing of English academic papers by non-native English-speaking medical students - the educational application of ChatGPT. BMC Medical Education, 24, Article 736. https://doi.org/10.1186/s12909-024-05738-y
Lo, C.K., Yu, P.L.H., Xu, S., Ng, D.T.K, & Jong, M.S. (2024). Exploring the application of ChatGPT in ESL/EFL education and related research issues: a systematic review of empirical studies. Smart Learning Environments, 11(1), Article 50, https://doi.org/10.1186/s40561-024-00342-5
Mizumoto, A., & Eguchi, M. (2023). Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics, 2(2), Article 100050. https://doi.org/10.1016/j.rmal.2023.100050
Mollick, E., & Mollick, L. (2023). Assigning AI: Seven approaches for students, with prompts. arXiv preprint arXiv:2306.10052. https://doi.org/10.48550/arXiv.2306.10052
Nguyen, P., & Hegelheimer, V. (2022). Technology and assessment. In N. Ziegler & M. González-Lloret (Eds.), The Routledge handbook of second language acquisition and technology (pp. 107-118). Routledge.
OpenAI (2022, November 30). ChatGPT: Optimizing language models for dialogue. OpenAI. https://openai.com/blog/chatgpt/
Page, E.B. (2003). Project Essay Grade: PEG. In M.D. Shermis & J. Burstein (Eds.). Automated essay scoring: A cross-disciplinary perspective, (pp. 39-50). Lawrence Erlbaum Associates.
Palinkas, L.A., Horwitz, S.M., Green, C.A., Wisdom, J.P., Duan, N., & Hoagwood, K. (2013). Purposeful sampling for qualitative data collection and analysis in mixed method implementation research. Administration and Policy in Mental Health and Mental Health Services Research, 42(5), 533–544. https://doi.org/10.1007/s10488-013-0528-y
Patton, M.Q. (2002). Qualitative research and evaluation methods (3rd ed.). Sage Publications.
Quah, B., Zheng, L., Sng, T.J.H., Yong C.W., & Islam, I. (2024). Reliability of ChatGPT in automated essay scoring for dental undergraduate examinations. BMC Medical Education, 24, Article 962. https://doi.org/10.1186/s12909-024-05881-6
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre training. OpenAI. https://cdn.openai.com/research covers/language unsupervised/language_understanding_paper.pdf
Shen, X., Chen, Z., Backes, M., & Zhang, Y. (2023). In ChatGPT we trust? Measuring and characterizing the reliability of ChatGPT. arXiv preprint arXiv:2304.08979. https://doi.org/10.48550/arXiv.2304.08979
Shermis, M.D. (2024). Using ChatGPT to score essays and short-form constructed responses. arXiv preprint arXiv:2408.09540. https://doi.org/10.48550/arXiv.2408.09540
Shermis, M.D., & Burstein, J. (2013). Handbook of Automated Essay Evaluation: Current applications and new directions. (1st ed.). Routledge. https://doi.org/10.4324/9780203122761
Sihite, M.R., Meisuri, M., & Sibarani, B. (2023). Examining the validity and reliability of ChatGPT 3.5-generated reading comprehension questions for academic texts. Randwick International of Education and Linguistics Science Journal, 4(4), 937-944. https://doi.org/10.47175/rielsj.v4i4.835
Steiss, J., Tate, T., Graham, S., Cruz, J., Hebert, M., Wang, J., Moon, Y., Tseng, W., Warschauer, M., & Olson, C.B. (2024). Comparing the quality of human and ChatGPT feedback of students’ writing. Learning and Instruction, 91, Article 101894. https://doi.org/10.1016/j.learninstruc.2024.101894
Uyar, A., & Büyükahıska, D. (2025). Artificial intelligence as an automated essay scoring tool: A focus on ChatGPT. International Journal of Assessment Tools in Education, 12(1), 20 32. https://doi.org/10.21449/ijate.1517994
Wang, F., & Wang, S. (2012). A comparative study on the influence of automated evaluation system and teacher grading on students’ English writing. Procedia Engineering, 29, 993 997. https://doi.org/10.1016/j.proeng.2012.01.077
Wang, J., & Brown, M. (2007). Automated essay scoring versus human scoring: A comparative study. Journal of Technology, Learning, and Assessment, 6(2). https://ejournals.bc.edu/index.php/jtla/article/view/1632/1476
Wang, J., Hu, X., Hou, W., Chen, H., Zheng, R., Wang, Y., Yang, L., Huang, H., Ye, W., Geng, X., Jiao, B., Zhang, Y., & Xie, X. (2023). On the robustness of ChatGPT: An adversarial and out-of-distribution perspective. arXiv preprintarXiv:2302.12095. https://doi.org/10.48550/arXiv.2302.12095
Wang, K., Harrington, M., & White, P. (2012). Detecting breakdowns in local coherence in the writing of Chinese English learners. Journal of Computer Assisted Learning, 28(4), 396 410. https://doi.org/10.1111/j.1365-2729.2011.00475.x
Warschauer, M., Tseng, W., Yim, S., Webster, T., Jacob, S., Du, Q., & Tate, T. (2023). The affordances and contradictions of AI-generated text for writers of English as a second or foreign language. Journal of Second Language Writing, 62, Article 101071. https://doi.org/10.1016/j.jslw.2023.101071
Wu, H., Wang, W., Wan, Y., Jiao, W., & Lyu, M. (2023). ChatGPT or Grammarly? Evaluating ChatGPT on grammatical error correction benchmark. arXiv preprint arXiv:2303.13648. https://doi.org/10.48550/arXiv.2303.13648
Xia, W., Mao, S., & Zheng, C. (2024). Empirical study of large language models as automated essay scoring tools in English composition-taking TOEFL independent writing task for example. arXiv preprint arXiv:2401.03401. https://doi.org/10.48550/arXiv.2401.03401
Xiao, C., Ma, W., Song, Q., Xu, S.X., Zhang, K., Wang, Y., & Fu, Q. (2025). Human-ai collaborative essay scoring: A dual-process framework with LLMs. In Proceedings of the 15th International Learning Analytics and Knowledge Conference (pp. 293-305). Association for Computing Machinery. https://doi.org/10.1145/3706468.3706507
Yancey, K.P., Laflair, G., Verardi, A., & Burstein, J. (2023). Rating short L2 essays on the CEFR scale with GPT-4. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (pp. 576 584). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.bea-1.49
Yiğiter, M., & Boduroğlu, E. (2025). Examining the performance of artificial intelligence in scoring students' handwritten responses to open ended Items. Education and Science, 50, 1 18. https://doi.org/10.15390/EB.2025.14119
Yoon, S.Y., Miszoglad, E., & Pierce, L.R. (2023). Evaluation of ChatGPT feedback on ELL writers' coherence and cohesion. arXiv preprint arXiv:2310.06505. https://doi.org/10.48550/arXiv.2310.06505

There are 58 citations in total.

Details

Primary Language	English
Subjects	Computer Based Exam Applications, Measurement and Evaluation in Education (Other)
Journal Section	Research Article
Authors	Rabia Damla Karaçeper 0000-0001-6155-9973 Gülay Kıray 0000-0003-2045-8636
Submission Date	May 23, 2025
Acceptance Date	September 27, 2025
Publication Date	January 2, 2026
DOI	https://doi.org/10.21449/ijate.1701871
IZ	https://izlik.org/JA72ZG36PM
Published in Issue	Year 2026 Volume: 13 Issue: 1

Cite

APA	Karaçeper, R. D., & Kıray, G. (2026). ChatGPT-4o as an automated scoring tool for writing assessment: Strengths and weaknesses. International Journal of Assessment Tools in Education, 13(1), 66-94. https://doi.org/10.21449/ijate.1701871

Article Files

Full Text

23823 23825 23824