Research Article
BibTex RIS Cite

Hey AI, Are You Sure? Analyzing Reflective Prompting Attempts in AI-Based Writing Assessment

Year 2025, Volume: 15 Issue: 3, 315 - 335, 25.12.2025
https://doi.org/10.19126/suje.1713879

Abstract

As artificial intelligence (AI) becomes increasingly integrated into education, its role in the assessment of writing skills has emerged as a significant area of research. This study investigates the reliability and consistency of a custom-built, GPT-based model specifically designed to evaluate B1-level English opinion paragraphs. A dataset consisting of 40 texts (20 written by students and 20 generated by AI) was evaluated across six separate sessions. In the first three sessions, the model assigned scores based on a standard rubric. In the remaining three sessions, it was prompted to re-evaluate its own scores based on reflective questions that suggested it might have “overestimated,” “underestimated,” or “overestimated or underestimated” the original score; thus introducing positive, negative, and neutral reflective prompting. Guided by six research questions, the study examines the internal consistency of the model and its responsiveness to such reflective prompting. Intraclass correlation coefficients (ICCs) indicated excellent reliability across all conditions (ICC > .94). Predictable changes were observed in scoring behavior depending on the prompt direction, particularly in rubric components involving higher-order cognitive skills (i.e., content and organization), whereas grammar and vocabulary scores remained stable. Although the limited sample size constrains the generalizability of the findings, these findings demonstrate that AI-based scoring is not only reliable but also adaptable to metacognitive prompts, offering valuable insights for scalable, rubric-aligned assessment models. The study highlights that GPT-based tools can serve not only as dependable evaluators of student writing but also as instruments that promote self-assessment, support rater training, and foster more equitable feedback in educational settings.

References

  • Awagu, I. V. (2021). Language in Academic Writing: Features and Topical Issues. Linguistics and Literature Studies, 9(2), 49-56. https://doi.org/10.13189/lls.2021.090201
  • Albakkosh, I. (2024). Using Fleiss’ Kappa coefficient to measure the intra and inter-rater reliability of three AI software programs in the assessment of EFL learners’ story writing. International Journal of Educational Sciences and Arts, 3(1), 69-96. https://doi.org/IJESA.2023.v3n1p4
  • Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater® V. 2. The Journal of Technology, Learning and Assessment, 4(3), 1-31.
  • Barkaoui, K. (2007). Teaching writing to second language learners: Insights from theory and research. TESL Reporter, 40(1), 35-48.
  • Borah, M. A. R., Dev, R. S., Suprathik, B. M., Harshini, A. R., Boggula, Y., & Charitha, V. (2024, December). Automated Models in Educational Assessment: A Comprehensive Survey. In 2024 9th International Conference on Communication and Electronics Systems (ICCES) (pp. 1029-1034). IEEE.
  • Bray, M., Adamson, B., & Mason, M. (Eds.). (2014). Comparative education research: Approaches and methods (Vol. 19). Springer.
  • Brown, H. D. (2004). Language assessment, principle and classroom practice. New York: Longman.
  • Choudhury, A. S., & PGCTE, P. (2013). Of speaking, writing, and developing writing skills in English. Language in India, 13(9), 27-32.
  • Cumming, A. (2009). Assessing academic writing in foreign and second languages. Language Teaching, 42(1), 95-107. https://doi.org/10.1017/S0261444808005430
  • Eckes, T. (2012). Operational rater types in writing assessment: Linking rater cognition to rater behavior. Language Assessment Quarterly, 9(3), 270-292. https://doi.org/10.1080/15434303.2011.649381
  • Eskin, D. (2023). Writing task performance and first language background on an esl placement exam: A many-facets rasch analysis of facet main effects and differential facet functioning. Studies in Applied Linguistics & TESOL, 23(1), 37-65. https://doi.org/10.52214/salt.v23i1.11805
  • Geçkin, V., Kızıltaş, E., & Çınar, Ç. (2023). Assessing second-language academic writing: AI vs. Human raters. Journal of Educational Technology and Online Learning, 6(4), 1096-1108. https://doi.org/10.31681/jetol.1336599
  • Glass, K. T., & Marzano, R. J. (2018). The new art and science of teaching writing. Solution Tree Press.
  • Koo, T. K., & Li, M. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine, 15(2), 155–163. https://doi.org/10.1016/j.jcm.2016.02.012
  • Moses, R., & Mohamad, M. (2019). Challenges faced by students and teachers on writing skills in ESL contexts: A literature review. Creative Education, 10(13), 3385–3391. https://doi.org/10.4236/ce.2019.1013260
  • Nayak, N. K. S. (2021). Importance of academic writing skills for students & research scholars with respect to English language while writing research paper. Royal Book Publishing.
  • Pack, A., Barrett, A., & Escalante, J. (2024). Large language models and automated essay scoring of English language learner writing: Insights into validity and reliability. Computers and Education: Artificial Intelligence, 6, 100234. https://doi.org/10.1016/j.caeai.2024.100234
  • Page, E. B. (2003). Project Essay Grade: PEG. In M. D. Shermis & J. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 43–54). Lawrence Erlbaum Associates Publishers.
  • Portney, L.G., Watkins, M.P., (2000). Foundations of clinical research: Applications to practice. New Jersey: Prentice Hall.
  • R Core Team. (2025). R: A language and environment for statistical computing (Version 4.4.1) [Computer software]. R Foundation for Statistical Computing. https://www.R-project.org/
  • Seßler, K., Fürstenberg, M., Bühler, B., & Kasneci, E. (2025, March). Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring. In Proceedings of the 15th International Learning Analytics and Knowledge Conference (pp. 462-472).
  • Shabara, R., ElEbyary, K., & Boraie, D. (2024). Teachers or ChatGPT: The issue of accuracy and consistency in L2 assessment. Teaching English with Technology, 24(2), 71-92. https://doi.org/10.56297/vaca6841/LRDX3699/XSEZ5215
  • Uyar, A. C., & Büyükahıska, D. (2025). Artificial intelligence as an automated essay scoring tool: A focus on ChatGPT. International Journal of Assessment Tools in Education, 12(1), 20-32. https://doi.org/10.21449/ijate.1517994
  • Valenti, S., Neri, F., & Cucchiarelli, A. (2003). An overview of current research on automated essay grading. Journal of Information Technology Education: Research, 2(1), 319-330.
  • Weigle, S. C. (2002). Assessing writing. Cambridge University Press.
  • Wetzler, E. L., Cassidy, K. S., Jones, M. J., Frazier, C. R., Korbut, N. A., Sims, C. M., Bowen, S. S., & Wood, M. (2024). Grading the graders: Comparing generative ai and human assessment in essay evaluation. Teaching of Psychology, 52(3), 298-304. https://doi.org/10.1177/00986283241282696
  • Whittington, D., & Hunt, H. (1999, January). Approaches to the computerized assessment of free text responses. In Proceedings of the 3rd CAA Conference (pp. 1-14). Loughborough.
  • Wolf, K., & Stevens, E. (2007). The role of rubrics in advancing and assessing student learning. Journal of Effective Teaching, 7(1), 3-14.
  • Zare-ee, A., & Kaur, S. (2013). Undergraduate argumentative writing in English as a foreign language: A gendered perspective. Journal of Language, Culture, and Translation, 2(1), 123-145.
There are 29 citations in total.

Details

Primary Language English
Subjects Information Systems Education, Classroom Measurement Practices, Computer Based Exam Applications
Journal Section Research Article
Authors

Hüseyin Ataseven 0000-0001-9992-4518

Ömay Çokluk Bökeoğlu 0000-0002-3879-9204

Fazilet Taşdemir 0000-0002-0430-9094

Submission Date June 4, 2025
Acceptance Date October 24, 2025
Early Pub Date December 9, 2025
Publication Date December 25, 2025
Published in Issue Year 2025 Volume: 15 Issue: 3

Cite

APA Ataseven, H., Çokluk Bökeoğlu, Ö., & Taşdemir, F. (2025). Hey AI, Are You Sure? Analyzing Reflective Prompting Attempts in AI-Based Writing Assessment. Sakarya University Journal of Education, 15(3), 315-335. https://doi.org/10.19126/suje.1713879
AMA Ataseven H, Çokluk Bökeoğlu Ö, Taşdemir F. Hey AI, Are You Sure? Analyzing Reflective Prompting Attempts in AI-Based Writing Assessment. SUJE. December 2025;15(3):315-335. doi:10.19126/suje.1713879
Chicago Ataseven, Hüseyin, Ömay Çokluk Bökeoğlu, and Fazilet Taşdemir. “Hey AI, Are You Sure? Analyzing Reflective Prompting Attempts in AI-Based Writing Assessment”. Sakarya University Journal of Education 15, no. 3 (December 2025): 315-35. https://doi.org/10.19126/suje.1713879.
EndNote Ataseven H, Çokluk Bökeoğlu Ö, Taşdemir F (December 1, 2025) Hey AI, Are You Sure? Analyzing Reflective Prompting Attempts in AI-Based Writing Assessment. Sakarya University Journal of Education 15 3 315–335.
IEEE H. Ataseven, Ö. Çokluk Bökeoğlu, and F. Taşdemir, “Hey AI, Are You Sure? Analyzing Reflective Prompting Attempts in AI-Based Writing Assessment”, SUJE, vol. 15, no. 3, pp. 315–335, 2025, doi: 10.19126/suje.1713879.
ISNAD Ataseven, Hüseyin et al. “Hey AI, Are You Sure? Analyzing Reflective Prompting Attempts in AI-Based Writing Assessment”. Sakarya University Journal of Education 15/3 (December2025), 315-335. https://doi.org/10.19126/suje.1713879.
JAMA Ataseven H, Çokluk Bökeoğlu Ö, Taşdemir F. Hey AI, Are You Sure? Analyzing Reflective Prompting Attempts in AI-Based Writing Assessment. SUJE. 2025;15:315–335.
MLA Ataseven, Hüseyin et al. “Hey AI, Are You Sure? Analyzing Reflective Prompting Attempts in AI-Based Writing Assessment”. Sakarya University Journal of Education, vol. 15, no. 3, 2025, pp. 315-3, doi:10.19126/suje.1713879.
Vancouver Ataseven H, Çokluk Bökeoğlu Ö, Taşdemir F. Hey AI, Are You Sure? Analyzing Reflective Prompting Attempts in AI-Based Writing Assessment. SUJE. 2025;15(3):315-3.