Araştırma Makalesi
BibTex RIS Kaynak Göster

Can AI Assess Writing Skills Like a Human? A Reliability Analysis

Yıl 2025, Cilt: 18 Sayı: 4, 757 - 775, 28.10.2025
https://doi.org/10.30831/akukeg.1718511

Öz

This study investigates the reliability and consistency of a custom GPT-based scoring system in comparison to trained human raters, focusing on B1-level opinion paragraphs written by English preparatory students. Addressing the limited evidence on how AI scoring systems align with human evaluations in foreign language contexts, the study provides insights into both strengths and limitations of automated writing assessment. A total of 175 student writings were evaluated twice by human raters and twice by the AI system using analytic rubric. Findings indicate excellent agreement among human raters and high consistency across AI-generated scores, but only moderate alignment between human and AI evaluations, with the AI showing a tendency to assign higher scores and overlook off-topic content. These results suggest that while AI scoring systems offer efficiency and consistency, they still lack the interpretive depth of human judgment. The study highlights the potential of AI as a complementary tool in writing assessment, with practical implications for language testing policy and classroom pedagogy.

Kaynakça

  • Bachman, L. F., & Palmer, A. S. (2010). Language assessment in practice. Oxford University Press.
  • Brown, H. D. (2018). Language assessment: Principles and classroom practices (3rd ed.). Pearson Education.
  • Bui, N. M., & Barrot, J. S. (2024). ChatGPT as an automated essay scoring tool in the writing classrooms: how it compares with human scoring. Education and Information Technologies, 1-18. https://doi.org/10.1007/s10639-024-12891-w
  • Creswell, J. W., & Creswell, J. D. (2018). Research design: Qualitative, quantitative, and mixed methods approaches. Sage.
  • Dazzeo, R. (2024). AI-enhanced writing self-assessment: Empowering student revision with AI tools. Journal of Technology-Integrated Lessons and Teaching, 3(2), 80–85. https://doi.org/10.13001/jtilt.v3i2.9119
  • DiSabito, D., Hansen, L., Mennella, T., & Rodriguez, J. (2025). Exploring the frontiers of generative AI in assessment: Is there potential for a human‐AI partnership?. New Directions for Teaching and Learning, 2025(182), 81-96. https://doi.org/10.1002%2Ftl.20630
  • Geçkin, V., Kızıltaş, E., & Çınar, Ç. (2023). Assessing second-language academic writing: AI vs. Human raters. Journal of Educational Technology and Online Learning, 6(4), 1096-1108. https://doi.org/10.31681/jetol.1336599
  • Hannah, L., Jang, E. E., Shah, M., & Gupta, V. (2023). Validity arguments for automated essay scoring of young students’ writing traits. Language Assessment Quarterly, 20(4–5), 399–420. https://doi.org/10.1080/15434303.2023.2288253
  • Jonsson, A., & Svingby, G. (2007). The use of scoring rubrics: Reliability, validity, and educational consequences. Educational Research Review, 2(2), 130–144. https://doi.org/10.1016/j.edurev.2007.05.002
  • Koo, T. K., & Li, M. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine, 15(2), 155–163. https://doi.org/10.1016/j.jcm.2016.02.012
  • Li, C., Long, J., Guan, N., Wang, Y., Liang, W., & Li, C. (2024, August). AI-Assisted College Students' English Writing Scoring with Hierarchical LSTM for Enhanced Contextual Understanding and Grading Accuracy. In 2024 7th International Conference on Education, Network and Information Technology (ICENIT) (pp. 175-180). IEEE. https://doi.org/10.1109/ICENIT61951.2024.00039
  • Lundgren, M. (2024). Large language models in student assessment: Comparing ChatGPT and human graders. arXiv preprint arXiv:2406.16510.
  • Perelman, L. (2020). The BABEL generator and E-rater: 21st century writing constructs and automated essay scoring (AES). Journal of Writing Assessment, 13(1).
  • Portney, L.G., Watkins, M.P., (2000). Foundations of clinical research: applications to practice. New Jersey: Prentice Hall.
  • R Core Team. (2025). R: A language and environment for statistical computing (Version 4.4.1) [Computer software]. R Foundation for Statistical Computing. https://www.R-project.org/
  • Ragolane, M., Patel, S., & Salikram, P. (2024). AI Versus Human Graders: Assessing the Role of Large Language Models in Higher Education. Asian Journal of Education and Social Studies, 50 (10). pp. 244-263. https://doi.org/10.9734/ajess/2024/v50i101616
  • Rahman, N. A. A., Zulkornain, L. H., & Hamzah, N. H. (2022). Exploring artificial intelligence using automated writing evaluation for writing skills. Environment-Behaviour Proceedings Journal, 7(SI9), 547–553. https://doi.org/10.21834/ebpj.v7iSI9.4304
  • Shermis, M. D., & Burstein, J. (2013). Handbook of automated essay evaluation: Current applications and new directions. Routledge.
  • Sun, Y. (2024). AI in teaching English writing: Automatic scoring and feedback system. Applied Mathematics and Nonlinear Sciences, 9(1), 1–17. https://doi.org/10.2478/amns-2024-3262
  • Tang, X., Chen, H., Lin, D., & Li, K. (2024). Incorporating Fine-Grained Linguistic Features and Explainable AI into Multi-Dimensional Automated Writing Assessment. Applied Sciences, 14(10), 4182. https://doi.org/10.3390/app14104182
  • Wickham, H. (2016). ggplot2: Elegant graphics for data analysis (2nd ed.). Springer. https://doi.org/10.1007/978-3-319-24277-4
  • Weigle, S. C. (2002). Assessing writing. Cambridge University Press.
  • Wetzler, E. L., Cassidy, K. S., Jones, M. J., Frazier, C. R., Korbut, N. A., Sims, C. M., & Wood, M. (2024). Grading the Graders: Comparing Generative AI and Human Assessment in Essay Evaluation. Teaching of Psychology, 52(3), 298-304. https://doi.org/10.1177/00986283241282696
  • Wolf, K., & Stevens, E. (2007). The role of rubrics in advancing and assessing student learning. Journal of effective teaching, 7(1), 3-14.
  • Wu, X., Saraf, P. P., Lee, G., Latif, E., Liu, N., & Zhai, X. (2025). Unveiling scoring processes: Dissecting the differences between llms and human graders in automatic scoring. Technology, Knowledge and Learning, 1-16. https://doi.org/10.1007/s10758-025-09836-8

Yapay Zekâ Yazma Becerilerini İnsan gibi Değerlendirebilir mi? Bir Güvenirlik Analizi

Yıl 2025, Cilt: 18 Sayı: 4, 757 - 775, 28.10.2025
https://doi.org/10.30831/akukeg.1718511

Öz

Bu çalışma, İngilizce hazırlık öğrencilerinin yazdığı B1 düzeyindeki görüş paragraflarını değerlendirmede özel olarak yapılandırılmış GPT tabanlı bir puanlama sisteminin, eğitimli insan değerlendiricilerle karşılaştırıldığında güvenirlik ve tutarlılığını incelemektedir. Yapay zekâ tabanlı puanlama sistemlerinin yabancı dil bağlamında insan değerlendirmeleriyle ne ölçüde örtüştüğüne dair sınırlı kanıtlardan yola çıkan çalışma, otomatik yazma değerlendirmesinin güçlü ve zayıf yönlerine ışık tutmaktadır. Toplam 175 öğrenci yazısı, hem iki insan değerlendirici hem de yapay zekâ sistemi tarafından, analitik bir rubrik kullanılarak iki kez puanlanmıştır. Bulgular, insan değerlendiriciler arasında mükemmel düzeyde bir uyum ve yapay zekâ puanlamaları arasında yüksek derecede tutarlılık olduğunu; ancak insan ve yapay zekâ değerlendirmeleri arasında yalnızca orta düzeyde bir uyum bulunduğunu göstermektedir. Ayrıca yapay zekânın daha yüksek puan verme ve konu dışı içerikleri gözden kaçırma eğiliminde olduğu görülmüştür. Bu sonuçlar, yapay zekâ sistemlerinin yazma değerlendirmesinde verimlilik ve tutarlılık sağlasa da insan yargısının sahip olduğu yorumlama derinliğinden yoksun olduğunu ortaya koymaktadır. Çalışma, yapay zekânın yazma değerlendirmesinde insan değerlendirmenin yerine değil, tamamlayıcısı olarak kullanılabileceğini vurgulamakta; dil testi politikaları ve sınıf içi pedagojik uygulamalar için önemli yansımalar sunmaktadır.

Kaynakça

  • Bachman, L. F., & Palmer, A. S. (2010). Language assessment in practice. Oxford University Press.
  • Brown, H. D. (2018). Language assessment: Principles and classroom practices (3rd ed.). Pearson Education.
  • Bui, N. M., & Barrot, J. S. (2024). ChatGPT as an automated essay scoring tool in the writing classrooms: how it compares with human scoring. Education and Information Technologies, 1-18. https://doi.org/10.1007/s10639-024-12891-w
  • Creswell, J. W., & Creswell, J. D. (2018). Research design: Qualitative, quantitative, and mixed methods approaches. Sage.
  • Dazzeo, R. (2024). AI-enhanced writing self-assessment: Empowering student revision with AI tools. Journal of Technology-Integrated Lessons and Teaching, 3(2), 80–85. https://doi.org/10.13001/jtilt.v3i2.9119
  • DiSabito, D., Hansen, L., Mennella, T., & Rodriguez, J. (2025). Exploring the frontiers of generative AI in assessment: Is there potential for a human‐AI partnership?. New Directions for Teaching and Learning, 2025(182), 81-96. https://doi.org/10.1002%2Ftl.20630
  • Geçkin, V., Kızıltaş, E., & Çınar, Ç. (2023). Assessing second-language academic writing: AI vs. Human raters. Journal of Educational Technology and Online Learning, 6(4), 1096-1108. https://doi.org/10.31681/jetol.1336599
  • Hannah, L., Jang, E. E., Shah, M., & Gupta, V. (2023). Validity arguments for automated essay scoring of young students’ writing traits. Language Assessment Quarterly, 20(4–5), 399–420. https://doi.org/10.1080/15434303.2023.2288253
  • Jonsson, A., & Svingby, G. (2007). The use of scoring rubrics: Reliability, validity, and educational consequences. Educational Research Review, 2(2), 130–144. https://doi.org/10.1016/j.edurev.2007.05.002
  • Koo, T. K., & Li, M. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine, 15(2), 155–163. https://doi.org/10.1016/j.jcm.2016.02.012
  • Li, C., Long, J., Guan, N., Wang, Y., Liang, W., & Li, C. (2024, August). AI-Assisted College Students' English Writing Scoring with Hierarchical LSTM for Enhanced Contextual Understanding and Grading Accuracy. In 2024 7th International Conference on Education, Network and Information Technology (ICENIT) (pp. 175-180). IEEE. https://doi.org/10.1109/ICENIT61951.2024.00039
  • Lundgren, M. (2024). Large language models in student assessment: Comparing ChatGPT and human graders. arXiv preprint arXiv:2406.16510.
  • Perelman, L. (2020). The BABEL generator and E-rater: 21st century writing constructs and automated essay scoring (AES). Journal of Writing Assessment, 13(1).
  • Portney, L.G., Watkins, M.P., (2000). Foundations of clinical research: applications to practice. New Jersey: Prentice Hall.
  • R Core Team. (2025). R: A language and environment for statistical computing (Version 4.4.1) [Computer software]. R Foundation for Statistical Computing. https://www.R-project.org/
  • Ragolane, M., Patel, S., & Salikram, P. (2024). AI Versus Human Graders: Assessing the Role of Large Language Models in Higher Education. Asian Journal of Education and Social Studies, 50 (10). pp. 244-263. https://doi.org/10.9734/ajess/2024/v50i101616
  • Rahman, N. A. A., Zulkornain, L. H., & Hamzah, N. H. (2022). Exploring artificial intelligence using automated writing evaluation for writing skills. Environment-Behaviour Proceedings Journal, 7(SI9), 547–553. https://doi.org/10.21834/ebpj.v7iSI9.4304
  • Shermis, M. D., & Burstein, J. (2013). Handbook of automated essay evaluation: Current applications and new directions. Routledge.
  • Sun, Y. (2024). AI in teaching English writing: Automatic scoring and feedback system. Applied Mathematics and Nonlinear Sciences, 9(1), 1–17. https://doi.org/10.2478/amns-2024-3262
  • Tang, X., Chen, H., Lin, D., & Li, K. (2024). Incorporating Fine-Grained Linguistic Features and Explainable AI into Multi-Dimensional Automated Writing Assessment. Applied Sciences, 14(10), 4182. https://doi.org/10.3390/app14104182
  • Wickham, H. (2016). ggplot2: Elegant graphics for data analysis (2nd ed.). Springer. https://doi.org/10.1007/978-3-319-24277-4
  • Weigle, S. C. (2002). Assessing writing. Cambridge University Press.
  • Wetzler, E. L., Cassidy, K. S., Jones, M. J., Frazier, C. R., Korbut, N. A., Sims, C. M., & Wood, M. (2024). Grading the Graders: Comparing Generative AI and Human Assessment in Essay Evaluation. Teaching of Psychology, 52(3), 298-304. https://doi.org/10.1177/00986283241282696
  • Wolf, K., & Stevens, E. (2007). The role of rubrics in advancing and assessing student learning. Journal of effective teaching, 7(1), 3-14.
  • Wu, X., Saraf, P. P., Lee, G., Latif, E., Liu, N., & Zhai, X. (2025). Unveiling scoring processes: Dissecting the differences between llms and human graders in automatic scoring. Technology, Knowledge and Learning, 1-16. https://doi.org/10.1007/s10758-025-09836-8
Toplam 25 adet kaynakça vardır.

Ayrıntılar

Birincil Dil İngilizce
Konular Eğitim Teknolojisi ve Bilgi İşlem
Bölüm Makaleler
Yazarlar

Hüseyin Ataseven 0000-0001-9992-4518

Ömay Çokluk Bökeoğlu 0000-0002-3879-9204

Fazilet Taşdemir 0000-0002-0430-9094

Yayımlanma Tarihi 28 Ekim 2025
Gönderilme Tarihi 12 Haziran 2025
Kabul Tarihi 22 Ekim 2025
Yayımlandığı Sayı Yıl 2025 Cilt: 18 Sayı: 4

Kaynak Göster

APA Ataseven, H., Çokluk Bökeoğlu, Ö., & Taşdemir, F. (2025). Can AI Assess Writing Skills Like a Human? A Reliability Analysis. Journal of Theoretical Educational Sciences, 18(4), 757-775. https://doi.org/10.30831/akukeg.1718511