Assessing second-language academic writing: AI vs. Human raters

Vasfiye Geckin; Ebru Kızıltaş; Çağatay Çınar

doi:10.31681/jetol.1336599

EN

Assessing second-language academic writing: AI vs. Human raters

Abstract

The quality of writing in a second language (L2) is one of the indicators of the level of proficiency for many college students to be eligible for departmental studies. Although certain software programs, such as Intelligent Essay Assessor or IntelliMetric, have been introduced to evaluate second-language writing quality, an overall assessment of writing proficiency is still largely achieved through trained human raters. The question that needs to be addressed today is whether generative artificial intelligence (AI) algorithms of large language models (LLMs) could facilitate and possibly replace human raters when it comes to the burdensome task of assessing student-written academic work. For this purpose, first-year college students (n=43) were given a paragraph writing task which was evaluated through the same writing criteria introduced to the generative pre-trained transformer, ChatGPT-3.5, and five human raters. The scores assigned by the five human raters revealed a statistically significant low to high positive correlation. A slight to fair but significant level of agreement was observed in the scores assigned by ChatGPT-3.5 and two of the human raters. The findings suggest that reliable results could be obtained when the scores of an application and multiple human raters are considered and that ChatGPT may potentially assist human raters in assessing L2 college writing.

Keywords

References

Alikaniotis, D., Yannakoudakis, H., & Rei, M. (2016). Automatic text scoring using neural networks. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics: Volume 1 Long Papers (pp. 715-725). Stroudsburg: Association for Computational Linguistics.
Amorim, E. & Veloso, A. (2017). A multi aspect analysis of automatic essay scoring for Brazilian Portuguese. In Proceedings of the 15th Conference of the European Chapter of the Association for
Computational Linguistics (pp. 94-102). Student Research Workshop: Association for Computational Linguistics.
Arslan Mancar, S., & Gulleroglu, H. D. (2022). Comparison of inter-rater reliability techniques in performance-based assessment. International Journal of Assessment Tools in Education, 9(2), 515-533.
Attali, Y., Lewis, W., & Steier, M. (2013). Scoring with the computer: Alternative procedures for improving the reliability of holistic essay scoring. Language Testing, 30(1), 125-141.
Azmi, A. M., Al-Jouie, M. F., & Hussain, M. (2019). AAEE–Automated evaluation of students’ essays in
Arabic language. Information Processing & Management, 56(5), 1736-1752.
Bai, J. Y-H., Zawacki-Richter, O., Bozkurt, A., Lee, K., Fanguy, M., Sari, B. C., & Marin, V. I. (2022). Automated essay scoring (AES) systems: Opportunities and challenges for open and distance education. In Proceedings of the Tenth Pan-Commonwealth Forum on Open Learning (PCF10) (pp. 1-10). Canada Minutes of Congress.

Chan, K. K. Y., Bond, T., & Yan, Z. (2023). Application of an automated essay scoring engine to English writing assessment using Many-Facet Rush measurement. Language Testing, 40(1), 61-85.
Chen, E. C-F., & Cheng, E. W-Y. (2008). Beyond the design of automated writing evaluation: Pedagogical practices and perceived learning effectiveness in EFL writing classes. Language Learning and Technology, 12(2), 94-112.
Coombe, C. (2010). Assessing foreign/second language writing ability. Education, Business and Society: Contemporary Middle Eastern Issues, 3(3), 178-187.
Crossley, S. A., & McNamara, S. (2016). Adaptive educational Technologies for Literacy Instruction. New York: Routledge.
Crusan, D., Plakans, L., & Gebril, A. (2016). Writing assessment literacy: Surveying second language teachers’ knowledge, beliefs, and practices. Assessing Writing, 28, 43-56.
Çetin, Y. (2011). Reliability of raters for writing assessment: Analytic-holistic, analytic-analytic, holistic-holistic. Mustafa Kemal University Journal of Social Sciences Institute, 8(16), 471-486.
Dasgupta, T., Naskar, A., Saha, R., & Dey, L. (2018). Augmenting textual qualitative features in deep convolution recurrent neural network for automatic essay scoring. In Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications (pp. 93-102). Stroudsburg: Association for Computational Linguistics.
Deane, P. (2013). On the relation between automated essay scoring and modern views of the writing construct. Assessing Writing, 18, 7-24.
Dikli S. (2006). An overview of automated scoring of essays. The Journal of Technology, Learning, and Assessment, 5(1), 1-36.
Dong, F., & Zhang, Y. (2016). Automatic features for essay scoring—an empirical study. In Proceedings of the 2016 conference on empirical methods in natural language processing (pp. 1072-1077). Stroudsburg: Association for Computational Linguistics.
Doewes, A., & Pechenizkiy, M. (2012). On the limitations of human computer agreement in automated essay Scoring. In Proceedings of the 14th International Conference on Educational Data Mining (EDM21) (pp. 475-480). International Educational Data Mining Society.
Düzenli, H. (2021). A systematic review of educational suggestions on generation Z in the context of distance education. Journal of Educational Technology & Online Learning, 4(4), 896-912.
Dwivedi, Y.K., Kshetri, N., Hughes, L., ….Wright, R. (2023). Opinion Paper: “So what if ChatGPT wrote it?” Multidisciplinary perspectives on opportunities, challenges and implications of generative conversational AI for research, practice and policy. International Journal of Information Management, 71, 1-63.
EduKitchen. (2023, January 21). Chomsky on ChatGPT, education, Russia and the unvaccinated [Video]. YouTube. https://www.youtube.com/watch?v = IgxzcOugvEI.
Educational Testing Service (n.d.). About the e-rater® scoring engine. Retrieved June 1, 2023, from
https://www.ets.org/erater/about.
Farrokhnia, M., Banihashem, S. K., Noroozi, O., & Wals, A. (2023). A SWOT analysis of ChatGPT:
Implications for educational practice and research. Innovations in Education and Teaching International, 1-15.
Fraiwan, M., & Khasawneh, N. (2023). A Review of ChatGPT Applications in Education, Marketing,
Software Engineering, and Healthcare: Benefits, Drawbacks, and Research Directions. arXiv preprint arXiv:2305.00237.
Gierl, M., Latifi, S., Lai, H., Boulais, A., & Champlain, A. (2014). Automated essay scoring and the future
of educational assessment in medical education. Medical Education, 48(10), 950-962.
Hong, W. C. H. (2023). The impact of ChatGPT on foreign language teaching and learning: Opportunities
in education and research. Journal of Educational Technology and Innovation, 5(1), 37-45.
Hoang, G. T. L. (2011). Validating My Access as an automated writing instructional tool for English
language learners (Unpublished master's thesis). California State University, Los Angeles.
Hoang, G. T. L., & Kunnan, A. J. (2016). Automated Essay Evaluation for English Language Learners: A
Case Study of MY Access. Language Assessment Quarterly, 13(4), 359-376.
Hua, C., & Wind, S. A. (2019). Exploring the psychometric properties of the mind-map scoring rubric.
Behaviormetrika, 46(1), 73-99.
Hussein, M. A., Hassan, H., & Nassef, M. (2019). Automated language essay scoring systems: a literature
review. Peer Journal of Computer Science, 5, 208-224.
IBM Corp. Released 2017. IBM SPSS Statistics for Windows, Version 25.0. Armonk, NY: IBM Corp.
Ifenthaler, D. (2022). Automated essay grading systems. In O. Zawacki-Richter & I. Jung (Eds.), Handbook
of open, distance and digital education (pp. 1–15). Springer.
Ifenthaler, D., & Dikli, S. (2015). Automated scoring of essays. In J. M. Spector (Ed.), The SAGE
encyclopedia of educational technology (Vol. 1, pp. 64–68). Thousand Oaks, CA: Sage.
Landauer, T. K., Laham, D., & Foltz, P. (2003). Automatic essay assessment. Assessment in Education: Principles, Policy & Practice, 10(3), 295-308.
Lim, C-T., Bong, C-H., Wong, W-S., & Lee, N-K. (2021). A comprehensive review of automated essay scoring (AES) research and development. Pertanika Science and Technology, 29(3), 1875-1899.
Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15(4), 474-496.
Lu, X. (2019). An empirical study on the artificial intelligence writing evaluation system in China CET. Big Data, 7(2), 121-129.
Lui, S., & Kunnan, A. J. (2016). Investigating the application of automated writing evaluation to Chinese undergraduate English majors: A case study of WriteToLearn. Computer Assisted Language Instruction Consortium, 33, 71-91.
Lund, B. D., Wang, T., Mannuru, N. R., Nie, B., Shimray, S., & Wang, Z. (2023). ChatGPT and a new academic reality: Artificial Intelligence‐written research papers and the ethics of the large language models in scholarly publishing. Journal of the Association for Information Science and Technology, 74(5), 570-581.
Kumar, V., & Boulanger, D. (2020). Explainable automated essay scoring: Deep learning really has pedagogical value. Frontiers in Education, 5, 572367.
Ma, H., & Slater, T. (2015). Using the developmental path of cause to bridge the gap between AWE scores and writing teachers’ evaluations. Writing & Pedagogy, 7, 395-422.
Mizumoto, A., & Eguchi, M. (2023). Exploring the potential of using an AI language model for Automated Essay Scoring. Research Methods in Applied Linguistics, 2(2), 1-13.
OpenAI. (2023, March 14). GPT-4. Retrieved June 1, 2023, from https://openai.com/research/gpt-4
Page, E. B. (1966). The imminence of grading essays by computer. Phi Delta Kappan, 48, 238-243. https://www.jstor.org/stable/20371545.
Page, E. B. (2003). Intelligent Essay Grade (PEG®) [Computer software]. https://www.measurementinc.com/products-services/automated-essay-scoring.
Pearson Education. (2010). Intelligent Essay Assessor (IEA)™ Fact Sheet [Fact sheet]. Retrieved June 1, 2023, from https://images.pearsonassessments.com/images/assets/kt/download/IEA-FactSheet-20100401.pdf.
Peng, X., Ke, D., Xu, B. (2012). Automated essay scoring based on finite state transducer: towards ASR transcription of oral English speech. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1 (pp. 50-59). Association for Computational Linguistics.
Perelman, L. (2020). The BABEL generator and E-rater: 21st century writing constructs and automated essay scoring (AES). Journal of Writing Assessment, 13(1). https://escholarship.org/uc/item/263565cq
Popham, W. J. (2004). Why assessment illiteracy is professional suicide. Educational Leadership, 62, 82-83.
Raković, M., Winne, P. H., Marzouk, Z., & Chang, D. (2021). Automatic identification of knowledge‐transforming content in argument essays developed from multiple sources. Journal of Computer Assisted Learning, 37, 903-924.
Ramesh, D. & Sanampudi, S. K. (2022). An automated essay scoring systems: A systematic literature review. Artificial Intelligence Review, 55, 2495-2527.
Rasul, T., Nair, S., Kalendra, D., Robin, M., de Oliveira Santini, F., Ladeira, W. J., Sun, M., Day, I., Rather, R. A., & Heathcote, L. (2023). The role of ChatGPT in higher education: Benefits, challenges, and future research directions. Journal of Applied Learning & Teaching, 6(1), 1-16.
Refaat, M. M., Ewees A. A., & Eisa, M. M. (2012). Automated assessment of students’ Arabic free text answers. International Journal of Intelligent Computing and Information Science, 12(1), 213-222.
Rosmawan, H. (2017). The Analysis of students' writing before and after the implementation of ready-to- write approach. Journal of Culture, Arts, Literature, and Linguistics, 2(1), 1-16.
Rupp, A. A., Casabianca, J. M., Krüger, M., Keller, S., & Köller, O. (2019). Automated essay scoring at scale: a case study in Switzerland and Germany. ETS TOEFL Research Report Series, 1-23.
Sethi, A., & Singh, K. (2022). Natural Language Processing based Automated Essay Scoring with Parameter-Efficient Transformer Approach. In 6th International Conference on Computing Methodologies and Communication (ICCMC) (pp. 749-756).
Taghipour K., & Ng, H. T. (2016). A neural approach to automated essay scoring. In Proceedings of the 2016 conference on empirical methods in natural language processing (pp. 1882-1891). Stroudsburg: Association for Computational Linguistics.
Taghipour, K. (2017). Robust Trait-Specific Essay Scoring using Neural Networks and Density Estimators. Unpublished Doctoral Dissertation, National University of Singapore, Singapore.
Tsai, M. (2012). The consistency between human raters and an automated essay scoring system in grading high school students’ English writing. Action in Teacher Education, 34(4), 328-335.
Tsai, M. (2010). Things that an automated essay scoring system can and cannot do. In Proceedings of 2010 International Conference on ELT Technological Industry (pp. 87-103). Pingtung, ROC: NPUST.
Uto, M. (2021). A review of deep-neural automated essay scoring models. Behaviormetrika, 48, 459-484.
Uto, M., & Ueno, M. (2018). Empirical comparison of item response theory models with rater’s parameters. Heliyon, Elsevier 4(5), 1-32.
Vantage Learning (n.d.). Intellimetric®. Retrieved June 1, 2023, from https://intellimetric.com/direct
Wang, J., & Brown, M. S. (2008). Automated essay scoring versus human scoring: A correlational study. Contemporary Issues in Technology and Teacher Education, 8(4), 310-325.
White, E. (2009). Are you assessment literate? Some fundamental questions regarding effective classroom-based assessment. OnCUE Journal, 3(1), 3-25.
Wong, W. S., & Bong, C. H. (2021). Assessing Malaysian University English Test (MUET) Essay on Language and Semantic Features Using Intelligent Essay Grader (IEG). Pertanika Journal of Science & Technology, 29(2), 919-941.
Xames, M. D., & Shefa, J. (2023). ChatGPT for research and publication: Opportunities and challenges. Journal of Applied Learning & Teaching, 6(1), 1-6.
Zhang, M. (2013). Contrasting automated and human scoring of essays. R & D Connections, 21(2), 1-11.

Details

Primary Language

English

Subjects

Instructional Technologies

Journal Section

Research Article

Authors

Vasfiye Geckin ^*
0000-0001-8532-8627
Türkiye

Ebru Kızıltaş
0000-0002-1275-9327
Türkiye

Çağatay Çınar
0009-0007-3852-3658
Türkiye

Publication Date

December 31, 2023

Submission Date

August 2, 2023

Acceptance Date

October 21, 2023

Published in Issue

Year 2023 Volume: 6 Number: 4

DOI

https://doi.org/10.31681/jetol.1336599

IZ

https://izlik.org/JA75RB29ST

Cite

RIS / Bibtex

APA

Geckin, V., Kızıltaş, E., & Çınar, Ç. (2023). Assessing second-language academic writing: AI vs. Human raters. Journal of Educational Technology and Online Learning, 6(4), 1096-1108. https://doi.org/10.31681/jetol.1336599