Araştırma Makalesi
BibTex RIS Kaynak Göster

İngilizce Yazma Becerisinin Değerlendirilmesinde Yapay Zeka ve İnsan Değerlendiriciler: Puanlama Ölçeği ve Metin Türünün Güvenirliğe Etkisi

Yıl 2025, Cilt: 8 Sayı: 2, 191 - 210, 31.12.2025
https://doi.org/10.52974/jena.1785369

Öz

Bu çalışma, İngilizceyi Yabancı Dil Olarak (EFL) yazmayı değerlendirmede büyük dil modellerinin (LLM) insan değerlendiricilere kıyasla güvenilirliğini araştırmaktadır. Özellikle, ChatGPT 4.0 ve DeepSeek R1'in performansı, rubriksiz ve rubrik tabanlı puanlama koşulları altında üç türde (tartışmalı, görüş ve ikna edici denemeler) incelenmiştir. Katılımcılar, toplam 162 deneme üreten Türkiye'deki bir üniversitede okuyan 65 lisans İngilizce Öğretmenliği öğrencisidir. İki deneyimli insan değerlendirici tüm denemeleri puanlamış ve değerlendirmeleri neredeyse mükemmel bir değerlendiriciler arası güvenilirlik göstererek karşılaştırma için istikrarlı bir kıstas sağlamıştır. Aynı denemeler daha sonra her iki puanlama koşulunda ChatGPT ve DeepSeek ile derecelendirilmiştir. İstatistiksel analizler, sınıf içi korelasyon katsayıları (ICC), Pearson korelasyonları, eşleştirilmiş örneklem t-testleri ve ANOVA'ları içermiştir. Bulgular, rubrik entegrasyonunun, özellikle DeepSeek'ten daha güçlü rubrik kriterlerine duyarlılık gösteren ChatGPT için, yapay zeka ve insan puanları arasındaki uyumu önemli ölçüde iyileştirdiğini ortaya koymuştur. Tür etkileri de belirgindi: görüş yazıları en yüksek yapay zeka-insan uyumunu, ikna edici metinler orta düzeyde uyumu ve tartışmacı yazılar en düşük tutarlılığı sağladı. Her iki yapay zeka aracı da insan değerlendiricilere göre daha az değişkenlikle daha merkezi puanlar üretse de, özellikle değerlendirme ölçütü rehberliği olmadan riskten kaçınma eğilimleri sergilediler. Sonuçlar, yapay zeka tabanlı puanlamanın, özellikle bilişsel olarak zorlayıcı türlerde, insan değerlendirmesini tamamlayabileceğini, ancak yerini alamayacağını göstermektedir. Çalışma, yapay zeka destekli yazma değerlendirmesinin eğitim değerini en üst düzeye çıkarmada değerlendirme ölçütünün açıklığının, hızlı tasarımın ve tür farkındalığının önemini vurgulamaktadır.

Etik Beyan

Bu araştırma, Nevşehir Hacı Bektaş Veli Üniversitesi Bilimsel Araştırma ve Yayın Etiği Kurulu'nun 05/02/2025 tarih ve 2025.01.42 sayılı kararına dayanarak verilen izinle yürütülmüştür.

Teşekkür

Bu çalışmaya katılan öğrencilere ve öğrenci kompozisyonlarının değerlendirilmesinde değerli yardımlarından dolayı Öğretim Görevlisi Uğur Ünalır'a teşekkür ederiz.

Kaynakça

  • Ahmadi Shirazi, M. (2019). For a greater good: Bias analysis in writing assessment. Sage Open, 9(1), 1-14. https://doi.org/10.1177/2158244018822377
  • Barkaoui, K. (2010). Variability in ESL essay rating processes: The role of the rating scale and rater experience. Language Assessment Quarterly, 7(1), 54-74. https://doi.org/10.1080/15434300903464418
  • Bond, M., Khosravi, H., De Laat, M., Bergdahl, N., Negrea, V., Oxley, E., Pham, P., Chong, S. W., & Siemens, G. (2024). A meta systematic review of artificial intelligence in higher education: a call for increased ethics, collaboration, and rigour. International Journal of Educational Technology in Higher Education, 21(1). https://doi.org/10.1186/s41239-023-00436-z
  • Bouziane, K., & Bouziane, A. (2024). AI versus human effectiveness in essay evaluation. Discover Education, 3(1), 201. https://doi.org/10.1007/s44217-024-00320-6
  • Bucol, J. L., & Sangkawong, N. (2024). Exploring ChatGPT as a writing assessment tool. Innovations in Education and Teaching International, 1-16. https://doi.org/10.1080/14703297.2024.2363901
  • Bui, N. M., & Barrot, J. S. (2024). ChatGPT as an automated essay scoring tool in the writing classrooms: how it compares with human scoring. Education and Information Technologies, 1-18. https://doi.org/10.1007/s10639-024-12891-w
  • Chapelle, C. A., & Douglas, D. (2006). Assessing language through computer technology. Cambridge University Press.
  • Crossley, S. (2020). Linguistic features in writing quality and development: An overview. Journal of Writing Research, 11(3), 415-443. https://doi.org/10.17239/jowr-2020.11.03.01
  • Crusan, D., Plakans, L., & Gebril, A. (2016). Writing assessment literacy: Surveying second language teachers’ knowledge, beliefs, and practices. Assessing Writing, 28, 43-56. https://doi.org/10.1016/j.asw.2016.03.001
  • Cumming, A., Kantor, R., & Powers, D. E. (2002). Decision making while rating ESL/EFL writing tasks: A descriptive framework. The Modern Language Journal, 86(1), 67-96. https://doi.org/10.1111/1540-4781.00137
  • Dempsey, M. S., PytlikZillig, L. M., & Bruning, R. H. (2009). Helping preservice teachers learn to assess writing: Practice and feedback in a Web-based environment. Assessing Writing, 14(1), 38-61. https://doi.org/10.1016/j.asw.2008.12.003
  • Eckes, T. (2015). Introduction to many-facet Rasch measurement. Peter Lang.
  • Geçkin, V., Kızıltaş, E., & Çınar, Ç. (2023). Assessing second-language academic writing: AI vs. Human raters. Journal of Educational Technology & Online Learning, 6(4), 1096-1108. https://doi.org/10.31681/jetol.1336599
  • González-Calatayud, V., Prendes-Espinosa, P., & Roig-Vila, R. (2021). Artificial intelligence for student assessment: A systematic review. Applied Sciences, 11(12), 5467, 1-15. https://doi.org/10.3390/app11125467
  • Huang, J. (2008). How accurate are ESL students’ holistic writing scores on large-scale assessments?-A generalizability theory approach. Assessing Writing, 13(3), 201-218. https://doi.org/10.1016/j.asw.2008.10.002
  • Hussein, M. A., Hassan, H., & Nassef, M. (2019). Automated language essay scoring systems: A literature review. PeerJ Computer Science, 5, e208. https://doi.org/10.7717/peerj-cs.208
  • Hyland, K. (2019). Second language writing. Cambridge University Press. https://doi.org/10.1017/9781108635547
  • Jackaria, P. M., Hajan, B. H., & Mastul, A. H. (2024). A Comparative Analysis of the Rating of College Students’ Essays by ChatGPT versus Human Raters. International Journal of Learning Teaching and Educational Research, 23(2), 478-492. https://doi.org/10.26803/ijlter.23.2.23
  • Khosravi, H., Viberg, O., Kovanovic, V., & Ferguson, R. (2023). Generative AI and learning analytics. Journal of Learning Analytics, 10(3), 1-6. https://doi.org/10.18608/jla.2023.8333
  • Kim, H., Baghestani, Sh., Yin, Sh., Karatay, Y., Kurt, S., Beck, J., & Karatay, L. (2024). ChatGPT for writing evaluation: Examining the accuracy and reliability of AI-generated scores compared to human raters. In C. A. Chapelle, G. H. Beckett, & J. Ranalli (Eds.), Exploring artificial intelligence in applied linguistics (pp. 73-95). Iowa State University Digital Press. https://doi.org/10.31274/isudp.2024.154.06
  • Knoch, U. (2011). Rating scales for diagnostic assessment of writing: What should they look like and where should the criteria come from?. Assessing writing, 16(2), 81-96. https://doi.org/10.1016/j.asw.2011.02.003
  • Koltovskaia, S. (2020). Student engagement with automated written corrective feedback (AWCF) provided by Grammarly: A multiple case study. Assessing Writing, 44, 100450. https://doi.org/10.1016/j.asw.2020.100450
  • Korkmaz, H., & Akbıyık, M. (2024). Unlocking the potential: Attitudes of tertiary level EFL learners towards using AI in language learning. Participatory Educational Research, 11(6), 1-19. https://doi.org/10.17275/per.24.76.11.6
  • Lantolf, J. (Ed.) (2000). Sociocultural theory and second language learning. Oxford University Press.
  • Leow, R. P., & Suh, B-R. (2022). Theoretical perspectives on writing, corrective feedback, and language learning in individual writing conditions. In R. M. Manchón & C. Polio (Eds.), Routledge handbook of second language acquisition and writing (pp. 9-21). Routledge. https://doi.org/10.4324/9780429199691-3
  • Li, J., Jangamreddy, N. K., Hisamoto, R., Bhansali, R., Dyda, A., Zaphir, L., & Glencross, M. (2024). AI-assisted marking: Functionality and limitations of ChatGPT in written assessment evaluation. Australasian Journal of Educational Technology, 40(4), 56-72. https://doi.org/10.14742/ajet.9463
  • Lim, G. S. (2011). The development and maintenance of rating quality in performance writing assessment: A longitudinal study of new and experienced raters. Language Testing, 28(4), 543-560. https://doi.org/10.1177/0265532211406422
  • Lu, X. (2011). A Corpus‐Based evaluation of syntactic complexity measures as indices of College‐Level ESL writers’ language development. TESOL Quarterly, 45(1), 36-62. https://doi.org/10.5054/tq.2011.240859
  • Lundgren, M. 2024. Large Language Models in Student Assessment: Comparing ChatGPT and Human Graders. arXiv preprint arXiv:2406.16510.
  • Mahshanian, A., & Shahnazari, M. (2020). The effect of raters’ fatigue on scoring EFL writing tasks. Indonesian Journal of Applied Linguistics, 10(1), 1-13. https://doi.org/10.17509/ijal.v10i1.24956
  • Manchón, R. M. (2011). Writing to learn the language: Issues in theory and research. In R. M. Manchón (Ed.), Learning‐to‐Write and Writing‐to‐Learn in an Additional Language, (pp. 61‐82). Johns Benjamins Publishing Company.
  • Manning, J., Baldwin, J., & Powell, N. (2025). Human versus machine: The effectiveness of ChatGPT in automated essay scoring. Innovations in Education and Teaching International, 1-14. https://doi.org/10.1080/14703297.2025.2469089
  • McConlogue, T. (2012). But is it fair? Developing students’ understanding of grading complex written work through peer assessment. Assessment & Evaluation in Higher Education, 37(1), 113-123. https://doi.org/10.1080/02602938.2010.515010
  • Mizumoto, A., & Eguchi, M. (2023). Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics, 2(2), 100050. https://doi.org/10.1016/j.rmal.2023.100050
  • Ragupathi, K., & Lee, A. (2020). Beyond fairness and consistency in grading: The role of rubrics in higher education. In C. S. Sanger & N. W. Gleason (Eds.), Diversity and inclusion in global higher education: Lessons from across Asia (pp. 73–95). Palgrave Macmillan.
  • Shin, D., & Lee, J. H. (2024). Exploratory study on the potential of ChatGPT as a rater of second language writing. Education and information technologies, 29, 24735-24757. https://doi.org/10.1007/s10639-024-12817-6
  • Stiggins, R. J. (1995). Assessment literacy for the 21st century. Phi Delta Kappan, 77(3), 238-245.
  • Tömen, M. (2022). Automated Essay Scoring Feedback in Foreign Language Writing: Does it coincide with instructor feedback? Disiplinler Arası Dil Araştırmaları, 4(4), 53-62. https://doi.org/10.48147/dada.60
  • Weigle, S. C. (2013). English as a second language writing and automated essay evaluation. In M. D. Shermis & J. Burstein (Eds.), The handbook of automated essay evaluation: Current applications and new directions (pp. 36-54). Routledge.
  • Williams, J. (2012). The potential role(s) of writing in second language development. Journal of Second Language Writing, 21, 321-331. https://doi.org/10.1016/j.jslw.2012.09.007
  • Wood, E. H., & Henderson, S. (2010). Large cohort assessment: depth, interaction and manageable marking. Marketing Intelligence & Planning, 28(7), 898-907. https://doi.org/10.1108/02634501011086481
  • Yavuz, F., Çelik, Ö., & Çelik, G. Y. (2024). Utilizing large language models for EFL essay grading: An examination of reliability and validity in rubric‐based assessments. British Journal of Educational Technology, 56(1), 150-166. https://doi.org/10.1111/bjet.13494
  • Yue, X. (2024). A comparative study on ERNIE Bot 4.0 Turbo and ChatGPT 4O’s performance in evaluating First-Year undergraduate persuasive essays. Arts Culture and Language, 1(9). https://doi.org/10.61173/nk1ywa21
  • Zhang, J. (2016). Same text different processing? Exploring how raters’ cognitive and meta-cognitive strategies influence rating accuracy in essay scoring. Assessing Writing, 27, 37-53. https://doi.org/10.1016/j.asw.2015.11.001
  • Zhao, C., & Huang, J. (2020). The impact of the scoring system of a large-scale standardized EFL writing assessment on its score variability and reliability: Implications for assessment policy makers. Studies in Educational Evaluation, 67, 100911. https://doi.org/10.1016/j.stueduc.2020.100911

Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability

Yıl 2025, Cilt: 8 Sayı: 2, 191 - 210, 31.12.2025
https://doi.org/10.52974/jena.1785369

Öz

This study investigates the reliability of large language models (LLMs) in assessing English as a Foreign Language (EFL) writing compared to human raters. Specifically, the performances of ChatGPT 4.0 and DeepSeek R1 were examined across three genres; argumentative, opinion, and persuasive essays, under rubric-free and rubric-based scoring conditions. Participants were 65 undergraduate ELT students at a Turkish university who produced a total of 162 essays. Two experienced human raters scored all essays, and their evaluations demonstrated near-perfect inter-rater reliability, providing a stable benchmark for comparison. The same essays were then rated by ChatGPT and DeepSeek under both scoring conditions. Statistical analyses included intraclass correlation coefficients (ICC), Pearson correlations, paired-samples t-tests, and ANOVAs. Findings revealed that rubric integration substantially improved alignment between AI and human scores, particularly for ChatGPT, which showed stronger sensitivity to rubric criteria than DeepSeek. Genre effects were also evident: opinion essays yielded the highest AI-human agreement, persuasive texts moderate alignment, and argumentative essays the weakest consistency. While both AI tools produced more centralized scores with less variability than human raters, they also exhibited risk-averse tendencies, especially without rubric guidance. The results indicate that AI-based scoring can complement, but not replace, human evaluation, especially in cognitively demanding genres. The study highlights the importance of rubric clarity, prompt design, and genre awareness in maximizing the educational value of AI-assisted writing assessment.

Etik Beyan

This research was conducted with the permission granted by the Nevşehir Hacı Bektaş Veli University Scientific Research and Publication Ethics Committee, based on the decision dated 05/02/2025 and numbered 2025.01.42.

Teşekkür

We are grateful to the students who participated in this study and to Instructor Uğur Ünalır for his invaluable assistance in evaluating the student essays.

Kaynakça

  • Ahmadi Shirazi, M. (2019). For a greater good: Bias analysis in writing assessment. Sage Open, 9(1), 1-14. https://doi.org/10.1177/2158244018822377
  • Barkaoui, K. (2010). Variability in ESL essay rating processes: The role of the rating scale and rater experience. Language Assessment Quarterly, 7(1), 54-74. https://doi.org/10.1080/15434300903464418
  • Bond, M., Khosravi, H., De Laat, M., Bergdahl, N., Negrea, V., Oxley, E., Pham, P., Chong, S. W., & Siemens, G. (2024). A meta systematic review of artificial intelligence in higher education: a call for increased ethics, collaboration, and rigour. International Journal of Educational Technology in Higher Education, 21(1). https://doi.org/10.1186/s41239-023-00436-z
  • Bouziane, K., & Bouziane, A. (2024). AI versus human effectiveness in essay evaluation. Discover Education, 3(1), 201. https://doi.org/10.1007/s44217-024-00320-6
  • Bucol, J. L., & Sangkawong, N. (2024). Exploring ChatGPT as a writing assessment tool. Innovations in Education and Teaching International, 1-16. https://doi.org/10.1080/14703297.2024.2363901
  • Bui, N. M., & Barrot, J. S. (2024). ChatGPT as an automated essay scoring tool in the writing classrooms: how it compares with human scoring. Education and Information Technologies, 1-18. https://doi.org/10.1007/s10639-024-12891-w
  • Chapelle, C. A., & Douglas, D. (2006). Assessing language through computer technology. Cambridge University Press.
  • Crossley, S. (2020). Linguistic features in writing quality and development: An overview. Journal of Writing Research, 11(3), 415-443. https://doi.org/10.17239/jowr-2020.11.03.01
  • Crusan, D., Plakans, L., & Gebril, A. (2016). Writing assessment literacy: Surveying second language teachers’ knowledge, beliefs, and practices. Assessing Writing, 28, 43-56. https://doi.org/10.1016/j.asw.2016.03.001
  • Cumming, A., Kantor, R., & Powers, D. E. (2002). Decision making while rating ESL/EFL writing tasks: A descriptive framework. The Modern Language Journal, 86(1), 67-96. https://doi.org/10.1111/1540-4781.00137
  • Dempsey, M. S., PytlikZillig, L. M., & Bruning, R. H. (2009). Helping preservice teachers learn to assess writing: Practice and feedback in a Web-based environment. Assessing Writing, 14(1), 38-61. https://doi.org/10.1016/j.asw.2008.12.003
  • Eckes, T. (2015). Introduction to many-facet Rasch measurement. Peter Lang.
  • Geçkin, V., Kızıltaş, E., & Çınar, Ç. (2023). Assessing second-language academic writing: AI vs. Human raters. Journal of Educational Technology & Online Learning, 6(4), 1096-1108. https://doi.org/10.31681/jetol.1336599
  • González-Calatayud, V., Prendes-Espinosa, P., & Roig-Vila, R. (2021). Artificial intelligence for student assessment: A systematic review. Applied Sciences, 11(12), 5467, 1-15. https://doi.org/10.3390/app11125467
  • Huang, J. (2008). How accurate are ESL students’ holistic writing scores on large-scale assessments?-A generalizability theory approach. Assessing Writing, 13(3), 201-218. https://doi.org/10.1016/j.asw.2008.10.002
  • Hussein, M. A., Hassan, H., & Nassef, M. (2019). Automated language essay scoring systems: A literature review. PeerJ Computer Science, 5, e208. https://doi.org/10.7717/peerj-cs.208
  • Hyland, K. (2019). Second language writing. Cambridge University Press. https://doi.org/10.1017/9781108635547
  • Jackaria, P. M., Hajan, B. H., & Mastul, A. H. (2024). A Comparative Analysis of the Rating of College Students’ Essays by ChatGPT versus Human Raters. International Journal of Learning Teaching and Educational Research, 23(2), 478-492. https://doi.org/10.26803/ijlter.23.2.23
  • Khosravi, H., Viberg, O., Kovanovic, V., & Ferguson, R. (2023). Generative AI and learning analytics. Journal of Learning Analytics, 10(3), 1-6. https://doi.org/10.18608/jla.2023.8333
  • Kim, H., Baghestani, Sh., Yin, Sh., Karatay, Y., Kurt, S., Beck, J., & Karatay, L. (2024). ChatGPT for writing evaluation: Examining the accuracy and reliability of AI-generated scores compared to human raters. In C. A. Chapelle, G. H. Beckett, & J. Ranalli (Eds.), Exploring artificial intelligence in applied linguistics (pp. 73-95). Iowa State University Digital Press. https://doi.org/10.31274/isudp.2024.154.06
  • Knoch, U. (2011). Rating scales for diagnostic assessment of writing: What should they look like and where should the criteria come from?. Assessing writing, 16(2), 81-96. https://doi.org/10.1016/j.asw.2011.02.003
  • Koltovskaia, S. (2020). Student engagement with automated written corrective feedback (AWCF) provided by Grammarly: A multiple case study. Assessing Writing, 44, 100450. https://doi.org/10.1016/j.asw.2020.100450
  • Korkmaz, H., & Akbıyık, M. (2024). Unlocking the potential: Attitudes of tertiary level EFL learners towards using AI in language learning. Participatory Educational Research, 11(6), 1-19. https://doi.org/10.17275/per.24.76.11.6
  • Lantolf, J. (Ed.) (2000). Sociocultural theory and second language learning. Oxford University Press.
  • Leow, R. P., & Suh, B-R. (2022). Theoretical perspectives on writing, corrective feedback, and language learning in individual writing conditions. In R. M. Manchón & C. Polio (Eds.), Routledge handbook of second language acquisition and writing (pp. 9-21). Routledge. https://doi.org/10.4324/9780429199691-3
  • Li, J., Jangamreddy, N. K., Hisamoto, R., Bhansali, R., Dyda, A., Zaphir, L., & Glencross, M. (2024). AI-assisted marking: Functionality and limitations of ChatGPT in written assessment evaluation. Australasian Journal of Educational Technology, 40(4), 56-72. https://doi.org/10.14742/ajet.9463
  • Lim, G. S. (2011). The development and maintenance of rating quality in performance writing assessment: A longitudinal study of new and experienced raters. Language Testing, 28(4), 543-560. https://doi.org/10.1177/0265532211406422
  • Lu, X. (2011). A Corpus‐Based evaluation of syntactic complexity measures as indices of College‐Level ESL writers’ language development. TESOL Quarterly, 45(1), 36-62. https://doi.org/10.5054/tq.2011.240859
  • Lundgren, M. 2024. Large Language Models in Student Assessment: Comparing ChatGPT and Human Graders. arXiv preprint arXiv:2406.16510.
  • Mahshanian, A., & Shahnazari, M. (2020). The effect of raters’ fatigue on scoring EFL writing tasks. Indonesian Journal of Applied Linguistics, 10(1), 1-13. https://doi.org/10.17509/ijal.v10i1.24956
  • Manchón, R. M. (2011). Writing to learn the language: Issues in theory and research. In R. M. Manchón (Ed.), Learning‐to‐Write and Writing‐to‐Learn in an Additional Language, (pp. 61‐82). Johns Benjamins Publishing Company.
  • Manning, J., Baldwin, J., & Powell, N. (2025). Human versus machine: The effectiveness of ChatGPT in automated essay scoring. Innovations in Education and Teaching International, 1-14. https://doi.org/10.1080/14703297.2025.2469089
  • McConlogue, T. (2012). But is it fair? Developing students’ understanding of grading complex written work through peer assessment. Assessment & Evaluation in Higher Education, 37(1), 113-123. https://doi.org/10.1080/02602938.2010.515010
  • Mizumoto, A., & Eguchi, M. (2023). Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics, 2(2), 100050. https://doi.org/10.1016/j.rmal.2023.100050
  • Ragupathi, K., & Lee, A. (2020). Beyond fairness and consistency in grading: The role of rubrics in higher education. In C. S. Sanger & N. W. Gleason (Eds.), Diversity and inclusion in global higher education: Lessons from across Asia (pp. 73–95). Palgrave Macmillan.
  • Shin, D., & Lee, J. H. (2024). Exploratory study on the potential of ChatGPT as a rater of second language writing. Education and information technologies, 29, 24735-24757. https://doi.org/10.1007/s10639-024-12817-6
  • Stiggins, R. J. (1995). Assessment literacy for the 21st century. Phi Delta Kappan, 77(3), 238-245.
  • Tömen, M. (2022). Automated Essay Scoring Feedback in Foreign Language Writing: Does it coincide with instructor feedback? Disiplinler Arası Dil Araştırmaları, 4(4), 53-62. https://doi.org/10.48147/dada.60
  • Weigle, S. C. (2013). English as a second language writing and automated essay evaluation. In M. D. Shermis & J. Burstein (Eds.), The handbook of automated essay evaluation: Current applications and new directions (pp. 36-54). Routledge.
  • Williams, J. (2012). The potential role(s) of writing in second language development. Journal of Second Language Writing, 21, 321-331. https://doi.org/10.1016/j.jslw.2012.09.007
  • Wood, E. H., & Henderson, S. (2010). Large cohort assessment: depth, interaction and manageable marking. Marketing Intelligence & Planning, 28(7), 898-907. https://doi.org/10.1108/02634501011086481
  • Yavuz, F., Çelik, Ö., & Çelik, G. Y. (2024). Utilizing large language models for EFL essay grading: An examination of reliability and validity in rubric‐based assessments. British Journal of Educational Technology, 56(1), 150-166. https://doi.org/10.1111/bjet.13494
  • Yue, X. (2024). A comparative study on ERNIE Bot 4.0 Turbo and ChatGPT 4O’s performance in evaluating First-Year undergraduate persuasive essays. Arts Culture and Language, 1(9). https://doi.org/10.61173/nk1ywa21
  • Zhang, J. (2016). Same text different processing? Exploring how raters’ cognitive and meta-cognitive strategies influence rating accuracy in essay scoring. Assessing Writing, 27, 37-53. https://doi.org/10.1016/j.asw.2015.11.001
  • Zhao, C., & Huang, J. (2020). The impact of the scoring system of a large-scale standardized EFL writing assessment on its score variability and reliability: Implications for assessment policy makers. Studies in Educational Evaluation, 67, 100911. https://doi.org/10.1016/j.stueduc.2020.100911
Toplam 45 adet kaynakça vardır.

Ayrıntılar

Birincil Dil İngilizce
Konular Eğitimde Ölçme ve Değerlendirme (Diğer)
Bölüm Araştırma Makalesi
Yazarlar

Samet Taşçı 0000-0003-3925-3825

Gönderilme Tarihi 16 Eylül 2025
Kabul Tarihi 4 Kasım 2025
Yayımlanma Tarihi 31 Aralık 2025
Yayımlandığı Sayı Yıl 2025 Cilt: 8 Sayı: 2

Kaynak Göster

APA Taşçı, S. (2025). Human and AI Scoring of EFL Writing: The Influence of Rubrics and Genre on Reliability. Eğitim ve Yeni Yaklaşımlar Dergisi, 8(2), 191-210. https://doi.org/10.52974/jena.1785369

Flag Counter