Reliability of Human Expert and AI Raters in Translation Assessment

Yasemin Uzun

doi:10.26466/opusjsr.1821518

Araştırma Makalesi

Çeviri Değerlendirmesinde İnsan Uzman ve Yapay Zekâ Puanlayıcılarının Güvenirliği

Yıl 2025, Cilt: 22 Sayı: 6, 1305 - 1317, 17.12.2025

Yasemin Uzun

https://doi.org/10.26466/opusjsr.1821518

Öz

Yapay zekâ tabanlı değerlendirme sistemleri eğitimde yeni olanaklar sunsa da çeviri gibi karmaşık bilişsel becerilerin ölçümünde bu değerlendirme sistemlerinin insan yargısıyla tutarlılığı tartışmalıdır. Bu çalışma, C2 düzeyinde Türkçe çevirilerin değerlendirilmesinde alan uzmanı ile yapay zekâ puanlayıcıları (ChatGPT-5 ve Gemini 1.5 Pro) arasındaki puanlayıcılar arası güvenirliği incelemektedir. Yakınsak karma yöntem tasarımı kullanılarak, 14 öğrencinin çevirileri 5'li analitik rubrikle puanlanmıştır. Krippendorff alfa, düşük genel uyum (α = .392) ortaya koymuş, özellikle "Anlamsal Doğruluk" boyutunda uyum zayıf bulunmuştur (α = .288). Nitel analiz üç temel farklılık belirlemiştir: görev sadakati, hata ciddiyeti algısı ve kriter yorumlama çeşitliliği. Bulgular, yapay zekâ modellerinin biçimsel doğrulukta kısmi tutarlılık gösterdiğini ancak anlamsal nüans, üslup ve bağlamsal uygunlukta insan uzmanından sistematik olarak ayrıştığını ortaya koymaktadır. Uzman "görev odaklı" bir yaklaşım benimserken, yapay zekâ modelleri daha "biçim odaklı" (Gemini) veya "yüzeysel tutarlılık odaklı" (ChatGPT) değerlendirmeler yapmıştır. Yapay zekâ sistemleri çeviri değerlendirmesinde yararlı yardımcı araçlar olsa da uzman yargısının yerini alamamaktadır.

Anahtar Kelimeler

Çeviri değerlendirmesi , yabancı dil olarak Türkçenin öğretimi , yapay zekâ , puanlayıcı güvenirliği

Etik Beyan

Çalışma, bir devlet üniversitesinin Sosyal ve Beşerî Bilimler Etik Kurulu’nun 24.10.2025 tarihli ve 21/114 sayılı onay kararıyla yürütülmüştür.

Destekleyen Kurum

Çanakkale Onsekiz Mart Üniversitesi

Proje Numarası

2025-YÖNP-2114

Kaynakça

Bassnett, S. (2002). Translation studies. Routledge.
Büyüköztürk, Ş., Çakmak, E. K., Akgün, Ö. E., Karadeniz, Ş., & Demirel, F. (2020). Bilimsel araştırma yöntemleri (27. bs.). Pegem Akademi.
Doewes, A., & Pechenizkiy, M. (2021). On the limitations of human-computer agreement in automated essay scoring. Proceedings of the 2021 Educational Data Mining Conference.
Fahmy, Y. (2024). Student perception on AI-driven assessment: Motivation, engagement and feedback capabilities [Yüksek lisans tezi, University of Twente]. University of Twente Student Theses. https://essay.utwente.nl/91297/
Farrokhnia, M., Banihashem, S. K., Noroozi, O., & Wals, A. (2024). A SWOT analysis of ChatGPT: Implications for educational practice and research. Innovations in Education and Teaching International, 61(3), 460-474. https://doi.org/10.1080/14703297.2023.2195846
İşcan, A. (2011). Türkçenin yabancı dil olarak önemi. International Journal of Eurasia Social Sciences, 2(4), 29-36. Kaleli, S., & Özdemir, A. (2025). Artificial intelligence and its role in teaching Turkish as a foreign language. Turkish Linguistics Journal.
Kocmi, T., & Federmann, C. (2023). Large language models are state-of-the-art evaluators of translation quality. Proceedings of the 24th Annual Conference of the European Association for Machine Translation, 193-203. https://aclanthology.org/2023.eamt-1.19/
Kotlyar, I., & Krasman, J. (2022). Virtual simulation: New method for assessing teamwork skills. International Journal of Selection and Assessment, 30(3), 344-360. https://doi.org/10.1111/ijsa.12368
Kotlyar, I., & Krasman, J. (2025). Student reactions to AI versus human feedback in teamwork skills assessment. International Journal of Educational Technology in Higher Education, 22(1), 1-34. https://doi.org/10.1186/s41239-025-00555-9
Krippendorff, K. (2004). Content analysis: An introduction to its methodology (2. bs.). Sage Publications.
Lommel, A., Burchardt, A., & Uszkoreit, H. (2014). Multidimensional quality metrics (MQM): A framework for declaring and describing translation quality metrics. Tradumàtica: Tecnologies de la Traducció, 12, 455-463. https://doi.org/10.5565/rev/tradumatica.77
Luo, J., Zheng, C., Yin, J., & Teo, H. H. (2025). Design and assessment of AI-based learning tools in higher education: A systematic review. International Journal of Educational Technology in Higher Education, 22, 42. https://doi.org/10.1186/s41239-025-00540-2
Munday, J. (2016). Introducing translation studies: Theories and applications. Routledge.
Özdemir, C. (2018). Günümüzde yabancı dil olarak Türkçe öğretiminin durumu. Alatoo Academic Studies, 18(1), 11-19.
Rei, R., Stewart, C., Farinha, A. C., & Lavie, A. (2020). COMET: A neural framework for MT evaluation. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2685-2702. https://doi.org/10.18653/v1/2020.emnlp-main.213
Reiss, K., & Vermeer, H. J. (1984). Grundlehren einer allgemeinen Translationstheorie. Cornelsen.
Snell-Hornby, M. (1988). Translation studies: An interdisciplinary approach. John Benjamins.
Tang, X., Chen, H., & Lin, D. (2024). Harnessing LLMs for multi-dimensional writing assessment: Reliability and alignment with human judgments. Frontiers in Education, 9, Article 11305227. https://doi.org/10.3389/feduc.2024.11305227
Uyar, A. C., & Büyükahıska, D. (2025). Artificial intelligence as an automated essay scoring tool: A focus on ChatGPT. International Journal of Assessment Tools in Education, 12(1), 20-32. https://doi.org/10.21449/ijate.1517994 Venuti, L. (Ed.). (2012). The translation studies reader (3. bs.). Routledge.
Zawacki-Richter, O., Marín, V. I., Bond, M., & Gouverneur, F. (2019). Systematic review of research on artificial intelligence applications in higher education-where are the educators? International Journal of Educational Technology in Higher Education, 16(1), 1-27. https://doi.org/10.1186/s41239-019-0171-0

Reliability of Human Expert and AI Raters in Translation Assessment

Yıl 2025, Cilt: 22 Sayı: 6, 1305 - 1317, 17.12.2025

Yasemin Uzun

https://doi.org/10.26466/opusjsr.1821518

Öz

Although AI-based assessment systems offer new opportunities in education, their consistency with human judgment in measuring complex cognitive skills such as translation remains debatable. This study examines inter-rater reliability between a domain expert and AI raters (ChatGPT-5 and Gemini 1.5 Pro) in evaluating C2-level Turkish translations. Using a convergent mixed-methods design, translations from 14 students were scored with a 5-point analytic rubric. Krippendorff's alpha revealed low overall agreement (α = .392), particularly weak in "Semantic Accuracy" (α = .288). Qualitative analysis identified three key divergences: task fidelity, error severity perception, and criterion interpretation variability. Findings show AI models exhibit partial consistency in formal accuracy but systematically diverge from human experts in semantic nuance, style, and contextual appropriateness. The expert adopted a "task-oriented" approach, while AI models were more "form-focused" (Gemini) or "surface coherence-oriented" (ChatGPT). Although AI systems serve as useful auxiliary tools in translation assessment, they are not able to replace expert judgment

Anahtar Kelimeler

Artificial intelligence , inter-rater reliability , teaching Turkish as a foreign language , translation assessment

Etik Beyan

The study was conducted with the approval decision numbered 21/114 dated 24.10.2025 of the Social and Human Sciences Ethics Committee of a state university.

Destekleyen Kurum

Canakkale Onsekiz Mart University

Proje Numarası

2025-YÖNP-2114

Kaynakça

Bassnett, S. (2002). Translation studies. Routledge.
Büyüköztürk, Ş., Çakmak, E. K., Akgün, Ö. E., Karadeniz, Ş., & Demirel, F. (2020). Bilimsel araştırma yöntemleri (27. bs.). Pegem Akademi.
Doewes, A., & Pechenizkiy, M. (2021). On the limitations of human-computer agreement in automated essay scoring. Proceedings of the 2021 Educational Data Mining Conference.
Fahmy, Y. (2024). Student perception on AI-driven assessment: Motivation, engagement and feedback capabilities [Yüksek lisans tezi, University of Twente]. University of Twente Student Theses. https://essay.utwente.nl/91297/
Farrokhnia, M., Banihashem, S. K., Noroozi, O., & Wals, A. (2024). A SWOT analysis of ChatGPT: Implications for educational practice and research. Innovations in Education and Teaching International, 61(3), 460-474. https://doi.org/10.1080/14703297.2023.2195846
İşcan, A. (2011). Türkçenin yabancı dil olarak önemi. International Journal of Eurasia Social Sciences, 2(4), 29-36. Kaleli, S., & Özdemir, A. (2025). Artificial intelligence and its role in teaching Turkish as a foreign language. Turkish Linguistics Journal.
Kocmi, T., & Federmann, C. (2023). Large language models are state-of-the-art evaluators of translation quality. Proceedings of the 24th Annual Conference of the European Association for Machine Translation, 193-203. https://aclanthology.org/2023.eamt-1.19/
Kotlyar, I., & Krasman, J. (2022). Virtual simulation: New method for assessing teamwork skills. International Journal of Selection and Assessment, 30(3), 344-360. https://doi.org/10.1111/ijsa.12368
Kotlyar, I., & Krasman, J. (2025). Student reactions to AI versus human feedback in teamwork skills assessment. International Journal of Educational Technology in Higher Education, 22(1), 1-34. https://doi.org/10.1186/s41239-025-00555-9
Krippendorff, K. (2004). Content analysis: An introduction to its methodology (2. bs.). Sage Publications.
Lommel, A., Burchardt, A., & Uszkoreit, H. (2014). Multidimensional quality metrics (MQM): A framework for declaring and describing translation quality metrics. Tradumàtica: Tecnologies de la Traducció, 12, 455-463. https://doi.org/10.5565/rev/tradumatica.77
Luo, J., Zheng, C., Yin, J., & Teo, H. H. (2025). Design and assessment of AI-based learning tools in higher education: A systematic review. International Journal of Educational Technology in Higher Education, 22, 42. https://doi.org/10.1186/s41239-025-00540-2
Munday, J. (2016). Introducing translation studies: Theories and applications. Routledge.
Özdemir, C. (2018). Günümüzde yabancı dil olarak Türkçe öğretiminin durumu. Alatoo Academic Studies, 18(1), 11-19.
Rei, R., Stewart, C., Farinha, A. C., & Lavie, A. (2020). COMET: A neural framework for MT evaluation. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2685-2702. https://doi.org/10.18653/v1/2020.emnlp-main.213
Reiss, K., & Vermeer, H. J. (1984). Grundlehren einer allgemeinen Translationstheorie. Cornelsen.
Snell-Hornby, M. (1988). Translation studies: An interdisciplinary approach. John Benjamins.
Tang, X., Chen, H., & Lin, D. (2024). Harnessing LLMs for multi-dimensional writing assessment: Reliability and alignment with human judgments. Frontiers in Education, 9, Article 11305227. https://doi.org/10.3389/feduc.2024.11305227
Uyar, A. C., & Büyükahıska, D. (2025). Artificial intelligence as an automated essay scoring tool: A focus on ChatGPT. International Journal of Assessment Tools in Education, 12(1), 20-32. https://doi.org/10.21449/ijate.1517994 Venuti, L. (Ed.). (2012). The translation studies reader (3. bs.). Routledge.
Zawacki-Richter, O., Marín, V. I., Bond, M., & Gouverneur, F. (2019). Systematic review of research on artificial intelligence applications in higher education-where are the educators? International Journal of Educational Technology in Higher Education, 16(1), 1-27. https://doi.org/10.1186/s41239-019-0171-0

Toplam 20 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	İngilizce
Konular	İnternet, Yeni İletişim Teknolojileri
Bölüm	Araştırma Makalesi
Yazarlar	Yasemin Uzun 0000-0001-8995-772X
Proje Numarası	2025-YÖNP-2114
Gönderilme Tarihi	11 Kasım 2025
Kabul Tarihi	12 Aralık 2025
Yayımlanma Tarihi	17 Aralık 2025
Yayımlandığı Sayı	Yıl 2025 Cilt: 22 Sayı: 6

Kaynak Göster

APA	Uzun, Y. (2025). Reliability of Human Expert and AI Raters in Translation Assessment. OPUS Journal of Society Research, 22(6), 1305-1317. https://doi.org/10.26466/opusjsr.1821518

Makale Dosyaları

Tam Metin