Research Article

Reliability of Human Expert and AI Raters in Translation Assessment

Volume: 22 Number: 6 December 17, 2025
TR EN

Reliability of Human Expert and AI Raters in Translation Assessment

Abstract

Although AI-based assessment systems offer new opportunities in education, their consistency with human judgment in measuring complex cognitive skills such as translation remains debatable. This study examines inter-rater reliability between a domain expert and AI raters (ChatGPT-5 and Gemini 1.5 Pro) in evaluating C2-level Turkish translations. Using a convergent mixed-methods design, translations from 14 students were scored with a 5-point analytic rubric. Krippendorff's alpha revealed low overall agreement (α = .392), particularly weak in "Semantic Accuracy" (α = .288). Qualitative analysis identified three key divergences: task fidelity, error severity perception, and criterion interpretation variability. Findings show AI models exhibit partial consistency in formal accuracy but systematically diverge from human experts in semantic nuance, style, and contextual appropriateness. The expert adopted a "task-oriented" approach, while AI models were more "form-focused" (Gemini) or "surface coherence-oriented" (ChatGPT). Although AI systems serve as useful auxiliary tools in translation assessment, they are not able to replace expert judgment

Keywords

Artificial intelligence, inter-rater reliability, teaching Turkish as a foreign language, translation assessment

Supporting Institution

Canakkale Onsekiz Mart University

Project Number

2025-YÖNP-2114

Ethical Statement

The study was conducted with the approval decision numbered 21/114 dated 24.10.2025 of the Social and Human Sciences Ethics Committee of a state university.

References

  1. Bassnett, S. (2002). Translation studies. Routledge.
  2. Büyüköztürk, Ş., Çakmak, E. K., Akgün, Ö. E., Karadeniz, Ş., & Demirel, F. (2020). Bilimsel araştırma yöntemleri (27. bs.). Pegem Akademi.
  3. Doewes, A., & Pechenizkiy, M. (2021). On the limitations of human-computer agreement in automated essay scoring. Proceedings of the 2021 Educational Data Mining Conference.
  4. Fahmy, Y. (2024). Student perception on AI-driven assessment: Motivation, engagement and feedback capabilities [Yüksek lisans tezi, University of Twente]. University of Twente Student Theses. https://essay.utwente.nl/91297/
  5. Farrokhnia, M., Banihashem, S. K., Noroozi, O., & Wals, A. (2024). A SWOT analysis of ChatGPT: Implications for educational practice and research. Innovations in Education and Teaching International, 61(3), 460-474. https://doi.org/10.1080/14703297.2023.2195846
  6. İşcan, A. (2011). Türkçenin yabancı dil olarak önemi. International Journal of Eurasia Social Sciences, 2(4), 29-36. Kaleli, S., & Özdemir, A. (2025). Artificial intelligence and its role in teaching Turkish as a foreign language. Turkish Linguistics Journal.
  7. Kocmi, T., & Federmann, C. (2023). Large language models are state-of-the-art evaluators of translation quality. Proceedings of the 24th Annual Conference of the European Association for Machine Translation, 193-203. https://aclanthology.org/2023.eamt-1.19/
  8. Kotlyar, I., & Krasman, J. (2022). Virtual simulation: New method for assessing teamwork skills. International Journal of Selection and Assessment, 30(3), 344-360. https://doi.org/10.1111/ijsa.12368
  9. Kotlyar, I., & Krasman, J. (2025). Student reactions to AI versus human feedback in teamwork skills assessment. International Journal of Educational Technology in Higher Education, 22(1), 1-34. https://doi.org/10.1186/s41239-025-00555-9
  10. Krippendorff, K. (2004). Content analysis: An introduction to its methodology (2. bs.). Sage Publications.
APA
Uzun, Y. (2025). Reliability of Human Expert and AI Raters in Translation Assessment. OPUS Journal of Society Research, 22(6), 1305-1317. https://doi.org/10.26466/opusjsr.1821518
AMA
1.Uzun Y. Reliability of Human Expert and AI Raters in Translation Assessment. OPUS JSR. 2025;22(6):1305-1317. doi:10.26466/opusjsr.1821518
Chicago
Uzun, Yasemin. 2025. “Reliability of Human Expert and AI Raters in Translation Assessment”. OPUS Journal of Society Research 22 (6): 1305-17. https://doi.org/10.26466/opusjsr.1821518.
EndNote
Uzun Y (December 1, 2025) Reliability of Human Expert and AI Raters in Translation Assessment. OPUS Journal of Society Research 22 6 1305–1317.
IEEE
[1]Y. Uzun, “Reliability of Human Expert and AI Raters in Translation Assessment”, OPUS JSR, vol. 22, no. 6, pp. 1305–1317, Dec. 2025, doi: 10.26466/opusjsr.1821518.
ISNAD
Uzun, Yasemin. “Reliability of Human Expert and AI Raters in Translation Assessment”. OPUS Journal of Society Research 22/6 (December 1, 2025): 1305-1317. https://doi.org/10.26466/opusjsr.1821518.
JAMA
1.Uzun Y. Reliability of Human Expert and AI Raters in Translation Assessment. OPUS JSR. 2025;22:1305–1317.
MLA
Uzun, Yasemin. “Reliability of Human Expert and AI Raters in Translation Assessment”. OPUS Journal of Society Research, vol. 22, no. 6, Dec. 2025, pp. 1305-17, doi:10.26466/opusjsr.1821518.
Vancouver
1.Yasemin Uzun. Reliability of Human Expert and AI Raters in Translation Assessment. OPUS JSR. 2025 Dec. 1;22(6):1305-17. doi:10.26466/opusjsr.1821518