TY - JOUR T1 - Reliability of Human Expert and AI Raters in Translation Assessment TT - Çeviri Değerlendirmesinde İnsan Uzman ve Yapay Zekâ Puanlayıcılarının Güvenirliği AU - Uzun, Yasemin PY - 2025 DA - December Y2 - 2025 DO - 10.26466/opusjsr.1821518 JF - OPUS Journal of Society Research JO - OPUS TAD PB - İdeal Kent Yayınları WT - DergiPark SN - 2791-9862 SP - 1305 EP - 1317 VL - 22 IS - 6 LA - en AB - Although AI-based assessment systems offer new opportunities in education, their consistency with human judgment in measuring complex cognitive skills such as translation remains debatable. This study examines inter-rater reliability between a domain expert and AI raters (ChatGPT-5 and Gemini 1.5 Pro) in evaluating C2-level Turkish translations. Using a convergent mixed-methods design, translations from 14 students were scored with a 5-point analytic rubric. Krippendorff's alpha revealed low overall agreement (α = .392), particularly weak in "Semantic Accuracy" (α = .288). Qualitative analysis identified three key divergences: task fidelity, error severity perception, and criterion interpretation variability. Findings show AI models exhibit partial consistency in formal accuracy but systematically diverge from human experts in semantic nuance, style, and contextual appropriateness. The expert adopted a "task-oriented" approach, while AI models were more "form-focused" (Gemini) or "surface coherence-oriented" (ChatGPT). Although AI systems serve as useful auxiliary tools in translation assessment, they are not able to replace expert judgment KW - Artificial intelligence KW - inter-rater reliability KW - teaching Turkish as a foreign language KW - translation assessment N2 - Yapay zekâ tabanlı değerlendirme sistemleri eğitimde yeni olanaklar sunsa da çeviri gibi karmaşık bilişsel becerilerin ölçümünde bu değerlendirme sistemlerinin insan yargısıyla tutarlılığı tartışmalıdır. Bu çalışma, C2 düzeyinde Türkçe çevirilerin değerlendirilmesinde alan uzmanı ile yapay zekâ puanlayıcıları (ChatGPT-5 ve Gemini 1.5 Pro) arasındaki puanlayıcılar arası güvenirliği incelemektedir. Yakınsak karma yöntem tasarımı kullanılarak, 14 öğrencinin çevirileri 5'li analitik rubrikle puanlanmıştır. Krippendorff alfa, düşük genel uyum (α = .392) ortaya koymuş, özellikle "Anlamsal Doğruluk" boyutunda uyum zayıf bulunmuştur (α = .288). Nitel analiz üç temel farklılık belirlemiştir: görev sadakati, hata ciddiyeti algısı ve kriter yorumlama çeşitliliği. Bulgular, yapay zekâ modellerinin biçimsel doğrulukta kısmi tutarlılık gösterdiğini ancak anlamsal nüans, üslup ve bağlamsal uygunlukta insan uzmanından sistematik olarak ayrıştığını ortaya koymaktadır. Uzman "görev odaklı" bir yaklaşım benimserken, yapay zekâ modelleri daha "biçim odaklı" (Gemini) veya "yüzeysel tutarlılık odaklı" (ChatGPT) değerlendirmeler yapmıştır. Yapay zekâ sistemleri çeviri değerlendirmesinde yararlı yardımcı araçlar olsa da uzman yargısının yerini alamamaktadır. CR - Bassnett, S. (2002). Translation studies. Routledge. CR - Büyüköztürk, Ş., Çakmak, E. K., Akgün, Ö. E., Karadeniz, Ş., & Demirel, F. (2020). Bilimsel araştırma yöntemleri (27. bs.). Pegem Akademi. CR - Doewes, A., & Pechenizkiy, M. (2021). On the limitations of human-computer agreement in automated essay scoring. Proceedings of the 2021 Educational Data Mining Conference. CR - Fahmy, Y. (2024). Student perception on AI-driven assessment: Motivation, engagement and feedback capabilities [Yüksek lisans tezi, University of Twente]. University of Twente Student Theses. https://essay.utwente.nl/91297/ CR - Farrokhnia, M., Banihashem, S. K., Noroozi, O., & Wals, A. (2024). A SWOT analysis of ChatGPT: Implications for educational practice and research. Innovations in Education and Teaching International, 61(3), 460-474. https://doi.org/10.1080/14703297.2023.2195846 CR - İşcan, A. (2011). Türkçenin yabancı dil olarak önemi. International Journal of Eurasia Social Sciences, 2(4), 29-36. Kaleli, S., & Özdemir, A. (2025). Artificial intelligence and its role in teaching Turkish as a foreign language. Turkish Linguistics Journal. CR - Kocmi, T., & Federmann, C. (2023). Large language models are state-of-the-art evaluators of translation quality. Proceedings of the 24th Annual Conference of the European Association for Machine Translation, 193-203. https://aclanthology.org/2023.eamt-1.19/ CR - Kotlyar, I., & Krasman, J. (2022). Virtual simulation: New method for assessing teamwork skills. International Journal of Selection and Assessment, 30(3), 344-360. https://doi.org/10.1111/ijsa.12368 CR - Kotlyar, I., & Krasman, J. (2025). Student reactions to AI versus human feedback in teamwork skills assessment. International Journal of Educational Technology in Higher Education, 22(1), 1-34. https://doi.org/10.1186/s41239-025-00555-9 CR - Krippendorff, K. (2004). Content analysis: An introduction to its methodology (2. bs.). Sage Publications. CR - Lommel, A., Burchardt, A., & Uszkoreit, H. (2014). Multidimensional quality metrics (MQM): A framework for declaring and describing translation quality metrics. Tradumàtica: Tecnologies de la Traducció, 12, 455-463. https://doi.org/10.5565/rev/tradumatica.77 CR - Luo, J., Zheng, C., Yin, J., & Teo, H. H. (2025). Design and assessment of AI-based learning tools in higher education: A systematic review. International Journal of Educational Technology in Higher Education, 22, 42. https://doi.org/10.1186/s41239-025-00540-2 CR - Munday, J. (2016). Introducing translation studies: Theories and applications. Routledge. CR - Özdemir, C. (2018). Günümüzde yabancı dil olarak Türkçe öğretiminin durumu. Alatoo Academic Studies, 18(1), 11-19. CR - Rei, R., Stewart, C., Farinha, A. C., & Lavie, A. (2020). COMET: A neural framework for MT evaluation. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2685-2702. https://doi.org/10.18653/v1/2020.emnlp-main.213 CR - Reiss, K., & Vermeer, H. J. (1984). Grundlehren einer allgemeinen Translationstheorie. Cornelsen. CR - Snell-Hornby, M. (1988). Translation studies: An interdisciplinary approach. John Benjamins. CR - Tang, X., Chen, H., & Lin, D. (2024). Harnessing LLMs for multi-dimensional writing assessment: Reliability and alignment with human judgments. Frontiers in Education, 9, Article 11305227. https://doi.org/10.3389/feduc.2024.11305227 CR - Uyar, A. C., & Büyükahıska, D. (2025). Artificial intelligence as an automated essay scoring tool: A focus on ChatGPT. International Journal of Assessment Tools in Education, 12(1), 20-32. https://doi.org/10.21449/ijate.1517994 Venuti, L. (Ed.). (2012). The translation studies reader (3. bs.). Routledge. CR - Zawacki-Richter, O., Marín, V. I., Bond, M., & Gouverneur, F. (2019). Systematic review of research on artificial intelligence applications in higher education-where are the educators? International Journal of Educational Technology in Higher Education, 16(1), 1-27. https://doi.org/10.1186/s41239-019-0171-0 UR - https://doi.org/10.26466/opusjsr.1821518 L1 - https://dergipark.org.tr/en/download/article-file/5410734 ER -