Araştırma Makalesi
BibTex RIS Kaynak Göster

Çeviri Değerlendirmesinde İnsan Uzman ve Yapay Zekâ Puanlayıcılarının Güvenirliği

Yıl 2025, Cilt: 22 Sayı: 6, 1305 - 1317, 17.12.2025
https://doi.org/10.26466/opusjsr.1821518

Öz

Yapay zekâ tabanlı değerlendirme sistemleri eğitimde yeni olanaklar sunsa da çeviri gibi karmaşık bilişsel becerilerin ölçümünde bu değerlendirme sistemlerinin insan yargısıyla tutarlılığı tartışmalıdır. Bu çalışma, C2 düzeyinde Türkçe çevirilerin değerlendirilmesinde alan uzmanı ile yapay zekâ puanlayıcıları (ChatGPT-5 ve Gemini 1.5 Pro) arasındaki puanlayıcılar arası güvenirliği incelemektedir. Yakınsak karma yöntem tasarımı kullanılarak, 14 öğrencinin çevirileri 5'li analitik rubrikle puanlanmıştır. Krippendorff alfa, düşük genel uyum (α = .392) ortaya koymuş, özellikle "Anlamsal Doğruluk" boyutunda uyum zayıf bulunmuştur (α = .288). Nitel analiz üç temel farklılık belirlemiştir: görev sadakati, hata ciddiyeti algısı ve kriter yorumlama çeşitliliği. Bulgular, yapay zekâ modellerinin biçimsel doğrulukta kısmi tutarlılık gösterdiğini ancak anlamsal nüans, üslup ve bağlamsal uygunlukta insan uzmanından sistematik olarak ayrıştığını ortaya koymaktadır. Uzman "görev odaklı" bir yaklaşım benimserken, yapay zekâ modelleri daha "biçim odaklı" (Gemini) veya "yüzeysel tutarlılık odaklı" (ChatGPT) değerlendirmeler yapmıştır. Yapay zekâ sistemleri çeviri değerlendirmesinde yararlı yardımcı araçlar olsa da uzman yargısının yerini alamamaktadır.

Etik Beyan

Çalışma, bir devlet üniversitesinin Sosyal ve Beşerî Bilimler Etik Kurulu’nun 24.10.2025 tarihli ve 21/114 sayılı onay kararıyla yürütülmüştür.

Destekleyen Kurum

Çanakkale Onsekiz Mart Üniversitesi

Proje Numarası

2025-YÖNP-2114

Kaynakça

  • Bassnett, S. (2002). Translation studies. Routledge.
  • Büyüköztürk, Ş., Çakmak, E. K., Akgün, Ö. E., Karadeniz, Ş., & Demirel, F. (2020). Bilimsel araştırma yöntemleri (27. bs.). Pegem Akademi.
  • Doewes, A., & Pechenizkiy, M. (2021). On the limitations of human-computer agreement in automated essay scoring. Proceedings of the 2021 Educational Data Mining Conference.
  • Fahmy, Y. (2024). Student perception on AI-driven assessment: Motivation, engagement and feedback capabilities [Yüksek lisans tezi, University of Twente]. University of Twente Student Theses. https://essay.utwente.nl/91297/
  • Farrokhnia, M., Banihashem, S. K., Noroozi, O., & Wals, A. (2024). A SWOT analysis of ChatGPT: Implications for educational practice and research. Innovations in Education and Teaching International, 61(3), 460-474. https://doi.org/10.1080/14703297.2023.2195846
  • İşcan, A. (2011). Türkçenin yabancı dil olarak önemi. International Journal of Eurasia Social Sciences, 2(4), 29-36. Kaleli, S., & Özdemir, A. (2025). Artificial intelligence and its role in teaching Turkish as a foreign language. Turkish Linguistics Journal.
  • Kocmi, T., & Federmann, C. (2023). Large language models are state-of-the-art evaluators of translation quality. Proceedings of the 24th Annual Conference of the European Association for Machine Translation, 193-203. https://aclanthology.org/2023.eamt-1.19/
  • Kotlyar, I., & Krasman, J. (2022). Virtual simulation: New method for assessing teamwork skills. International Journal of Selection and Assessment, 30(3), 344-360. https://doi.org/10.1111/ijsa.12368
  • Kotlyar, I., & Krasman, J. (2025). Student reactions to AI versus human feedback in teamwork skills assessment. International Journal of Educational Technology in Higher Education, 22(1), 1-34. https://doi.org/10.1186/s41239-025-00555-9
  • Krippendorff, K. (2004). Content analysis: An introduction to its methodology (2. bs.). Sage Publications.
  • Lommel, A., Burchardt, A., & Uszkoreit, H. (2014). Multidimensional quality metrics (MQM): A framework for declaring and describing translation quality metrics. Tradumàtica: Tecnologies de la Traducció, 12, 455-463. https://doi.org/10.5565/rev/tradumatica.77
  • Luo, J., Zheng, C., Yin, J., & Teo, H. H. (2025). Design and assessment of AI-based learning tools in higher education: A systematic review. International Journal of Educational Technology in Higher Education, 22, 42. https://doi.org/10.1186/s41239-025-00540-2
  • Munday, J. (2016). Introducing translation studies: Theories and applications. Routledge.
  • Özdemir, C. (2018). Günümüzde yabancı dil olarak Türkçe öğretiminin durumu. Alatoo Academic Studies, 18(1), 11-19.
  • Rei, R., Stewart, C., Farinha, A. C., & Lavie, A. (2020). COMET: A neural framework for MT evaluation. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2685-2702. https://doi.org/10.18653/v1/2020.emnlp-main.213
  • Reiss, K., & Vermeer, H. J. (1984). Grundlehren einer allgemeinen Translationstheorie. Cornelsen.
  • Snell-Hornby, M. (1988). Translation studies: An interdisciplinary approach. John Benjamins.
  • Tang, X., Chen, H., & Lin, D. (2024). Harnessing LLMs for multi-dimensional writing assessment: Reliability and alignment with human judgments. Frontiers in Education, 9, Article 11305227. https://doi.org/10.3389/feduc.2024.11305227
  • Uyar, A. C., & Büyükahıska, D. (2025). Artificial intelligence as an automated essay scoring tool: A focus on ChatGPT. International Journal of Assessment Tools in Education, 12(1), 20-32. https://doi.org/10.21449/ijate.1517994 Venuti, L. (Ed.). (2012). The translation studies reader (3. bs.). Routledge.
  • Zawacki-Richter, O., Marín, V. I., Bond, M., & Gouverneur, F. (2019). Systematic review of research on artificial intelligence applications in higher education-where are the educators? International Journal of Educational Technology in Higher Education, 16(1), 1-27. https://doi.org/10.1186/s41239-019-0171-0

Reliability of Human Expert and AI Raters in Translation Assessment

Yıl 2025, Cilt: 22 Sayı: 6, 1305 - 1317, 17.12.2025
https://doi.org/10.26466/opusjsr.1821518

Öz

Although AI-based assessment systems offer new opportunities in education, their consistency with human judgment in measuring complex cognitive skills such as translation remains debatable. This study examines inter-rater reliability between a domain expert and AI raters (ChatGPT-5 and Gemini 1.5 Pro) in evaluating C2-level Turkish translations. Using a convergent mixed-methods design, translations from 14 students were scored with a 5-point analytic rubric. Krippendorff's alpha revealed low overall agreement (α = .392), particularly weak in "Semantic Accuracy" (α = .288). Qualitative analysis identified three key divergences: task fidelity, error severity perception, and criterion interpretation variability. Findings show AI models exhibit partial consistency in formal accuracy but systematically diverge from human experts in semantic nuance, style, and contextual appropriateness. The expert adopted a "task-oriented" approach, while AI models were more "form-focused" (Gemini) or "surface coherence-oriented" (ChatGPT). Although AI systems serve as useful auxiliary tools in translation assessment, they are not able to replace expert judgment

Etik Beyan

The study was conducted with the approval decision numbered 21/114 dated 24.10.2025 of the Social and Human Sciences Ethics Committee of a state university.

Destekleyen Kurum

Canakkale Onsekiz Mart University

Proje Numarası

2025-YÖNP-2114

Kaynakça

  • Bassnett, S. (2002). Translation studies. Routledge.
  • Büyüköztürk, Ş., Çakmak, E. K., Akgün, Ö. E., Karadeniz, Ş., & Demirel, F. (2020). Bilimsel araştırma yöntemleri (27. bs.). Pegem Akademi.
  • Doewes, A., & Pechenizkiy, M. (2021). On the limitations of human-computer agreement in automated essay scoring. Proceedings of the 2021 Educational Data Mining Conference.
  • Fahmy, Y. (2024). Student perception on AI-driven assessment: Motivation, engagement and feedback capabilities [Yüksek lisans tezi, University of Twente]. University of Twente Student Theses. https://essay.utwente.nl/91297/
  • Farrokhnia, M., Banihashem, S. K., Noroozi, O., & Wals, A. (2024). A SWOT analysis of ChatGPT: Implications for educational practice and research. Innovations in Education and Teaching International, 61(3), 460-474. https://doi.org/10.1080/14703297.2023.2195846
  • İşcan, A. (2011). Türkçenin yabancı dil olarak önemi. International Journal of Eurasia Social Sciences, 2(4), 29-36. Kaleli, S., & Özdemir, A. (2025). Artificial intelligence and its role in teaching Turkish as a foreign language. Turkish Linguistics Journal.
  • Kocmi, T., & Federmann, C. (2023). Large language models are state-of-the-art evaluators of translation quality. Proceedings of the 24th Annual Conference of the European Association for Machine Translation, 193-203. https://aclanthology.org/2023.eamt-1.19/
  • Kotlyar, I., & Krasman, J. (2022). Virtual simulation: New method for assessing teamwork skills. International Journal of Selection and Assessment, 30(3), 344-360. https://doi.org/10.1111/ijsa.12368
  • Kotlyar, I., & Krasman, J. (2025). Student reactions to AI versus human feedback in teamwork skills assessment. International Journal of Educational Technology in Higher Education, 22(1), 1-34. https://doi.org/10.1186/s41239-025-00555-9
  • Krippendorff, K. (2004). Content analysis: An introduction to its methodology (2. bs.). Sage Publications.
  • Lommel, A., Burchardt, A., & Uszkoreit, H. (2014). Multidimensional quality metrics (MQM): A framework for declaring and describing translation quality metrics. Tradumàtica: Tecnologies de la Traducció, 12, 455-463. https://doi.org/10.5565/rev/tradumatica.77
  • Luo, J., Zheng, C., Yin, J., & Teo, H. H. (2025). Design and assessment of AI-based learning tools in higher education: A systematic review. International Journal of Educational Technology in Higher Education, 22, 42. https://doi.org/10.1186/s41239-025-00540-2
  • Munday, J. (2016). Introducing translation studies: Theories and applications. Routledge.
  • Özdemir, C. (2018). Günümüzde yabancı dil olarak Türkçe öğretiminin durumu. Alatoo Academic Studies, 18(1), 11-19.
  • Rei, R., Stewart, C., Farinha, A. C., & Lavie, A. (2020). COMET: A neural framework for MT evaluation. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2685-2702. https://doi.org/10.18653/v1/2020.emnlp-main.213
  • Reiss, K., & Vermeer, H. J. (1984). Grundlehren einer allgemeinen Translationstheorie. Cornelsen.
  • Snell-Hornby, M. (1988). Translation studies: An interdisciplinary approach. John Benjamins.
  • Tang, X., Chen, H., & Lin, D. (2024). Harnessing LLMs for multi-dimensional writing assessment: Reliability and alignment with human judgments. Frontiers in Education, 9, Article 11305227. https://doi.org/10.3389/feduc.2024.11305227
  • Uyar, A. C., & Büyükahıska, D. (2025). Artificial intelligence as an automated essay scoring tool: A focus on ChatGPT. International Journal of Assessment Tools in Education, 12(1), 20-32. https://doi.org/10.21449/ijate.1517994 Venuti, L. (Ed.). (2012). The translation studies reader (3. bs.). Routledge.
  • Zawacki-Richter, O., Marín, V. I., Bond, M., & Gouverneur, F. (2019). Systematic review of research on artificial intelligence applications in higher education-where are the educators? International Journal of Educational Technology in Higher Education, 16(1), 1-27. https://doi.org/10.1186/s41239-019-0171-0
Toplam 20 adet kaynakça vardır.

Ayrıntılar

Birincil Dil İngilizce
Konular İnternet, Yeni İletişim Teknolojileri
Bölüm Araştırma Makalesi
Yazarlar

Yasemin Uzun 0000-0001-8995-772X

Proje Numarası 2025-YÖNP-2114
Gönderilme Tarihi 11 Kasım 2025
Kabul Tarihi 12 Aralık 2025
Yayımlanma Tarihi 17 Aralık 2025
Yayımlandığı Sayı Yıl 2025 Cilt: 22 Sayı: 6

Kaynak Göster

APA Uzun, Y. (2025). Reliability of Human Expert and AI Raters in Translation Assessment. OPUS Journal of Society Research, 22(6), 1305-1317. https://doi.org/10.26466/opusjsr.1821518