Araştırma Makalesi
BibTex RIS Kaynak Göster

Yapay Zeka Her Dilde Aynı mı Konuşur? Üretken Yapay Zeka Yanıtlarının Dile Göre Farklılaşması Üzerine Bir İnceleme

Yıl 2025, Cilt: 7 Sayı: 2, 623 - 641, 27.12.2025

Öz

Bu çalışmanın amacı, üretken yapay zeka (ÜYZ) sistemlerinin eğitim bağlamında farklı dillerde ürettikleri içeriklerin benzerlik ve farklılıklarını çok metrikli biçimde incelemektir. ChatGPT ve Copilot adlı büyük dil modellerinin Türkçe, İngilizce, İspanyolca ve Arapça dillerinde verdikleri yanıtlar toksisite, duygu valansı, stereotip, olgusal doğruluk ve güvenlik/ret davranışları açısından karşılaştırmalı içerik analizine tabi tutulmuştur. Her model ile dil kombinasyonu için sekiz istem türüyle toplam 192 yanıt toplanmış ve veriler hem nitel hem nicel yöntemlerle kodlanmıştır. Elde edilen bulgular, modellerin farklı dillerde anlamlı biçimde değişen yanıt stratejileri sergilediğini göstermektedir. Düşük/orta kaynaklı dillerde stereotip üretimi, toksisite eğilimi ve doğruluk sapmaları gibi riskli örüntüler dikkat çekmiştir. Ayrıca bazı dillerde güvenlik filtrelerinin daha kısıtlayıcı çalıştığı ve bunun da eğitsel eşitlik açısından önemli sonuçlar doğurabileceği belirlenmiştir. Sonuçlar, üretken yapay zeka araçlarının çok dilli ortamlarda adil, güvenilir ve pedagojik olarak uygun şekilde kullanılabilmesi için dil duyarlı tasarım ve yönetişim politikalarına ihtiyaç duyulduğuna işaret etmektedir.

Kaynakça

  • Barocas, S., Hardt, M., & Narayanan, A. (2019). Fairness and machine learning. http://fairmlbook.org
  • Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT’21) (pp. 610–623). Association for Computing Machinery. https://doi.org/10.1145/3442188.3445922
  • Blasi, D. E., Anastasopoulos, A., & Neubig, G. (2022). Systematic inequalities in language technology performance across the world’s languages. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (pp. 5486–5510). https://doi.org/10.18653/v1/2022.acl-long.376
  • Blodgett, S. L., Barocas, S., Daumé III, H., & Wallach, H. (2020). Language (technology) is power: A critical survey of “bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 5454–5476). https://doi.org/10.18653/v1/2020.acl-main.485
  • Buolamwini, J., & Gebru, T. (2018). Gender shades: Intersectional accuracy disparities in commercial gender classification. Proceedings of Machine Learning Research, 81, 77–91. https://proceedings.mlr.press/v81/buolamwini18a/buolamwini18a.pdf
  • Caliskan, A., Bryson, J. J., & Narayanan, A. (2017). Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334), 183–186. https://doi.org/10.1126/science.aal4230
  • Dhamala, J., Sun, T., Kumar, V., Krishna, S., Li, S., Pruksachatkun, Y., ... & Chang, K.-W. (2021). BOLD: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 862–872). https://doi.org/10.1145/3442188.3445924
  • Gehman, S., Gururangan, S., Sap, M., Choi, Y., & Smith, N. A. (2020). RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 3356–3369). https://doi.org/10.18653/v1/2020.findings-emnlp.301
  • Gehman, S., Gururangan, S., Sap, M., Choi, Y., & Smith, N. A. (2020). RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 3356–3369). Association for Computational Linguistics. doi:10.18653/v1/2020.findings-emnlp.301.
  • Hartvigsen, T., et al. (2022). ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics.
  • House, J. (2015). Translation quality assessment: Past and present. Routledge.
  • Hovy, D., & Spruit, S. L. (2016). The social impact of natural language processing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (pp. 591–598). https://doi.org/10.18653/v1/P16-2096
  • Hu, J., Ruder, S., Siddhant, A., Neubig, G., Firat, O., & Johnson, M. (2020). XTREME: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization. In Proceedings of the 37th International Conference on Machine Learning (pp. 4411–4421). https://proceedings.mlr.press/v119/hu20b.html
  • Joshi, P., Santy, S., Budhiraja, A., Bali, K., & Choudhury, M. (2020). The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 6282–6293). Association for Computational Linguistics.
  • Krippendorff, K. (2018). Content analysis: An introduction to its methodology (4th ed.). Sage Publications.
  • Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159-174. https://doi.org/10.2307/2529310
  • Liang, P., Bommasani, R., et al. (2022). Holistic evaluation of language models (HELM). arXiv. https://arxiv.org/abs/2211.09110
  • Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., & Galstyan, A. (2021). A survey on bias and fairness in machine learning. ACM Computing Surveys, 54(6), 1–35. https://doi.org/10.1145/3457607
  • Microsoft. (2023). Microsoft Copilot overview. https://www.microsoft.com
  • Nadeem, M., Bethke, A., & Reddy, S. (2021). StereoSet: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics.
  • Nangia, N., et al. (2020). CrowS-Pairs: A challenge dataset for measuring social biases in masked language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics.
  • Neuendorf, K. (2017). The content analysis guidebook. SAGE Publications, Inc, https://doi.org/10.4135/9781071802878
  • OpenAI. (2023). GPT-4 technical report. arXiv. https://arxiv.org/abs/2303.08774
  • Parrish, A., et al. (2022). BBQ: A hand-built bias benchmark for question answering. In Findings of the Association for Computational Linguistics: NAACL 2022. Association for Computational Linguistics.
  • UNESCO. (2023). Guidance for generative AI in education and research. https://unesdoc.unesco.org/ark:/48223/pf0000386691
  • Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P.-S., … Gabriel, I. (2022). Ethical and social risks of large language models. arXiv. https://arxiv.org/abs/2112.04359
  • Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P.-S., … Gabriel, I. (2022). Taxonomy of risks posed by language models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’22) (pp. 214-229). Association for Computing Machinery. https://doi.org/10.1145/3531146.3533088

Does Artificial Intelligence Speak the Same in Every Language? A Cross-Linguistic Study on Generative AI Responses

Yıl 2025, Cilt: 7 Sayı: 2, 623 - 641, 27.12.2025

Öz

The aim of this study is to examine the similarities and differences in the responses generated by generative artificial intelligence (GenAI) systems across multiple languages within educational contexts, using a multi-metric approach. The responses produced by two large language models, ChatGPT and Copilot, in Turkish, English, Spanish, and Arabic were comparatively analyzed in terms of toxicity, sentiment valence, stereotyping, factual accuracy, and safety/refusal behavior. A total of 192 responses were collected using eight prompt types for each model–language combination, and the data were coded using both qualitative and quantitative techniques. The findings indicate that the models exhibit significantly divergent response strategies across different languages. Notably, languages with lower or moderate resource availability tended to show risk-prone patterns such as increased stereotyping, higher toxicity, and reduced factual accuracy. Furthermore, safety filters were found to operate more restrictively in certain languages, which may have critical implications for educational equity. These results suggest the necessity of language-aware design and governance policies to ensure that GAI tools are deployed fairly, reliably, and pedagogically appropriately in multilingual educational settings.

Kaynakça

  • Barocas, S., Hardt, M., & Narayanan, A. (2019). Fairness and machine learning. http://fairmlbook.org
  • Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT’21) (pp. 610–623). Association for Computing Machinery. https://doi.org/10.1145/3442188.3445922
  • Blasi, D. E., Anastasopoulos, A., & Neubig, G. (2022). Systematic inequalities in language technology performance across the world’s languages. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (pp. 5486–5510). https://doi.org/10.18653/v1/2022.acl-long.376
  • Blodgett, S. L., Barocas, S., Daumé III, H., & Wallach, H. (2020). Language (technology) is power: A critical survey of “bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 5454–5476). https://doi.org/10.18653/v1/2020.acl-main.485
  • Buolamwini, J., & Gebru, T. (2018). Gender shades: Intersectional accuracy disparities in commercial gender classification. Proceedings of Machine Learning Research, 81, 77–91. https://proceedings.mlr.press/v81/buolamwini18a/buolamwini18a.pdf
  • Caliskan, A., Bryson, J. J., & Narayanan, A. (2017). Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334), 183–186. https://doi.org/10.1126/science.aal4230
  • Dhamala, J., Sun, T., Kumar, V., Krishna, S., Li, S., Pruksachatkun, Y., ... & Chang, K.-W. (2021). BOLD: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 862–872). https://doi.org/10.1145/3442188.3445924
  • Gehman, S., Gururangan, S., Sap, M., Choi, Y., & Smith, N. A. (2020). RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 3356–3369). https://doi.org/10.18653/v1/2020.findings-emnlp.301
  • Gehman, S., Gururangan, S., Sap, M., Choi, Y., & Smith, N. A. (2020). RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 3356–3369). Association for Computational Linguistics. doi:10.18653/v1/2020.findings-emnlp.301.
  • Hartvigsen, T., et al. (2022). ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics.
  • House, J. (2015). Translation quality assessment: Past and present. Routledge.
  • Hovy, D., & Spruit, S. L. (2016). The social impact of natural language processing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (pp. 591–598). https://doi.org/10.18653/v1/P16-2096
  • Hu, J., Ruder, S., Siddhant, A., Neubig, G., Firat, O., & Johnson, M. (2020). XTREME: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization. In Proceedings of the 37th International Conference on Machine Learning (pp. 4411–4421). https://proceedings.mlr.press/v119/hu20b.html
  • Joshi, P., Santy, S., Budhiraja, A., Bali, K., & Choudhury, M. (2020). The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 6282–6293). Association for Computational Linguistics.
  • Krippendorff, K. (2018). Content analysis: An introduction to its methodology (4th ed.). Sage Publications.
  • Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159-174. https://doi.org/10.2307/2529310
  • Liang, P., Bommasani, R., et al. (2022). Holistic evaluation of language models (HELM). arXiv. https://arxiv.org/abs/2211.09110
  • Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., & Galstyan, A. (2021). A survey on bias and fairness in machine learning. ACM Computing Surveys, 54(6), 1–35. https://doi.org/10.1145/3457607
  • Microsoft. (2023). Microsoft Copilot overview. https://www.microsoft.com
  • Nadeem, M., Bethke, A., & Reddy, S. (2021). StereoSet: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics.
  • Nangia, N., et al. (2020). CrowS-Pairs: A challenge dataset for measuring social biases in masked language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics.
  • Neuendorf, K. (2017). The content analysis guidebook. SAGE Publications, Inc, https://doi.org/10.4135/9781071802878
  • OpenAI. (2023). GPT-4 technical report. arXiv. https://arxiv.org/abs/2303.08774
  • Parrish, A., et al. (2022). BBQ: A hand-built bias benchmark for question answering. In Findings of the Association for Computational Linguistics: NAACL 2022. Association for Computational Linguistics.
  • UNESCO. (2023). Guidance for generative AI in education and research. https://unesdoc.unesco.org/ark:/48223/pf0000386691
  • Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P.-S., … Gabriel, I. (2022). Ethical and social risks of large language models. arXiv. https://arxiv.org/abs/2112.04359
  • Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P.-S., … Gabriel, I. (2022). Taxonomy of risks posed by language models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’22) (pp. 214-229). Association for Computing Machinery. https://doi.org/10.1145/3531146.3533088
Toplam 27 adet kaynakça vardır.

Ayrıntılar

Birincil Dil Türkçe
Konular Alan Eğitimleri (Diğer)
Bölüm Araştırma Makalesi
Yazarlar

Veysel Bilal Arslankara 0000-0002-9062-9210

Ertuğrul Usta 0000-0001-6112-9965

Gönderilme Tarihi 14 Ekim 2025
Kabul Tarihi 17 Aralık 2025
Yayımlanma Tarihi 27 Aralık 2025
Yayımlandığı Sayı Yıl 2025 Cilt: 7 Sayı: 2

Kaynak Göster

APA Arslankara, V. B., & Usta, E. (2025). Yapay Zeka Her Dilde Aynı mı Konuşur? Üretken Yapay Zeka Yanıtlarının Dile Göre Farklılaşması Üzerine Bir İnceleme. Necmettin Erbakan Üniversitesi Ereğli Eğitim Fakültesi Dergisi, 7(2), 623-641. https://izlik.org/JA57WD24LR