Türkçe Dil Modellerinde Persona Vektörleri: Karakter Özelliklerinin İzlenmesi ve Kontrolü

Müge Akbulut

doi:10.24146/tk.1765562

TR EN

Türkçe Dil Modellerinde Persona Vektörleri: Karakter Özelliklerinin İzlenmesi ve Kontrolü

Abstract

Amaç: Büyük dil modellerinin davranışlarını anlama ve kontrol etme çabaları, yapay zekâ güvenliği açısından kritik öneme sahiptir. Bu çalışma, Chen ve arkadaşlarının (2025) geliştirdiği yöntemi Türkçeye uyarlamaktadır. Amaç, Türkçe dilinde eğitilmiş üretken bir dil modelinin aktivasyon uzayında belirli kişilik özelliklerini temsil eden persona vektörlerini çıkarmaktır. Araştırmanın hedefi, bu vektörlerin diller arası transfer edilebilirliğini ve Türkçe dil modellerinde güvenlik uygulamalarındaki potansiyelini ortaya koymaktır.

Yöntem: Yedi persona (kötülük, aşırı uyumluluk, halüsinasyon, iyimserlik, kabalık, ilgisizlik, mizah) için her biri bir olumlu ve bir olumsuz komut içeren 63 karşıtsal komut çifti oluşturulmuştur. Cevap ortalaması (response averaging) stratejisi kullanılarak modelin 32. katmanından vektörler çıkarılmış; etkinlikleri Vektör Etkinlik Skoru (VES) ve davranışsal geçerlilikleri ise yönlendirme testleri ile değerlendirilmiştir.

Bulgular: Çıkarılan persona vektörleri, hedeflenen kişilikleri başarıyla kodlamıştır (ortalama VES: 0,183±0,069). Geometrik VES ile gözlemlenen davranışsal performans arasında orta-güçlü pozitif bir korelasyon (r = 0,576) elde edilmiştir. Mizah personası, hem geometrik (VES=0,277) hem de davranışsal (etki=0,300) metriklerde en yüksek performansı sergilemiştir.

Sonuç: Bulgular, persona vektörlerinin diller arası transfer edilebilirliğini doğrulamakta ve Türkçe dil modellerinde davranışsal izleme, kontrol ve veri seti denetimi için sağlam bir temel sunduğunu göstermektedir. VES ile davranışsal performans arasındaki korelasyon (r=0,576), yönteminin geçerliliğini desteklerken, daha kapsamlı doğrulama ihtiyacını da ortaya koymaktadır.

Özgünlük: Bu araştırma, söz konusu yöntemi Türkçeye uygulayan ve persona vektörlerini Türkçe dil modellerinden çıkaran ilk çalışmadır. Dolayısıyla, diller arası transfer edilebilirlik literatürüne somut katkı sunmakta Türkçe doğal dil işleme alanındaki güvenlik araştırmalarına öncülük etmektedir.

Keywords

Büyük dil modelleri, persona vektörleri, aktivasyon yönlendirme, diller arası transfer edilebilirlik, yapay zekâ güvenliği

Persona Vectors in Turkish Language Models: Monitoring and Controlling Character Traits

Abstract

Objective: The efforts to understand and control the behavior of large language models are of critical importance for AI safety. This study adapts the methodology developed by Chen et al. (2025) to the Turkish language. The aim is to extract persona vectors, which represent specific personality traits in the activation space of a generative language model trained in Turkish. The research seeks to demonstrate the cross-lingual transferability of these vectors and highlight their potential for security applications in Turkish language models.

Method: For seven personas (evil, sycophancy, hallucination, optimism, impoliteness, apathy, humor), 63 pairs of contrastive prompts were created, each containing one positive and one negative command. Using the response averaging strategy, vectors were extracted from layer 32 of the model. Their effectiveness was evaluated using the Vector Effectiveness Score (VES), and their behavioral validity was assessed through steering tests.

Findings: The extracted persona vectors successfully encode the targeted personality traits (mean VES= 0.183±0.069). A moderate-to-strong positive correlation (r = 0.576) was found between the geometric VES and the observed behavioral performance. The humor persona showed the highest performance in both geometric (VES=0.277) and behavioral (effect=0.300) metrics. Implications: The findings confirm the cross-lingual transferability of persona vectors and provide a solid foundation for behavioral monitoring, control, and dataset screening in Turkish language models. The correlation between VES and behavioral performance (r=0.576) supports the validity of the methodology, while also indicating the need for more comprehensive validation

Originality: This research is the first study to apply this methodology to the Turkish language and extract persona vectors from Turkish language models. Therefore, it makes a concrete contribution to the cross-lingual transferability literature and pioneers safety research in the field of Turkish natural language processing

Keywords

Large language models, persona vectors, activation steering, cross-lingual transferability, AI safety

References

Ahmadian, A., Cremer, C., Gallé, M., Fadaee, M., Kreutzer, J., Pietquin, O., Üstün, A. ve Hooker, S. (2024). Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs. arXiv. https://doi.org/10.48550/arXiv.2402.14740
Alain, G. ve Bengio, Y. (2018). Understanding intermediate layers using linear classifier probes. arXiv. https://doi.org/10.48550/arXiv.1610.01644
Barnhart, L., Bafghi, R. A., Becker, S. ve Raissi, M. (2025). Aligning to what? limits to RLHF based alignment. arXiv. https://doi.org/10.48550/arXiv.2503.09025
Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., ... ve Kaplan, J. (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv. https://doi.org/10.48550/arXiv.2204.05862
Bereska, L.ve Gavves, E. (2024). Mechanistic interpretability for AI safety--a review. arXiv. https://doi.org/10.48550/arXiv.2404.14082
Betley, J., Tan, D., Warncke, N., Sztyber-Betley, A., Bao, X., Soto, M., ... ve Evans, O. (2025). Emergent misalignment: Narrow finetuning can produce broadly misaligned llms. arXiv. https://doi.org/10.48550/arXiv.2502.17424
Chen, R., Arditi, A., Sleight, H., Evans, O., ve Lindsey, J. (2025). Persona vectors: Monitoring and controlling character traits in language models. arXiv. https://doi.org/10.48550/arXiv.2507.21509
Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S. ve Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30. https://proceedings.neurips.cc/paper_files/paper/2017/file/d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf
Farrell, H. ve Han, H. (2025). AI and Democratic Publics Sébastien A. Krier using Midjourney 6.1. Artificial Intelligence. https://knightcolumbia.org/content/ai-and-democratic-publics
Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D. ve Steinhardt, J. (2021). Aligning AI with shared human values. arXiv. https://doi.org/10.48550/arXiv.2008.02275

Hofmann, V., Kalluri, P. R., Jurafsky, D. ve King, S. (2024). Dialect prejudice predicts AI decisions about people’s character, employability, and criminality. arXiv. https://doi.org/10.48550/arXiv.2403.00742
Marks, S. ve Tegmark, M. (2024). The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv. https://doi.org/10.48550/arXiv.2310.06824
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. ve Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, 26, 3111-3119. https://papers.nips.cc/paper_files/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html
Perez, E., Huang, S., Song, H. F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N…. ve Irving, G. (2022). Discovering language model behaviors with model-written evaluations. CoRR, abs/2202.03286. https://aclanthology.org/2023.findings-acl.847/
Perrigo, B. (2023, 17 Şubat). Bing’s AI is threatening users. That’s no laughing matter. Time. https://time.com/6256529/bing-openai-chatgpt-danger-alignment/
Park, K., Choe, Y. J. ve Veitch, V. (2024). The linear representation hypothesis and the geometry of large language models. Proceedings of the 41st International Conference on Machine Learning, 39643–39666. https://doi.org/10.5555/3692070.3693675
Rogers, A., Kovaleva, O. ve Rumshisky, A. (2020). A primer in BERTology: What we know about how BERT works. Transactions of the association for computational linguistics, 8, 842-866. https://doi.org/10.48550/arXiv.2002.12327
Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., ... ve Christiano, P. F. (2020). Learning to summarize with human feedback. Advances in neural information processing systems, 33, 3008-3021. https://proceedings.neurips.cc/paper_files/paper/2020/file/1f89885d556929e98d3ef9b86448f951-Paper.pdf
Turner, A., Thiergart, L., Udell, D., Leech, G., Mini, U. ve MacDiarmid, M. (2024). Activation addition: Steering language models without optimization. arXiv. https://doi.org/10.48550/arXiv.2308.10248
Trendyol AI Team. (2024). Trendyol-LLM-7b-chat-v0.1: Turkish Language Model for Conversational AI. Hugging Face Model Hub. https://huggingface.co/Trendyol/Trendyol-LLM-7b-chat-v0.1
Olah, C. (2022, 27 Haziran). Mechanistic interpretability, variables, and the importance of interpretable bases. Transformer Circuits. https://transformer-circuits.pub/2022/mech-interp-essay/index.html
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., ... ve Lowe, R. (2022). Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35, 27730-27744. https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf
Sorokovikova, A., Chizhov, P., Eremenko, I. ve Yamshchikov, I. P. (2025). Surface fairness, deep bias: A comparative study of bias in language models. Proceedings of the 6th Workshop on Gender Bias in Natural Language Processing (GeBNLP) içinde (s. 206–227). Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.gebnlp-1.20
Wiggins, W. F. ve Tejani, A. S. (2022). On the opportunities and risks of foundation models for natural language processing in radiology. Radiology: Artificial Intelligence, 4(4), e220119. https://doi.org/10.1148/ryai.220119
Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P. ve Irving, G. (2020). Fine-tuning language models from human preferences. arXiv. https://doi.org/10.48550/arXiv.1909.08593
Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., Goel, S., Li, N., Byun, M. J., Wang, Z., Mallen, A., Basart, S., Koyejo, S., Song, D., Fredrikson, M., Kolter, J. Z. ve Hendrycks, D. (2023). Representation engineering: A top-down approach to AI transparency. arXiv. https://doi.org/10.48550/arXiv.2310.01405

Details

Primary Language

Turkish

Subjects

Information Retrival

Journal Section

Research Article

Authors

Müge Akbulut ^*
0000-0003-0026-6485
Türkiye

Early Pub Date

January 5, 2026

Publication Date

January 5, 2026

Submission Date

August 16, 2025

Acceptance Date

November 26, 2025

Published in Issue

Year 2026 Volume: 40 Number: 1

DOI

https://doi.org/10.24146/tk.1765562

IZ

https://izlik.org/JA69AM43TU

APA

Akbulut, M. (2026). Türkçe Dil Modellerinde Persona Vektörleri: Karakter Özelliklerinin İzlenmesi ve Kontrolü. Türk Kütüphaneciliği, 40(1), 142-166. https://doi.org/10.24146/tk.1765562

AMA

1.Akbulut M. Türkçe Dil Modellerinde Persona Vektörleri: Karakter Özelliklerinin İzlenmesi ve Kontrolü. TL. 2026;40(1):142-166. doi:10.24146/tk.1765562

Chicago

Akbulut, Müge. 2026. “Türkçe Dil Modellerinde Persona Vektörleri: Karakter Özelliklerinin İzlenmesi Ve Kontrolü”. Türk Kütüphaneciliği 40 (1): 142-66. https://doi.org/10.24146/tk.1765562.

EndNote

Akbulut M (March 1, 2026) Türkçe Dil Modellerinde Persona Vektörleri: Karakter Özelliklerinin İzlenmesi ve Kontrolü. Türk Kütüphaneciliği 40 1 142–166.

IEEE

[1]M. Akbulut, “Türkçe Dil Modellerinde Persona Vektörleri: Karakter Özelliklerinin İzlenmesi ve Kontrolü”, TL, vol. 40, no. 1, pp. 142–166, Mar. 2026, doi: 10.24146/tk.1765562.

ISNAD

Akbulut, Müge. “Türkçe Dil Modellerinde Persona Vektörleri: Karakter Özelliklerinin İzlenmesi Ve Kontrolü”. Türk Kütüphaneciliği 40/1 (March 1, 2026): 142-166. https://doi.org/10.24146/tk.1765562.

JAMA

1.Akbulut M. Türkçe Dil Modellerinde Persona Vektörleri: Karakter Özelliklerinin İzlenmesi ve Kontrolü. TL. 2026;40:142–166.

MLA

Akbulut, Müge. “Türkçe Dil Modellerinde Persona Vektörleri: Karakter Özelliklerinin İzlenmesi Ve Kontrolü”. Türk Kütüphaneciliği, vol. 40, no. 1, Mar. 2026, pp. 142-66, doi:10.24146/tk.1765562.

Vancouver

1.Müge Akbulut. Türkçe Dil Modellerinde Persona Vektörleri: Karakter Özelliklerinin İzlenmesi ve Kontrolü. TL. 2026 Mar. 1;40(1):142-66. doi:10.24146/tk.1765562

Türkçe Dil Modellerinde Persona Vektörleri: Karakter Özelliklerinin İzlenmesi ve Kontrolü

Abstract

Keywords

Persona Vectors in Turkish Language Models: Monitoring and Controlling Character Traits

Abstract

Keywords

References

Details

Primary Language

Subjects

Journal Section

Authors

Early Pub Date

Publication Date

Submission Date

Acceptance Date

Published in Issue

DOI

IZ

Cite