Research Article

Knowledge Distillation for Embeddings of Low-Resource Turkic Family Languages

Volume: 13 Number: 1 March 31, 2026

Knowledge Distillation for Embeddings of Low-Resource Turkic Family Languages

Abstract

Embedding models are a critical component of Retrieval-Augmented Generation (RAG) systems, but creating efficient, high-performance models for low-resource languages like Azerbaijani presents a significant challenge. To address this, we propose a multi-task distillation framework that improves a target low-resource language by distilling knowledge from English, while crucially preserving the model's original English language performance. Our approach combines cross-lingual knowledge distillation with a bilingual training objective, controlled by a single hyperparameter that balances performance across languages. We release a large-scale English-Azerbaijani parallel corpus of 300,000 Wikipedia articles to facilitate further research. Our experimental results demonstrate that models trained with our framework achieve state-of-the-art performance on Azerbaijani benchmarks: our BGE M3 model improves Azerbaijani STS from 79.3 to 85.0 (+5.7 points) while retaining 98.6% of its original English MTEB performance. In a real-world RAG application for Azerbaijani legal document retrieval, our best model achieves 89% top-1 retrieval accuracy and a 94% "Excellent" rating from domain experts, significantly outperforming baseline models. This framework provides a generalizable blueprint for enhancing embedding models for other low-resource languages, particularly within the Turkic language family.

Keywords

Supporting Institution

Baku Higher oil School

Ethical Statement

All datasets have been shared open source.

References

  1. Alizada, T., Suleymanov, U., & Rustamov, Z. (2024). Contextualized Word Embeddings in Azerbaijani Language. In: 2024 IEEE 18th International Conference on Application of Information and Communication Technologies (AICT) (pp. 1–6). IEEE. 2024 IEEE 18th International Conference on Application of Information and Communication Technologies (AICT). https://doi.org/10.1109/aict61888.2024.10740448
  2. Bandarkar, L., Liang, D., Muller, B., Artetxe, M., Shukla, S. N., Husa, D., Goyal, N., Krishnan, A., Zettlemoyer, L., & Khabsa, M. (2024). The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 749–775). https://doi.org/10.18653/v1/2024.acl-long.44
  3. Bowman, S. R., Angeli, G., Potts, C., & Manning, C. D. (2015). A large annotated corpus for learning natural language inference (Version 1). https://doi.org/10.48550/ARXIV.1508.05326
  4. Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., & Liu, Z. (2024). M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation (Version 5). arXiv. https://doi.org/10.48550/ARXIV.2402.03216
  5. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Version 2). https://doi.org/10.48550/ARXIV.1810.04805
  6. Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models (Version 2). https://doi.org/10.48550/ARXIV.2106.09685
  7. Isbarov, J., Huseynova, K., Mammadov, E., Hajili, M. & Ataman, D. (2024) Open foundation models for Azerbaijani language. Proceedings of the First Workshop on Natural Language Processing for Turkic Languages (SIGTURK 2024). pp. 18-28 (2024,8). https://aclanthology.org/2024.sigturk-1.2
  8. Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. de las, Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., & Sayed, W. E. (2023). Mistral 7B (Version 1). https://doi.org/10.48550/ARXIV.2310.06825

Details

Primary Language

English

Subjects

Natural Language Processing

Journal Section

Research Article

Publication Date

March 31, 2026

Submission Date

December 18, 2025

Acceptance Date

January 26, 2026

Published in Issue

Year 2026 Volume: 13 Number: 1

APA
Aghalarov, M., Mammadov, E., & Huseynova, K. (2026). Knowledge Distillation for Embeddings of Low-Resource Turkic Family Languages. Gazi University Journal of Science Part A: Engineering and Innovation, 13(1), 334-347. https://doi.org/10.54287/gujsa.1844025
AMA
1.Aghalarov M, Mammadov E, Huseynova K. Knowledge Distillation for Embeddings of Low-Resource Turkic Family Languages. GU J Sci, Part A. 2026;13(1):334-347. doi:10.54287/gujsa.1844025
Chicago
Aghalarov, Mirakram, Elvin Mammadov, and Kavsar Huseynova. 2026. “Knowledge Distillation for Embeddings of Low-Resource Turkic Family Languages”. Gazi University Journal of Science Part A: Engineering and Innovation 13 (1): 334-47. https://doi.org/10.54287/gujsa.1844025.
EndNote
Aghalarov M, Mammadov E, Huseynova K (March 1, 2026) Knowledge Distillation for Embeddings of Low-Resource Turkic Family Languages. Gazi University Journal of Science Part A: Engineering and Innovation 13 1 334–347.
IEEE
[1]M. Aghalarov, E. Mammadov, and K. Huseynova, “Knowledge Distillation for Embeddings of Low-Resource Turkic Family Languages”, GU J Sci, Part A, vol. 13, no. 1, pp. 334–347, Mar. 2026, doi: 10.54287/gujsa.1844025.
ISNAD
Aghalarov, Mirakram - Mammadov, Elvin - Huseynova, Kavsar. “Knowledge Distillation for Embeddings of Low-Resource Turkic Family Languages”. Gazi University Journal of Science Part A: Engineering and Innovation 13/1 (March 1, 2026): 334-347. https://doi.org/10.54287/gujsa.1844025.
JAMA
1.Aghalarov M, Mammadov E, Huseynova K. Knowledge Distillation for Embeddings of Low-Resource Turkic Family Languages. GU J Sci, Part A. 2026;13:334–347.
MLA
Aghalarov, Mirakram, et al. “Knowledge Distillation for Embeddings of Low-Resource Turkic Family Languages”. Gazi University Journal of Science Part A: Engineering and Innovation, vol. 13, no. 1, Mar. 2026, pp. 334-47, doi:10.54287/gujsa.1844025.
Vancouver
1.Mirakram Aghalarov, Elvin Mammadov, Kavsar Huseynova. Knowledge Distillation for Embeddings of Low-Resource Turkic Family Languages. GU J Sci, Part A. 2026 Mar. 1;13(1):334-47. doi:10.54287/gujsa.1844025