Knowledge Distillation for Embeddings of Low-Resource Turkic Family Languages
Abstract
Embedding models are a critical component of Retrieval-Augmented Generation (RAG) systems, but creating efficient, high-performance models for low-resource languages like Azerbaijani presents a significant challenge. To address this, we propose a multi-task distillation framework that improves a target low-resource language by distilling knowledge from English, while crucially preserving the model's original English language performance. Our approach combines cross-lingual knowledge distillation with a bilingual training objective, controlled by a single hyperparameter that balances performance across languages. We release a large-scale English-Azerbaijani parallel corpus of 300,000 Wikipedia articles to facilitate further research. Our experimental results demonstrate that models trained with our framework achieve state-of-the-art performance on Azerbaijani benchmarks: our BGE M3 model improves Azerbaijani STS from 79.3 to 85.0 (+5.7 points) while retaining 98.6% of its original English MTEB performance. In a real-world RAG application for Azerbaijani legal document retrieval, our best model achieves 89% top-1 retrieval accuracy and a 94% "Excellent" rating from domain experts, significantly outperforming baseline models. This framework provides a generalizable blueprint for enhancing embedding models for other low-resource languages, particularly within the Turkic language family.
Keywords
Supporting Institution
Ethical Statement
References
- Alizada, T., Suleymanov, U., & Rustamov, Z. (2024). Contextualized Word Embeddings in Azerbaijani Language. In: 2024 IEEE 18th International Conference on Application of Information and Communication Technologies (AICT) (pp. 1–6). IEEE. 2024 IEEE 18th International Conference on Application of Information and Communication Technologies (AICT). https://doi.org/10.1109/aict61888.2024.10740448
- Bandarkar, L., Liang, D., Muller, B., Artetxe, M., Shukla, S. N., Husa, D., Goyal, N., Krishnan, A., Zettlemoyer, L., & Khabsa, M. (2024). The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 749–775). https://doi.org/10.18653/v1/2024.acl-long.44
- Bowman, S. R., Angeli, G., Potts, C., & Manning, C. D. (2015). A large annotated corpus for learning natural language inference (Version 1). https://doi.org/10.48550/ARXIV.1508.05326
- Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., & Liu, Z. (2024). M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation (Version 5). arXiv. https://doi.org/10.48550/ARXIV.2402.03216
- Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Version 2). https://doi.org/10.48550/ARXIV.1810.04805
- Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models (Version 2). https://doi.org/10.48550/ARXIV.2106.09685
- Isbarov, J., Huseynova, K., Mammadov, E., Hajili, M. & Ataman, D. (2024) Open foundation models for Azerbaijani language. Proceedings of the First Workshop on Natural Language Processing for Turkic Languages (SIGTURK 2024). pp. 18-28 (2024,8). https://aclanthology.org/2024.sigturk-1.2
- Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. de las, Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., & Sayed, W. E. (2023). Mistral 7B (Version 1). https://doi.org/10.48550/ARXIV.2310.06825
Details
Primary Language
English
Subjects
Natural Language Processing
Journal Section
Research Article
Authors
Mirakram Aghalarov
*
0009-0008-1551-7797
Azerbaijan
Elvin Mammadov
0009-0005-9237-9736
Azerbaijan
Kavsar Huseynova
0009-0007-0362-9591
Azerbaijan
Publication Date
March 31, 2026
Submission Date
December 18, 2025
Acceptance Date
January 26, 2026
Published in Issue
Year 2026 Volume: 13 Number: 1