Knowledge Distillation for Embeddings of Low-Resource Turkic Family Languages
Year 2026,
Volume: 13 Issue: 1
,
334
-
347
,
31.03.2026
Mirakram Aghalarov
,
Elvin Mammadov
,
Kavsar Huseynova
Abstract
Embedding models are a critical component of Retrieval-Augmented Generation (RAG) systems, but creating efficient, high-performance models for low-resource languages like Azerbaijani presents a significant challenge. To address this, we propose a multi-task distillation framework that improves a target low-resource language by distilling knowledge from English, while crucially preserving the model's original English language performance. Our approach combines cross-lingual knowledge distillation with a bilingual training objective, controlled by a single hyperparameter that balances performance across languages. We release a large-scale English-Azerbaijani parallel corpus of 300,000 Wikipedia articles to facilitate further research. Our experimental results demonstrate that models trained with our framework achieve state-of-the-art performance on Azerbaijani benchmarks: our BGE M3 model improves Azerbaijani STS from 79.3 to 85.0 (+5.7 points) while retaining 98.6% of its original English MTEB performance. In a real-world RAG application for Azerbaijani legal document retrieval, our best model achieves 89% top-1 retrieval accuracy and a 94% "Excellent" rating from domain experts, significantly outperforming baseline models. This framework provides a generalizable blueprint for enhancing embedding models for other low-resource languages, particularly within the Turkic language family.
Ethical Statement
All datasets have been shared open source.
Supporting Institution
Baku Higher oil School
References
-
Alizada, T., Suleymanov, U., & Rustamov, Z. (2024). Contextualized Word Embeddings in Azerbaijani Language. In: 2024 IEEE 18th International Conference on Application of Information and Communication Technologies (AICT) (pp. 1–6). IEEE. 2024 IEEE 18th International Conference on Application of Information and Communication Technologies (AICT). https://doi.org/10.1109/aict61888.2024.10740448
-
Bandarkar, L., Liang, D., Muller, B., Artetxe, M., Shukla, S. N., Husa, D., Goyal, N., Krishnan, A., Zettlemoyer, L., & Khabsa, M. (2024). The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 749–775). https://doi.org/10.18653/v1/2024.acl-long.44
-
Bowman, S. R., Angeli, G., Potts, C., & Manning, C. D. (2015). A large annotated corpus for learning natural language inference (Version 1). https://doi.org/10.48550/ARXIV.1508.05326
-
Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., & Liu, Z. (2024). M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation (Version 5). arXiv. https://doi.org/10.48550/ARXIV.2402.03216
-
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Version 2). https://doi.org/10.48550/ARXIV.1810.04805
-
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models (Version 2). https://doi.org/10.48550/ARXIV.2106.09685
-
Isbarov, J., Huseynova, K., Mammadov, E., Hajili, M. & Ataman, D. (2024) Open foundation models for Azerbaijani language. Proceedings of the First Workshop on Natural Language Processing for Turkic Languages (SIGTURK 2024). pp. 18-28 (2024,8). https://aclanthology.org/2024.sigturk-1.2
-
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. de las, Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., & Sayed, W. E. (2023). Mistral 7B (Version 1). https://doi.org/10.48550/ARXIV.2310.06825
-
Kabir, M. R., Nabil, Md. M. R., & Khan, M. A. (2024). BanglaEmbed: Efficient Sentence Embedding Models for a Low-Resource Language Using Cross-Lingual Distillation Techniques (Version 1). https://doi.org/10.48550/ARXIV.2411.15270
-
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Version 4). https://doi.org/10.48550/ARXIV.2005.11401
-
Liu, H., Cui, C., Du, Y., Liu, Y., & Pan, G. (2025). PP-FormulaNet: Bridging Accuracy and Efficiency in Advanced Formula Recognition (Version 1). https://doi.org/10.48550/ARXIV.2503.18382
-
LocalDoc, TEmA-small (Revision 5a04b7f). (Hugging Face, 2024). https://huggingface.co/LocalDoc/TEmA-small
-
Miao, Z., Wu, Q., Zhao, K., Wu, Z., & Tsuruoka, Y. (2024). Enhancing Cross-lingual Sentence Embedding for Low-resource Languages with Word Alignment (Version 1). https://doi.org/10.48550/ARXIV.2404.02490
-
Muennighoff, N., Tazi, N., Magne, L., & Reimers, N. (2022). MTEB: Massive Text Embedding Benchmark (Version 3). arXiv. https://doi.org/10.48550/ARXIV.2210.07316
-
NLLB Team, Costa-jussà, M. R., Cross, J., Çelebi, O., Elbayad, M., Heafield, K., Heffernan, K., Kalbassi, E., Lam, J., Licht, D., Maillard, J., Sun, A., Wang, S., Wenzek, G., Youngblood, A., Akula, B., Barrault, L., Gonzalez, G. M., Hansanti, P., … Wang, J. (2022). No Language Left Behind: Scaling Human-Centered Machine Translation (Version 3). arXiv. https://doi.org/10.48550/ARXIV.2207.04672
-
Papineni, K., Roukos, S., Ward, T. & Zhu, W. (2002). Bleu: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of The Association for Computational Linguistics. pp. 311-318. https://aclanthology.org/P02-1040/
-
Smith, R. (2007). An Overview of the Tesseract OCR Engine. In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Vol 2 (pp. 629–633). IEEE. Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Vol 2. https://doi.org/10.1109/icdar.2007.4376991
-
Sturua, S., Mohr, I., Akram, M. K., Günther, M., Wang, B., Krimmel, M., Wang, F., Mastrapas, G., Koukounas, A., Wang, N., & Xiao, H. (2024). jina-embeddings-v3: Multilingual Embeddings With Task LoRA (Version 3). https://doi.org/10.48550/ARXIV.2409.10173
-
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). LLaMA: Open and Efficient Foundation Language Models (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2302.13971
-
Wang, B., Xu, C., Zhao, X., Ouyang, L., Wu, F., Zhao, Z., Xu, R., Liu, K., Qu, Y., Shang, F., Zhang, B., Wei, L., Sui, Z., Li, W., Shi, B., Qiao, Y., Lin, D., & He, C. (2024). MinerU: An Open-Source Solution for Precise Document Content Extraction (Version 1). https://doi.org/10.48550/ARXIV.2409.18839