Knowledge Distillation for Embeddings of Low-Resource Turkic Family Languages

Mirakram Aghalarov; Elvin Mammadov; Kavsar Huseynova

doi:10.54287/gujsa.1844025

Knowledge Distillation for Embeddings of Low-Resource Turkic Family Languages

Abstract

Embedding models are a critical component of Retrieval-Augmented Generation (RAG) systems, but creating efficient, high-performance models for low-resource languages like Azerbaijani presents a significant challenge. To address this, we propose a multi-task distillation framework that improves a target low-resource language by distilling knowledge from English, while crucially preserving the model's original English language performance. Our approach combines cross-lingual knowledge distillation with a bilingual training objective, controlled by a single hyperparameter that balances performance across languages. We release a large-scale English-Azerbaijani parallel corpus of 300,000 Wikipedia articles to facilitate further research. Our experimental results demonstrate that models trained with our framework achieve state-of-the-art performance on Azerbaijani benchmarks: our BGE M3 model improves Azerbaijani STS from 79.3 to 85.0 (+5.7 points) while retaining 98.6% of its original English MTEB performance. In a real-world RAG application for Azerbaijani legal document retrieval, our best model achieves 89% top-1 retrieval accuracy and a 94% "Excellent" rating from domain experts, significantly outperforming baseline models. This framework provides a generalizable blueprint for enhancing embedding models for other low-resource languages, particularly within the Turkic language family.

Keywords

Supporting Institution

Baku Higher oil School

Ethical Statement

All datasets have been shared open source.

References

Alizada, T., Suleymanov, U., & Rustamov, Z. (2024). Contextualized Word Embeddings in Azerbaijani Language. In: 2024 IEEE 18th International Conference on Application of Information and Communication Technologies (AICT) (pp. 1–6). IEEE. 2024 IEEE 18th International Conference on Application of Information and Communication Technologies (AICT). https://doi.org/10.1109/aict61888.2024.10740448
Bandarkar, L., Liang, D., Muller, B., Artetxe, M., Shukla, S. N., Husa, D., Goyal, N., Krishnan, A., Zettlemoyer, L., & Khabsa, M. (2024). The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 749–775). https://doi.org/10.18653/v1/2024.acl-long.44
Bowman, S. R., Angeli, G., Potts, C., & Manning, C. D. (2015). A large annotated corpus for learning natural language inference (Version 1). https://doi.org/10.48550/ARXIV.1508.05326
Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., & Liu, Z. (2024). M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation (Version 5). arXiv. https://doi.org/10.48550/ARXIV.2402.03216
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Version 2). https://doi.org/10.48550/ARXIV.1810.04805
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models (Version 2). https://doi.org/10.48550/ARXIV.2106.09685
Isbarov, J., Huseynova, K., Mammadov, E., Hajili, M. & Ataman, D. (2024) Open foundation models for Azerbaijani language. Proceedings of the First Workshop on Natural Language Processing for Turkic Languages (SIGTURK 2024). pp. 18-28 (2024,8). https://aclanthology.org/2024.sigturk-1.2
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. de las, Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., & Sayed, W. E. (2023). Mistral 7B (Version 1). https://doi.org/10.48550/ARXIV.2310.06825

Kabir, M. R., Nabil, Md. M. R., & Khan, M. A. (2024). BanglaEmbed: Efficient Sentence Embedding Models for a Low-Resource Language Using Cross-Lingual Distillation Techniques (Version 1). https://doi.org/10.48550/ARXIV.2411.15270
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Version 4). https://doi.org/10.48550/ARXIV.2005.11401
Liu, H., Cui, C., Du, Y., Liu, Y., & Pan, G. (2025). PP-FormulaNet: Bridging Accuracy and Efficiency in Advanced Formula Recognition (Version 1). https://doi.org/10.48550/ARXIV.2503.18382
LocalDoc, TEmA-small (Revision 5a04b7f). (Hugging Face, 2024). https://huggingface.co/LocalDoc/TEmA-small
Miao, Z., Wu, Q., Zhao, K., Wu, Z., & Tsuruoka, Y. (2024). Enhancing Cross-lingual Sentence Embedding for Low-resource Languages with Word Alignment (Version 1). https://doi.org/10.48550/ARXIV.2404.02490
Muennighoff, N., Tazi, N., Magne, L., & Reimers, N. (2022). MTEB: Massive Text Embedding Benchmark (Version 3). arXiv. https://doi.org/10.48550/ARXIV.2210.07316
NLLB Team, Costa-jussà, M. R., Cross, J., Çelebi, O., Elbayad, M., Heafield, K., Heffernan, K., Kalbassi, E., Lam, J., Licht, D., Maillard, J., Sun, A., Wang, S., Wenzek, G., Youngblood, A., Akula, B., Barrault, L., Gonzalez, G. M., Hansanti, P., … Wang, J. (2022). No Language Left Behind: Scaling Human-Centered Machine Translation (Version 3). arXiv. https://doi.org/10.48550/ARXIV.2207.04672
Papineni, K., Roukos, S., Ward, T. & Zhu, W. (2002). Bleu: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of The Association for Computational Linguistics. pp. 311-318. https://aclanthology.org/P02-1040/
Smith, R. (2007). An Overview of the Tesseract OCR Engine. In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Vol 2 (pp. 629–633). IEEE. Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Vol 2. https://doi.org/10.1109/icdar.2007.4376991
Sturua, S., Mohr, I., Akram, M. K., Günther, M., Wang, B., Krimmel, M., Wang, F., Mastrapas, G., Koukounas, A., Wang, N., & Xiao, H. (2024). jina-embeddings-v3: Multilingual Embeddings With Task LoRA (Version 3). https://doi.org/10.48550/ARXIV.2409.10173
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). LLaMA: Open and Efficient Foundation Language Models (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2302.13971
Wang, B., Xu, C., Zhao, X., Ouyang, L., Wu, F., Zhao, Z., Xu, R., Liu, K., Qu, Y., Shang, F., Zhang, B., Wei, L., Sui, Z., Li, W., Shi, B., Qiao, Y., Lin, D., & He, C. (2024). MinerU: An Open-Source Solution for Precise Document Content Extraction (Version 1). https://doi.org/10.48550/ARXIV.2409.18839

Details

Primary Language

English

Subjects

Natural Language Processing

Journal Section

Research Article

Authors

Mirakram Aghalarov ^*
0009-0008-1551-7797
Azerbaijan

Elvin Mammadov
0009-0005-9237-9736
Azerbaijan

Kavsar Huseynova
0009-0007-0362-9591
Azerbaijan

Publication Date

March 31, 2026

Submission Date

December 18, 2025

Acceptance Date

January 26, 2026

Published in Issue

Year 2026 Volume: 13 Number: 1

DOI

https://doi.org/10.54287/gujsa.1844025

IZ

https://izlik.org/JA76GX49AA

Cite

RIS / Bibtex

APA

Aghalarov, M., Mammadov, E., & Huseynova, K. (2026). Knowledge Distillation for Embeddings of Low-Resource Turkic Family Languages. Gazi University Journal of Science Part A: Engineering and Innovation, 13(1), 334-347. https://doi.org/10.54287/gujsa.1844025

AMA

1.Aghalarov M, Mammadov E, Huseynova K. Knowledge Distillation for Embeddings of Low-Resource Turkic Family Languages. GU J Sci, Part A. 2026;13(1):334-347. doi:10.54287/gujsa.1844025

Chicago

Aghalarov, Mirakram, Elvin Mammadov, and Kavsar Huseynova. 2026. “Knowledge Distillation for Embeddings of Low-Resource Turkic Family Languages”. Gazi University Journal of Science Part A: Engineering and Innovation 13 (1): 334-47. https://doi.org/10.54287/gujsa.1844025.

EndNote

Aghalarov M, Mammadov E, Huseynova K (March 1, 2026) Knowledge Distillation for Embeddings of Low-Resource Turkic Family Languages. Gazi University Journal of Science Part A: Engineering and Innovation 13 1 334–347.

IEEE

[1]M. Aghalarov, E. Mammadov, and K. Huseynova, “Knowledge Distillation for Embeddings of Low-Resource Turkic Family Languages”, GU J Sci, Part A, vol. 13, no. 1, pp. 334–347, Mar. 2026, doi: 10.54287/gujsa.1844025.

ISNAD

Aghalarov, Mirakram - Mammadov, Elvin - Huseynova, Kavsar. “Knowledge Distillation for Embeddings of Low-Resource Turkic Family Languages”. Gazi University Journal of Science Part A: Engineering and Innovation 13/1 (March 1, 2026): 334-347. https://doi.org/10.54287/gujsa.1844025.

JAMA

1.Aghalarov M, Mammadov E, Huseynova K. Knowledge Distillation for Embeddings of Low-Resource Turkic Family Languages. GU J Sci, Part A. 2026;13:334–347.

MLA

Aghalarov, Mirakram, et al. “Knowledge Distillation for Embeddings of Low-Resource Turkic Family Languages”. Gazi University Journal of Science Part A: Engineering and Innovation, vol. 13, no. 1, Mar. 2026, pp. 334-47, doi:10.54287/gujsa.1844025.

Vancouver

1.Mirakram Aghalarov, Elvin Mammadov, Kavsar Huseynova. Knowledge Distillation for Embeddings of Low-Resource Turkic Family Languages. GU J Sci, Part A. 2026 Mar. 1;13(1):334-47. doi:10.54287/gujsa.1844025