Dynamic Modality Reweighting for Robust Vision–Language Models under Noisy Multimodal Environments

Vinuja G; Niyas Ahamed A

doi:10.31127/tuje.1813692

Research Article

Dynamic Modality Reweighting for Robust Vision–Language Models under Noisy Multimodal Environments

Year 2026, Volume: 10 Issue: 2 , 370 - 377 , 01.05.2026

Vinuja G , Niyas Ahamed A

https://doi.org/10.31127/tuje.1813692

https://izlik.org/JA46ZY37UM

Abstract

Recent Vision Language Models (VLMs) perform poorly when either the visual or textual part is affected by noise like blur, occlusion, or unclear text. This paper presents a Dynamic Modality Reweighting (DMR) framework that balances the contributions of visual and textual features based on their estimated reliability during inference. The structure includes a Confidence Estimation Network (CEN) to evaluate trust scores for each modality, followed by a Dynamic Fusion Layer (DFL) that combines embeddings using data-driven weights. Experimental results on noisy versions of the MS-COCO, Flickr30k, and Visual Genome datasets show up to 23% improvement in multimodal consistency and a 17% decrease in semantic drift when compared to baseline CLIP and BLIP-2 models.

Keywords

Multimodal robustness , Vision–Language Models , Dynamic weighting , Noise-aware fusion , Adaptive representation learning

References

Menachery, N., Deepanraj, B., & Thomas, S. (2025). Characterization and analysis of carbon fiber and nano hBN reinforced hybrid Aluminium Metal Matrix Composites by conventional sintering. Turkish Journal of Engineering, 9(2), 378-384.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PmLR.
Gnanasekaran, K., Rajesh, M., & Hariram, V. (2025). Optimizing abrasive water jet parameters for enhanced interactivity in metal-stacked hybrid fiber laminates. Turkish Journal of Engineering, 9(1), 28-36.
Li, J., Li, D., Xiong, C., & Hoi, S. (2022, June). Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning (pp. 12888-12900). PMLR.
Li, J., Li, D., Savarese, S., & Hoi, S. (2023, July). Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning (pp. 19730-19742). PMLR.
Soydan, Z., Şahin, F. İ., & Acaralı, N. (2024). Advancements in polymeric matrix composite production: A review on methods and approaches. Turkish Journal of Engineering, 8(4), 677-686.
Dai, Y., Chen, H., Du, J., Wang, R., Chen, S., Wang, H., & Lee, C. H. (2024). A study of dropout-induced modality bias on robustness to missing video frames for audio-visual speech recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 27445-27455).,
Wang, H., Du, J., Dai, Y., Lee, C. H., Ren, Y., & Liu, Y. (2024, April). Improving Multi-Modal Emotion Recognition Using Entropy-Based Fusion and Pruning-Based Network Architecture Optimization. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 11766-11770). IEEE.
Khurshid, F., & Günal, A. Y. (2024). Harnessing earthquake generated glass and plastic waste for sustainable construction. Turkish Journal of Engineering, 8(2), 394-402.
Xu, M., Zhang, Z., Wei, F., Lin, Y., Cao, Y., Hu, H., & Bai, X. (2022, October). A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In European Conference on Computer Vision (pp. 736-753). Cham: Springer Nature Switzerland.
Kılınçarslan, Ş., & Türker, Y. Ş. (2022). Strengthening of solid beam with fiber reinforced polymers. Turkish Journal of Engineering, 7(3), 166-171.
Li, M., Xu, R., Wang, S., Zhou, L., Lin, X., Zhu, C., ... & Chang, S. F. (2022). Clip-event: Connecting text and images with event structures. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16420-16429).
Jia, C., Yang, Y., Xia, Y., Chen, Y. T., Parekh, Z., Pham, H., ... & Duerig, T. (2021, July). Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning (pp. 4904-4916). PMLR.
Alayrac, J. B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., ... & Simonyan, K. (2022). Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35, 23716-23736.
Kim, K., Park, S., et al. (2023). Multimodal robustness via modality dropout and contrastive consistency. IEEE Transactions on Multimedia, 25, 5678–5692.
Shen, P., Wu, Y., et al. (2024). Entropy-guided attention pruning for multimodal transformers. IEEE Transactions on Neural Networks and Learning Systems, 35, 1123–1135.
Wang, X., Zhou, M., et al. (2023). Uncertainty-gated multimodal fusion for noisy visual-audio tasks. IEEE Signal Processing Letters, 30, 1012–1016.
Chen, J., Yang, L., et al. (2024). Reliability-guided transformers for cross-domain sentiment fusion. IEEE Transactions on Knowledge and Data Engineering, 36, 1456–1468.
Yang, L., Zhang, F., et al. (2024). Entropy-guided fusion for multimodal representation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence. Advance online publication.
Zhang, Y., Li, X., et al. (2024). Robust-ALBEF: Enhancing multimodal fusion under noisy conditions. In Proceedings of the AAAI Conference on Artificial Intelligence.
Zhai, X., Yuan, L., et al. (2022). Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
Dosovitskiy, L., Beyer, L., et al. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations (ICLR).
Zhang, Y., Liu, X., et al. (2022). Noisy supervision for webly-supervised visual grounding. In Proceedings of the European Conference on Computer Vision (ECCV).
Jia, R., Zhou, X., et al. (2023). Certified robustness to text adversarial attacks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
J. Li, H. Chen, et al., “ALBEF: Align before fuse for multimodal learning,” in Proc. NeurIPS, 2021.

There are 25 citations in total.

Details

Primary Language	English
Subjects	Network Engineering
Journal Section	Research Article
Authors	Vinuja G 0000-0001-8109-988X Niyas Ahamed A 0009-0004-7271-1781
Submission Date	October 30, 2025
Acceptance Date	December 18, 2025
Publication Date	May 1, 2026
DOI	https://doi.org/10.31127/tuje.1813692
IZ	https://izlik.org/JA46ZY37UM
Published in Issue	Year 2026 Volume: 10 Issue: 2

Cite

APA	G, V., & A, N. A. (2026). Dynamic Modality Reweighting for Robust Vision–Language Models under Noisy Multimodal Environments. Turkish Journal of Engineering, 10(2), 370-377. https://doi.org/10.31127/tuje.1813692
AMA	1.G V, A NA. Dynamic Modality Reweighting for Robust Vision–Language Models under Noisy Multimodal Environments. TUJE. 2026;10(2):370-377. doi:10.31127/tuje.1813692
Chicago	G, Vinuja, and Niyas Ahamed A. 2026. “Dynamic Modality Reweighting for Robust Vision–Language Models under Noisy Multimodal Environments”. Turkish Journal of Engineering 10 (2): 370-77. https://doi.org/10.31127/tuje.1813692.
EndNote	G V, A NA (May 1, 2026) Dynamic Modality Reweighting for Robust Vision–Language Models under Noisy Multimodal Environments. Turkish Journal of Engineering 10 2 370–377.
IEEE	[1]V. G and N. A. A, “Dynamic Modality Reweighting for Robust Vision–Language Models under Noisy Multimodal Environments”, TUJE, vol. 10, no. 2, pp. 370–377, May 2026, doi: 10.31127/tuje.1813692.
ISNAD	G, Vinuja - A, Niyas Ahamed. “Dynamic Modality Reweighting for Robust Vision–Language Models under Noisy Multimodal Environments”. Turkish Journal of Engineering 10/2 (May 1, 2026): 370-377. https://doi.org/10.31127/tuje.1813692.
JAMA	1.G V, A NA. Dynamic Modality Reweighting for Robust Vision–Language Models under Noisy Multimodal Environments. TUJE. 2026;10:370–377.
MLA	G, Vinuja, and Niyas Ahamed A. “Dynamic Modality Reweighting for Robust Vision–Language Models under Noisy Multimodal Environments”. Turkish Journal of Engineering, vol. 10, no. 2, May 2026, pp. 370-7, doi:10.31127/tuje.1813692.
Vancouver	1.Vinuja G, Niyas Ahamed A. Dynamic Modality Reweighting for Robust Vision–Language Models under Noisy Multimodal Environments. TUJE. 2026 May 1;10(2):370-7. doi:10.31127/tuje.1813692

Article Files

Full Text