Dynamic Modality Reweighting for Robust Vision–Language Models under Noisy Multimodal Environments
Year 2026,
Volume: 10 Issue: 2
,
370
-
377
,
01.05.2026
Vinuja G
,
Niyas Ahamed A
Abstract
Recent Vision Language Models (VLMs) perform poorly when either the visual or textual part is affected by noise like blur, occlusion, or unclear text. This paper presents a Dynamic Modality Reweighting (DMR) framework that balances the contributions of visual and textual features based on their estimated reliability during inference. The structure includes a Confidence Estimation Network (CEN) to evaluate trust scores for each modality, followed by a Dynamic Fusion Layer (DFL) that combines embeddings using data-driven weights. Experimental results on noisy versions of the MS-COCO, Flickr30k, and Visual Genome datasets show up to 23% improvement in multimodal consistency and a 17% decrease in semantic drift when compared to baseline CLIP and BLIP-2 models.
References
-
Menachery, N., Deepanraj, B., & Thomas, S. (2025). Characterization and analysis of carbon fiber and nano hBN reinforced hybrid Aluminium Metal Matrix Composites by conventional sintering. Turkish Journal of Engineering, 9(2), 378-384.
-
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PmLR.
-
Gnanasekaran, K., Rajesh, M., & Hariram, V. (2025). Optimizing abrasive water jet parameters for enhanced interactivity in metal-stacked hybrid fiber laminates. Turkish Journal of Engineering, 9(1), 28-36.
-
Li, J., Li, D., Xiong, C., & Hoi, S. (2022, June). Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning (pp. 12888-12900). PMLR.
-
Li, J., Li, D., Savarese, S., & Hoi, S. (2023, July). Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning (pp. 19730-19742). PMLR.
-
Soydan, Z., Şahin, F. İ., & Acaralı, N. (2024). Advancements in polymeric matrix composite production: A review on methods and approaches. Turkish Journal of Engineering, 8(4), 677-686.
-
Dai, Y., Chen, H., Du, J., Wang, R., Chen, S., Wang, H., & Lee, C. H. (2024). A study of dropout-induced modality bias on robustness to missing video frames for audio-visual speech recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 27445-27455).,
-
Wang, H., Du, J., Dai, Y., Lee, C. H., Ren, Y., & Liu, Y. (2024, April). Improving Multi-Modal Emotion Recognition Using Entropy-Based Fusion and Pruning-Based Network Architecture Optimization. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 11766-11770). IEEE.
-
Khurshid, F., & Günal, A. Y. (2024). Harnessing earthquake generated glass and plastic waste for sustainable construction. Turkish Journal of Engineering, 8(2), 394-402.
-
Xu, M., Zhang, Z., Wei, F., Lin, Y., Cao, Y., Hu, H., & Bai, X. (2022, October). A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In European Conference on Computer Vision (pp. 736-753). Cham: Springer Nature Switzerland.
-
Kılınçarslan, Ş., & Türker, Y. Ş. (2022). Strengthening of solid beam with fiber reinforced polymers. Turkish Journal of Engineering, 7(3), 166-171.
-
Li, M., Xu, R., Wang, S., Zhou, L., Lin, X., Zhu, C., ... & Chang, S. F. (2022). Clip-event: Connecting text and images with event structures. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16420-16429).
-
Jia, C., Yang, Y., Xia, Y., Chen, Y. T., Parekh, Z., Pham, H., ... & Duerig, T. (2021, July). Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning (pp. 4904-4916). PMLR.
-
Alayrac, J. B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., ... & Simonyan, K. (2022). Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35, 23716-23736.
-
Kim, K., Park, S., et al. (2023). Multimodal robustness via modality dropout and contrastive consistency. IEEE Transactions on Multimedia, 25, 5678–5692.
-
Shen, P., Wu, Y., et al. (2024). Entropy-guided attention pruning for multimodal transformers. IEEE Transactions on Neural Networks and Learning Systems, 35, 1123–1135.
-
Wang, X., Zhou, M., et al. (2023). Uncertainty-gated multimodal fusion for noisy visual-audio tasks. IEEE Signal Processing Letters, 30, 1012–1016.
-
Chen, J., Yang, L., et al. (2024). Reliability-guided transformers for cross-domain sentiment fusion. IEEE Transactions on Knowledge and Data Engineering, 36, 1456–1468.
-
Yang, L., Zhang, F., et al. (2024). Entropy-guided fusion for multimodal representation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence. Advance online publication.
-
Zhang, Y., Li, X., et al. (2024). Robust-ALBEF: Enhancing multimodal fusion under noisy conditions. In Proceedings of the AAAI Conference on Artificial Intelligence.
-
Zhai, X., Yuan, L., et al. (2022). Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
-
Dosovitskiy, L., Beyer, L., et al. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations (ICLR).
-
Zhang, Y., Liu, X., et al. (2022). Noisy supervision for webly-supervised visual grounding. In Proceedings of the European Conference on Computer Vision (ECCV).
-
Jia, R., Zhou, X., et al. (2023). Certified robustness to text adversarial attacks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
-
J. Li, H. Chen, et al., “ALBEF: Align before fuse for multimodal learning,” in Proc. NeurIPS, 2021.