Research Article
BibTex RIS Cite

Dynamic Modality Reweighting for Robust Vision–Language Models under Noisy Multimodal Environments

Year 2026, Volume: 10 Issue: 2 , 370 - 377 , 01.05.2026
https://doi.org/10.31127/tuje.1813692
https://izlik.org/JA46ZY37UM

Abstract

Recent Vision Language Models (VLMs) perform poorly when either the visual or textual part is affected by noise like blur, occlusion, or unclear text. This paper presents a Dynamic Modality Reweighting (DMR) framework that balances the contributions of visual and textual features based on their estimated reliability during inference. The structure includes a Confidence Estimation Network (CEN) to evaluate trust scores for each modality, followed by a Dynamic Fusion Layer (DFL) that combines embeddings using data-driven weights. Experimental results on noisy versions of the MS-COCO, Flickr30k, and Visual Genome datasets show up to 23% improvement in multimodal consistency and a 17% decrease in semantic drift when compared to baseline CLIP and BLIP-2 models.

References

  • Menachery, N., Deepanraj, B., & Thomas, S. (2025). Characterization and analysis of carbon fiber and nano hBN reinforced hybrid Aluminium Metal Matrix Composites by conventional sintering. Turkish Journal of Engineering, 9(2), 378-384.
  • Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PmLR.
  • Gnanasekaran, K., Rajesh, M., & Hariram, V. (2025). Optimizing abrasive water jet parameters for enhanced interactivity in metal-stacked hybrid fiber laminates. Turkish Journal of Engineering, 9(1), 28-36.
  • Li, J., Li, D., Xiong, C., & Hoi, S. (2022, June). Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning (pp. 12888-12900). PMLR.
  • Li, J., Li, D., Savarese, S., & Hoi, S. (2023, July). Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning (pp. 19730-19742). PMLR.
  • Soydan, Z., Şahin, F. İ., & Acaralı, N. (2024). Advancements in polymeric matrix composite production: A review on methods and approaches. Turkish Journal of Engineering, 8(4), 677-686.
  • Dai, Y., Chen, H., Du, J., Wang, R., Chen, S., Wang, H., & Lee, C. H. (2024). A study of dropout-induced modality bias on robustness to missing video frames for audio-visual speech recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 27445-27455).,
  • Wang, H., Du, J., Dai, Y., Lee, C. H., Ren, Y., & Liu, Y. (2024, April). Improving Multi-Modal Emotion Recognition Using Entropy-Based Fusion and Pruning-Based Network Architecture Optimization. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 11766-11770). IEEE.
  • Khurshid, F., & Günal, A. Y. (2024). Harnessing earthquake generated glass and plastic waste for sustainable construction. Turkish Journal of Engineering, 8(2), 394-402.
  • Xu, M., Zhang, Z., Wei, F., Lin, Y., Cao, Y., Hu, H., & Bai, X. (2022, October). A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In European Conference on Computer Vision (pp. 736-753). Cham: Springer Nature Switzerland.
  • Kılınçarslan, Ş., & Türker, Y. Ş. (2022). Strengthening of solid beam with fiber reinforced polymers. Turkish Journal of Engineering, 7(3), 166-171.
  • Li, M., Xu, R., Wang, S., Zhou, L., Lin, X., Zhu, C., ... & Chang, S. F. (2022). Clip-event: Connecting text and images with event structures. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16420-16429).
  • Jia, C., Yang, Y., Xia, Y., Chen, Y. T., Parekh, Z., Pham, H., ... & Duerig, T. (2021, July). Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning (pp. 4904-4916). PMLR.
  • Alayrac, J. B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., ... & Simonyan, K. (2022). Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35, 23716-23736.
  • Kim, K., Park, S., et al. (2023). Multimodal robustness via modality dropout and contrastive consistency. IEEE Transactions on Multimedia, 25, 5678–5692.
  • Shen, P., Wu, Y., et al. (2024). Entropy-guided attention pruning for multimodal transformers. IEEE Transactions on Neural Networks and Learning Systems, 35, 1123–1135.
  • Wang, X., Zhou, M., et al. (2023). Uncertainty-gated multimodal fusion for noisy visual-audio tasks. IEEE Signal Processing Letters, 30, 1012–1016.
  • Chen, J., Yang, L., et al. (2024). Reliability-guided transformers for cross-domain sentiment fusion. IEEE Transactions on Knowledge and Data Engineering, 36, 1456–1468.
  • Yang, L., Zhang, F., et al. (2024). Entropy-guided fusion for multimodal representation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence. Advance online publication.
  • Zhang, Y., Li, X., et al. (2024). Robust-ALBEF: Enhancing multimodal fusion under noisy conditions. In Proceedings of the AAAI Conference on Artificial Intelligence.
  • Zhai, X., Yuan, L., et al. (2022). Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  • Dosovitskiy, L., Beyer, L., et al. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations (ICLR).
  • Zhang, Y., Liu, X., et al. (2022). Noisy supervision for webly-supervised visual grounding. In Proceedings of the European Conference on Computer Vision (ECCV).
  • Jia, R., Zhou, X., et al. (2023). Certified robustness to text adversarial attacks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
  • J. Li, H. Chen, et al., “ALBEF: Align before fuse for multimodal learning,” in Proc. NeurIPS, 2021.
There are 25 citations in total.

Details

Primary Language English
Subjects Network Engineering
Journal Section Research Article
Authors

Vinuja G 0000-0001-8109-988X

Niyas Ahamed A 0009-0004-7271-1781

Submission Date October 30, 2025
Acceptance Date December 18, 2025
Publication Date May 1, 2026
DOI https://doi.org/10.31127/tuje.1813692
IZ https://izlik.org/JA46ZY37UM
Published in Issue Year 2026 Volume: 10 Issue: 2

Cite

APA G, V., & A, N. A. (2026). Dynamic Modality Reweighting for Robust Vision–Language Models under Noisy Multimodal Environments. Turkish Journal of Engineering, 10(2), 370-377. https://doi.org/10.31127/tuje.1813692
AMA 1.G V, A NA. Dynamic Modality Reweighting for Robust Vision–Language Models under Noisy Multimodal Environments. TUJE. 2026;10(2):370-377. doi:10.31127/tuje.1813692
Chicago G, Vinuja, and Niyas Ahamed A. 2026. “Dynamic Modality Reweighting for Robust Vision–Language Models under Noisy Multimodal Environments”. Turkish Journal of Engineering 10 (2): 370-77. https://doi.org/10.31127/tuje.1813692.
EndNote G V, A NA (May 1, 2026) Dynamic Modality Reweighting for Robust Vision–Language Models under Noisy Multimodal Environments. Turkish Journal of Engineering 10 2 370–377.
IEEE [1]V. G and N. A. A, “Dynamic Modality Reweighting for Robust Vision–Language Models under Noisy Multimodal Environments”, TUJE, vol. 10, no. 2, pp. 370–377, May 2026, doi: 10.31127/tuje.1813692.
ISNAD G, Vinuja - A, Niyas Ahamed. “Dynamic Modality Reweighting for Robust Vision–Language Models under Noisy Multimodal Environments”. Turkish Journal of Engineering 10/2 (May 1, 2026): 370-377. https://doi.org/10.31127/tuje.1813692.
JAMA 1.G V, A NA. Dynamic Modality Reweighting for Robust Vision–Language Models under Noisy Multimodal Environments. TUJE. 2026;10:370–377.
MLA G, Vinuja, and Niyas Ahamed A. “Dynamic Modality Reweighting for Robust Vision–Language Models under Noisy Multimodal Environments”. Turkish Journal of Engineering, vol. 10, no. 2, May 2026, pp. 370-7, doi:10.31127/tuje.1813692.
Vancouver 1.Vinuja G, Niyas Ahamed A. Dynamic Modality Reweighting for Robust Vision–Language Models under Noisy Multimodal Environments. TUJE. 2026 May 1;10(2):370-7. doi:10.31127/tuje.1813692
Flag Counter