Research Article
BibTex RIS Cite

PERFORMANCE COMPARISON OF VISION-LANGUAGE MODELS IN IMAGE CLASSIFICATION

Year 2025, Volume: 9 Issue: 2, 247 - 262, 30.08.2025
https://doi.org/10.46519/ij3dptdi.1693015

Abstract

Vision-Language Models (VLMs) have introduced a new paradigm shift in image classification by integrating visual and textual modalities. While these models have demonstrated strong performance on multimodal tasks, their effectiveness in purely visual classification remains underexplored. This study presents a comprehensive, metric-driven comparative analysis of eight state-of-the-art VLMs—GPT-4o-latest, GPT-4o-mini, Gemini-flash-1.5-8b, LLaMA-3.2-90B-vision-instruct, Grok-2-vision-1212, Qwen2.5-vl-7b-instruct, Claude-3.5-sonnet, and Pixtral-large-2411—across four datasets: CIFAR-10, ImageNet, COCO, and the domain-specific New Plant Diseases dataset. Model performance was evaluated using accuracy, precision, recall, F1-score, and robustness under zero-shot and few-shot settings. Quantitative results indicate that GPT-4o-latest consistently achieves the highest performance on typical benchmarks (accuracy: 0.91, F1-score: 0.91 on CIFAR-10), substantially surpassing lightweight models such as Pixtral-large-2411 (accuracy: 0.13, F1-score: 0.13). Near-perfect results on ImageNet and COCO likely reflect pre-training overlap, whereas notable performance degradation on the New Plant Diseases dataset underscores domain adaptation challenges. Our findings emphasize the need for robust, parameter-efficient, and domain-adaptive fine-tuning strategies to advance VLMs in real-world image classification.

Project Number

124E769

References

  • 1. LeCun, Y., Bengio, Y., and Hinton, G. “Deep learning”, Nature, Vol. 521, Issue 7553, Pages 436-444, 2015.
  • 2. Yao, G., Lei, T., and Zhong, J. “A review of convolutional‐neural‐network‐based action recognition”, Pattern Recognition Letters, Vol. 118, Pages 14-22, 2019.
  • 3. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... and Bengio, Y. “Generative adversarial networks”, Communications of the ACM, Vol. 63, Issue 11, Pages 139-144, 2020.
  • 4. Alzubaidi, L., Zhang, J., Humaidi, A.J., Al-Dujaili, A., Duan, Y., … and Farhan, L. “Review of deep learning: concepts, CNN architectures, challenges, applications, future directions”, Journal of Big Data, Vol. 8, Issue 53, 2021.
  • 5. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... and Polosukhin, I. “Attention is all you need”, arXiv preprint arXiv: 1706.03762v7.
  • 6. Krizhevsky, A., and Hinton, G. “Learning multiple layers of features from tiny images”, https://www.cs.toronto.edu/~kriz/cifar.html, 2009.
  • 7. Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., and Fei-Fei, L. “Imagenet: A large-scale hierarchical image database”, IEEE Conference on Computer Vision and Pattern Recognition, 248-255, 2009.
  • 8. LeCun, Y., Bottou, L., Bengio, Y. and Haffner, P. “Gradient-based learning applied to document recognition”, Proceedings of the IEEE, Vol. 86, Issue 11, Pages 2278-2324, 1998.
  • 9. Liu, Z., Luo, P., Wang, X., and Tang, X. "Deep learning face attributes in the wild," IEEE International Conference on Computer Vision, Pages 3730-3738, 2015.
  • 10. Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., ... and Dollar, P. “Microsoft COCO: Common objects in context”, arXiv preprint arXiv: 1405.0312v3, 2014.
  • 11. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., … and Sutskever, I. “Learning transferable visual models from natural language supervision”, arXiv preprint arXiv: 2103.00020, 2021.
  • 12. Jia, C., Yang, Y., Xia, Y., Chen, Y. T., Parekh, Z., Pham, H., ... and Duerig, T. “Scaling up visual and vision-language representation learning with noisy text supervision”, arXiv preprint arXiv: 2102.05918v2, 2021.
  • 13. Lai, Z., Saveris, V., Chen, C., Chen, H. Y., Zhang, H., … and Yang Y. “Revisit large-scale image-caption data in pre-training multimodal foundation models”, arXiv preprint arXiv: 2410.02740v1, 2024.
  • 14. Alayrac, J. B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., ... and Simonyan, K. “Flamingo: A visual language model for few-shot learning”, arXiv preprint arXiv: 2204.14198v2, 2022.
  • 15. Simonyan, K. and Zisserman, A. “Very deep convolutional networks for large-scale image recognition”, arXiv preprint arXiv: 1409.1556v6, 2015.
  • 16. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... and Rabinovich, A. “Going deeper with convolutions”, arXiv preprint arXiv: 1409.4842v1 2015.
  • 17. He, K., Zhang, X., Ren, S., and Sun, J. “Deep residual learning for image recognition”, arXiv preprint arXiv: 1512.03385v1, 2016. 18. Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. “Densely connected convolutional networks”, arXiv preprint arXiv: 1608.06993v5, 2017.
  • 19. Wang, Z., Lu, Y., Li, W., Wang, S., Wang, X., and Chen, X. “Single image super-resolution with attention-based densely connected modüle”, Neurocomputing, Vol. 453, Pages, 876-884, 2021.
  • 20. Yildiz, E., Yuksel, M. E., and Sevgen, S. “A single-image GAN model using self-attention mechanism and DenseNets”, Neurocomputing, Vol. 596, Issue, 127873, 2024. 21. Tan, M. and Le, Q. V. “EfficientNet: Rethinking model scaling for convolutional neural networks”, arXiv preprint arXiv: 1905.11946v5, 2019.
  • 22. Lu, J., Batra, D., Parikh, D., and Lee, S. “ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks”, arXiv preprint arXiv: 1908.02265v1, 2019.
  • 23. Li, L. H., Yatskar, M., Yin, D., Hsieh, C. J., and Chang, K. W. “VisualBERT: A simple and performant baseline for vision and language”, arXiv preprint arXiv: 1908.03557v1, 2019
  • 24. Tan, H. and Bansal, M. “LXMERT: Learning cross-modality encoder representations from transformers”, arXiv preprint arXiv: 1908.07490v3, 2019.
  • 25. Chen, Y. C., Li, L., Yu, L., Kholy, A. E., Ahmed, F., and Liu, J. “UNITER: Universal image-text representation learning”, arXiv preprint arXiv: 1909.11740v3, 2020.
  • 26. Li, X., Yin, X., Li, C., Zhang, P., Hu, X., … and Gao, J. “Oscar: Object-semantics aligned pre-training for vision-language tasks”, arXiv preprint arXiv: 2004.06165v5, 2020.
  • 27. Li, J., Selvaraju, R. R., Gotmare, A. D., … and Hoi, S. “Align before fuse: Vision and language representation learning with momentum distillation”, arXiv preprint arXiv: 2107.07651v2, 2021.
  • 28. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... and Houlsby, N. “An image is worth 16x16 words: Transformers for image recognition at scale”, arXiv preprint arXiv: 2010.11929v2, 2021.
  • 29. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., ... and Guo, B. “Swin Transformer: Hierarchical vision transformer using shifted windows”, arXiv preprint arXiv: 2103.14030v2, 2021.
  • 30. Kim, W., Son, B., and Kim, I. “ViLT: Vision-and-Language Transformer without Convolution or Region Supervision”, arXiv preprint arXiv: 2102.03334v2, 2021.
  • 31. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and Jégou, H. “Training data-efficient image transformers & distillation through attention”, arXiv preprint arXiv: 2012.12877v2, 2021.
  • 32. Shen, S., Li, L. H., Tan, H., Bansal, M., Rohrbach, A., …, and Keutzer, K., “How much can CLIP benefit vision-and-language tasks?”, arXiv preprint arXiv: 2107.06383, 2021.
  • 33. Li, J., Li, D., Xie, S., and Li, F. F. “BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation”, arXiv preprint arXiv: 2201.12086v2, 2022.
  • 34. Caffagni, D., Cocchi, F., Barsellotti, L., Moratelli, N., Sarto, S., ... and Cucchiara, R. “The revolution of multimodal large language models: a survey”, arXiv preprint arXiv: 2402.12451v2, 2024.
  • 35. Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., … and Ji, R. “MME: A comprehensive evaluation benchmark for multimodal large language models”, arXiv preprint arXiv: 2306.13394v4, 2024.
  • 36. Gong, T., Lyu, C., Zhang, S., Wang, Y., Zheng, M., … and Chen, K. “MultiModal-GPT: A vision and language model for dialogue with humans”, arXiv preprint arXiv: 2305.04790v3, 2023.
  • 37. Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., … and Zhou, J. “Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond.” arXiv preprint arXiv: 2308.12966v3, 2023.
  • 38. Mohanty, S. P., Hughes, D. P., and Salathé, M. “Using deep learning for image-based plant disease detection”, Frontiers In Plant Science, Vol. 7, Issue 215232, 2016.
  • 39. Dosovitskiy, A. and Brox, T., “Discriminative unsupervised feature learning with exemplar convolutional neural networks”, arXiv preprint arXiv: 1406.6909v2, 2015.
  • 40. Chollet, F. “Xception: Deep Learning with Depthwise Separable Convolutions”, arXiv preprint arXiv: 1610.02357v3, 2017.
  • 41. Zeiler, M. D. and Fergus, R. “Visualizing and understanding convolutional networks”, arXiv preprint arXiv: 1311.2901v3, 2014.
  • 42. Chen, F., Zhang, D., Han, M., Chen, X., … and Xu, B. “VLP: A survey on vision-language pre-training”, arXiv preprint arXiv: 2202.09061v4, 2022.
  • 43. Long, S., Cao, F., Han, S. C., and Yang, H. “Vision-and-language pretrained models: A survey”, arXiv preprint arXiv: 2204.07356v5, 2022.
  • 44. Gao, P., Geng, S., Zhang, R., Ma, T., … and Qiao, Y. “CLIP-adapter: Better vision-language models with feature adapters”, arXiv preprint arXiv: 2110.04544v2, 2025.
  • 45. Gou, J., Yu, B., Maybank, S. J., and Tao, D. “Knowledge distillation: A survey”, International Journal of Computer Vision, vol. 129, Pages. 1789-1819, 2021.
  • 46. Sanh, V., Debut, L., Chaumond, J., and Wolf, T. “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter”, arXiv preprint arXiv: 1910.01108v4, 2020.
  • 47. Sanh, V., Wolf, T., and Ruder, S. “Movement pruning: Adaptive sparsity by fine-tuning”, arXiv preprint arXiv: 2005.07683v2, 2020.
  • 48. Zhou, K., Yang, J., Loy, C. C., and Liu, Z. “Conditional prompt learning for vision-language models”, arXiv preprint arXiv: 2203.05557v2, 2022.
  • 49. Yu, J., Wang, Z., Vasudevan, V., Yeung, L., … and Wu, Y. “CoCa: Contrastive captioners are image-text foundation models”, arXiv preprint arXiv: 2205.01917v2, 2022.
  • 50. Liu, M., Li, B., and Yu, Y. “Fully fine-tuned CLIP models are efficient few-shot learners”, arXiv preprint arXiv: 2407.04003v1, 2024.
  • 51. Wang, S., Wang, J., Wang, G., Zhang, B., Zhou, K., and Wei, H., “Open-vocabulary calibration for fine-tuned CLIP”, arXiv preprint arXiv: 2402.04655v4, 2024.
  • 52. Chen, J., Yang, D., Jiang, Y., Li, M., … and Zhang, L. “Efficiency in focus: LayerNorm as a catalyst for fine-tuning medical visual language pre-trained models”, arXiv preprint arXiv: 2404.16385v1, 2024.
  • 53. Duan, Z., Cheng, H., Xu, D., Wu, X., Zhang, X., … and Xie, Z. “CityLLaVA: Efficient fine-tuning for VLMs in city scenario”, arXiv preprint arXiv: 2405.03194v1, 2024.
  • 54. Zhao, Y., Pang, T., Du, C., Yang, X., Li, C., Cheung, N. M. M., and Lin, M. “On evaluating adversarial robustness of large vision-language models”, arXiv preprint arXiv: 2305.16934v2, 2023. 55. Zhou, W., Bai, S., Mandic, D. P., Zhao, Q., and Chen, B. “Revisiting the adversarial robustness of vision language models: a multimodal perspective. arXiv preprint arXiv: 2404.19287, 2024.
  • 56. Li, L., Guan, H., Qiu, J., and Spratling, M. “One prompt word is enough to boost adversarial robustness for pre-trained vision-language models”, arXiv preprint arXiv: 403.01849v1, 2024.

PERFORMANCE COMPARISON OF VISION-LANGUAGE MODELS IN IMAGE CLASSIFICATION

Year 2025, Volume: 9 Issue: 2, 247 - 262, 30.08.2025
https://doi.org/10.46519/ij3dptdi.1693015

Abstract

Vision-Language Models (VLMs) have introduced a new paradigm shift in image classification by integrating visual and textual modalities. While these models have demonstrated strong performance on multimodal tasks, their effectiveness in purely visual classification remains underexplored. This study presents a comprehensive, metric-driven comparative analysis of eight state-of-the-art VLMs—GPT-4o-latest, GPT-4o-mini, Gemini-flash-1.5-8b, LLaMA-3.2-90B-vision-instruct, Grok-2-vision-1212, Qwen2.5-vl-7b-instruct, Claude-3.5-sonnet, and Pixtral-large-2411—across four datasets: CIFAR-10, ImageNet, COCO, and the domain-specific New Plant Diseases dataset. Model performance was evaluated using accuracy, precision, recall, F1-score, and robustness under zero-shot and few-shot settings. Quantitative results indicate that GPT-4o-latest consistently achieves the highest performance on typical benchmarks (accuracy: 0.91, F1-score: 0.91 on CIFAR-10), substantially surpassing lightweight models such as Pixtral-large-2411 (accuracy: 0.13, F1-score: 0.13). Near-perfect results on ImageNet and COCO likely reflect pre-training overlap, whereas notable performance degradation on the New Plant Diseases dataset underscores domain adaptation challenges. Our findings emphasize the need for robust, parameter-efficient, and domain-adaptive fine-tuning strategies to advance VLMs in real-world image classification.

Ethical Statement

This study does not require ethics committee permission or any special permission.

Supporting Institution

This research was supported by The Scientific and Technological Research Council of Türkiye (TÜBİTAK) under the 1005-National New Ideas and Products Research Support Program (Project No: 124E769).

Project Number

124E769

References

  • 1. LeCun, Y., Bengio, Y., and Hinton, G. “Deep learning”, Nature, Vol. 521, Issue 7553, Pages 436-444, 2015.
  • 2. Yao, G., Lei, T., and Zhong, J. “A review of convolutional‐neural‐network‐based action recognition”, Pattern Recognition Letters, Vol. 118, Pages 14-22, 2019.
  • 3. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... and Bengio, Y. “Generative adversarial networks”, Communications of the ACM, Vol. 63, Issue 11, Pages 139-144, 2020.
  • 4. Alzubaidi, L., Zhang, J., Humaidi, A.J., Al-Dujaili, A., Duan, Y., … and Farhan, L. “Review of deep learning: concepts, CNN architectures, challenges, applications, future directions”, Journal of Big Data, Vol. 8, Issue 53, 2021.
  • 5. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... and Polosukhin, I. “Attention is all you need”, arXiv preprint arXiv: 1706.03762v7.
  • 6. Krizhevsky, A., and Hinton, G. “Learning multiple layers of features from tiny images”, https://www.cs.toronto.edu/~kriz/cifar.html, 2009.
  • 7. Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., and Fei-Fei, L. “Imagenet: A large-scale hierarchical image database”, IEEE Conference on Computer Vision and Pattern Recognition, 248-255, 2009.
  • 8. LeCun, Y., Bottou, L., Bengio, Y. and Haffner, P. “Gradient-based learning applied to document recognition”, Proceedings of the IEEE, Vol. 86, Issue 11, Pages 2278-2324, 1998.
  • 9. Liu, Z., Luo, P., Wang, X., and Tang, X. "Deep learning face attributes in the wild," IEEE International Conference on Computer Vision, Pages 3730-3738, 2015.
  • 10. Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., ... and Dollar, P. “Microsoft COCO: Common objects in context”, arXiv preprint arXiv: 1405.0312v3, 2014.
  • 11. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., … and Sutskever, I. “Learning transferable visual models from natural language supervision”, arXiv preprint arXiv: 2103.00020, 2021.
  • 12. Jia, C., Yang, Y., Xia, Y., Chen, Y. T., Parekh, Z., Pham, H., ... and Duerig, T. “Scaling up visual and vision-language representation learning with noisy text supervision”, arXiv preprint arXiv: 2102.05918v2, 2021.
  • 13. Lai, Z., Saveris, V., Chen, C., Chen, H. Y., Zhang, H., … and Yang Y. “Revisit large-scale image-caption data in pre-training multimodal foundation models”, arXiv preprint arXiv: 2410.02740v1, 2024.
  • 14. Alayrac, J. B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., ... and Simonyan, K. “Flamingo: A visual language model for few-shot learning”, arXiv preprint arXiv: 2204.14198v2, 2022.
  • 15. Simonyan, K. and Zisserman, A. “Very deep convolutional networks for large-scale image recognition”, arXiv preprint arXiv: 1409.1556v6, 2015.
  • 16. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... and Rabinovich, A. “Going deeper with convolutions”, arXiv preprint arXiv: 1409.4842v1 2015.
  • 17. He, K., Zhang, X., Ren, S., and Sun, J. “Deep residual learning for image recognition”, arXiv preprint arXiv: 1512.03385v1, 2016. 18. Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. “Densely connected convolutional networks”, arXiv preprint arXiv: 1608.06993v5, 2017.
  • 19. Wang, Z., Lu, Y., Li, W., Wang, S., Wang, X., and Chen, X. “Single image super-resolution with attention-based densely connected modüle”, Neurocomputing, Vol. 453, Pages, 876-884, 2021.
  • 20. Yildiz, E., Yuksel, M. E., and Sevgen, S. “A single-image GAN model using self-attention mechanism and DenseNets”, Neurocomputing, Vol. 596, Issue, 127873, 2024. 21. Tan, M. and Le, Q. V. “EfficientNet: Rethinking model scaling for convolutional neural networks”, arXiv preprint arXiv: 1905.11946v5, 2019.
  • 22. Lu, J., Batra, D., Parikh, D., and Lee, S. “ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks”, arXiv preprint arXiv: 1908.02265v1, 2019.
  • 23. Li, L. H., Yatskar, M., Yin, D., Hsieh, C. J., and Chang, K. W. “VisualBERT: A simple and performant baseline for vision and language”, arXiv preprint arXiv: 1908.03557v1, 2019
  • 24. Tan, H. and Bansal, M. “LXMERT: Learning cross-modality encoder representations from transformers”, arXiv preprint arXiv: 1908.07490v3, 2019.
  • 25. Chen, Y. C., Li, L., Yu, L., Kholy, A. E., Ahmed, F., and Liu, J. “UNITER: Universal image-text representation learning”, arXiv preprint arXiv: 1909.11740v3, 2020.
  • 26. Li, X., Yin, X., Li, C., Zhang, P., Hu, X., … and Gao, J. “Oscar: Object-semantics aligned pre-training for vision-language tasks”, arXiv preprint arXiv: 2004.06165v5, 2020.
  • 27. Li, J., Selvaraju, R. R., Gotmare, A. D., … and Hoi, S. “Align before fuse: Vision and language representation learning with momentum distillation”, arXiv preprint arXiv: 2107.07651v2, 2021.
  • 28. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... and Houlsby, N. “An image is worth 16x16 words: Transformers for image recognition at scale”, arXiv preprint arXiv: 2010.11929v2, 2021.
  • 29. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., ... and Guo, B. “Swin Transformer: Hierarchical vision transformer using shifted windows”, arXiv preprint arXiv: 2103.14030v2, 2021.
  • 30. Kim, W., Son, B., and Kim, I. “ViLT: Vision-and-Language Transformer without Convolution or Region Supervision”, arXiv preprint arXiv: 2102.03334v2, 2021.
  • 31. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and Jégou, H. “Training data-efficient image transformers & distillation through attention”, arXiv preprint arXiv: 2012.12877v2, 2021.
  • 32. Shen, S., Li, L. H., Tan, H., Bansal, M., Rohrbach, A., …, and Keutzer, K., “How much can CLIP benefit vision-and-language tasks?”, arXiv preprint arXiv: 2107.06383, 2021.
  • 33. Li, J., Li, D., Xie, S., and Li, F. F. “BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation”, arXiv preprint arXiv: 2201.12086v2, 2022.
  • 34. Caffagni, D., Cocchi, F., Barsellotti, L., Moratelli, N., Sarto, S., ... and Cucchiara, R. “The revolution of multimodal large language models: a survey”, arXiv preprint arXiv: 2402.12451v2, 2024.
  • 35. Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., … and Ji, R. “MME: A comprehensive evaluation benchmark for multimodal large language models”, arXiv preprint arXiv: 2306.13394v4, 2024.
  • 36. Gong, T., Lyu, C., Zhang, S., Wang, Y., Zheng, M., … and Chen, K. “MultiModal-GPT: A vision and language model for dialogue with humans”, arXiv preprint arXiv: 2305.04790v3, 2023.
  • 37. Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., … and Zhou, J. “Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond.” arXiv preprint arXiv: 2308.12966v3, 2023.
  • 38. Mohanty, S. P., Hughes, D. P., and Salathé, M. “Using deep learning for image-based plant disease detection”, Frontiers In Plant Science, Vol. 7, Issue 215232, 2016.
  • 39. Dosovitskiy, A. and Brox, T., “Discriminative unsupervised feature learning with exemplar convolutional neural networks”, arXiv preprint arXiv: 1406.6909v2, 2015.
  • 40. Chollet, F. “Xception: Deep Learning with Depthwise Separable Convolutions”, arXiv preprint arXiv: 1610.02357v3, 2017.
  • 41. Zeiler, M. D. and Fergus, R. “Visualizing and understanding convolutional networks”, arXiv preprint arXiv: 1311.2901v3, 2014.
  • 42. Chen, F., Zhang, D., Han, M., Chen, X., … and Xu, B. “VLP: A survey on vision-language pre-training”, arXiv preprint arXiv: 2202.09061v4, 2022.
  • 43. Long, S., Cao, F., Han, S. C., and Yang, H. “Vision-and-language pretrained models: A survey”, arXiv preprint arXiv: 2204.07356v5, 2022.
  • 44. Gao, P., Geng, S., Zhang, R., Ma, T., … and Qiao, Y. “CLIP-adapter: Better vision-language models with feature adapters”, arXiv preprint arXiv: 2110.04544v2, 2025.
  • 45. Gou, J., Yu, B., Maybank, S. J., and Tao, D. “Knowledge distillation: A survey”, International Journal of Computer Vision, vol. 129, Pages. 1789-1819, 2021.
  • 46. Sanh, V., Debut, L., Chaumond, J., and Wolf, T. “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter”, arXiv preprint arXiv: 1910.01108v4, 2020.
  • 47. Sanh, V., Wolf, T., and Ruder, S. “Movement pruning: Adaptive sparsity by fine-tuning”, arXiv preprint arXiv: 2005.07683v2, 2020.
  • 48. Zhou, K., Yang, J., Loy, C. C., and Liu, Z. “Conditional prompt learning for vision-language models”, arXiv preprint arXiv: 2203.05557v2, 2022.
  • 49. Yu, J., Wang, Z., Vasudevan, V., Yeung, L., … and Wu, Y. “CoCa: Contrastive captioners are image-text foundation models”, arXiv preprint arXiv: 2205.01917v2, 2022.
  • 50. Liu, M., Li, B., and Yu, Y. “Fully fine-tuned CLIP models are efficient few-shot learners”, arXiv preprint arXiv: 2407.04003v1, 2024.
  • 51. Wang, S., Wang, J., Wang, G., Zhang, B., Zhou, K., and Wei, H., “Open-vocabulary calibration for fine-tuned CLIP”, arXiv preprint arXiv: 2402.04655v4, 2024.
  • 52. Chen, J., Yang, D., Jiang, Y., Li, M., … and Zhang, L. “Efficiency in focus: LayerNorm as a catalyst for fine-tuning medical visual language pre-trained models”, arXiv preprint arXiv: 2404.16385v1, 2024.
  • 53. Duan, Z., Cheng, H., Xu, D., Wu, X., Zhang, X., … and Xie, Z. “CityLLaVA: Efficient fine-tuning for VLMs in city scenario”, arXiv preprint arXiv: 2405.03194v1, 2024.
  • 54. Zhao, Y., Pang, T., Du, C., Yang, X., Li, C., Cheung, N. M. M., and Lin, M. “On evaluating adversarial robustness of large vision-language models”, arXiv preprint arXiv: 2305.16934v2, 2023. 55. Zhou, W., Bai, S., Mandic, D. P., Zhao, Q., and Chen, B. “Revisiting the adversarial robustness of vision language models: a multimodal perspective. arXiv preprint arXiv: 2404.19287, 2024.
  • 56. Li, L., Guan, H., Qiu, J., and Spratling, M. “One prompt word is enough to boost adversarial robustness for pre-trained vision-language models”, arXiv preprint arXiv: 403.01849v1, 2024.
There are 53 citations in total.

Details

Primary Language English
Subjects Software Engineering (Other)
Journal Section Research Article
Authors

Doğukan Özeren 0009-0008-3801-3057

Erkan Yüksel 0000-0001-8976-9964

Asım Sinan Yüksel 0000-0003-1986-5269

Project Number 124E769
Publication Date August 30, 2025
Submission Date May 6, 2025
Acceptance Date June 28, 2025
Published in Issue Year 2025 Volume: 9 Issue: 2

Cite

APA Özeren, D., Yüksel, E., & Yüksel, A. S. (2025). PERFORMANCE COMPARISON OF VISION-LANGUAGE MODELS IN IMAGE CLASSIFICATION. International Journal of 3D Printing Technologies and Digital Industry, 9(2), 247-262. https://doi.org/10.46519/ij3dptdi.1693015
AMA Özeren D, Yüksel E, Yüksel AS. PERFORMANCE COMPARISON OF VISION-LANGUAGE MODELS IN IMAGE CLASSIFICATION. International Journal of 3D Printing Technologies and Digital Industry. August 2025;9(2):247-262. doi:10.46519/ij3dptdi.1693015
Chicago Özeren, Doğukan, Erkan Yüksel, and Asım Sinan Yüksel. “PERFORMANCE COMPARISON OF VISION-LANGUAGE MODELS IN IMAGE CLASSIFICATION”. International Journal of 3D Printing Technologies and Digital Industry 9, no. 2 (August 2025): 247-62. https://doi.org/10.46519/ij3dptdi.1693015.
EndNote Özeren D, Yüksel E, Yüksel AS (August 1, 2025) PERFORMANCE COMPARISON OF VISION-LANGUAGE MODELS IN IMAGE CLASSIFICATION. International Journal of 3D Printing Technologies and Digital Industry 9 2 247–262.
IEEE D. Özeren, E. Yüksel, and A. S. Yüksel, “PERFORMANCE COMPARISON OF VISION-LANGUAGE MODELS IN IMAGE CLASSIFICATION”, International Journal of 3D Printing Technologies and Digital Industry, vol. 9, no. 2, pp. 247–262, 2025, doi: 10.46519/ij3dptdi.1693015.
ISNAD Özeren, Doğukan et al. “PERFORMANCE COMPARISON OF VISION-LANGUAGE MODELS IN IMAGE CLASSIFICATION”. International Journal of 3D Printing Technologies and Digital Industry 9/2 (August2025), 247-262. https://doi.org/10.46519/ij3dptdi.1693015.
JAMA Özeren D, Yüksel E, Yüksel AS. PERFORMANCE COMPARISON OF VISION-LANGUAGE MODELS IN IMAGE CLASSIFICATION. International Journal of 3D Printing Technologies and Digital Industry. 2025;9:247–262.
MLA Özeren, Doğukan et al. “PERFORMANCE COMPARISON OF VISION-LANGUAGE MODELS IN IMAGE CLASSIFICATION”. International Journal of 3D Printing Technologies and Digital Industry, vol. 9, no. 2, 2025, pp. 247-62, doi:10.46519/ij3dptdi.1693015.
Vancouver Özeren D, Yüksel E, Yüksel AS. PERFORMANCE COMPARISON OF VISION-LANGUAGE MODELS IN IMAGE CLASSIFICATION. International Journal of 3D Printing Technologies and Digital Industry. 2025;9(2):247-62.

download

International Journal of 3D Printing Technologies and Digital Industry is lisenced under Creative Commons Atıf-GayriTicari 4.0 Uluslararası Lisansı