Araştırma Makalesi
BibTex RIS Kaynak Göster

PERFORMANCE COMPARISON OF VISION-LANGUAGE MODELS IN IMAGE CLASSIFICATION

Yıl 2025, Cilt: 9 Sayı: 2, 247 - 262, 30.08.2025
https://doi.org/10.46519/ij3dptdi.1693015

Öz

Vision-Language Models (VLMs) have introduced a new paradigm shift in image classification by integrating visual and textual modalities. While these models have demonstrated strong performance on multimodal tasks, their effectiveness in purely visual classification remains underexplored. This study presents a comprehensive, metric-driven comparative analysis of eight state-of-the-art VLMs—GPT-4o-latest, GPT-4o-mini, Gemini-flash-1.5-8b, LLaMA-3.2-90B-vision-instruct, Grok-2-vision-1212, Qwen2.5-vl-7b-instruct, Claude-3.5-sonnet, and Pixtral-large-2411—across four datasets: CIFAR-10, ImageNet, COCO, and the domain-specific New Plant Diseases dataset. Model performance was evaluated using accuracy, precision, recall, F1-score, and robustness under zero-shot and few-shot settings. Quantitative results indicate that GPT-4o-latest consistently achieves the highest performance on typical benchmarks (accuracy: 0.91, F1-score: 0.91 on CIFAR-10), substantially surpassing lightweight models such as Pixtral-large-2411 (accuracy: 0.13, F1-score: 0.13). Near-perfect results on ImageNet and COCO likely reflect pre-training overlap, whereas notable performance degradation on the New Plant Diseases dataset underscores domain adaptation challenges. Our findings emphasize the need for robust, parameter-efficient, and domain-adaptive fine-tuning strategies to advance VLMs in real-world image classification.

Proje Numarası

124E769

Kaynakça

  • 1. LeCun, Y., Bengio, Y., and Hinton, G. “Deep learning”, Nature, Vol. 521, Issue 7553, Pages 436-444, 2015.
  • 2. Yao, G., Lei, T., and Zhong, J. “A review of convolutional‐neural‐network‐based action recognition”, Pattern Recognition Letters, Vol. 118, Pages 14-22, 2019.
  • 3. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... and Bengio, Y. “Generative adversarial networks”, Communications of the ACM, Vol. 63, Issue 11, Pages 139-144, 2020.
  • 4. Alzubaidi, L., Zhang, J., Humaidi, A.J., Al-Dujaili, A., Duan, Y., … and Farhan, L. “Review of deep learning: concepts, CNN architectures, challenges, applications, future directions”, Journal of Big Data, Vol. 8, Issue 53, 2021.
  • 5. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... and Polosukhin, I. “Attention is all you need”, arXiv preprint arXiv: 1706.03762v7.
  • 6. Krizhevsky, A., and Hinton, G. “Learning multiple layers of features from tiny images”, https://www.cs.toronto.edu/~kriz/cifar.html, 2009.
  • 7. Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., and Fei-Fei, L. “Imagenet: A large-scale hierarchical image database”, IEEE Conference on Computer Vision and Pattern Recognition, 248-255, 2009.
  • 8. LeCun, Y., Bottou, L., Bengio, Y. and Haffner, P. “Gradient-based learning applied to document recognition”, Proceedings of the IEEE, Vol. 86, Issue 11, Pages 2278-2324, 1998.
  • 9. Liu, Z., Luo, P., Wang, X., and Tang, X. "Deep learning face attributes in the wild," IEEE International Conference on Computer Vision, Pages 3730-3738, 2015.
  • 10. Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., ... and Dollar, P. “Microsoft COCO: Common objects in context”, arXiv preprint arXiv: 1405.0312v3, 2014.
  • 11. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., … and Sutskever, I. “Learning transferable visual models from natural language supervision”, arXiv preprint arXiv: 2103.00020, 2021.
  • 12. Jia, C., Yang, Y., Xia, Y., Chen, Y. T., Parekh, Z., Pham, H., ... and Duerig, T. “Scaling up visual and vision-language representation learning with noisy text supervision”, arXiv preprint arXiv: 2102.05918v2, 2021.
  • 13. Lai, Z., Saveris, V., Chen, C., Chen, H. Y., Zhang, H., … and Yang Y. “Revisit large-scale image-caption data in pre-training multimodal foundation models”, arXiv preprint arXiv: 2410.02740v1, 2024.
  • 14. Alayrac, J. B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., ... and Simonyan, K. “Flamingo: A visual language model for few-shot learning”, arXiv preprint arXiv: 2204.14198v2, 2022.
  • 15. Simonyan, K. and Zisserman, A. “Very deep convolutional networks for large-scale image recognition”, arXiv preprint arXiv: 1409.1556v6, 2015.
  • 16. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... and Rabinovich, A. “Going deeper with convolutions”, arXiv preprint arXiv: 1409.4842v1 2015.
  • 17. He, K., Zhang, X., Ren, S., and Sun, J. “Deep residual learning for image recognition”, arXiv preprint arXiv: 1512.03385v1, 2016. 18. Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. “Densely connected convolutional networks”, arXiv preprint arXiv: 1608.06993v5, 2017.
  • 19. Wang, Z., Lu, Y., Li, W., Wang, S., Wang, X., and Chen, X. “Single image super-resolution with attention-based densely connected modüle”, Neurocomputing, Vol. 453, Pages, 876-884, 2021.
  • 20. Yildiz, E., Yuksel, M. E., and Sevgen, S. “A single-image GAN model using self-attention mechanism and DenseNets”, Neurocomputing, Vol. 596, Issue, 127873, 2024. 21. Tan, M. and Le, Q. V. “EfficientNet: Rethinking model scaling for convolutional neural networks”, arXiv preprint arXiv: 1905.11946v5, 2019.
  • 22. Lu, J., Batra, D., Parikh, D., and Lee, S. “ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks”, arXiv preprint arXiv: 1908.02265v1, 2019.
  • 23. Li, L. H., Yatskar, M., Yin, D., Hsieh, C. J., and Chang, K. W. “VisualBERT: A simple and performant baseline for vision and language”, arXiv preprint arXiv: 1908.03557v1, 2019
  • 24. Tan, H. and Bansal, M. “LXMERT: Learning cross-modality encoder representations from transformers”, arXiv preprint arXiv: 1908.07490v3, 2019.
  • 25. Chen, Y. C., Li, L., Yu, L., Kholy, A. E., Ahmed, F., and Liu, J. “UNITER: Universal image-text representation learning”, arXiv preprint arXiv: 1909.11740v3, 2020.
  • 26. Li, X., Yin, X., Li, C., Zhang, P., Hu, X., … and Gao, J. “Oscar: Object-semantics aligned pre-training for vision-language tasks”, arXiv preprint arXiv: 2004.06165v5, 2020.
  • 27. Li, J., Selvaraju, R. R., Gotmare, A. D., … and Hoi, S. “Align before fuse: Vision and language representation learning with momentum distillation”, arXiv preprint arXiv: 2107.07651v2, 2021.
  • 28. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... and Houlsby, N. “An image is worth 16x16 words: Transformers for image recognition at scale”, arXiv preprint arXiv: 2010.11929v2, 2021.
  • 29. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., ... and Guo, B. “Swin Transformer: Hierarchical vision transformer using shifted windows”, arXiv preprint arXiv: 2103.14030v2, 2021.
  • 30. Kim, W., Son, B., and Kim, I. “ViLT: Vision-and-Language Transformer without Convolution or Region Supervision”, arXiv preprint arXiv: 2102.03334v2, 2021.
  • 31. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and Jégou, H. “Training data-efficient image transformers & distillation through attention”, arXiv preprint arXiv: 2012.12877v2, 2021.
  • 32. Shen, S., Li, L. H., Tan, H., Bansal, M., Rohrbach, A., …, and Keutzer, K., “How much can CLIP benefit vision-and-language tasks?”, arXiv preprint arXiv: 2107.06383, 2021.
  • 33. Li, J., Li, D., Xie, S., and Li, F. F. “BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation”, arXiv preprint arXiv: 2201.12086v2, 2022.
  • 34. Caffagni, D., Cocchi, F., Barsellotti, L., Moratelli, N., Sarto, S., ... and Cucchiara, R. “The revolution of multimodal large language models: a survey”, arXiv preprint arXiv: 2402.12451v2, 2024.
  • 35. Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., … and Ji, R. “MME: A comprehensive evaluation benchmark for multimodal large language models”, arXiv preprint arXiv: 2306.13394v4, 2024.
  • 36. Gong, T., Lyu, C., Zhang, S., Wang, Y., Zheng, M., … and Chen, K. “MultiModal-GPT: A vision and language model for dialogue with humans”, arXiv preprint arXiv: 2305.04790v3, 2023.
  • 37. Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., … and Zhou, J. “Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond.” arXiv preprint arXiv: 2308.12966v3, 2023.
  • 38. Mohanty, S. P., Hughes, D. P., and Salathé, M. “Using deep learning for image-based plant disease detection”, Frontiers In Plant Science, Vol. 7, Issue 215232, 2016.
  • 39. Dosovitskiy, A. and Brox, T., “Discriminative unsupervised feature learning with exemplar convolutional neural networks”, arXiv preprint arXiv: 1406.6909v2, 2015.
  • 40. Chollet, F. “Xception: Deep Learning with Depthwise Separable Convolutions”, arXiv preprint arXiv: 1610.02357v3, 2017.
  • 41. Zeiler, M. D. and Fergus, R. “Visualizing and understanding convolutional networks”, arXiv preprint arXiv: 1311.2901v3, 2014.
  • 42. Chen, F., Zhang, D., Han, M., Chen, X., … and Xu, B. “VLP: A survey on vision-language pre-training”, arXiv preprint arXiv: 2202.09061v4, 2022.
  • 43. Long, S., Cao, F., Han, S. C., and Yang, H. “Vision-and-language pretrained models: A survey”, arXiv preprint arXiv: 2204.07356v5, 2022.
  • 44. Gao, P., Geng, S., Zhang, R., Ma, T., … and Qiao, Y. “CLIP-adapter: Better vision-language models with feature adapters”, arXiv preprint arXiv: 2110.04544v2, 2025.
  • 45. Gou, J., Yu, B., Maybank, S. J., and Tao, D. “Knowledge distillation: A survey”, International Journal of Computer Vision, vol. 129, Pages. 1789-1819, 2021.
  • 46. Sanh, V., Debut, L., Chaumond, J., and Wolf, T. “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter”, arXiv preprint arXiv: 1910.01108v4, 2020.
  • 47. Sanh, V., Wolf, T., and Ruder, S. “Movement pruning: Adaptive sparsity by fine-tuning”, arXiv preprint arXiv: 2005.07683v2, 2020.
  • 48. Zhou, K., Yang, J., Loy, C. C., and Liu, Z. “Conditional prompt learning for vision-language models”, arXiv preprint arXiv: 2203.05557v2, 2022.
  • 49. Yu, J., Wang, Z., Vasudevan, V., Yeung, L., … and Wu, Y. “CoCa: Contrastive captioners are image-text foundation models”, arXiv preprint arXiv: 2205.01917v2, 2022.
  • 50. Liu, M., Li, B., and Yu, Y. “Fully fine-tuned CLIP models are efficient few-shot learners”, arXiv preprint arXiv: 2407.04003v1, 2024.
  • 51. Wang, S., Wang, J., Wang, G., Zhang, B., Zhou, K., and Wei, H., “Open-vocabulary calibration for fine-tuned CLIP”, arXiv preprint arXiv: 2402.04655v4, 2024.
  • 52. Chen, J., Yang, D., Jiang, Y., Li, M., … and Zhang, L. “Efficiency in focus: LayerNorm as a catalyst for fine-tuning medical visual language pre-trained models”, arXiv preprint arXiv: 2404.16385v1, 2024.
  • 53. Duan, Z., Cheng, H., Xu, D., Wu, X., Zhang, X., … and Xie, Z. “CityLLaVA: Efficient fine-tuning for VLMs in city scenario”, arXiv preprint arXiv: 2405.03194v1, 2024.
  • 54. Zhao, Y., Pang, T., Du, C., Yang, X., Li, C., Cheung, N. M. M., and Lin, M. “On evaluating adversarial robustness of large vision-language models”, arXiv preprint arXiv: 2305.16934v2, 2023. 55. Zhou, W., Bai, S., Mandic, D. P., Zhao, Q., and Chen, B. “Revisiting the adversarial robustness of vision language models: a multimodal perspective. arXiv preprint arXiv: 2404.19287, 2024.
  • 56. Li, L., Guan, H., Qiu, J., and Spratling, M. “One prompt word is enough to boost adversarial robustness for pre-trained vision-language models”, arXiv preprint arXiv: 403.01849v1, 2024.

PERFORMANCE COMPARISON OF VISION-LANGUAGE MODELS IN IMAGE CLASSIFICATION

Yıl 2025, Cilt: 9 Sayı: 2, 247 - 262, 30.08.2025
https://doi.org/10.46519/ij3dptdi.1693015

Öz

Vision-Language Models (VLMs) have introduced a new paradigm shift in image classification by integrating visual and textual modalities. While these models have demonstrated strong performance on multimodal tasks, their effectiveness in purely visual classification remains underexplored. This study presents a comprehensive, metric-driven comparative analysis of eight state-of-the-art VLMs—GPT-4o-latest, GPT-4o-mini, Gemini-flash-1.5-8b, LLaMA-3.2-90B-vision-instruct, Grok-2-vision-1212, Qwen2.5-vl-7b-instruct, Claude-3.5-sonnet, and Pixtral-large-2411—across four datasets: CIFAR-10, ImageNet, COCO, and the domain-specific New Plant Diseases dataset. Model performance was evaluated using accuracy, precision, recall, F1-score, and robustness under zero-shot and few-shot settings. Quantitative results indicate that GPT-4o-latest consistently achieves the highest performance on typical benchmarks (accuracy: 0.91, F1-score: 0.91 on CIFAR-10), substantially surpassing lightweight models such as Pixtral-large-2411 (accuracy: 0.13, F1-score: 0.13). Near-perfect results on ImageNet and COCO likely reflect pre-training overlap, whereas notable performance degradation on the New Plant Diseases dataset underscores domain adaptation challenges. Our findings emphasize the need for robust, parameter-efficient, and domain-adaptive fine-tuning strategies to advance VLMs in real-world image classification.

Etik Beyan

This study does not require ethics committee permission or any special permission.

Destekleyen Kurum

This research was supported by The Scientific and Technological Research Council of Türkiye (TÜBİTAK) under the 1005-National New Ideas and Products Research Support Program (Project No: 124E769).

Proje Numarası

124E769

Kaynakça

  • 1. LeCun, Y., Bengio, Y., and Hinton, G. “Deep learning”, Nature, Vol. 521, Issue 7553, Pages 436-444, 2015.
  • 2. Yao, G., Lei, T., and Zhong, J. “A review of convolutional‐neural‐network‐based action recognition”, Pattern Recognition Letters, Vol. 118, Pages 14-22, 2019.
  • 3. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... and Bengio, Y. “Generative adversarial networks”, Communications of the ACM, Vol. 63, Issue 11, Pages 139-144, 2020.
  • 4. Alzubaidi, L., Zhang, J., Humaidi, A.J., Al-Dujaili, A., Duan, Y., … and Farhan, L. “Review of deep learning: concepts, CNN architectures, challenges, applications, future directions”, Journal of Big Data, Vol. 8, Issue 53, 2021.
  • 5. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... and Polosukhin, I. “Attention is all you need”, arXiv preprint arXiv: 1706.03762v7.
  • 6. Krizhevsky, A., and Hinton, G. “Learning multiple layers of features from tiny images”, https://www.cs.toronto.edu/~kriz/cifar.html, 2009.
  • 7. Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., and Fei-Fei, L. “Imagenet: A large-scale hierarchical image database”, IEEE Conference on Computer Vision and Pattern Recognition, 248-255, 2009.
  • 8. LeCun, Y., Bottou, L., Bengio, Y. and Haffner, P. “Gradient-based learning applied to document recognition”, Proceedings of the IEEE, Vol. 86, Issue 11, Pages 2278-2324, 1998.
  • 9. Liu, Z., Luo, P., Wang, X., and Tang, X. "Deep learning face attributes in the wild," IEEE International Conference on Computer Vision, Pages 3730-3738, 2015.
  • 10. Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., ... and Dollar, P. “Microsoft COCO: Common objects in context”, arXiv preprint arXiv: 1405.0312v3, 2014.
  • 11. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., … and Sutskever, I. “Learning transferable visual models from natural language supervision”, arXiv preprint arXiv: 2103.00020, 2021.
  • 12. Jia, C., Yang, Y., Xia, Y., Chen, Y. T., Parekh, Z., Pham, H., ... and Duerig, T. “Scaling up visual and vision-language representation learning with noisy text supervision”, arXiv preprint arXiv: 2102.05918v2, 2021.
  • 13. Lai, Z., Saveris, V., Chen, C., Chen, H. Y., Zhang, H., … and Yang Y. “Revisit large-scale image-caption data in pre-training multimodal foundation models”, arXiv preprint arXiv: 2410.02740v1, 2024.
  • 14. Alayrac, J. B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., ... and Simonyan, K. “Flamingo: A visual language model for few-shot learning”, arXiv preprint arXiv: 2204.14198v2, 2022.
  • 15. Simonyan, K. and Zisserman, A. “Very deep convolutional networks for large-scale image recognition”, arXiv preprint arXiv: 1409.1556v6, 2015.
  • 16. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... and Rabinovich, A. “Going deeper with convolutions”, arXiv preprint arXiv: 1409.4842v1 2015.
  • 17. He, K., Zhang, X., Ren, S., and Sun, J. “Deep residual learning for image recognition”, arXiv preprint arXiv: 1512.03385v1, 2016. 18. Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. “Densely connected convolutional networks”, arXiv preprint arXiv: 1608.06993v5, 2017.
  • 19. Wang, Z., Lu, Y., Li, W., Wang, S., Wang, X., and Chen, X. “Single image super-resolution with attention-based densely connected modüle”, Neurocomputing, Vol. 453, Pages, 876-884, 2021.
  • 20. Yildiz, E., Yuksel, M. E., and Sevgen, S. “A single-image GAN model using self-attention mechanism and DenseNets”, Neurocomputing, Vol. 596, Issue, 127873, 2024. 21. Tan, M. and Le, Q. V. “EfficientNet: Rethinking model scaling for convolutional neural networks”, arXiv preprint arXiv: 1905.11946v5, 2019.
  • 22. Lu, J., Batra, D., Parikh, D., and Lee, S. “ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks”, arXiv preprint arXiv: 1908.02265v1, 2019.
  • 23. Li, L. H., Yatskar, M., Yin, D., Hsieh, C. J., and Chang, K. W. “VisualBERT: A simple and performant baseline for vision and language”, arXiv preprint arXiv: 1908.03557v1, 2019
  • 24. Tan, H. and Bansal, M. “LXMERT: Learning cross-modality encoder representations from transformers”, arXiv preprint arXiv: 1908.07490v3, 2019.
  • 25. Chen, Y. C., Li, L., Yu, L., Kholy, A. E., Ahmed, F., and Liu, J. “UNITER: Universal image-text representation learning”, arXiv preprint arXiv: 1909.11740v3, 2020.
  • 26. Li, X., Yin, X., Li, C., Zhang, P., Hu, X., … and Gao, J. “Oscar: Object-semantics aligned pre-training for vision-language tasks”, arXiv preprint arXiv: 2004.06165v5, 2020.
  • 27. Li, J., Selvaraju, R. R., Gotmare, A. D., … and Hoi, S. “Align before fuse: Vision and language representation learning with momentum distillation”, arXiv preprint arXiv: 2107.07651v2, 2021.
  • 28. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... and Houlsby, N. “An image is worth 16x16 words: Transformers for image recognition at scale”, arXiv preprint arXiv: 2010.11929v2, 2021.
  • 29. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., ... and Guo, B. “Swin Transformer: Hierarchical vision transformer using shifted windows”, arXiv preprint arXiv: 2103.14030v2, 2021.
  • 30. Kim, W., Son, B., and Kim, I. “ViLT: Vision-and-Language Transformer without Convolution or Region Supervision”, arXiv preprint arXiv: 2102.03334v2, 2021.
  • 31. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and Jégou, H. “Training data-efficient image transformers & distillation through attention”, arXiv preprint arXiv: 2012.12877v2, 2021.
  • 32. Shen, S., Li, L. H., Tan, H., Bansal, M., Rohrbach, A., …, and Keutzer, K., “How much can CLIP benefit vision-and-language tasks?”, arXiv preprint arXiv: 2107.06383, 2021.
  • 33. Li, J., Li, D., Xie, S., and Li, F. F. “BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation”, arXiv preprint arXiv: 2201.12086v2, 2022.
  • 34. Caffagni, D., Cocchi, F., Barsellotti, L., Moratelli, N., Sarto, S., ... and Cucchiara, R. “The revolution of multimodal large language models: a survey”, arXiv preprint arXiv: 2402.12451v2, 2024.
  • 35. Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., … and Ji, R. “MME: A comprehensive evaluation benchmark for multimodal large language models”, arXiv preprint arXiv: 2306.13394v4, 2024.
  • 36. Gong, T., Lyu, C., Zhang, S., Wang, Y., Zheng, M., … and Chen, K. “MultiModal-GPT: A vision and language model for dialogue with humans”, arXiv preprint arXiv: 2305.04790v3, 2023.
  • 37. Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., … and Zhou, J. “Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond.” arXiv preprint arXiv: 2308.12966v3, 2023.
  • 38. Mohanty, S. P., Hughes, D. P., and Salathé, M. “Using deep learning for image-based plant disease detection”, Frontiers In Plant Science, Vol. 7, Issue 215232, 2016.
  • 39. Dosovitskiy, A. and Brox, T., “Discriminative unsupervised feature learning with exemplar convolutional neural networks”, arXiv preprint arXiv: 1406.6909v2, 2015.
  • 40. Chollet, F. “Xception: Deep Learning with Depthwise Separable Convolutions”, arXiv preprint arXiv: 1610.02357v3, 2017.
  • 41. Zeiler, M. D. and Fergus, R. “Visualizing and understanding convolutional networks”, arXiv preprint arXiv: 1311.2901v3, 2014.
  • 42. Chen, F., Zhang, D., Han, M., Chen, X., … and Xu, B. “VLP: A survey on vision-language pre-training”, arXiv preprint arXiv: 2202.09061v4, 2022.
  • 43. Long, S., Cao, F., Han, S. C., and Yang, H. “Vision-and-language pretrained models: A survey”, arXiv preprint arXiv: 2204.07356v5, 2022.
  • 44. Gao, P., Geng, S., Zhang, R., Ma, T., … and Qiao, Y. “CLIP-adapter: Better vision-language models with feature adapters”, arXiv preprint arXiv: 2110.04544v2, 2025.
  • 45. Gou, J., Yu, B., Maybank, S. J., and Tao, D. “Knowledge distillation: A survey”, International Journal of Computer Vision, vol. 129, Pages. 1789-1819, 2021.
  • 46. Sanh, V., Debut, L., Chaumond, J., and Wolf, T. “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter”, arXiv preprint arXiv: 1910.01108v4, 2020.
  • 47. Sanh, V., Wolf, T., and Ruder, S. “Movement pruning: Adaptive sparsity by fine-tuning”, arXiv preprint arXiv: 2005.07683v2, 2020.
  • 48. Zhou, K., Yang, J., Loy, C. C., and Liu, Z. “Conditional prompt learning for vision-language models”, arXiv preprint arXiv: 2203.05557v2, 2022.
  • 49. Yu, J., Wang, Z., Vasudevan, V., Yeung, L., … and Wu, Y. “CoCa: Contrastive captioners are image-text foundation models”, arXiv preprint arXiv: 2205.01917v2, 2022.
  • 50. Liu, M., Li, B., and Yu, Y. “Fully fine-tuned CLIP models are efficient few-shot learners”, arXiv preprint arXiv: 2407.04003v1, 2024.
  • 51. Wang, S., Wang, J., Wang, G., Zhang, B., Zhou, K., and Wei, H., “Open-vocabulary calibration for fine-tuned CLIP”, arXiv preprint arXiv: 2402.04655v4, 2024.
  • 52. Chen, J., Yang, D., Jiang, Y., Li, M., … and Zhang, L. “Efficiency in focus: LayerNorm as a catalyst for fine-tuning medical visual language pre-trained models”, arXiv preprint arXiv: 2404.16385v1, 2024.
  • 53. Duan, Z., Cheng, H., Xu, D., Wu, X., Zhang, X., … and Xie, Z. “CityLLaVA: Efficient fine-tuning for VLMs in city scenario”, arXiv preprint arXiv: 2405.03194v1, 2024.
  • 54. Zhao, Y., Pang, T., Du, C., Yang, X., Li, C., Cheung, N. M. M., and Lin, M. “On evaluating adversarial robustness of large vision-language models”, arXiv preprint arXiv: 2305.16934v2, 2023. 55. Zhou, W., Bai, S., Mandic, D. P., Zhao, Q., and Chen, B. “Revisiting the adversarial robustness of vision language models: a multimodal perspective. arXiv preprint arXiv: 2404.19287, 2024.
  • 56. Li, L., Guan, H., Qiu, J., and Spratling, M. “One prompt word is enough to boost adversarial robustness for pre-trained vision-language models”, arXiv preprint arXiv: 403.01849v1, 2024.
Toplam 53 adet kaynakça vardır.

Ayrıntılar

Birincil Dil İngilizce
Konular Yazılım Mühendisliği (Diğer)
Bölüm Araştırma Makalesi
Yazarlar

Doğukan Özeren 0009-0008-3801-3057

Erkan Yüksel 0000-0001-8976-9964

Asım Sinan Yüksel 0000-0003-1986-5269

Proje Numarası 124E769
Yayımlanma Tarihi 30 Ağustos 2025
Gönderilme Tarihi 6 Mayıs 2025
Kabul Tarihi 28 Haziran 2025
Yayımlandığı Sayı Yıl 2025 Cilt: 9 Sayı: 2

Kaynak Göster

APA Özeren, D., Yüksel, E., & Yüksel, A. S. (2025). PERFORMANCE COMPARISON OF VISION-LANGUAGE MODELS IN IMAGE CLASSIFICATION. International Journal of 3D Printing Technologies and Digital Industry, 9(2), 247-262. https://doi.org/10.46519/ij3dptdi.1693015
AMA Özeren D, Yüksel E, Yüksel AS. PERFORMANCE COMPARISON OF VISION-LANGUAGE MODELS IN IMAGE CLASSIFICATION. IJ3DPTDI. Ağustos 2025;9(2):247-262. doi:10.46519/ij3dptdi.1693015
Chicago Özeren, Doğukan, Erkan Yüksel, ve Asım Sinan Yüksel. “PERFORMANCE COMPARISON OF VISION-LANGUAGE MODELS IN IMAGE CLASSIFICATION”. International Journal of 3D Printing Technologies and Digital Industry 9, sy. 2 (Ağustos 2025): 247-62. https://doi.org/10.46519/ij3dptdi.1693015.
EndNote Özeren D, Yüksel E, Yüksel AS (01 Ağustos 2025) PERFORMANCE COMPARISON OF VISION-LANGUAGE MODELS IN IMAGE CLASSIFICATION. International Journal of 3D Printing Technologies and Digital Industry 9 2 247–262.
IEEE D. Özeren, E. Yüksel, ve A. S. Yüksel, “PERFORMANCE COMPARISON OF VISION-LANGUAGE MODELS IN IMAGE CLASSIFICATION”, IJ3DPTDI, c. 9, sy. 2, ss. 247–262, 2025, doi: 10.46519/ij3dptdi.1693015.
ISNAD Özeren, Doğukan vd. “PERFORMANCE COMPARISON OF VISION-LANGUAGE MODELS IN IMAGE CLASSIFICATION”. International Journal of 3D Printing Technologies and Digital Industry 9/2 (Ağustos2025), 247-262. https://doi.org/10.46519/ij3dptdi.1693015.
JAMA Özeren D, Yüksel E, Yüksel AS. PERFORMANCE COMPARISON OF VISION-LANGUAGE MODELS IN IMAGE CLASSIFICATION. IJ3DPTDI. 2025;9:247–262.
MLA Özeren, Doğukan vd. “PERFORMANCE COMPARISON OF VISION-LANGUAGE MODELS IN IMAGE CLASSIFICATION”. International Journal of 3D Printing Technologies and Digital Industry, c. 9, sy. 2, 2025, ss. 247-62, doi:10.46519/ij3dptdi.1693015.
Vancouver Özeren D, Yüksel E, Yüksel AS. PERFORMANCE COMPARISON OF VISION-LANGUAGE MODELS IN IMAGE CLASSIFICATION. IJ3DPTDI. 2025;9(2):247-62.

 download

Uluslararası 3B Yazıcı Teknolojileri ve Dijital Endüstri Dergisi Creative Commons Atıf-GayriTicari 4.0 Uluslararası Lisansı ile lisanslanmıştır.