Context-aware CLIP for Enhanced Food Recognition

Övgü Öztürk Ergün

doi:10.54569/aair.1707867

EN TR

Context-aware CLIP for Enhanced Food Recognition

Abstract

Generalization of food image recognition frameworks is difficult due to the wide variety of food categories in cuisines across cultures. The performance of the deep neural network models highly depends on the training dataset. To overcome this problem, we propose to extract context information from images in order to increase the discrimination capacity of networks. In this work, we utilize the CLIP architecture with the automatically derived ingredient context from food images. A list of ingredients are associated with each food category, which is later modeled as text after a voting process and fed to a CLIP architecture together with input image. Experimental results on the Food101 dataset show that this approach significantly improves the model’s performance, achieving a 2% overall increase in accuracy. This improvement varies across food classes, with increases ranging from 0.5% to as much as 22%. The proposed framework, CLIP fed with ingredient text, outperforms Yolov8 (81.46%) with 81.80% top 1 overall accuracy over 101 classes.

Keywords

Yemek Kategori Tanıma için İçerik Farkındalıklı CLIP

Abstract

Gıda görüntü tanıma çözümlerinin genelleştirilmesi, kültürler arası mutfaklardaki gıda kategorilerinin çok çeşitli olması nedeniyle zordur. Derin sinir ağı modellerinin performansı büyük ölçüde eğitim veri kümesine bağlıdır. Bu sorunun üstesinden gelmek için, ağların sınıfları daha iyi ayırt edebilme kapasitesini artırmak amacıyla görüntülerden bağlam bilgisi çıkarmayı öneriyoruz. Bu çalışmada, gıda görüntülerinden otomatik olarak türetilen bileşen bağlamına sahip CLIP mimarisini kullanıyoruz. Her bir gıda kategorisiyle bir bileşen listesi ilişkilendirilir ve daha sonra bir oylama sürecinden sonra metin olarak modellenir ve giriş görüntüsüyle birlikte bir CLIP mimarisine beslenir. Food101 veri kümesindeki deneysel sonuçlar, bu yaklaşımın modelin performansını önemli ölçüde iyileştirdiğini ve doğrulukta %2'lik bir genel artış sağladığını göstermektedir. Bu iyileştirme, %0,5'ten %22'ye kadar değişen artışlarla gıda sınıflarına göre değişmektedir. Önerilen bileşen metniyle beslenen CLIP yöntemi, 101 sınıf üzerinde %81,80'lik ilk 1 genel doğrulukla Yolov8'i (%81,46) geride bırakmaktadır.

Keywords

References

Chen X, Kamavuako EN. “Vision-based methods for food and fluid intake monitoring: A literature review”, Sensors, (2023) 23(13), 2023.
Ponte D et al. “Ontologydriven deep learning model for multitask visual food analysis”, VISIGRAPP (2024) 624-631.
Zhang Y et al. “Deep learning in food category recognition”, Information Fusion, (2023) 98:101859.
Zhao H et al. “Fusion learning using semantics and graph convolutional network for visual food recognition”, In WACV, (2021) 1710–1719.
Liu C et al. “Deepfood: Deep learning-based food image recognition for computer-aided dietary assessment”, In International Conference on Smart Homes and Health Telematics”, Springer Intl. Publishing, (2016) 37–48.
Shuqiang J et al. “Few-shot food recognition via multi-view representation learning”, ACM Transactions on Multi-media Computing, Communications and Applications, (2020). 1-4.
Yang J et al. “Learning to classify new foods incrementally via compressed exemplars”, CVPRW, (2024) 3695-3704.
Ergun OO, Ozturk B. “An ontology based semantic representation for turkish cuisine”, In 26th Signal Processing and Communications Applications Conference (SIU), (2018) 1–4.

Morales R et al. “Robust deep neural network for learning in noisy multilabel food images”,Sensors, (2024) 24(7).
Jocher I et al. Yolov8, 2023.
Mao R et al. “Visual aware hierarchy-based food recognition”, In ICPR, (2021) 571–598.
Morales R, Quispe J, Aguilar E. “Exploring multi-food detection using deep learning-based algorithms”, In IEEE 13th International Conference on Pattern Recognition Systems (ICPRS), (2023) 1–7.
Radford A et al. “Learning transferable visual models from natural language supervision”, arxiv, (2021). Available: https://arxiv.org/abs/2103.00020 (accessed: May 5, 2025).
De-Vera R et al. “Lofi: Long-tailed fine-grained net- 433 work for food recognition”, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, (2024) 3750–3760.
Zhang W et al. “Hsnn: A subnetwork-based encoding structure for dimension reduction and food classification via harnessing multi-cnn model high-level features”, Neurocomputing, (2020) 57–66.
Aguilar E, Nagarajan B, Radeva P. “Uncertainty-aware selecting for an ensemble of deep food recognition models”, Computers in Biology and Medicine, (2022) 146:105645.
Alahmari S et al. “Segment anything in food images”, In CVPRW, (2024) 3715–3720.
Bossard L et al. “Food-101 – mining discriminative components with random forests”, In ECCV, (2014) 446–461
Ponte D et al. “Multi-task visual food recognition by integrating an ontology supported with llm”, In SSRN, (2024) 3695–3704.
Che C et al. “Enhancing multimodal understanding with clip-based image-to-text transformation”, In Proc. of the 6th International Conference on Big Data Technologies. Association for Computing Machinery, (2023).
Min W et al. “Ingredient-guided cascaded multi-attention network for food recognition”, In Proc. of the 27th ACM International Conference on Multimedia, (2019) 1331–1339.
Chen J, Ngo C. “Deep-based ingredient 391 recognition for cooking recipe retrieval”, In Proceedings of the 24th ACM International Conference on Multimedia, (2016) 32–41.
Cozzolino D et al. “Raising the bar of ai-generated image detection with clip”, In CVPRW, (2024).
Li M et al. “Clip-event: Connecting text and images with event structures”, 2022.
Ganz R, Elad M. “Text-to-image generation via energy-based clip”, 2024.
Zhang Z et al. “Dual-image enhanced clip for zero-shot anomaly detection”, 2024.
Sain A et al. “Clip for all things zero-shot sketch based image retrieval, fine-grained or ”, CVPR, (2023)2765–775.
Rawlekar S “et al. Prior-aware multilabel food recognition using graph convolutional networks”, In Extended Abstract in MetaFood, CVPRW, (2024) 3695–3704.
Wu Y et al. “Few-shot food recognition with pre-trained model”, In Proc. of the 1st Intl. Workshop on Multimedia for Cooking, Eating, and Related APPlications”, ACM (2022) 45–48.

Details

Primary Language

English

Subjects

Computer Vision, Artificial Intelligence (Other)

Journal Section

Research Article

Authors

Övgü Öztürk Ergün ^*
0009-0007-6273-4877
Türkiye

Early Pub Date

June 16, 2025

Publication Date

June 16, 2025

Submission Date

May 28, 2025

Acceptance Date

June 9, 2025

Published in Issue

Year 2025 Volume: 5 Number: 1

DOI

https://doi.org/10.54569/aair.1707867

IZ

https://izlik.org/JA89FA64WY

Cite

RIS / Bibtex

APA

Öztürk Ergün, Ö. (2025). Context-aware CLIP for Enhanced Food Recognition. Advances in Artificial Intelligence Research, 5(1), 7-13. https://doi.org/10.54569/aair.1707867

AMA

1.Öztürk Ergün Ö. Context-aware CLIP for Enhanced Food Recognition. Adv. Artif. Intell. Res. 2025;5(1):7-13. doi:10.54569/aair.1707867

Chicago

Öztürk Ergün, Övgü. 2025. “Context-Aware CLIP for Enhanced Food Recognition”. Advances in Artificial Intelligence Research 5 (1): 7-13. https://doi.org/10.54569/aair.1707867.

EndNote

Öztürk Ergün Ö (June 1, 2025) Context-aware CLIP for Enhanced Food Recognition. Advances in Artificial Intelligence Research 5 1 7–13.

IEEE

[1]Ö. Öztürk Ergün, “Context-aware CLIP for Enhanced Food Recognition”, Adv. Artif. Intell. Res., vol. 5, no. 1, pp. 7–13, June 2025, doi: 10.54569/aair.1707867.

ISNAD

Öztürk Ergün, Övgü. “Context-Aware CLIP for Enhanced Food Recognition”. Advances in Artificial Intelligence Research 5/1 (June 1, 2025): 7-13. https://doi.org/10.54569/aair.1707867.

JAMA

1.Öztürk Ergün Ö. Context-aware CLIP for Enhanced Food Recognition. Adv. Artif. Intell. Res. 2025;5:7–13.

MLA

Öztürk Ergün, Övgü. “Context-Aware CLIP for Enhanced Food Recognition”. Advances in Artificial Intelligence Research, vol. 5, no. 1, June 2025, pp. 7-13, doi:10.54569/aair.1707867.

Vancouver

1.Övgü Öztürk Ergün. Context-aware CLIP for Enhanced Food Recognition. Adv. Artif. Intell. Res. 2025 Jun. 1;5(1):7-13. doi:10.54569/aair.1707867