Context-aware CLIP for Enhanced Food Recognition

Övgü Öztürk Ergün

doi:10.54569/aair.1707867

Research Article

Context-aware CLIP for Enhanced Food Recognition

Year 2025, Volume: 5 Issue: 1, 7 - 13, 16.06.2025

Övgü Öztürk Ergün

https://doi.org/10.54569/aair.1707867

Abstract

Generalization of food image recognition frameworks is difficult due to the wide variety of food categories in cuisines across cultures. The performance of the deep neural network models highly depends on the training dataset. To overcome this problem, we propose to extract context information from images in order to increase the discrimination capacity of networks. In this work, we utilize the CLIP architecture with the automatically derived ingredient context from food images. A list of ingredients are associated with each food category, which is later modeled as text after a voting process and fed to a CLIP architecture together with input image. Experimental results on the Food101 dataset show that this approach significantly improves the model’s performance, achieving a 2% overall increase in accuracy. This improvement varies across food classes, with increases ranging from 0.5% to as much as 22%. The proposed framework, CLIP fed with ingredient text, outperforms Yolov8 (81.46%) with 81.80% top 1 overall accuracy over 101 classes.

Keywords

Food Image Processing , CLIP , Ingredient Analysis , Context , Deep Learning , AI

References

Chen X, Kamavuako EN. “Vision-based methods for food and fluid intake monitoring: A literature review”, Sensors, (2023) 23(13), 2023.
Ponte D et al. “Ontologydriven deep learning model for multitask visual food analysis”, VISIGRAPP (2024) 624-631.
Zhang Y et al. “Deep learning in food category recognition”, Information Fusion, (2023) 98:101859.
Zhao H et al. “Fusion learning using semantics and graph convolutional network for visual food recognition”, In WACV, (2021) 1710–1719.
Liu C et al. “Deepfood: Deep learning-based food image recognition for computer-aided dietary assessment”, In International Conference on Smart Homes and Health Telematics”, Springer Intl. Publishing, (2016) 37–48.
Shuqiang J et al. “Few-shot food recognition via multi-view representation learning”, ACM Transactions on Multi-media Computing, Communications and Applications, (2020). 1-4.
Yang J et al. “Learning to classify new foods incrementally via compressed exemplars”, CVPRW, (2024) 3695-3704.
Ergun OO, Ozturk B. “An ontology based semantic representation for turkish cuisine”, In 26th Signal Processing and Communications Applications Conference (SIU), (2018) 1–4.
Morales R et al. “Robust deep neural network for learning in noisy multilabel food images”,Sensors, (2024) 24(7).
Jocher I et al. Yolov8, 2023.
Mao R et al. “Visual aware hierarchy-based food recognition”, In ICPR, (2021) 571–598.
Morales R, Quispe J, Aguilar E. “Exploring multi-food detection using deep learning-based algorithms”, In IEEE 13th International Conference on Pattern Recognition Systems (ICPRS), (2023) 1–7.
Radford A et al. “Learning transferable visual models from natural language supervision”, arxiv, (2021). Available: https://arxiv.org/abs/2103.00020 (accessed: May 5, 2025).
De-Vera R et al. “Lofi: Long-tailed fine-grained net- 433 work for food recognition”, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, (2024) 3750–3760.
Zhang W et al. “Hsnn: A subnetwork-based encoding structure for dimension reduction and food classification via harnessing multi-cnn model high-level features”, Neurocomputing, (2020) 57–66.
Aguilar E, Nagarajan B, Radeva P. “Uncertainty-aware selecting for an ensemble of deep food recognition models”, Computers in Biology and Medicine, (2022) 146:105645.
Alahmari S et al. “Segment anything in food images”, In CVPRW, (2024) 3715–3720.
Bossard L et al. “Food-101 – mining discriminative components with random forests”, In ECCV, (2014) 446–461
Ponte D et al. “Multi-task visual food recognition by integrating an ontology supported with llm”, In SSRN, (2024) 3695–3704.
Che C et al. “Enhancing multimodal understanding with clip-based image-to-text transformation”, In Proc. of the 6th International Conference on Big Data Technologies. Association for Computing Machinery, (2023).
Min W et al. “Ingredient-guided cascaded multi-attention network for food recognition”, In Proc. of the 27th ACM International Conference on Multimedia, (2019) 1331–1339.
Chen J, Ngo C. “Deep-based ingredient 391 recognition for cooking recipe retrieval”, In Proceedings of the 24th ACM International Conference on Multimedia, (2016) 32–41.
Cozzolino D et al. “Raising the bar of ai-generated image detection with clip”, In CVPRW, (2024).
Li M et al. “Clip-event: Connecting text and images with event structures”, 2022.
Ganz R, Elad M. “Text-to-image generation via energy-based clip”, 2024.
Zhang Z et al. “Dual-image enhanced clip for zero-shot anomaly detection”, 2024.
Sain A et al. “Clip for all things zero-shot sketch based image retrieval, fine-grained or ”, CVPR, (2023)2765–775.
Rawlekar S “et al. Prior-aware multilabel food recognition using graph convolutional networks”, In Extended Abstract in MetaFood, CVPRW, (2024) 3695–3704.
Wu Y et al. “Few-shot food recognition with pre-trained model”, In Proc. of the 1st Intl. Workshop on Multimedia for Cooking, Eating, and Related APPlications”, ACM (2022) 45–48.

Yemek Kategori Tanıma için İçerik Farkındalıklı CLIP

Year 2025, Volume: 5 Issue: 1, 7 - 13, 16.06.2025

Övgü Öztürk Ergün

https://doi.org/10.54569/aair.1707867

Abstract

Gıda görüntü tanıma çözümlerinin genelleştirilmesi, kültürler arası mutfaklardaki gıda kategorilerinin çok çeşitli olması nedeniyle zordur. Derin sinir ağı modellerinin performansı büyük ölçüde eğitim veri kümesine bağlıdır. Bu sorunun üstesinden gelmek için, ağların sınıfları daha iyi ayırt edebilme kapasitesini artırmak amacıyla görüntülerden bağlam bilgisi çıkarmayı öneriyoruz. Bu çalışmada, gıda görüntülerinden otomatik olarak türetilen bileşen bağlamına sahip CLIP mimarisini kullanıyoruz. Her bir gıda kategorisiyle bir bileşen listesi ilişkilendirilir ve daha sonra bir oylama sürecinden sonra metin olarak modellenir ve giriş görüntüsüyle birlikte bir CLIP mimarisine beslenir. Food101 veri kümesindeki deneysel sonuçlar, bu yaklaşımın modelin performansını önemli ölçüde iyileştirdiğini ve doğrulukta %2'lik bir genel artış sağladığını göstermektedir. Bu iyileştirme, %0,5'ten %22'ye kadar değişen artışlarla gıda sınıflarına göre değişmektedir. Önerilen bileşen metniyle beslenen CLIP yöntemi, 101 sınıf üzerinde %81,80'lik ilk 1 genel doğrulukla Yolov8'i (%81,46) geride bırakmaktadır.

Keywords

Yemek Resim Tanıma , CLIP , Derin Öğrenme , YZ , İçerik , Malzeme Analizi

References

Chen X, Kamavuako EN. “Vision-based methods for food and fluid intake monitoring: A literature review”, Sensors, (2023) 23(13), 2023.
Ponte D et al. “Ontologydriven deep learning model for multitask visual food analysis”, VISIGRAPP (2024) 624-631.
Zhang Y et al. “Deep learning in food category recognition”, Information Fusion, (2023) 98:101859.
Zhao H et al. “Fusion learning using semantics and graph convolutional network for visual food recognition”, In WACV, (2021) 1710–1719.
Liu C et al. “Deepfood: Deep learning-based food image recognition for computer-aided dietary assessment”, In International Conference on Smart Homes and Health Telematics”, Springer Intl. Publishing, (2016) 37–48.
Shuqiang J et al. “Few-shot food recognition via multi-view representation learning”, ACM Transactions on Multi-media Computing, Communications and Applications, (2020). 1-4.
Yang J et al. “Learning to classify new foods incrementally via compressed exemplars”, CVPRW, (2024) 3695-3704.
Ergun OO, Ozturk B. “An ontology based semantic representation for turkish cuisine”, In 26th Signal Processing and Communications Applications Conference (SIU), (2018) 1–4.
Morales R et al. “Robust deep neural network for learning in noisy multilabel food images”,Sensors, (2024) 24(7).
Jocher I et al. Yolov8, 2023.
Mao R et al. “Visual aware hierarchy-based food recognition”, In ICPR, (2021) 571–598.
Morales R, Quispe J, Aguilar E. “Exploring multi-food detection using deep learning-based algorithms”, In IEEE 13th International Conference on Pattern Recognition Systems (ICPRS), (2023) 1–7.
Radford A et al. “Learning transferable visual models from natural language supervision”, arxiv, (2021). Available: https://arxiv.org/abs/2103.00020 (accessed: May 5, 2025).
De-Vera R et al. “Lofi: Long-tailed fine-grained net- 433 work for food recognition”, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, (2024) 3750–3760.
Zhang W et al. “Hsnn: A subnetwork-based encoding structure for dimension reduction and food classification via harnessing multi-cnn model high-level features”, Neurocomputing, (2020) 57–66.
Aguilar E, Nagarajan B, Radeva P. “Uncertainty-aware selecting for an ensemble of deep food recognition models”, Computers in Biology and Medicine, (2022) 146:105645.
Alahmari S et al. “Segment anything in food images”, In CVPRW, (2024) 3715–3720.
Bossard L et al. “Food-101 – mining discriminative components with random forests”, In ECCV, (2014) 446–461
Ponte D et al. “Multi-task visual food recognition by integrating an ontology supported with llm”, In SSRN, (2024) 3695–3704.
Che C et al. “Enhancing multimodal understanding with clip-based image-to-text transformation”, In Proc. of the 6th International Conference on Big Data Technologies. Association for Computing Machinery, (2023).
Min W et al. “Ingredient-guided cascaded multi-attention network for food recognition”, In Proc. of the 27th ACM International Conference on Multimedia, (2019) 1331–1339.
Chen J, Ngo C. “Deep-based ingredient 391 recognition for cooking recipe retrieval”, In Proceedings of the 24th ACM International Conference on Multimedia, (2016) 32–41.
Cozzolino D et al. “Raising the bar of ai-generated image detection with clip”, In CVPRW, (2024).
Li M et al. “Clip-event: Connecting text and images with event structures”, 2022.
Ganz R, Elad M. “Text-to-image generation via energy-based clip”, 2024.
Zhang Z et al. “Dual-image enhanced clip for zero-shot anomaly detection”, 2024.
Sain A et al. “Clip for all things zero-shot sketch based image retrieval, fine-grained or ”, CVPR, (2023)2765–775.
Rawlekar S “et al. Prior-aware multilabel food recognition using graph convolutional networks”, In Extended Abstract in MetaFood, CVPRW, (2024) 3695–3704.
Wu Y et al. “Few-shot food recognition with pre-trained model”, In Proc. of the 1st Intl. Workshop on Multimedia for Cooking, Eating, and Related APPlications”, ACM (2022) 45–48.

There are 29 citations in total.

Details

Primary Language	English
Subjects	Computer Vision, Artificial Intelligence (Other)
Journal Section	Research Article
Authors	Övgü Öztürk Ergün 0009-0007-6273-4877
Early Pub Date	June 16, 2025
Publication Date	June 16, 2025
Submission Date	May 28, 2025
Acceptance Date	June 9, 2025
Published in Issue	Year 2025 Volume: 5 Issue: 1

Cite

IEEE	Ö. Öztürk Ergün, “Context-aware CLIP for Enhanced Food Recognition”, Adv. Artif. Intell. Res., vol. 5, no. 1, pp. 7–13, 2025, doi: 10.54569/aair.1707867.

Download Cover Image

Article Files

Full Text

Advances in Artificial Intelligence Research is an open access journal which means that the content is freely available without charge to the user or his/her institution. All papers are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, which allows users to distribute, remix, adapt, and build upon the material in any medium or format for non-commercial purposes only, and only so long as attribution is given to the creator.

Graphic design @ Özden Işıktaş