Context-aware CLIP for Enhanced Food Recognition

Övgü Öztürk Ergün

doi:10.54569/aair.1707867

EN TR

Context-aware CLIP for Enhanced Food Recognition

Öz

Generalization of food image recognition frameworks is difficult due to the wide variety of food categories in cuisines across cultures. The performance of the deep neural network models highly depends on the training dataset. To overcome this problem, we propose to extract context information from images in order to increase the discrimination capacity of networks. In this work, we utilize the CLIP architecture with the automatically derived ingredient context from food images. A list of ingredients are associated with each food category, which is later modeled as text after a voting process and fed to a CLIP architecture together with input image. Experimental results on the Food101 dataset show that this approach significantly improves the model’s performance, achieving a 2% overall increase in accuracy. This improvement varies across food classes, with increases ranging from 0.5% to as much as 22%. The proposed framework, CLIP fed with ingredient text, outperforms Yolov8 (81.46%) with 81.80% top 1 overall accuracy over 101 classes.

Anahtar Kelimeler

Yemek Kategori Tanıma için İçerik Farkındalıklı CLIP

Öz

Gıda görüntü tanıma çözümlerinin genelleştirilmesi, kültürler arası mutfaklardaki gıda kategorilerinin çok çeşitli olması nedeniyle zordur. Derin sinir ağı modellerinin performansı büyük ölçüde eğitim veri kümesine bağlıdır. Bu sorunun üstesinden gelmek için, ağların sınıfları daha iyi ayırt edebilme kapasitesini artırmak amacıyla görüntülerden bağlam bilgisi çıkarmayı öneriyoruz. Bu çalışmada, gıda görüntülerinden otomatik olarak türetilen bileşen bağlamına sahip CLIP mimarisini kullanıyoruz. Her bir gıda kategorisiyle bir bileşen listesi ilişkilendirilir ve daha sonra bir oylama sürecinden sonra metin olarak modellenir ve giriş görüntüsüyle birlikte bir CLIP mimarisine beslenir. Food101 veri kümesindeki deneysel sonuçlar, bu yaklaşımın modelin performansını önemli ölçüde iyileştirdiğini ve doğrulukta %2'lik bir genel artış sağladığını göstermektedir. Bu iyileştirme, %0,5'ten %22'ye kadar değişen artışlarla gıda sınıflarına göre değişmektedir. Önerilen bileşen metniyle beslenen CLIP yöntemi, 101 sınıf üzerinde %81,80'lik ilk 1 genel doğrulukla Yolov8'i (%81,46) geride bırakmaktadır.

Anahtar Kelimeler

Kaynakça

Chen X, Kamavuako EN. “Vision-based methods for food and fluid intake monitoring: A literature review”, Sensors, (2023) 23(13), 2023.
Ponte D et al. “Ontologydriven deep learning model for multitask visual food analysis”, VISIGRAPP (2024) 624-631.
Zhang Y et al. “Deep learning in food category recognition”, Information Fusion, (2023) 98:101859.
Zhao H et al. “Fusion learning using semantics and graph convolutional network for visual food recognition”, In WACV, (2021) 1710–1719.
Liu C et al. “Deepfood: Deep learning-based food image recognition for computer-aided dietary assessment”, In International Conference on Smart Homes and Health Telematics”, Springer Intl. Publishing, (2016) 37–48.
Shuqiang J et al. “Few-shot food recognition via multi-view representation learning”, ACM Transactions on Multi-media Computing, Communications and Applications, (2020). 1-4.
Yang J et al. “Learning to classify new foods incrementally via compressed exemplars”, CVPRW, (2024) 3695-3704.
Ergun OO, Ozturk B. “An ontology based semantic representation for turkish cuisine”, In 26th Signal Processing and Communications Applications Conference (SIU), (2018) 1–4.

Morales R et al. “Robust deep neural network for learning in noisy multilabel food images”,Sensors, (2024) 24(7).
Jocher I et al. Yolov8, 2023.
Mao R et al. “Visual aware hierarchy-based food recognition”, In ICPR, (2021) 571–598.
Morales R, Quispe J, Aguilar E. “Exploring multi-food detection using deep learning-based algorithms”, In IEEE 13th International Conference on Pattern Recognition Systems (ICPRS), (2023) 1–7.
Radford A et al. “Learning transferable visual models from natural language supervision”, arxiv, (2021). Available: https://arxiv.org/abs/2103.00020 (accessed: May 5, 2025).
De-Vera R et al. “Lofi: Long-tailed fine-grained net- 433 work for food recognition”, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, (2024) 3750–3760.
Zhang W et al. “Hsnn: A subnetwork-based encoding structure for dimension reduction and food classification via harnessing multi-cnn model high-level features”, Neurocomputing, (2020) 57–66.
Aguilar E, Nagarajan B, Radeva P. “Uncertainty-aware selecting for an ensemble of deep food recognition models”, Computers in Biology and Medicine, (2022) 146:105645.
Alahmari S et al. “Segment anything in food images”, In CVPRW, (2024) 3715–3720.
Bossard L et al. “Food-101 – mining discriminative components with random forests”, In ECCV, (2014) 446–461
Ponte D et al. “Multi-task visual food recognition by integrating an ontology supported with llm”, In SSRN, (2024) 3695–3704.
Che C et al. “Enhancing multimodal understanding with clip-based image-to-text transformation”, In Proc. of the 6th International Conference on Big Data Technologies. Association for Computing Machinery, (2023).
Min W et al. “Ingredient-guided cascaded multi-attention network for food recognition”, In Proc. of the 27th ACM International Conference on Multimedia, (2019) 1331–1339.
Chen J, Ngo C. “Deep-based ingredient 391 recognition for cooking recipe retrieval”, In Proceedings of the 24th ACM International Conference on Multimedia, (2016) 32–41.
Cozzolino D et al. “Raising the bar of ai-generated image detection with clip”, In CVPRW, (2024).
Li M et al. “Clip-event: Connecting text and images with event structures”, 2022.
Ganz R, Elad M. “Text-to-image generation via energy-based clip”, 2024.
Zhang Z et al. “Dual-image enhanced clip for zero-shot anomaly detection”, 2024.
Sain A et al. “Clip for all things zero-shot sketch based image retrieval, fine-grained or ”, CVPR, (2023)2765–775.
Rawlekar S “et al. Prior-aware multilabel food recognition using graph convolutional networks”, In Extended Abstract in MetaFood, CVPRW, (2024) 3695–3704.
Wu Y et al. “Few-shot food recognition with pre-trained model”, In Proc. of the 1st Intl. Workshop on Multimedia for Cooking, Eating, and Related APPlications”, ACM (2022) 45–48.

Ayrıntılar

Birincil Dil

İngilizce

Konular

Bilgisayar Görüşü, Yapay Zeka (Diğer)

Bölüm

Araştırma Makalesi

Yazarlar

Övgü Öztürk Ergün ^*
0009-0007-6273-4877
Türkiye

Erken Görünüm Tarihi

16 Haziran 2025

Yayımlanma Tarihi

16 Haziran 2025

Gönderilme Tarihi

28 Mayıs 2025

Kabul Tarihi

9 Haziran 2025

Yayımlandığı Sayı

Yıl 2025 Cilt: 5 Sayı: 1

DOI

https://doi.org/10.54569/aair.1707867

IZ

https://izlik.org/JA89FA64WY

Kaynak Göster

RIS / Bibtex

APA

Öztürk Ergün, Ö. (2025). Context-aware CLIP for Enhanced Food Recognition. Advances in Artificial Intelligence Research, 5(1), 7-13. https://doi.org/10.54569/aair.1707867

AMA

1.Öztürk Ergün Ö. Context-aware CLIP for Enhanced Food Recognition. Adv. Artif. Intell. Res. 2025;5(1):7-13. doi:10.54569/aair.1707867

Chicago

Öztürk Ergün, Övgü. 2025. “Context-aware CLIP for Enhanced Food Recognition”. Advances in Artificial Intelligence Research 5 (1): 7-13. https://doi.org/10.54569/aair.1707867.

EndNote

Öztürk Ergün Ö (01 Haziran 2025) Context-aware CLIP for Enhanced Food Recognition. Advances in Artificial Intelligence Research 5 1 7–13.

IEEE

[1]Ö. Öztürk Ergün, “Context-aware CLIP for Enhanced Food Recognition”, Adv. Artif. Intell. Res., c. 5, sy 1, ss. 7–13, Haz. 2025, doi: 10.54569/aair.1707867.

ISNAD

Öztürk Ergün, Övgü. “Context-aware CLIP for Enhanced Food Recognition”. Advances in Artificial Intelligence Research 5/1 (01 Haziran 2025): 7-13. https://doi.org/10.54569/aair.1707867.

JAMA

1.Öztürk Ergün Ö. Context-aware CLIP for Enhanced Food Recognition. Adv. Artif. Intell. Res. 2025;5:7–13.

MLA

Öztürk Ergün, Övgü. “Context-aware CLIP for Enhanced Food Recognition”. Advances in Artificial Intelligence Research, c. 5, sy 1, Haziran 2025, ss. 7-13, doi:10.54569/aair.1707867.

Vancouver

1.Övgü Öztürk Ergün. Context-aware CLIP for Enhanced Food Recognition. Adv. Artif. Intell. Res. 01 Haziran 2025;5(1):7-13. doi:10.54569/aair.1707867