Fusion of High-Level Visual Attributes for Image Captioning

Murat Kılcı; Özkan Çaylı; Volkan Kılıç

TR EN

Fusion of High-Level Visual Attributes for Image Captioning

Öz

Image captioning aims to generate a natural language description that accurately conveys the content of an image. Recently, deep learning models have been used to extract visual attributes from images, enhancing the accuracy of captions. However, it is essential to assess these visual attributes to ensure optimal performance and avoid incorporating redundant or misleading information. In this study, we employ the visual attributes of semantic segmentation, object detection, instance segmentation, keypoint detection, and their fusion. Experimental evaluations on the commonly used datasets VizWiz and MSCOCO Captions demonstrate that the fusion of visual attributes improves the accuracy of caption generation. Furthermore, the image captioning model, which utilizes the fusion of visual attributes, has been embedded into our custom-designed Android application, named NObstacle, enabling captioning without the need for an internet connection.

Anahtar Kelimeler

Destekleyen Kurum

TUBITAK ve İKCU BAP

Proje Numarası

120N995, 2021-ÖDL-MÜMF-0006, 2022-TYL-FEBE-0012

Teşekkür

This research was supported by the Scientific and Technological Research Council of Turkey (TUBITAK) British Council (The Newton Katip Celebi Fund Institutional Links, Turkey UK project: 120N995) and by the scientific research projects coordination unit of Izmir Katip Celebi University (project no: 2021-ÖDL-MÜMF-0006, & 2022-TYL-FEBE-0012).

Kaynakça

Akosman, Ş. A., Öktem, M., Moral, Ö. T., & Kılıç, V. (2021). Deep Learning-based Semantic Segmentation for Crack Detection on Marbles. 29th Signal Processing and Communications Applications Conference (SIU),
Amit, Y., Felzenszwalb, P., & Girshick, R. (2020). Object detection. Computer Vision: A Reference Guide, 1-9.
Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). Spice: Semantic propositional image caption evaluation. Computer Vision–ECCV: 14th European Conference, Amsterdam, The Netherlands, October 11-14, Proceedings, Part V 14,
Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., & Zhang, L. (2018). Bottom-up and top-down attention for image captioning and visual question answering. Proceedings of the IEEE conference on computer vision and pattern recognition,
Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization,
Baran, M., Moral, Ö. T., & Kılıç, V. (2021). Akıllı telefonlar için birleştirme modeli tabanlı görüntü altyazılama. Avrupa Bilim ve Teknoloji Dergisi(26), 191-196.
Barroso-Laguna, A., Riba, E., Ponsa, D., & Mikolajczyk, K. (2019). Key. net: Keypoint detection by handcrafted and learned cnn filters. Proceedings of the IEEE/CVF international conference on computer vision,
Betül, U., Çaylı, Ö., Kılıç, V., & Onan, A. (2022). Resnet based deep gated recurrent unit for image captioning on smartphone. Avrupa Bilim ve Teknoloji Dergisi(35), 610-615.

Aydın, S., Çaylı, Ö., Kılıç, V., & Onan, A. (2022). Sequence-to-sequence video captioning with residual connected gated recurrent units. Avrupa Bilim ve Teknoloji Dergisi(35), 380-386.
Çaylı, Ö., Makav, B., Kılıç, V., & Onan, A. (2021). Mobile application based automatic caption generation for visually impaired. Intelligent and Fuzzy Techniques: Smart and Innovative Solutions: Proceedings of the INFUS Conference, Istanbul, Turkey, July 21-23,
Chang, S.-F. (1995). Compressed-domain techniques for image/video indexing and manipulation. Proceedings., International Conference on Image Processing,
Chen, T., Zhang, Z., You, Q., Fang, C., Wang, Z., Jin, H., & Luo, J. (2018). ``Factual''or``Emotional'': Stylized Image Captioning with Adaptive Learning and Attention. Proceedings of the european conference on computer vision (ECCV),
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.
Çaylı, Ö., Kılıç, V., Onan, A., & Wang, W. (2022). Auxiliary classifier based residual rnn for image captioning. 30th European Signal Processing Conference (EUSIPCO),
Çaylı, Ö., Liu, X., Kılıç, V., & Wang, W. (2023). Knowledge Distillation for Efficient Audio-Visual Video Captioning. arXiv preprint arXiv:2306.09947.
Deselaers, T., Keysers, D., & Ney, H. (2004). Features for image retrieval: A quantitative comparison. Pattern Recognition: 26th DAGM Symposium, Tübingen, Germany, August 30-September 1. Proceedings 26,
Doǧan, V., Isık, T., Kılıç, V., & Horzum, N. (2022). A field-deployable water quality monitoring with machine learning-based smartphone colorimetry. Analytical Methods, 14(35), 3458-3466.
Farhadi, A., Hejrati, M., Sadeghi, M. A., Young, P., Rashtchian, C., Hockenmaier, J., & Forsyth, D. (2010). Every picture tells a story: Generating sentences from images. Computer Vision–ECCV: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, Proceedings, Part IV 11,
Fetiler, B., Çaylı, Ö., Moral, Ö. T., Kılıç, V., & Onan, A. (2021). Video captioning based on multi-layer gated recurrent unit for smartphones. Avrupa Bilim ve Teknoloji Dergisi(32), 221-226.
Ganesan, K. (2018). Rouge 2.0: Updated and improved measures for evaluation of summarization tasks. arXiv preprint arXiv:1803.01937.
Guo, Y., Liu, Y., Georgiou, T., & Lew, M. S. (2018). A review of semantic segmentation using deep neural networks. International journal of multimedia information retrieval, 7, 87-93.
Gurari, D., Zhao, Y., Zhang, M., & Bhattacharya, N. (2020). Captioning images taken by people who are blind. Computer Vision–ECCV: 16th European Conference, Glasgow, UK, August 23–28, Proceedings, Part XVII 16,
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. Proceedings of the IEEE international conference on computer vision,
Ibarra, F. F., Kardan, O., Hunter, M. R., Kotabe, H. P., Meyer, F. A., & Berman, M. G. (2017). Image feature types and their predictions of aesthetic preference and naturalness. Frontiers in Psychology, 8, 632.
Jiang, W., Ma, L., Chen, X., Zhang, H., & Liu, W. (2018). Learning to guide decoding for image captioning. Proceedings of the AAAI Conference on Artificial Intelligence,
Keskin, R., Çaylı, Ö., Moral, Ö. T., Kılıç, V., & Onan, A. (2021). A benchmark for feature-injection architectures in image captioning. Avrupa Bilim ve Teknoloji Dergisi(31), 461-468.
Kılıç, V. (2021). Deep gated recurrent unit for smartphone-based image captioning. Sakarya University Journal of Computer and Information Sciences, 4(2), 181-191.
Kılıç, V., Mercan, Ö. B., Tetik, M., Kap, Ö., & Horzum, N. (2022). Non-enzymatic colorimetric glucose detection based on Au/Ag nanoparticles using smartphone and machine learning. Analytical Sciences, 38(2), 347-358.
Kılıç, V., Zhong, X., Barnard, M., Wang, W., & Kittler, J. (2014). Audio-visual tracking of a variable number of speakers with a random finite set approach. 17th International Conference on Information Fusion (FUSION),
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. Computer Vision–ECCV: 13th European Conference, Zurich, Switzerland, September 6-12, Proceedings, Part V 13,
Liu, S., Qi, L., Qin, H., Shi, J., & Jia, J. (2018). Path aggregation network for instance segmentation. Proceedings of the IEEE conference on computer vision and pattern recognition,
Makav, B., & Kılıç, V. (2019). Smartphone-based image captioning for visually and hearing impaired. 11th international conference on electrical and electronics engineering (ELECO),
Mercan, Ö. B., Doğan, V., & Kılıç, V. (2020). Time Series Analysis based Machine Learning Classification for Blood Sugar Levels. Medical Technologies Congress (TIPTEKNO),
Mercan, Ö. B., & Kılıç, V. (2020). Deep learning based colorimetric classification of glucose with au-ag nanoparticles using smartphone. Medical Technologies Congress (TIPTEKNO),
Moral, Ö. T., Kılıç, V., Onan, A., & Wang, W. (2022). Automated Image Captioning with Multi-layer Gated Recurrent Unit. 30th European Signal Processing Conference (EUSIPCO),
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. Proceedings of the 40th annual meeting of the Association for Computational Linguistics,
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., & Antiga, L. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
Pu, B., Liu, Y., Zhu, N., Li, K., & Li, K. (2020). ED-ACNN: Novel attention convolutional neural network based on encoder–decoder framework for human traffic prediction. Applied Soft Computing, 97, 106688.
Sayraci, B., Ağralı, M., & Kılıç, V. (2023). Artificial Intelligence Based Instance-Aware Semantic Lobe Segmentation on Chest Computed Tomography Images. Avrupa Bilim ve Teknoloji Dergisi(46), 109-115.
Sun, X., Xiao, B., Wei, F., Liang, S., & Wei, Y. (2018). Integral human pose regression. Proceedings of the European conference on computer vision (ECCV),
Tahir, H., Iftikhar, A., & Mumraiz, M. (2021). Forecasting COVID-19 via registration slips of patients using resnet-101 and performance analysis and comparison of prediction for COVID-19 using faster r-cnn, mask r-cnn, and resnet-50. International conference on advances in electrical, computing, communication and sustainable technologies (ICAECT),
Toshev, A., & Szegedy, C. (2014). Deeppose: Human pose estimation via deep neural networks. Proceedings of the IEEE conference on computer vision and pattern recognition,
Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. Proceedings of the IEEE conference on computer vision and pattern recognition,
Wang, P., Chen, P., Yuan, Y., Liu, D., Huang, Z., Hou, X., & Cottrell, G. (2018). Understanding convolution for semantic segmentation. IEEE winter conference on applications of computer vision (WACV),
Xi, D., Qin, Y., Luo, J., Pu, H., & Wang, Z. (2021). Multipath fusion mask R-CNN with double attention and its application into gear pitting detection. IEEE Transactions on Instrumentation and Measurement, 70, 1-11.
Yang, M., Liu, J., Shen, Y., Zhao, Z., Chen, X., Wu, Q., & Li, C. (2020). An ensemble of generation-and retrieval-based image captioning with dual generator generative adversarial network. IEEE Transactions on Image Processing, 29, 9627-9640.
You, Q., Jin, H., & Luo, J. (2018). Image captioning at will: A versatile scheme for effectively injecting sentiments into image descriptions. arXiv preprint arXiv:1801.10121.
Yu, J., Li, J., Yu, Z., & Huang, Q. (2019). Multimodal transformer with multi-view visual representation for image captioning. IEEE transactions on circuits and systems for video technology, 30(12), 4467-4480.

Ayrıntılar

Birincil Dil

İngilizce

Konular

Yapay Görme, Doğal Dil İşleme

Bölüm

Araştırma Makalesi

Yazarlar

Murat Kılcı ^*
0009-0000-3192-1601
Türkiye

Özkan Çaylı
0000-0002-3389-3867
Türkiye

Volkan Kılıç
0000-0002-3164-1981
Türkiye

Erken Görünüm Tarihi

5 Aralık 2023

Yayımlanma Tarihi

15 Aralık 2023

Gönderilme Tarihi

18 Ağustos 2023

Kabul Tarihi

10 Eylül 2023

Yayımlandığı Sayı

Yıl 2023 Sayı: 52

IZ

https://izlik.org/JA87DT39DN

Kaynak Göster

RIS / Bibtex

APA

Kılcı, M., Çaylı, Ö., & Kılıç, V. (2023). Fusion of High-Level Visual Attributes for Image Captioning. Avrupa Bilim ve Teknoloji Dergisi, 52, 161-168. https://izlik.org/JA87DT39DN

AMA

1.Kılcı M, Çaylı Ö, Kılıç V. Fusion of High-Level Visual Attributes for Image Captioning. EJOSAT. 2023;(52):161-168. https://izlik.org/JA87DT39DN

Chicago

Kılcı, Murat, Özkan Çaylı, ve Volkan Kılıç. 2023. “Fusion of High-Level Visual Attributes for Image Captioning”. Avrupa Bilim ve Teknoloji Dergisi, sy 52: 161-68. https://izlik.org/JA87DT39DN.

EndNote

Kılcı M, Çaylı Ö, Kılıç V (01 Aralık 2023) Fusion of High-Level Visual Attributes for Image Captioning. Avrupa Bilim ve Teknoloji Dergisi 52 161–168.

IEEE

[1]M. Kılcı, Ö. Çaylı, ve V. Kılıç, “Fusion of High-Level Visual Attributes for Image Captioning”, EJOSAT, sy 52, ss. 161–168, Ara. 2023, [çevrimiçi]. Erişim adresi: https://izlik.org/JA87DT39DN

ISNAD

Kılcı, Murat - Çaylı, Özkan - Kılıç, Volkan. “Fusion of High-Level Visual Attributes for Image Captioning”. Avrupa Bilim ve Teknoloji Dergisi. 52 (01 Aralık 2023): 161-168. https://izlik.org/JA87DT39DN.

JAMA

1.Kılcı M, Çaylı Ö, Kılıç V. Fusion of High-Level Visual Attributes for Image Captioning. EJOSAT. 2023;:161–168.

MLA

Kılcı, Murat, vd. “Fusion of High-Level Visual Attributes for Image Captioning”. Avrupa Bilim ve Teknoloji Dergisi, sy 52, Aralık 2023, ss. 161-8, https://izlik.org/JA87DT39DN.

Vancouver

1.Murat Kılcı, Özkan Çaylı, Volkan Kılıç. Fusion of High-Level Visual Attributes for Image Captioning. EJOSAT [Internet]. 01 Aralık 2023;(52):161-8. Erişim adresi: https://izlik.org/JA87DT39DN

Görüntü Altyazılama için Üst Düzey Görsel Özniteliklerin Birleştirilmesi

Öz

Anahtar Kelimeler

Fusion of High-Level Visual Attributes for Image Captioning

Öz

Anahtar Kelimeler

Destekleyen Kurum

Proje Numarası

Teşekkür

Kaynakça

Ayrıntılar

Birincil Dil

Konular

Bölüm

Yazarlar

Erken Görünüm Tarihi

Yayımlanma Tarihi

Gönderilme Tarihi

Kabul Tarihi

Yayımlandığı Sayı

IZ

Kaynak Göster