From Pixels to Paragraphs: Exploring Enhanced Image-to-Text Generation using Inception v3 and Attention Mechanisms

Zeynep Karaca; Bihter Daş

doi:10.24012/dumf.1340656

Research Article

Piksellerden Paragraflara: Inception v3 ve Dikkat Mekanizmalarını Kullanarak Gelişmiş Görüntüden Metin Üretimi Keşfetme

Year 2023, Volume: 14 Issue: 4, 603 - 610, 31.12.2023

Zeynep Karaca Bihter Daş

https://doi.org/10.24012/dumf.1340656

Abstract

Görsel verilerin işlenmesi ve metne dönüştürülmesi, dijital dünyada bilgi erişimi ve veri analizi gibi alanlarda çok önemli bir rol oynamaktadır. Bu noktada, görsel ve metinsel veriler arasındaki boşluğu dolduran "resimden metne" dönüşüm, araştırmacılar ve sektör uzmanlarından büyük ilgi görmektedir. Bu makale, görüntülerden metin oluşturma üzerine bir çalışma sunmaktadır. Çalışma, görüntüden metne üretim için kodlayıcı-kod çözücü tabanlı Inception v3 derin öğrenme mimarisine bir dikkat mekanizması eklemenin katkısını ölçmeyi amaçlamaktadır. Modelde, Inception v3 modeli, görüntü özelliklerini çıkarmak için Flickr8k veri kümesinde eğitilmiştir. Bir sonraki kelime tahmini için dikkat mekanizmalı kodlayıcı-kod çözücü yapısı kullanılmaktadır ve model, performans değerlendirmesi için Flickr8k veri setinin tren görüntüleri üzerinde eğitilmektedir. Deneysel sonuçlar, modelin görüntülerdeki nesneleri doğru bir şekilde algılama konusundaki tatmin edici becerisini göstermektedir.

Keywords

Inception v3 Modeli, Dikkat Mekanizmaları, Metinsel İçerik Çıkarımı, Görüntüden Metne Üretim

References

[1] M. Bahani, A. E. Ouaazizi, and K. Maalmi, "The effectiveness of T5, GPT-2, and BERT on text-to-image generation task," Pattern Recognition Letters, Aug. 2023, doi: 10.1016/j.patrec.2023.08.001.
[2] Y. Tian, A. Ding, D. Wang, X. Luo, B. Wan, and Y. Wang, "Bi-Attention enhanced representation learning for image-text matching," Pattern Recognition, vol. 140, p. 109548, Aug. 2023, doi: 10.1016/j.patcog.2023.109548.
[3] H. Polat, M. U. Aluçlu, and M. S. Özerdem, "Evaluation of potential auras in generalized epilepsy from EEG signals using deep convolutional neural networks and time-frequency representation," Biomedical Engineering / Biomedizinische Technik, vol. 65, no. 4, pp. 379-391, 2020, doi: 10.1515/bmt-2019-0098.
[4] H. Elfaik and E. H. Nfaoui, "Leveraging feature-level fusion representations and attentional bidirectional RNN-CNN deep models for Arabic affect analysis on Twitter," Journal of King Saud University - Computer and Information Sciences, vol. 35, no. 1, pp. 462–482, Jan. 2023, doi: 10.1016/j.jksuci.2022.12.015.
[5] C. S. Kanimozhiselvi, K. V, K. S. P, and K. S, "Image Captioning Using Deep Learning," in 2022 International Conference on Computer Communication and Informatics (ICCCI), Jan. 2022, pp. 1-7, doi: 10.1109/ICCCI54379.2022.9740788.
[6] C. Bai, A. Zheng, Y. Huang, X. Pan, and N. Chen, "Boosting convolutional image captioning with semantic content and visual relationship," Displays, vol. 70, p. 102069, Dec. 2021, doi: 10.1016/j.displa.2021.102069.
[7] V. Agrawal, S. Dhekane, N. Tuniya, and V. Vyas, "Image Caption Generator Using Attention Mechanism," in 2021 12th International Conference on Computing Communication and Networking Technologies (ICCCNT), Jul. 2021, pp. 1-6, doi: 10.1109/ICCCNT51525.2021.9579967.
[8] M. Kılıçkaya, E. Erdem, A. Erdem, N. İ. Cinbiş, and R. Çakıcı, "Data-driven image captioning with meta-class based retrieval," in 2014 22nd Signal Processing and Communications Applications Conference (SIU), Apr. 2014, pp. 1922-1925, doi: 10.1109/SIU.2014.6830631.
[9] Y. Lu, C. Guo, X. Dai, and F.-Y. Wang, "Data-efficient image captioning of fine art paintings via virtual-real semantic alignment training," Neurocomputing, vol. 490, pp. 163-180, Jun. 2022, doi: 10.1016/j.neucom.2022.01.068.
[10] Z. Yang, P. Wang, T. Chu, and J. Yang, "Human-Centric Image Captioning," Pattern Recognition, vol. 126, p. 108545, Jun. 2022, doi: 10.1016/j.patcog.2022.108545.
[11] J. Li, N. Xu, W. Nie, and S. Zhang, "Image Captioning with multi-level similarity-guided semantic matching," Visual Informatics, vol. 5, no. 4, pp. 41-48, Dec. 2021, doi: 10.1016/j.visinf.2021.11.003.
[12] T. Jaknamon and S. Marukatat, "ThaiTC:Thai Transformer-based Image Captioning," in 2022 17th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), Nov. 2022, pp. 1-4, doi: 10.1109/iSAI-NLP56921.2022.9960246.
[13] A. Krisna, A. S. Parihar, A. Das, and A. Aryan, "End-to-End Model for Heavy Rain Image Captioning," in 2022 4th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N), Dec. 2022, pp. 1646-1651, doi: 10.1109/ICAC3N56670.2022.10074181.
[14] P. G. Shambharkar, P. Kumari, P. Yadav, and R. Kumar, "Generating Caption for Image using Beam Search and Analyzation with Unsupervised Image Captioning Algorithm," in 2021 5th International Conference on Intelligent Computing and Control Systems (ICICCS), May 2021, pp. 857-864, doi: 10.1109/ICICCS51141.2021.9432245.
[15] Y. Feng, K. Maeda, T. Ogawa, and M. Haseyama, "Human-Centric Image Retrieval with Gaze-Based Image Captioning," in 2022 IEEE International Conference on Image Processing (ICIP), Oct. 2022, pp. 3828-3832, doi: 10.1109/ICIP46576.2022.9897949.
[16] C. Cai, K.-H. Yap, and S. Wang, "Attribute Conditioned Fashion Image Captioning," in 2022 IEEE International Conference on Image Processing (ICIP), Oct. 2022, pp. 1921-1925, doi: 10.1109/ICIP46576.2022.9897417.
[17] X. Ye et al., "A Joint-Training Two-Stage Method For Remote Sensing Image Captioning," IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1-16, 2022, doi: 10.1109/TGRS.2022.3224244.
[18] J. Wang, Z. Chen, A. Ma, and Y. Zhong, "Capformer: Pure Transformer for Remote Sensing Image Caption," in IGARSS 2022 - 2022 IEEE International Geoscience and Remote Sensing Symposium, Jul. 2022, pp. 7996-7999, doi: 10.1109/IGARSS46834.2022.9883199.
[19] R. Malhotra, T. Raj, and V. Gupta, "Image Captioning and Identification of Dangerous Situations using Transfer Learning," in 2022 6th International Conference on Computing Methodologies and Communication (ICCMC), Mar. 2022, pp. 909-915, doi: 10.1109/ICCMC53470.2022.9753788.
[20] Xin Yang et al., "Context-Aware Transformer for image captioning," Neurocomputing, vol. 549, p. 126440, 2023, doi: 10.1016/j.neucom.2023.126440.
[21] M. Wang, L. Song, X. Yang, and C. Luo, "A parallel-fusion RNN-LSTM architecture for image caption generation," in 2016 IEEE International Conference on Image Processing (ICIP), Sep. 2016, pp. 4448-4452, doi: 10.1109/ICIP.2016.7533201.
[22] M. Şeker and M. S. Özerdem, "Automated Detection of Alzheimer’s Disease using raw EEG time series via. DWT-CNN model," Dicle Üniversitesi Mühendislik Fakültesi Mühendislik Dergisi, vol. 13, no. 4, pp. 673-684, Jan. 2023, doi:10.24012/dumf.1197722.
[23] S. Örenç, E. Acar, and M. S. Özerdem, "Utilizing the Ensemble of Deep Learning Approaches to Identify Monkeypox Disease," Dicle Üniversitesi Mühendislik Fakültesi Mühendislik Dergisi, vol. 13, no. 4, pp. 685-691, Jan. 2023, doi:10.24012/dumf.1199679.
[24] S. Degadwala, D. Vyas, H. Biswas, U. Chakraborty, and S. Saha, "Image Captioning Using Inception V3 Transfer Learning Model," in 2021 6th International Conference on Communication and Electronics Systems (ICCES), Jul. 2021, pp. 1103-1108, doi: 10.1109/ICCES51350.2021.9489111.
[25] O. Turk, D. Ozhan, E. Acar, T. C. Akinci, and M. Yilmaz, "Automatic detection of brain tumors with the aid of ensemble deep learning architectures and class activation map indicators by employing magnetic resonance images," Zeitschrift für Medizinische Physik, Dec. 2022, doi: 10.1016/j.zemedi.2022.11.010.
[26] K. Joshi, V. Tripathi, C. Bose, and C. Bhardwaj, "Robust Sports Image Classification Using InceptionV3 and Neural Networks," Procedia Computer Science, vol. 167, pp. 2374-2381, Jan. 2020, doi: 10.1016/j.procs.2020.03.290.
[27] C. Szegedy et al., "Rethinking the Inception Architecture for Computer Vision," in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2016, pp. 2818-2826, doi: 10.1109/CVPR.2016.308.
[28] X. Yu, Y. Ahn, and J. Jeong, "High-level Image Classification by Synergizing Image Captioning with BERT," in 2021 International Conference on Information and Communication Technology Convergence (ICTC), Oct. 2021, pp. 1686-1690, doi: 10.1109/ICTC52510.2021.9620954.
[29] C. Zhang, Y. Dai, Y. Cheng, Z. Jia, and K. Hirota, "Recurrent Attention LSTM Model for Image Chinese Caption Generation," in 2018 Joint 10th International Conference on Soft Computing and Intelligent Systems (SCIS) and 19th International Symposium on Advanced Intelligent Systems (ISIS), Dec. 2018, pp. 808-813, doi: 10.1109/SCIS-ISIS.2018.00134.
[30] K. Xu, H. Wang, and P. Tang, "Image captioning with deep LSTM based on sequential residual," in 2017 IEEE International Conference on Multimedia and Expo (ICME), Jul. 2017, pp. 361-366, doi: 10.1109/ICME.2017.8019408.

From Pixels to Paragraphs: Exploring Enhanced Image-to-Text Generation using Inception v3 and Attention Mechanisms

Year 2023, Volume: 14 Issue: 4, 603 - 610, 31.12.2023

Zeynep Karaca Bihter Daş

https://doi.org/10.24012/dumf.1340656

Abstract

Processing visual data and converting it into text plays a crucial role in fields like information retrieval and data analysis in the digital world. At this juncture, the "image-to-text" transformation, which bridges the gap between visual and textual data, has garnered significant interest from researchers and industry experts. This article presents a study on generating text from images. The study aims to measure the contribution of adding an attention mechanism to the encoder-decoder-based Inception v3 deep learning architecture for image-to-text generation. In the model, the Inception v3 model is trained on the Flickr8k dataset to extract image features. The encoder-decoder structure with an attention mechanism is employed for next-word prediction, and the model is trained on the train images of the Flickr8k dataset for performance evaluation. Experimental results demonstrate the model's satisfactory ability to accurately perceive objects in images.

Keywords

Inception v3 Model, Attention Mechanisms, Textual Content Extraction, Image-to-Text Generation

References

[1] M. Bahani, A. E. Ouaazizi, and K. Maalmi, "The effectiveness of T5, GPT-2, and BERT on text-to-image generation task," Pattern Recognition Letters, Aug. 2023, doi: 10.1016/j.patrec.2023.08.001.
[2] Y. Tian, A. Ding, D. Wang, X. Luo, B. Wan, and Y. Wang, "Bi-Attention enhanced representation learning for image-text matching," Pattern Recognition, vol. 140, p. 109548, Aug. 2023, doi: 10.1016/j.patcog.2023.109548.
[3] H. Polat, M. U. Aluçlu, and M. S. Özerdem, "Evaluation of potential auras in generalized epilepsy from EEG signals using deep convolutional neural networks and time-frequency representation," Biomedical Engineering / Biomedizinische Technik, vol. 65, no. 4, pp. 379-391, 2020, doi: 10.1515/bmt-2019-0098.
[4] H. Elfaik and E. H. Nfaoui, "Leveraging feature-level fusion representations and attentional bidirectional RNN-CNN deep models for Arabic affect analysis on Twitter," Journal of King Saud University - Computer and Information Sciences, vol. 35, no. 1, pp. 462–482, Jan. 2023, doi: 10.1016/j.jksuci.2022.12.015.
[5] C. S. Kanimozhiselvi, K. V, K. S. P, and K. S, "Image Captioning Using Deep Learning," in 2022 International Conference on Computer Communication and Informatics (ICCCI), Jan. 2022, pp. 1-7, doi: 10.1109/ICCCI54379.2022.9740788.
[6] C. Bai, A. Zheng, Y. Huang, X. Pan, and N. Chen, "Boosting convolutional image captioning with semantic content and visual relationship," Displays, vol. 70, p. 102069, Dec. 2021, doi: 10.1016/j.displa.2021.102069.
[7] V. Agrawal, S. Dhekane, N. Tuniya, and V. Vyas, "Image Caption Generator Using Attention Mechanism," in 2021 12th International Conference on Computing Communication and Networking Technologies (ICCCNT), Jul. 2021, pp. 1-6, doi: 10.1109/ICCCNT51525.2021.9579967.
[8] M. Kılıçkaya, E. Erdem, A. Erdem, N. İ. Cinbiş, and R. Çakıcı, "Data-driven image captioning with meta-class based retrieval," in 2014 22nd Signal Processing and Communications Applications Conference (SIU), Apr. 2014, pp. 1922-1925, doi: 10.1109/SIU.2014.6830631.
[9] Y. Lu, C. Guo, X. Dai, and F.-Y. Wang, "Data-efficient image captioning of fine art paintings via virtual-real semantic alignment training," Neurocomputing, vol. 490, pp. 163-180, Jun. 2022, doi: 10.1016/j.neucom.2022.01.068.
[10] Z. Yang, P. Wang, T. Chu, and J. Yang, "Human-Centric Image Captioning," Pattern Recognition, vol. 126, p. 108545, Jun. 2022, doi: 10.1016/j.patcog.2022.108545.
[11] J. Li, N. Xu, W. Nie, and S. Zhang, "Image Captioning with multi-level similarity-guided semantic matching," Visual Informatics, vol. 5, no. 4, pp. 41-48, Dec. 2021, doi: 10.1016/j.visinf.2021.11.003.
[12] T. Jaknamon and S. Marukatat, "ThaiTC:Thai Transformer-based Image Captioning," in 2022 17th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), Nov. 2022, pp. 1-4, doi: 10.1109/iSAI-NLP56921.2022.9960246.
[13] A. Krisna, A. S. Parihar, A. Das, and A. Aryan, "End-to-End Model for Heavy Rain Image Captioning," in 2022 4th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N), Dec. 2022, pp. 1646-1651, doi: 10.1109/ICAC3N56670.2022.10074181.
[14] P. G. Shambharkar, P. Kumari, P. Yadav, and R. Kumar, "Generating Caption for Image using Beam Search and Analyzation with Unsupervised Image Captioning Algorithm," in 2021 5th International Conference on Intelligent Computing and Control Systems (ICICCS), May 2021, pp. 857-864, doi: 10.1109/ICICCS51141.2021.9432245.
[15] Y. Feng, K. Maeda, T. Ogawa, and M. Haseyama, "Human-Centric Image Retrieval with Gaze-Based Image Captioning," in 2022 IEEE International Conference on Image Processing (ICIP), Oct. 2022, pp. 3828-3832, doi: 10.1109/ICIP46576.2022.9897949.
[16] C. Cai, K.-H. Yap, and S. Wang, "Attribute Conditioned Fashion Image Captioning," in 2022 IEEE International Conference on Image Processing (ICIP), Oct. 2022, pp. 1921-1925, doi: 10.1109/ICIP46576.2022.9897417.
[17] X. Ye et al., "A Joint-Training Two-Stage Method For Remote Sensing Image Captioning," IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1-16, 2022, doi: 10.1109/TGRS.2022.3224244.
[18] J. Wang, Z. Chen, A. Ma, and Y. Zhong, "Capformer: Pure Transformer for Remote Sensing Image Caption," in IGARSS 2022 - 2022 IEEE International Geoscience and Remote Sensing Symposium, Jul. 2022, pp. 7996-7999, doi: 10.1109/IGARSS46834.2022.9883199.
[19] R. Malhotra, T. Raj, and V. Gupta, "Image Captioning and Identification of Dangerous Situations using Transfer Learning," in 2022 6th International Conference on Computing Methodologies and Communication (ICCMC), Mar. 2022, pp. 909-915, doi: 10.1109/ICCMC53470.2022.9753788.
[20] Xin Yang et al., "Context-Aware Transformer for image captioning," Neurocomputing, vol. 549, p. 126440, 2023, doi: 10.1016/j.neucom.2023.126440.
[21] M. Wang, L. Song, X. Yang, and C. Luo, "A parallel-fusion RNN-LSTM architecture for image caption generation," in 2016 IEEE International Conference on Image Processing (ICIP), Sep. 2016, pp. 4448-4452, doi: 10.1109/ICIP.2016.7533201.
[22] M. Şeker and M. S. Özerdem, "Automated Detection of Alzheimer’s Disease using raw EEG time series via. DWT-CNN model," Dicle Üniversitesi Mühendislik Fakültesi Mühendislik Dergisi, vol. 13, no. 4, pp. 673-684, Jan. 2023, doi:10.24012/dumf.1197722.
[23] S. Örenç, E. Acar, and M. S. Özerdem, "Utilizing the Ensemble of Deep Learning Approaches to Identify Monkeypox Disease," Dicle Üniversitesi Mühendislik Fakültesi Mühendislik Dergisi, vol. 13, no. 4, pp. 685-691, Jan. 2023, doi:10.24012/dumf.1199679.
[24] S. Degadwala, D. Vyas, H. Biswas, U. Chakraborty, and S. Saha, "Image Captioning Using Inception V3 Transfer Learning Model," in 2021 6th International Conference on Communication and Electronics Systems (ICCES), Jul. 2021, pp. 1103-1108, doi: 10.1109/ICCES51350.2021.9489111.
[25] O. Turk, D. Ozhan, E. Acar, T. C. Akinci, and M. Yilmaz, "Automatic detection of brain tumors with the aid of ensemble deep learning architectures and class activation map indicators by employing magnetic resonance images," Zeitschrift für Medizinische Physik, Dec. 2022, doi: 10.1016/j.zemedi.2022.11.010.
[26] K. Joshi, V. Tripathi, C. Bose, and C. Bhardwaj, "Robust Sports Image Classification Using InceptionV3 and Neural Networks," Procedia Computer Science, vol. 167, pp. 2374-2381, Jan. 2020, doi: 10.1016/j.procs.2020.03.290.
[27] C. Szegedy et al., "Rethinking the Inception Architecture for Computer Vision," in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2016, pp. 2818-2826, doi: 10.1109/CVPR.2016.308.
[28] X. Yu, Y. Ahn, and J. Jeong, "High-level Image Classification by Synergizing Image Captioning with BERT," in 2021 International Conference on Information and Communication Technology Convergence (ICTC), Oct. 2021, pp. 1686-1690, doi: 10.1109/ICTC52510.2021.9620954.
[29] C. Zhang, Y. Dai, Y. Cheng, Z. Jia, and K. Hirota, "Recurrent Attention LSTM Model for Image Chinese Caption Generation," in 2018 Joint 10th International Conference on Soft Computing and Intelligent Systems (SCIS) and 19th International Symposium on Advanced Intelligent Systems (ISIS), Dec. 2018, pp. 808-813, doi: 10.1109/SCIS-ISIS.2018.00134.
[30] K. Xu, H. Wang, and P. Tang, "Image captioning with deep LSTM based on sequential residual," in 2017 IEEE International Conference on Multimedia and Expo (ICME), Jul. 2017, pp. 361-366, doi: 10.1109/ICME.2017.8019408.

There are 30 citations in total.

Details

Primary Language	English
Subjects	Natural Language Processing
Journal Section	Articles
Authors	Zeynep Karaca 0000-0002-7751-8567 Bihter Daş 0000-0002-2498-3297
Early Pub Date	December 31, 2023
Publication Date	December 31, 2023
Submission Date	August 10, 2023
Published in Issue	Year 2023 Volume: 14 Issue: 4

Cite

IEEE	Z. Karaca and B. Daş, “From Pixels to Paragraphs: Exploring Enhanced Image-to-Text Generation using Inception v3 and Attention Mechanisms”, DUJE, vol. 14, no. 4, pp. 603–610, 2023, doi: 10.24012/dumf.1340656.

Download Cover Image

Article Files

Full Text

DUJE tarafından yayınlanan tüm makaleler, Creative Commons Atıf 4.0 Uluslararası Lisansı ile lisanslanmıştır. Bu, orijinal eser ve kaynağın uygun şekilde belirtilmesi koşuluyla, herkesin eseri kopyalamasına, yeniden dağıtmasına, yeniden düzenlemesine, iletmesine ve uyarlamasına izin verir. 24456