From Pixels to Paragraphs: Exploring Enhanced Image-to-Text Generation using Inception v3 and Attention Mechanisms

Zeynep Karaca; Bihter Daş

doi:10.24012/dumf.1340656

TR EN

From Pixels to Paragraphs: Exploring Enhanced Image-to-Text Generation using Inception v3 and Attention Mechanisms

Öz

Processing visual data and converting it into text plays a crucial role in fields like information retrieval and data analysis in the digital world. At this juncture, the "image-to-text" transformation, which bridges the gap between visual and textual data, has garnered significant interest from researchers and industry experts. This article presents a study on generating text from images. The study aims to measure the contribution of adding an attention mechanism to the encoder-decoder-based Inception v3 deep learning architecture for image-to-text generation. In the model, the Inception v3 model is trained on the Flickr8k dataset to extract image features. The encoder-decoder structure with an attention mechanism is employed for next-word prediction, and the model is trained on the train images of the Flickr8k dataset for performance evaluation. Experimental results demonstrate the model's satisfactory ability to accurately perceive objects in images.

Anahtar Kelimeler

Kaynakça

[1] M. Bahani, A. E. Ouaazizi, and K. Maalmi, "The effectiveness of T5, GPT-2, and BERT on text-to-image generation task," Pattern Recognition Letters, Aug. 2023, doi: 10.1016/j.patrec.2023.08.001.
[2] Y. Tian, A. Ding, D. Wang, X. Luo, B. Wan, and Y. Wang, "Bi-Attention enhanced representation learning for image-text matching," Pattern Recognition, vol. 140, p. 109548, Aug. 2023, doi: 10.1016/j.patcog.2023.109548.
[3] H. Polat, M. U. Aluçlu, and M. S. Özerdem, "Evaluation of potential auras in generalized epilepsy from EEG signals using deep convolutional neural networks and time-frequency representation," Biomedical Engineering / Biomedizinische Technik, vol. 65, no. 4, pp. 379-391, 2020, doi: 10.1515/bmt-2019-0098.
[4] H. Elfaik and E. H. Nfaoui, "Leveraging feature-level fusion representations and attentional bidirectional RNN-CNN deep models for Arabic affect analysis on Twitter," Journal of King Saud University - Computer and Information Sciences, vol. 35, no. 1, pp. 462–482, Jan. 2023, doi: 10.1016/j.jksuci.2022.12.015.
[5] C. S. Kanimozhiselvi, K. V, K. S. P, and K. S, "Image Captioning Using Deep Learning," in 2022 International Conference on Computer Communication and Informatics (ICCCI), Jan. 2022, pp. 1-7, doi: 10.1109/ICCCI54379.2022.9740788.
[6] C. Bai, A. Zheng, Y. Huang, X. Pan, and N. Chen, "Boosting convolutional image captioning with semantic content and visual relationship," Displays, vol. 70, p. 102069, Dec. 2021, doi: 10.1016/j.displa.2021.102069.
[7] V. Agrawal, S. Dhekane, N. Tuniya, and V. Vyas, "Image Caption Generator Using Attention Mechanism," in 2021 12th International Conference on Computing Communication and Networking Technologies (ICCCNT), Jul. 2021, pp. 1-6, doi: 10.1109/ICCCNT51525.2021.9579967.
[8] M. Kılıçkaya, E. Erdem, A. Erdem, N. İ. Cinbiş, and R. Çakıcı, "Data-driven image captioning with meta-class based retrieval," in 2014 22nd Signal Processing and Communications Applications Conference (SIU), Apr. 2014, pp. 1922-1925, doi: 10.1109/SIU.2014.6830631.

[9] Y. Lu, C. Guo, X. Dai, and F.-Y. Wang, "Data-efficient image captioning of fine art paintings via virtual-real semantic alignment training," Neurocomputing, vol. 490, pp. 163-180, Jun. 2022, doi: 10.1016/j.neucom.2022.01.068.
[10] Z. Yang, P. Wang, T. Chu, and J. Yang, "Human-Centric Image Captioning," Pattern Recognition, vol. 126, p. 108545, Jun. 2022, doi: 10.1016/j.patcog.2022.108545.
[11] J. Li, N. Xu, W. Nie, and S. Zhang, "Image Captioning with multi-level similarity-guided semantic matching," Visual Informatics, vol. 5, no. 4, pp. 41-48, Dec. 2021, doi: 10.1016/j.visinf.2021.11.003.
[12] T. Jaknamon and S. Marukatat, "ThaiTC:Thai Transformer-based Image Captioning," in 2022 17th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), Nov. 2022, pp. 1-4, doi: 10.1109/iSAI-NLP56921.2022.9960246.
[13] A. Krisna, A. S. Parihar, A. Das, and A. Aryan, "End-to-End Model for Heavy Rain Image Captioning," in 2022 4th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N), Dec. 2022, pp. 1646-1651, doi: 10.1109/ICAC3N56670.2022.10074181.
[14] P. G. Shambharkar, P. Kumari, P. Yadav, and R. Kumar, "Generating Caption for Image using Beam Search and Analyzation with Unsupervised Image Captioning Algorithm," in 2021 5th International Conference on Intelligent Computing and Control Systems (ICICCS), May 2021, pp. 857-864, doi: 10.1109/ICICCS51141.2021.9432245.
[15] Y. Feng, K. Maeda, T. Ogawa, and M. Haseyama, "Human-Centric Image Retrieval with Gaze-Based Image Captioning," in 2022 IEEE International Conference on Image Processing (ICIP), Oct. 2022, pp. 3828-3832, doi: 10.1109/ICIP46576.2022.9897949.
[16] C. Cai, K.-H. Yap, and S. Wang, "Attribute Conditioned Fashion Image Captioning," in 2022 IEEE International Conference on Image Processing (ICIP), Oct. 2022, pp. 1921-1925, doi: 10.1109/ICIP46576.2022.9897417.
[17] X. Ye et al., "A Joint-Training Two-Stage Method For Remote Sensing Image Captioning," IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1-16, 2022, doi: 10.1109/TGRS.2022.3224244.
[18] J. Wang, Z. Chen, A. Ma, and Y. Zhong, "Capformer: Pure Transformer for Remote Sensing Image Caption," in IGARSS 2022 - 2022 IEEE International Geoscience and Remote Sensing Symposium, Jul. 2022, pp. 7996-7999, doi: 10.1109/IGARSS46834.2022.9883199.
[19] R. Malhotra, T. Raj, and V. Gupta, "Image Captioning and Identification of Dangerous Situations using Transfer Learning," in 2022 6th International Conference on Computing Methodologies and Communication (ICCMC), Mar. 2022, pp. 909-915, doi: 10.1109/ICCMC53470.2022.9753788.
[20] Xin Yang et al., "Context-Aware Transformer for image captioning," Neurocomputing, vol. 549, p. 126440, 2023, doi: 10.1016/j.neucom.2023.126440.
[21] M. Wang, L. Song, X. Yang, and C. Luo, "A parallel-fusion RNN-LSTM architecture for image caption generation," in 2016 IEEE International Conference on Image Processing (ICIP), Sep. 2016, pp. 4448-4452, doi: 10.1109/ICIP.2016.7533201.
[22] M. Şeker and M. S. Özerdem, "Automated Detection of Alzheimer’s Disease using raw EEG time series via. DWT-CNN model," Dicle Üniversitesi Mühendislik Fakültesi Mühendislik Dergisi, vol. 13, no. 4, pp. 673-684, Jan. 2023, doi:10.24012/dumf.1197722.
[23] S. Örenç, E. Acar, and M. S. Özerdem, "Utilizing the Ensemble of Deep Learning Approaches to Identify Monkeypox Disease," Dicle Üniversitesi Mühendislik Fakültesi Mühendislik Dergisi, vol. 13, no. 4, pp. 685-691, Jan. 2023, doi:10.24012/dumf.1199679.
[24] S. Degadwala, D. Vyas, H. Biswas, U. Chakraborty, and S. Saha, "Image Captioning Using Inception V3 Transfer Learning Model," in 2021 6th International Conference on Communication and Electronics Systems (ICCES), Jul. 2021, pp. 1103-1108, doi: 10.1109/ICCES51350.2021.9489111.
[25] O. Turk, D. Ozhan, E. Acar, T. C. Akinci, and M. Yilmaz, "Automatic detection of brain tumors with the aid of ensemble deep learning architectures and class activation map indicators by employing magnetic resonance images," Zeitschrift für Medizinische Physik, Dec. 2022, doi: 10.1016/j.zemedi.2022.11.010.
[26] K. Joshi, V. Tripathi, C. Bose, and C. Bhardwaj, "Robust Sports Image Classification Using InceptionV3 and Neural Networks," Procedia Computer Science, vol. 167, pp. 2374-2381, Jan. 2020, doi: 10.1016/j.procs.2020.03.290.
[27] C. Szegedy et al., "Rethinking the Inception Architecture for Computer Vision," in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2016, pp. 2818-2826, doi: 10.1109/CVPR.2016.308.
[28] X. Yu, Y. Ahn, and J. Jeong, "High-level Image Classification by Synergizing Image Captioning with BERT," in 2021 International Conference on Information and Communication Technology Convergence (ICTC), Oct. 2021, pp. 1686-1690, doi: 10.1109/ICTC52510.2021.9620954.
[29] C. Zhang, Y. Dai, Y. Cheng, Z. Jia, and K. Hirota, "Recurrent Attention LSTM Model for Image Chinese Caption Generation," in 2018 Joint 10th International Conference on Soft Computing and Intelligent Systems (SCIS) and 19th International Symposium on Advanced Intelligent Systems (ISIS), Dec. 2018, pp. 808-813, doi: 10.1109/SCIS-ISIS.2018.00134.
[30] K. Xu, H. Wang, and P. Tang, "Image captioning with deep LSTM based on sequential residual," in 2017 IEEE International Conference on Multimedia and Expo (ICME), Jul. 2017, pp. 361-366, doi: 10.1109/ICME.2017.8019408.

Ayrıntılar

Birincil Dil

İngilizce

Konular

Doğal Dil İşleme

Bölüm

Araştırma Makalesi

Yazarlar

Zeynep Karaca
0000-0002-7751-8567
Türkiye

Bihter Daş ^*
0000-0002-2498-3297
Türkiye

Erken Görünüm Tarihi

31 Aralık 2023

Yayımlanma Tarihi

31 Aralık 2023

Gönderilme Tarihi

10 Ağustos 2023

Kabul Tarihi

11 Kasım 2023

Yayımlandığı Sayı

Yıl 2023 Cilt: 14 Sayı: 4

DOI

https://doi.org/10.24012/dumf.1340656

IZ

https://izlik.org/JA65SP82JC

Kaynak Göster

RIS / Bibtex

APA

Karaca, Z., & Daş, B. (2023). From Pixels to Paragraphs: Exploring Enhanced Image-to-Text Generation using Inception v3 and Attention Mechanisms. Dicle Üniversitesi Mühendislik Fakültesi Mühendislik Dergisi, 14(4), 603-610. https://doi.org/10.24012/dumf.1340656

AMA

1.Karaca Z, Daş B. From Pixels to Paragraphs: Exploring Enhanced Image-to-Text Generation using Inception v3 and Attention Mechanisms. DÜMF MD. 2023;14(4):603-610. doi:10.24012/dumf.1340656

Chicago

Karaca, Zeynep, ve Bihter Daş. 2023. “From Pixels to Paragraphs: Exploring Enhanced Image-to-Text Generation using Inception v3 and Attention Mechanisms”. Dicle Üniversitesi Mühendislik Fakültesi Mühendislik Dergisi 14 (4): 603-10. https://doi.org/10.24012/dumf.1340656.

EndNote

Karaca Z, Daş B (01 Aralık 2023) From Pixels to Paragraphs: Exploring Enhanced Image-to-Text Generation using Inception v3 and Attention Mechanisms. Dicle Üniversitesi Mühendislik Fakültesi Mühendislik Dergisi 14 4 603–610.

IEEE

[1]Z. Karaca ve B. Daş, “From Pixels to Paragraphs: Exploring Enhanced Image-to-Text Generation using Inception v3 and Attention Mechanisms”, DÜMF MD, c. 14, sy 4, ss. 603–610, Ara. 2023, doi: 10.24012/dumf.1340656.

ISNAD

Karaca, Zeynep - Daş, Bihter. “From Pixels to Paragraphs: Exploring Enhanced Image-to-Text Generation using Inception v3 and Attention Mechanisms”. Dicle Üniversitesi Mühendislik Fakültesi Mühendislik Dergisi 14/4 (01 Aralık 2023): 603-610. https://doi.org/10.24012/dumf.1340656.

JAMA

1.Karaca Z, Daş B. From Pixels to Paragraphs: Exploring Enhanced Image-to-Text Generation using Inception v3 and Attention Mechanisms. DÜMF MD. 2023;14:603–610.

MLA

Karaca, Zeynep, ve Bihter Daş. “From Pixels to Paragraphs: Exploring Enhanced Image-to-Text Generation using Inception v3 and Attention Mechanisms”. Dicle Üniversitesi Mühendislik Fakültesi Mühendislik Dergisi, c. 14, sy 4, Aralık 2023, ss. 603-10, doi:10.24012/dumf.1340656.

Vancouver

1.Zeynep Karaca, Bihter Daş. From Pixels to Paragraphs: Exploring Enhanced Image-to-Text Generation using Inception v3 and Attention Mechanisms. DÜMF MD. 01 Aralık 2023;14(4):603-10. doi:10.24012/dumf.1340656

Piksellerden Paragraflara: Inception v3 ve Dikkat Mekanizmalarını Kullanarak Gelişmiş Görüntüden Metin Üretimi Keşfetme

Öz

Anahtar Kelimeler

From Pixels to Paragraphs: Exploring Enhanced Image-to-Text Generation using Inception v3 and Attention Mechanisms

Öz

Anahtar Kelimeler

Kaynakça

Ayrıntılar

Birincil Dil

Konular

Bölüm

Yazarlar

Erken Görünüm Tarihi

Yayımlanma Tarihi

Gönderilme Tarihi

Kabul Tarihi

Yayımlandığı Sayı

DOI

IZ

Kaynak Göster