A Benchmark for Feature-injection Architectures in Image Captioning

Rumeysa Keskin; Özkan Çaylı; Özge Taylan Moral; Volkan Kılıç; Aytuğ Onan

doi:10.31590/ejosat.1013329

TR EN

Görüntü Altyazılamada Öznitelik Enjeksiyon Mimarileri için Bir Kıyaslama

Öz

Görüntü altyazılama olarak bilinen, bir görüntüyü dilbilgisel ve anlamsal olarak doğru bir cümle olarak tanımlama, bilgisayarlı görme ve doğal dil işleme alanlarındaki son gelişmelerle birlikte önemli ölçüde ilerlemiştir. Bu iki alanın birleştirilmesi, çıkarılan özniteliklerin altyazı oluşturmada nasıl kullanılacağını tanımlayan öznitelik enjeksiyon mimarisinin geliştirilmesine öncülük etmiştir. Bu çalışmada, bilgisayarlı görme ve doğal dil işleme tekniklerini kodlayıcı-kod çözücü tabanlı görüntü altyazılamada kullanan öznitelik enjeksiyon mimarilerinin bir karşılaştırılması raporlanmaktadır. Kıyaslama değerlendirmelerinde, Inception-v3 evrişimsel sinir ağı, kodlayıcıda görüntü özniteliklerini çıkarmak için kullanılırken; init-inject, pre-inject, par-inject ve merge gibi öznitelik enjeksiyon mimarileri altyazı üretmek için çok katmanlı kapılı tekrarlayan birim ile kod çözücüde uygulanmaktadır. Mimariler sekiz performans metriği ile MSCOCO veri kümesi üzerinde kapsamlı bir şekilde değerlendirilmiştir. 3 katmanlı GRU ile init-inject mimarisinin altyazı doğruluğu açısından diğer mimarilerden daha iyi performans gösterdiği sonucuna varılmıştır.

Anahtar Kelimeler

Proje Numarası

120N995

A Benchmark for Feature-injection Architectures in Image Captioning

Öz

Describing an image with a grammatically and semantically correct sentence, known as image captioning, has been improved significantly with recent advances in computer vision (CV) and natural language processing (NLP) communities. The integration of these communities leads to the development of feature-injection architectures, which define how extracted features are used in captioning. In this paper, a benchmark of feature-injection architectures that utilize CV and NLP techniques is reported for encoder-decoder based captioning. Benchmark evaluations include Inception-v3 convolutional neural network to extract image features in the encoder while the feature-injection architectures such as init-inject, pre-inject, par-inject and merge are applied with a multi-layer gated recurrent unit (GRU) to generate captions in the decoder. Architectures have been evaluated extensively on the MSCOCO dataset across eight performance metrics. It has been concluded that the init-inject architecture with 3-layer GRU outperforms the other architectures in terms of captioning accuracy.

Anahtar Kelimeler

Destekleyen Kurum

TÜBİTAK

Proje Numarası

120N995

Teşekkür

This research was supported by the Scientific and Technological Research Council of Turkey (TUBITAK)-British Council (The Newton-Katip Celebi Fund Institutional Links, Turkey-UK project: 120N995).

Kaynakça

Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). Spice: Semantic propositional image caption evaluation. Paper presented at the European Conference on Computer Vision.
Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Paper presented at the Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization.
Baran, M., Moral, Ö. T., & Kılıç, V. (2021). Akıllı Telefonlar için Birleştirme Modeli Tabanlı Görüntü Altyazılama. Avrupa Bilim ve Teknoloji Dergisi(26), 191-196.
Çaylı, Ö., Makav, B., Kılıç, V., & Onan, A. (2020). Mobile Application Based Automatic Caption Generation for Visually Impaired. Paper presented at the International Conference on Intelligent and Fuzzy Systems.
Chang, S.-F. (1995). Compressed-domain techniques for image/video indexing and manipulation. Paper presented at the Proceedings., International Conference on Image Processing.
Chiarella, D., Yarbrough, J., & Jackson, C. A.-L. (2020). Using alt text to make science Twitter more accessible for people with visual impairments. Nature Communications, 11(1), 1-3.
Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:.

Devlin, J., Cheng, H., Fang, H., Gupta, S., Deng, L., He, X., . . . Mitchell, M. J. a. p. a. (2015). Language models for image captioning: The quirks and what works.
Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Gao, Y., & Glowacka, D. (2016). Deep gate recurrent neural network. Paper presented at the Asian conference on machine learning.
Gers, F. A., & Schmidhuber, E. (2001). LSTM recurrent networks learn simple context-free and context-sensitive languages. IEEE Transactions on Neural Networks Learning Systems, 12(6), 1333-1340.
Gurari, D., Li, Q., Stangl, A. J., Guo, A., Lin, C., Grauman, K., . . . Bigham, J. P. (2018). Vizwiz grand challenge: Answering visual questions from blind people. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Hochreiter, S., & Schmidhuber, J. J. N. c. (1997). Long short-term memory. 9(8), 1735-1780.
Hodosh, M., Young, P., & Hockenmaier, J. (2013). Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47, 853-899.
Keskin, R., Moral, Ö. T., Kılıç, V., & Onan, A. (2021). Multi-GRU Based Automated Image Captioning for Smartphones. Paper presented at the 2021 29th Signal Processing and Communications Applications Conference (SIU).
Kılıç, V. (2021). Deep Gated Recurrent Unit for Smartphone-Based Image Captioning. Sakarya University Journal of Computer Information Sciences, 4(2), 181-191.
Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., . . . Berg, T. L. (2013). Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis Machine Intelligence, 35(12), 2891-2903.
Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. Paper presented at the Text Summarization Branches Out.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., . . . Zitnick, C. L. (2014). Microsoft coco: Common objects in context. Paper presented at the European Conference on Computer Vision.
Liu, S., Zhu, Z., Ye, N., Guadarrama, S., & Murphy, K. (2016). Optimization of image description metrics using policy gradient methods.
Liu, X., Xu, Q., & Wang, N. (2019). A survey on deep neural network-based image captioning. The Visual Computer, 35(3), 445-470.
Makav, B., & Kılıç, V. (2019a). A new image captioning approach for visually impaired people. Paper presented at the 2019 11th International Conference on Electrical and Electronics Engineering (ELECO).
Makav, B., & Kılıç, V. (2019b). Smartphone-based image captioning for visually and hearing impaired. Paper presented at the 2019 11th International Conference on Electrical and Electronics Engineering (ELECO).
Mao, J., Wei, X., Yang, Y., Wang, J., Huang, Z., & Yuille, A. L. (2015). Learning like a child: Fast novel visual concept learning from sentence descriptions of images. Paper presented at the Proceedings of the IEEE International Conference on Computer Vision.
Nina, O., & Rodriguez, A. (2015). Simplified LSTM unit and search space probability exploration for image description. Paper presented at the 2015 10th International Conference on Information, Communications and Signal Processing (ICICS).
Ordonez, V., Kulkarni, G., & Berg, T. (2011). Im2text: Describing images using 1 million captioned photographs. Advances in Neural Information Processing Systems, 24, 1143-1151.
Ouyang, H., Zeng, J., Li, Y., & Luo, S. J. P. (2020). Fault detection and identification of blast furnace ironmaking process using the gated recurrent unit network. 8(4), 391.
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. Paper presented at the Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Tanti, M., Gatt, A., & Camilleri, K. P. (2018). Where to put the image in an image caption generator. Natural Language Engineering, 24(3), 467-489.
Tao, Y., Wang, X., Sánchez, R.-V., Yang, S., & Bai, Y. (2019). Spur gear fault diagnosis using a multilayer gated recurrent unit approach with vibration signal. IEEE Access, 7, 56880-56889.
Targ, S., Almeida, D., & Lyman, K. (2016). Resnet in resnet: Generalizing residual architectures. arXiv preprint arXiv:.08029.
Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2016). Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE Transactions on Pattern Analysis Machine Intelligence, 39(4), 652-663.
Yao, T., Pan, Y., Li, Y., Qiu, Z., & Mei, T. (2017, 22-29 Oct. 2017). Boosting Image Captioning with Attributes. Paper presented at the 2017 IEEE International Conference on Computer Vision (ICCV).
Young, P., Lai, A., Hodosh, M., & Hockenmaier, J. (2014). From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2, 67-78.

Ayrıntılar

Birincil Dil

İngilizce

Konular

Mühendislik

Bölüm

Araştırma Makalesi

Yazarlar

Rumeysa Keskin
0000-0001-8452-8221
Türkiye

Özkan Çaylı
0000-0002-3389-3867
Türkiye

Özge Taylan Moral ^*
0000-0003-0482-267X
Türkiye

Volkan Kılıç
0000-0002-3164-1981
Türkiye

Aytuğ Onan
0000-0002-9434-5880
Türkiye

Yayımlanma Tarihi

31 Aralık 2021

Gönderilme Tarihi

22 Ekim 2021

Kabul Tarihi

6 Aralık 2021

Yayımlandığı Sayı

Yıl 2021 Sayı: 31

DOI

https://doi.org/10.31590/ejosat.1013329

IZ

https://izlik.org/JA36PZ28EE

Kaynak Göster

RIS / Bibtex

APA

Keskin, R., Çaylı, Ö., Moral, Ö. T., Kılıç, V., & Onan, A. (2021). A Benchmark for Feature-injection Architectures in Image Captioning. Avrupa Bilim ve Teknoloji Dergisi, 31, 461-468. https://doi.org/10.31590/ejosat.1013329

AMA

1.Keskin R, Çaylı Ö, Moral ÖT, Kılıç V, Onan A. A Benchmark for Feature-injection Architectures in Image Captioning. EJOSAT. 2021;(31):461-468. doi:10.31590/ejosat.1013329

Chicago

Keskin, Rumeysa, Özkan Çaylı, Özge Taylan Moral, Volkan Kılıç, ve Aytuğ Onan. 2021. “A Benchmark for Feature-injection Architectures in Image Captioning”. Avrupa Bilim ve Teknoloji Dergisi, sy 31: 461-68. https://doi.org/10.31590/ejosat.1013329.

EndNote

Keskin R, Çaylı Ö, Moral ÖT, Kılıç V, Onan A (01 Aralık 2021) A Benchmark for Feature-injection Architectures in Image Captioning. Avrupa Bilim ve Teknoloji Dergisi 31 461–468.

IEEE

[1]R. Keskin, Ö. Çaylı, Ö. T. Moral, V. Kılıç, ve A. Onan, “A Benchmark for Feature-injection Architectures in Image Captioning”, EJOSAT, sy 31, ss. 461–468, Ara. 2021, doi: 10.31590/ejosat.1013329.

ISNAD

Keskin, Rumeysa - Çaylı, Özkan - Moral, Özge Taylan - Kılıç, Volkan - Onan, Aytuğ. “A Benchmark for Feature-injection Architectures in Image Captioning”. Avrupa Bilim ve Teknoloji Dergisi. 31 (01 Aralık 2021): 461-468. https://doi.org/10.31590/ejosat.1013329.

JAMA

1.Keskin R, Çaylı Ö, Moral ÖT, Kılıç V, Onan A. A Benchmark for Feature-injection Architectures in Image Captioning. EJOSAT. 2021;:461–468.

MLA

Keskin, Rumeysa, vd. “A Benchmark for Feature-injection Architectures in Image Captioning”. Avrupa Bilim ve Teknoloji Dergisi, sy 31, Aralık 2021, ss. 461-8, doi:10.31590/ejosat.1013329.

Vancouver

1.Rumeysa Keskin, Özkan Çaylı, Özge Taylan Moral, Volkan Kılıç, Aytuğ Onan. A Benchmark for Feature-injection Architectures in Image Captioning. EJOSAT. 01 Aralık 2021;(31):461-8. doi:10.31590/ejosat.1013329

Görme engelliler için nesne tanıma ve resim altyazısını derin öğrenme teknikleriyle entegre eden verimli bir aktivite tanıma modeli

Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi

https://doi.org/10.17341/gazimmfd.1245400

A Benchmark for Feature-injection Architectures in Image Captioning

Görüntü Altyazılamada Öznitelik Enjeksiyon Mimarileri için Bir Kıyaslama

Öz

Anahtar Kelimeler

Proje Numarası

A Benchmark for Feature-injection Architectures in Image Captioning

Öz

Anahtar Kelimeler

Destekleyen Kurum

Proje Numarası

Teşekkür

Kaynakça

Ayrıntılar

Birincil Dil

Konular

Bölüm

Yazarlar

Yayımlanma Tarihi

Gönderilme Tarihi

Kabul Tarihi

Yayımlandığı Sayı

DOI

IZ

Kaynak Göster

Cited By

Sequence-to-Sequence Video Captioning with Residual Connected Gated Recurrent Units

Resnet based Deep Gated Recurrent Unit for Image Captioning on Smartphone

Artificial Intelligence Based Instance-Aware Semantic Lobe Segmentation on Chest Computed Tomography Images

Beyin Bilgisayarlı Tomografi Görüntülerinde Yapay Zeka Tabanlı Beyin Damar Hastalıkları Tespiti

Deep Learning-Based Ischemic Stroke Segmentation on Brain Computed Tomography Images

Görme engelliler için nesne tanıma ve resim altyazısını derin öğrenme teknikleriyle entegre eden verimli bir aktivite tanıma modeli