Sequence-to-Sequence Video Captioning with Residual Connected Gated Recurrent Units

Selman Aydın; Özkan Çaylı; Volkan Kılıç; Aytuğ Onan

doi:10.31590/ejosat.1071835

EN TR

Sequence-to-Sequence Video Captioning with Residual Connected Gated Recurrent Units

Öz

Recurrent neural networks have recently emerged as a useful tool in computer vision and language modeling tasks such as image and video captioning. The main limitation of these networks is preserving the gradient flow as the network gets deeper. We propose a video captioning approach that utilizes residual connections to overcome this limitation and maintain the gradient flow by carrying the information through layers from bottom to top with additive features. The experimental evaluations on the MSVD dataset indicate that the proposed approach achieves accurate caption generation compared to the state-of-the-art results. In addition, the proposed approach is integrated with our custom-designed Android application, WeCapV2, capable of generating captions without an internet connection.

Anahtar Kelimeler

Kaynakça

Amirian, S., Rasheed, K., Taha, T. R., & Arabnia, H. R. J. I. A. (2020). Automatic image and video caption generation with deep learning: A concise review and algorithmic overlap. IEEE Access, 8, 218386-218400.
Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). Spice: Semantic propositional image caption evaluation. Paper presented at the European Conference on Computer Vision.
Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Paper presented at the Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.
Baran, M., Moral, Ö. T., & Kılıç, V. J. A. B. v. T. D. (2021). Akıllı telefonlar için birleştirme modeli tabanlı görüntü altyazılama. European Journal of Science and Technology(26), 191-196. Chen, D., & Dolan, W. B. (2011). Collecting highly parallel data for paraphrase evaluation. Paper presented at the Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies.
Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Çaylı, Ö., Makav, B., Kılıç, V., & Onan, A. (2020). Mobile application based automatic caption generation for visually impaired. Paper presented at the International Conference on Intelligent and Fuzzy Systems.
Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Fetiler, B., Çaylı, Ö., Moral, Ö. T., Kılıç, V., & Onan, A. (2021). Video captioning based on multi-layer gated recurrent unit for smartphones. European Journal of Science and Technology(32), 221-226.

Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., Ranzato, M. A., & Mikolov, T. J. A. i. n. i. p. s. (2013). Devise: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems, 26.
Gan, C., Yao, T., Yang, K., Yang, Y., & Mei, T. (2016). You lead, we exceed: Labor-free video concept learning by jointly exploiting web videos and images. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Gao, L., Guo, Z., Zhang, H., Xu, X., & Shen, H. T. J. I. T. o. M. (2017). Video captioning with attention-based LSTM and semantic consistency. IEEE Transactions on Multimedia, 19(9), 2045-2055.
Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., & Saenko, K. (2013). Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. Paper presented at the Proceedings of the IEEE International Conference on Computer Vision.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Keskin, R., Çaylı, Ö., Moral, Ö. T., Kılıç, V., & Onan, A. (2021). A benchmark for feature-injection architectures in image captioning. European Journal of Science and Technology(31), 461-468.
Keskin, R., Moral, Ö. T., Kılıç, V., & Onan, A. (2021). Multi-GRU based automated image captioning for smartphones. Paper presented at the 2021 29th Signal Processing and Communications Applications Conference.
Khan, M. U. G., Zhang, L., & Gotoh, Y. (2011). Human focused video description. Paper presented at the 2011 IEEE International Conference on Computer Vision Workshops.
Kılıç, V. (2021). Deep gated recurrent unit for smartphone-based image captioning. Sakarya University Journal of Computer and Information Sciences, 4(2), 181-191.
Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. Paper presented at the Text Summarization Branches Out.
Liu, S., Zhu, Z., Ye, N., Guadarrama, S., & Murphy, K. (2017). Improved image captioning via policy gradient optimization of spider. Paper presented at the Proceedings of the IEEE International Conference on Computer Vision.
Liu, W., Wang, Q., Zhu, Y., & Chen, H. J. T. J. o. S. (2020). GRU: optimization of NPI performance. The Journal of Supercomputing, 76(5), 3542-3554.
Makav, B., & Kılıç, V. (2019a). A new image captioning approach for visually impaired people. Paper presented at the 2019 11th International Conference on Electrical and Electronics Engineering.
Makav, B., & Kılıç, V. (2019b). Smartphone-based image captioning for visually and hearing impaired. Paper presented at the 2019 11th International Conference on Electrical and Electronics Engineering.
Pan, Y., Yao, T., Li, H., & Mei, T. (2017). Video captioning with transferred semantic attributes. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. Paper presented at the Proceedings of the 40th annual meeting of the Association for Computational Linguistics.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., . . . Antiga, L. J. A. i. n. i. p. s. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32.
Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., & Schiele, B. (2013). Translating video content to natural language descriptions. Paper presented at the Proceedings of the IEEE international Conference on Computer Vision.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Targ, S., Almeida, D., & Lyman, K. J. a. p. a. (2016). Resnet in resnet: Generalizing residual architectures. arXiv preprint arXiv:1603.08029.
Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., & Saenko, K. (2015). Sequence to sequence-video to text. Paper presented at the Proceedings of the IEEE International Conference on Computer Vision.
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., & Saenko, K. J. a. p. a. (2014). Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729.
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., . . . Macherey, K. J. a. p. a. (2016). Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
Xu, J., Mei, T., Yao, T., & Rui, Y. (2016). Msr-vtt: A large video description dataset for bridging video and language. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., & Courville, A. (2015). Describing videos by exploiting temporal structure. Paper presented at the Proceedings of the IEEE International Conference on Computer Vision.
Yao, T., Pan, Y., Li, Y., Qiu, Z., & Mei, T. (2017). Boosting image captioning with attributes. Paper presented at the Proceedings of the IEEE International Conference on Computer Vision.

Ayrıntılar

Birincil Dil

İngilizce

Konular

Mühendislik

Bölüm

Araştırma Makalesi

Yazarlar

Selman Aydın ^*
0000-0002-2851-6303
Türkiye

Özkan Çaylı
0000-0002-3389-3867
Türkiye

Volkan Kılıç
0000-0002-3164-1981
Türkiye

Aytuğ Onan
0000-0002-9434-5880
Türkiye

Yayımlanma Tarihi

7 Mayıs 2022

Gönderilme Tarihi

11 Şubat 2022

Kabul Tarihi

24 Mart 2022

Yayımlandığı Sayı

Yıl 2022 Sayı: 35

DOI

https://doi.org/10.31590/ejosat.1071835

IZ

https://izlik.org/JA94XE87ZK

Kaynak Göster

RIS / Bibtex

APA

Aydın, S., Çaylı, Ö., Kılıç, V., & Onan, A. (2022). Sequence-to-Sequence Video Captioning with Residual Connected Gated Recurrent Units. Avrupa Bilim ve Teknoloji Dergisi, 35, 380-386. https://doi.org/10.31590/ejosat.1071835

AMA

1.Aydın S, Çaylı Ö, Kılıç V, Onan A. Sequence-to-Sequence Video Captioning with Residual Connected Gated Recurrent Units. EJOSAT. 2022;(35):380-386. doi:10.31590/ejosat.1071835

Chicago

Aydın, Selman, Özkan Çaylı, Volkan Kılıç, ve Aytuğ Onan. 2022. “Sequence-to-Sequence Video Captioning with Residual Connected Gated Recurrent Units”. Avrupa Bilim ve Teknoloji Dergisi, sy 35: 380-86. https://doi.org/10.31590/ejosat.1071835.

EndNote

Aydın S, Çaylı Ö, Kılıç V, Onan A (01 Mayıs 2022) Sequence-to-Sequence Video Captioning with Residual Connected Gated Recurrent Units. Avrupa Bilim ve Teknoloji Dergisi 35 380–386.

IEEE

[1]S. Aydın, Ö. Çaylı, V. Kılıç, ve A. Onan, “Sequence-to-Sequence Video Captioning with Residual Connected Gated Recurrent Units”, EJOSAT, sy 35, ss. 380–386, May. 2022, doi: 10.31590/ejosat.1071835.

ISNAD

Aydın, Selman - Çaylı, Özkan - Kılıç, Volkan - Onan, Aytuğ. “Sequence-to-Sequence Video Captioning with Residual Connected Gated Recurrent Units”. Avrupa Bilim ve Teknoloji Dergisi. 35 (01 Mayıs 2022): 380-386. https://doi.org/10.31590/ejosat.1071835.

JAMA

1.Aydın S, Çaylı Ö, Kılıç V, Onan A. Sequence-to-Sequence Video Captioning with Residual Connected Gated Recurrent Units. EJOSAT. 2022;:380–386.

MLA

Aydın, Selman, vd. “Sequence-to-Sequence Video Captioning with Residual Connected Gated Recurrent Units”. Avrupa Bilim ve Teknoloji Dergisi, sy 35, Mayıs 2022, ss. 380-6, doi:10.31590/ejosat.1071835.

Vancouver

1.Selman Aydın, Özkan Çaylı, Volkan Kılıç, Aytuğ Onan. Sequence-to-Sequence Video Captioning with Residual Connected Gated Recurrent Units. EJOSAT. 01 Mayıs 2022;(35):380-6. doi:10.31590/ejosat.1071835

Sequence-to-Sequence Video Captioning with Residual Connected Gated Recurrent Units

Sequence-to-Sequence Video Captioning with Residual Connected Gated Recurrent Units

Öz

Anahtar Kelimeler

Artık Bağlı Kapılı Tekrarlayan Birimlerle Sıradan Sıraya Video Altyazılama

Öz

Anahtar Kelimeler

Kaynakça

Ayrıntılar

Birincil Dil

Konular

Bölüm

Yazarlar

Yayımlanma Tarihi

Gönderilme Tarihi

Kabul Tarihi

Yayımlandığı Sayı

DOI

IZ

Kaynak Göster

Cited By

Resnet based Deep Gated Recurrent Unit for Image Captioning on Smartphone

Artificial Intelligence Based Instance-Aware Semantic Lobe Segmentation on Chest Computed Tomography Images

Beyin Bilgisayarlı Tomografi Görüntülerinde Yapay Zeka Tabanlı Beyin Damar Hastalıkları Tespiti

Deep Learning-Based Ischemic Stroke Segmentation on Brain Computed Tomography Images