Research Article
BibTex RIS Cite

Sequence-to-Sequence Video Captioning with Residual Connected Gated Recurrent Units

Year 2022, Issue: 35, 380 - 386, 07.05.2022
https://doi.org/10.31590/ejosat.1071835

Abstract

Recurrent neural networks have recently emerged as a useful tool in computer vision and language modeling tasks such as image and video captioning. The main limitation of these networks is preserving the gradient flow as the network gets deeper. We propose a video captioning approach that utilizes residual connections to overcome this limitation and maintain the gradient flow by carrying the information through layers from bottom to top with additive features. The experimental evaluations on the MSVD dataset indicate that the proposed approach achieves accurate caption generation compared to the state-of-the-art results. In addition, the proposed approach is integrated with our custom-designed Android application, WeCapV2, capable of generating captions without an internet connection.

References

  • Amirian, S., Rasheed, K., Taha, T. R., & Arabnia, H. R. J. I. A. (2020). Automatic image and video caption generation with deep learning: A concise review and algorithmic overlap. IEEE Access, 8, 218386-218400.
  • Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). Spice: Semantic propositional image caption evaluation. Paper presented at the European Conference on Computer Vision.
  • Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Paper presented at the Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.
  • Baran, M., Moral, Ö. T., & Kılıç, V. J. A. B. v. T. D. (2021). Akıllı telefonlar için birleştirme modeli tabanlı görüntü altyazılama. European Journal of Science and Technology(26), 191-196. Chen, D., & Dolan, W. B. (2011). Collecting highly parallel data for paraphrase evaluation. Paper presented at the Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies.
  • Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Çaylı, Ö., Makav, B., Kılıç, V., & Onan, A. (2020). Mobile application based automatic caption generation for visually impaired. Paper presented at the International Conference on Intelligent and Fuzzy Systems.
  • Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Fetiler, B., Çaylı, Ö., Moral, Ö. T., Kılıç, V., & Onan, A. (2021). Video captioning based on multi-layer gated recurrent unit for smartphones. European Journal of Science and Technology(32), 221-226.
  • Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., Ranzato, M. A., & Mikolov, T. J. A. i. n. i. p. s. (2013). Devise: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems, 26.
  • Gan, C., Yao, T., Yang, K., Yang, Y., & Mei, T. (2016). You lead, we exceed: Labor-free video concept learning by jointly exploiting web videos and images. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Gao, L., Guo, Z., Zhang, H., Xu, X., & Shen, H. T. J. I. T. o. M. (2017). Video captioning with attention-based LSTM and semantic consistency. IEEE Transactions on Multimedia, 19(9), 2045-2055.
  • Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., & Saenko, K. (2013). Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. Paper presented at the Proceedings of the IEEE International Conference on Computer Vision.
  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Keskin, R., Çaylı, Ö., Moral, Ö. T., Kılıç, V., & Onan, A. (2021). A benchmark for feature-injection architectures in image captioning. European Journal of Science and Technology(31), 461-468.
  • Keskin, R., Moral, Ö. T., Kılıç, V., & Onan, A. (2021). Multi-GRU based automated image captioning for smartphones. Paper presented at the 2021 29th Signal Processing and Communications Applications Conference.
  • Khan, M. U. G., Zhang, L., & Gotoh, Y. (2011). Human focused video description. Paper presented at the 2011 IEEE International Conference on Computer Vision Workshops.
  • Kılıç, V. (2021). Deep gated recurrent unit for smartphone-based image captioning. Sakarya University Journal of Computer and Information Sciences, 4(2), 181-191.
  • Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. Paper presented at the Text Summarization Branches Out.
  • Liu, S., Zhu, Z., Ye, N., Guadarrama, S., & Murphy, K. (2017). Improved image captioning via policy gradient optimization of spider. Paper presented at the Proceedings of the IEEE International Conference on Computer Vision.
  • Liu, W., Wang, Q., Zhu, Y., & Chen, H. J. T. J. o. S. (2020). GRU: optimization of NPI performance. The Journal of Supercomputing, 76(5), 3542-3554.
  • Makav, B., & Kılıç, V. (2019a). A new image captioning approach for visually impaired people. Paper presented at the 2019 11th International Conference on Electrical and Electronics Engineering.
  • Makav, B., & Kılıç, V. (2019b). Smartphone-based image captioning for visually and hearing impaired. Paper presented at the 2019 11th International Conference on Electrical and Electronics Engineering.
  • Pan, Y., Yao, T., Li, H., & Mei, T. (2017). Video captioning with transferred semantic attributes. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. Paper presented at the Proceedings of the 40th annual meeting of the Association for Computational Linguistics.
  • Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., . . . Antiga, L. J. A. i. n. i. p. s. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32.
  • Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., & Schiele, B. (2013). Translating video content to natural language descriptions. Paper presented at the Proceedings of the IEEE international Conference on Computer Vision.
  • Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Targ, S., Almeida, D., & Lyman, K. J. a. p. a. (2016). Resnet in resnet: Generalizing residual architectures. arXiv preprint arXiv:1603.08029.
  • Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., & Saenko, K. (2015). Sequence to sequence-video to text. Paper presented at the Proceedings of the IEEE International Conference on Computer Vision.
  • Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., & Saenko, K. J. a. p. a. (2014). Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729.
  • Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., . . . Macherey, K. J. a. p. a. (2016). Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
  • Xu, J., Mei, T., Yao, T., & Rui, Y. (2016). Msr-vtt: A large video description dataset for bridging video and language. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., & Courville, A. (2015). Describing videos by exploiting temporal structure. Paper presented at the Proceedings of the IEEE International Conference on Computer Vision.
  • Yao, T., Pan, Y., Li, Y., Qiu, Z., & Mei, T. (2017). Boosting image captioning with attributes. Paper presented at the Proceedings of the IEEE International Conference on Computer Vision.

Artık Bağlı Kapılı Tekrarlayan Birimlerle Sıradan Sıraya Video Altyazılama

Year 2022, Issue: 35, 380 - 386, 07.05.2022
https://doi.org/10.31590/ejosat.1071835

Abstract

Tekrarlayan sinir ağları, son zamanlarda bilgisayarla görme ve görüntü ve video altyazısı gibi dil modelleme görevlerinde kullanışlı bir araç olarak ortaya çıkmıştır. Bu ağların ana sınırlaması, ağ derinleştikçe gradyan akışını korumaktır. Bu sınırlamanın üstesinden gelmek ve bilgileri ek özelliklerle katmanlar arasında aşağıdan yukarıya taşıyarak gradyan akışını sürdürmek için artık bağlantıları kullanan bir video altyazı yaklaşımı öneriyoruz. MSVD veri seti üzerindeki deneysel değerlendirmeler, önerilen yaklaşımın en son teknoloji sonuçlarla karşılaştırıldığında doğru altyazı oluşturmayı başardığını göstermektedir. Ek olarak, önerilen yaklaşım, internet bağlantısı olmadan altyazı oluşturabilen özel olarak tasarlanmış Android uygulamamız WeCapV2 ile entegre edilmiştir.

References

  • Amirian, S., Rasheed, K., Taha, T. R., & Arabnia, H. R. J. I. A. (2020). Automatic image and video caption generation with deep learning: A concise review and algorithmic overlap. IEEE Access, 8, 218386-218400.
  • Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). Spice: Semantic propositional image caption evaluation. Paper presented at the European Conference on Computer Vision.
  • Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Paper presented at the Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.
  • Baran, M., Moral, Ö. T., & Kılıç, V. J. A. B. v. T. D. (2021). Akıllı telefonlar için birleştirme modeli tabanlı görüntü altyazılama. European Journal of Science and Technology(26), 191-196. Chen, D., & Dolan, W. B. (2011). Collecting highly parallel data for paraphrase evaluation. Paper presented at the Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies.
  • Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Çaylı, Ö., Makav, B., Kılıç, V., & Onan, A. (2020). Mobile application based automatic caption generation for visually impaired. Paper presented at the International Conference on Intelligent and Fuzzy Systems.
  • Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Fetiler, B., Çaylı, Ö., Moral, Ö. T., Kılıç, V., & Onan, A. (2021). Video captioning based on multi-layer gated recurrent unit for smartphones. European Journal of Science and Technology(32), 221-226.
  • Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., Ranzato, M. A., & Mikolov, T. J. A. i. n. i. p. s. (2013). Devise: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems, 26.
  • Gan, C., Yao, T., Yang, K., Yang, Y., & Mei, T. (2016). You lead, we exceed: Labor-free video concept learning by jointly exploiting web videos and images. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Gao, L., Guo, Z., Zhang, H., Xu, X., & Shen, H. T. J. I. T. o. M. (2017). Video captioning with attention-based LSTM and semantic consistency. IEEE Transactions on Multimedia, 19(9), 2045-2055.
  • Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., & Saenko, K. (2013). Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. Paper presented at the Proceedings of the IEEE International Conference on Computer Vision.
  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Keskin, R., Çaylı, Ö., Moral, Ö. T., Kılıç, V., & Onan, A. (2021). A benchmark for feature-injection architectures in image captioning. European Journal of Science and Technology(31), 461-468.
  • Keskin, R., Moral, Ö. T., Kılıç, V., & Onan, A. (2021). Multi-GRU based automated image captioning for smartphones. Paper presented at the 2021 29th Signal Processing and Communications Applications Conference.
  • Khan, M. U. G., Zhang, L., & Gotoh, Y. (2011). Human focused video description. Paper presented at the 2011 IEEE International Conference on Computer Vision Workshops.
  • Kılıç, V. (2021). Deep gated recurrent unit for smartphone-based image captioning. Sakarya University Journal of Computer and Information Sciences, 4(2), 181-191.
  • Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. Paper presented at the Text Summarization Branches Out.
  • Liu, S., Zhu, Z., Ye, N., Guadarrama, S., & Murphy, K. (2017). Improved image captioning via policy gradient optimization of spider. Paper presented at the Proceedings of the IEEE International Conference on Computer Vision.
  • Liu, W., Wang, Q., Zhu, Y., & Chen, H. J. T. J. o. S. (2020). GRU: optimization of NPI performance. The Journal of Supercomputing, 76(5), 3542-3554.
  • Makav, B., & Kılıç, V. (2019a). A new image captioning approach for visually impaired people. Paper presented at the 2019 11th International Conference on Electrical and Electronics Engineering.
  • Makav, B., & Kılıç, V. (2019b). Smartphone-based image captioning for visually and hearing impaired. Paper presented at the 2019 11th International Conference on Electrical and Electronics Engineering.
  • Pan, Y., Yao, T., Li, H., & Mei, T. (2017). Video captioning with transferred semantic attributes. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. Paper presented at the Proceedings of the 40th annual meeting of the Association for Computational Linguistics.
  • Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., . . . Antiga, L. J. A. i. n. i. p. s. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32.
  • Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., & Schiele, B. (2013). Translating video content to natural language descriptions. Paper presented at the Proceedings of the IEEE international Conference on Computer Vision.
  • Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Targ, S., Almeida, D., & Lyman, K. J. a. p. a. (2016). Resnet in resnet: Generalizing residual architectures. arXiv preprint arXiv:1603.08029.
  • Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., & Saenko, K. (2015). Sequence to sequence-video to text. Paper presented at the Proceedings of the IEEE International Conference on Computer Vision.
  • Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., & Saenko, K. J. a. p. a. (2014). Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729.
  • Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., . . . Macherey, K. J. a. p. a. (2016). Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
  • Xu, J., Mei, T., Yao, T., & Rui, Y. (2016). Msr-vtt: A large video description dataset for bridging video and language. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., & Courville, A. (2015). Describing videos by exploiting temporal structure. Paper presented at the Proceedings of the IEEE International Conference on Computer Vision.
  • Yao, T., Pan, Y., Li, Y., Qiu, Z., & Mei, T. (2017). Boosting image captioning with attributes. Paper presented at the Proceedings of the IEEE International Conference on Computer Vision.
There are 35 citations in total.

Details

Primary Language English
Subjects Engineering
Journal Section Articles
Authors

Selman Aydın 0000-0002-2851-6303

Özkan Çaylı 0000-0002-3389-3867

Volkan Kılıç 0000-0002-3164-1981

Aytuğ Onan 0000-0002-9434-5880

Publication Date May 7, 2022
Published in Issue Year 2022 Issue: 35

Cite

APA Aydın, S., Çaylı, Ö., Kılıç, V., Onan, A. (2022). Sequence-to-Sequence Video Captioning with Residual Connected Gated Recurrent Units. Avrupa Bilim Ve Teknoloji Dergisi(35), 380-386. https://doi.org/10.31590/ejosat.1071835