Video Captioning Based on Multi-layer Gated Recurrent Unit for Smartphones

Bengü Fetiler; Özkan Çaylı; Özge Taylan Moral; Volkan Kılıç; Aytuğ Onan

doi:10.31590/ejosat.1039242

Research Article

Akıllı Telefonlar için Çok Katmanlı Kapılı Tekrarlayan Birim Tabanlı Video Altyazılama

Year 2021, Issue: 32, 221 - 226, 31.12.2021

Bengü Fetiler , Özkan Çaylı , Özge Taylan Moral , Volkan Kılıç , Aytuğ Onan

https://doi.org/10.31590/ejosat.1039242

Cited By: 5

Abstract

Video altyazılama, bilgisayarlı görü (CV) ve doğal dil işleme (NLP) alanlarında ilgi çeken dilbilgisel ve anlamsal olarak anlamlı tanımlar oluşturan bir görsel anlama işlemidir. Mobil platformun hesaplama gücündeki son gelişmeler, CV ve NLP tekniklerini kullanan birçok video altyazılama uygulamasının önünü açmıştır. Bu video altyazılama uygulamaları, çoğunlukla, kodlayıcı üzerinde evrişimli sinir ağları (CNN'ler) ve kod çözücü üzerinde tekrarlayan sinir ağları (RNN’ler) kullanan internet bağlantısıyla çalışan kodlayıcı-kod çözücü yaklaşımına bağlıdır. Ancak, bu yaklaşım çevrimiçi veri aktarımından dolayı doğru altyazı sonuçları ve hızlı yanıt alma açısından yeterince güçlü değildir. Bu nedenle, bu bildiride, kodlayıcı-kod çözücü yaklaşımı anlamsal olarak daha uyumlu altyazı oluşturmak için çok katmanlı kapılı tekrarlayan birim (GRU) altında diziden dizeye yaklaşımı ile genişletilmiştir. Her video karesinin görüntü özelliklerinden görsel bilgiler, altyazı oluşturma amacıyla çok katmanlı GRU tabanlı kod çözücüyü beslemek için kodlayıcıdaki ResNet-101 CNN ile çıkarılır. Önerilen yaklaşım, sekiz performans metriği altında MSVD veri kümesi üzerinde deneyler kullanılarak gelişmiş yaklaşımlarla karşılaştırılmıştır. Ayrıca, önerilen yaklaşım internet bağlantısı olmadan daha hızlı altyazı üretme yeteneğine sahip, WeCap adlı, özel tasarlanmış Android uygulamamıza gömülmüştür.

Keywords

Evrişimsel sinir ağı , Kapılı Tekrarlayan Birim , Doğal Dil İşleme , Video Altyazılama , Android Uygulama

Project Number

120N995, 2021-ÖDL-MÜMF-0006

References

Amaresh, M., & Chitrakala, S. (2019). Video captioning using deep learning: An overview of methods, datasets and metrics. Paper presented at the 2019 International Conference on Communication and Signal Processing.
Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). Spice: Semantic propositional image caption evaluation. Paper presented at the European Conference on Computer Vision.
Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Paper presented at the Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.
Baraldi, L., Grana, C., & Cucchiara, R. (2017). Hierarchical boundary-aware neural encoder for video captioning. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Çaylı, Ö., Makav, B., Kılıç, V., & Onan, A. (2020). Mobile Application Based Automatic Caption Generation for Visually Impaired. Paper presented at the International Conference on Intelligent and Fuzzy Systems.
Chen, D., & Dolan, W. B. (2011). Collecting highly parallel data for paraphrase evaluation. Paper presented at the Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies.
Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Gan, C., Yao, T., Yang, K., Yang, Y., & Mei, T. (2016). You lead, we exceed: Labor-free video concept learning by jointly exploiting web videos and images. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., & Saenko, K. (2013). Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. Paper presented at the Proceedings of the IEEE International Conference on Computer Vision.
Guo, Z., Gao, L., Song, J., Xu, X., Shao, J., & Shen, H. T. (2016). Attention-based LSTM with semantic consistency for videos captioning. Paper presented at the Proceedings of the 24th ACM International Conference on Multimedia.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Keskin, R., Moral, Ö. T., Kılıç, V., & Onan, A. (2021). Multi-GRU Based Automated Image Captioning for Smartphones. Paper presented at the 2021 29th Signal Processing and Communications Applications Conference
Kılıç, V. (2021). Deep Gated Recurrent Unit for Smartphone-Based Image Captioning. Sakarya University Journal of Computer Information Sciences, 4(2), 181-191.
Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. Paper presented at the Text summarization branches out. Makav, B., & Kılıç, V. (2019). Smartphone-based image captioning for visually and hearing impaired. Paper presented at the 11th International Conference on Electrical and Electronics Engineering
Pan, P., Xu, Z., Yang, Y., Wu, F., & Zhuang, Y. (2016). Hierarchical recurrent neural encoder for video representation with application to captioning. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Pan, Y., Mei, T., Yao, T., Li, H., & Rui, Y. (2016). Jointly modeling embedding and translation to bridge video and language. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. Paper presented at the Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.
Rohrbach, A., Rohrbach, M., Tandon, N., & Schiele, B. (2015). A dataset for movie description. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Shen, F., Shen, C., Shi, Q., Van Den Hengel, A., & Tang, Z. (2013). Inductive hashing on manifolds. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Targ, S., Almeida, D., & Lyman, K. (2016). Resnet in resnet: Generalizing residual architectures. arXiv preprint arXiv:.08029. Torabi, A., Pal, C., Larochelle, H., & Courville, A. (2015). Using descriptive video services to create a large data source for video annotation research. arXiv preprint arXiv:.01070.
Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., & Saenko, K. (2015). Sequence to sequence-video to text. Paper presented at the Proceedings of the IEEE International Conference on Computer Vision.
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., & Saenko, K. (2014). Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:.1412.4729.
Xu, J., Mei, T., Yao, T., & Rui, Y. (2016). Msr-vtt: A large video description dataset for bridging video and language. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Xu, R., Xiong, C., Chen, W., & Corso, J. (2015). Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. Paper presented at the Proceedings of the AAAI Conference on Artificial Intelligence.
Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., & Courville, A. (2015). Describing videos by exploiting temporal structure. Paper presented at the Proceedings of the IEEE International Conference on Computer Vision.
Yu, H., Wang, J., Huang, Z., Yang, Y., & Xu, W. (2016). Video paragraph captioning using hierarchical recurrent neural networks. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

Video Captioning Based on Multi-layer Gated Recurrent Unit for Smartphones

Year 2021, Issue: 32, 221 - 226, 31.12.2021

Bengü Fetiler , Özkan Çaylı , Özge Taylan Moral , Volkan Kılıç , Aytuğ Onan

https://doi.org/10.31590/ejosat.1039242

Cited By: 5

Abstract

Video captioning is the visual understanding process to generate grammatically and semantically meaningful descriptions that are of interest in the fields of computer vision (CV) and natural language processing (NLP). Recent advances in the computing power of the mobile platform have led to many video captioning applications that use CV and NLP techniques. These video captioning applications mainly depend on the encoder-decoder approach running with the internet connection, which employs convolutional neural networks (CNNs) on the encoder and recurrent neural networks (RNNs) on the decoder. However, this approach is not powerful enough to get accurate captioning results, and fast response due to online data transfer. In this paper, therefore, the encoder-decoder approach has been extended with a sequence-to-sequence model under a multi-layer gated recurrent unit (GRU) to generate a semantically more coherent caption. Visual information from image features of each video frame is extracted with ResNet-101 CNN in the encoder to feed the multi-layer GRU based decoder for caption generation. The proposed approach has been compared with the state-of-the-art approaches using experiments on the MSVD dataset under eight performance metrics. In addition, the proposed approach is embedded into our custom-designed Android application, called WeCap, capable of faster caption generation without an internet connection.

Keywords

Convolutional Neural Network , Gated Recurrent Units , Natural Language Processing , Video Captioning , Android Application.

Supporting Institution

TÜBİTAK, BAP

Project Number

120N995, 2021-ÖDL-MÜMF-0006

References

Amaresh, M., & Chitrakala, S. (2019). Video captioning using deep learning: An overview of methods, datasets and metrics. Paper presented at the 2019 International Conference on Communication and Signal Processing.
Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). Spice: Semantic propositional image caption evaluation. Paper presented at the European Conference on Computer Vision.
Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Paper presented at the Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.
Baraldi, L., Grana, C., & Cucchiara, R. (2017). Hierarchical boundary-aware neural encoder for video captioning. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Çaylı, Ö., Makav, B., Kılıç, V., & Onan, A. (2020). Mobile Application Based Automatic Caption Generation for Visually Impaired. Paper presented at the International Conference on Intelligent and Fuzzy Systems.
Chen, D., & Dolan, W. B. (2011). Collecting highly parallel data for paraphrase evaluation. Paper presented at the Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies.
Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Gan, C., Yao, T., Yang, K., Yang, Y., & Mei, T. (2016). You lead, we exceed: Labor-free video concept learning by jointly exploiting web videos and images. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., & Saenko, K. (2013). Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. Paper presented at the Proceedings of the IEEE International Conference on Computer Vision.
Guo, Z., Gao, L., Song, J., Xu, X., Shao, J., & Shen, H. T. (2016). Attention-based LSTM with semantic consistency for videos captioning. Paper presented at the Proceedings of the 24th ACM International Conference on Multimedia.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Keskin, R., Moral, Ö. T., Kılıç, V., & Onan, A. (2021). Multi-GRU Based Automated Image Captioning for Smartphones. Paper presented at the 2021 29th Signal Processing and Communications Applications Conference
Kılıç, V. (2021). Deep Gated Recurrent Unit for Smartphone-Based Image Captioning. Sakarya University Journal of Computer Information Sciences, 4(2), 181-191.
Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. Paper presented at the Text summarization branches out. Makav, B., & Kılıç, V. (2019). Smartphone-based image captioning for visually and hearing impaired. Paper presented at the 11th International Conference on Electrical and Electronics Engineering
Pan, P., Xu, Z., Yang, Y., Wu, F., & Zhuang, Y. (2016). Hierarchical recurrent neural encoder for video representation with application to captioning. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Pan, Y., Mei, T., Yao, T., Li, H., & Rui, Y. (2016). Jointly modeling embedding and translation to bridge video and language. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. Paper presented at the Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.
Rohrbach, A., Rohrbach, M., Tandon, N., & Schiele, B. (2015). A dataset for movie description. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Shen, F., Shen, C., Shi, Q., Van Den Hengel, A., & Tang, Z. (2013). Inductive hashing on manifolds. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Targ, S., Almeida, D., & Lyman, K. (2016). Resnet in resnet: Generalizing residual architectures. arXiv preprint arXiv:.08029. Torabi, A., Pal, C., Larochelle, H., & Courville, A. (2015). Using descriptive video services to create a large data source for video annotation research. arXiv preprint arXiv:.01070.
Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., & Saenko, K. (2015). Sequence to sequence-video to text. Paper presented at the Proceedings of the IEEE International Conference on Computer Vision.
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., & Saenko, K. (2014). Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:.1412.4729.
Xu, J., Mei, T., Yao, T., & Rui, Y. (2016). Msr-vtt: A large video description dataset for bridging video and language. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Xu, R., Xiong, C., Chen, W., & Corso, J. (2015). Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. Paper presented at the Proceedings of the AAAI Conference on Artificial Intelligence.
Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., & Courville, A. (2015). Describing videos by exploiting temporal structure. Paper presented at the Proceedings of the IEEE International Conference on Computer Vision.
Yu, H., Wang, J., Huang, Z., Yang, Y., & Xu, W. (2016). Video paragraph captioning using hierarchical recurrent neural networks. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

There are 28 citations in total.

Details

Primary Language	English
Subjects	Engineering
Journal Section	Articles
Authors	Bengü Fetiler 0000-0002-2761-7751 Özkan Çaylı 0000-0002-3389-3867 Özge Taylan Moral 0000-0003-0482-267X Volkan Kılıç 0000-0002-3164-1981 Aytuğ Onan 0000-0002-9434-5880
Project Number	120N995, 2021-ÖDL-MÜMF-0006
Publication Date	December 31, 2021
Published in Issue	Year 2021 Issue: 32

Cite

APA	Fetiler, B., Çaylı, Ö., Moral, Ö. T., … Kılıç, V. (2021). Video Captioning Based on Multi-layer Gated Recurrent Unit for Smartphones. Avrupa Bilim Ve Teknoloji Dergisi(32), 221-226. https://doi.org/10.31590/ejosat.1039242