Akıllı Telefonlar için Çok Katmanlı Kapılı Tekrarlayan Birim Tabanlı Video Altyazılama
Yıl 2021,
, 221 - 226, 31.12.2021
Bengü Fetiler
,
Özkan Çaylı
,
Özge Taylan Moral
,
Volkan Kılıç
,
Aytuğ Onan
Öz
Video altyazılama, bilgisayarlı görü (CV) ve doğal dil işleme (NLP) alanlarında ilgi çeken dilbilgisel ve anlamsal olarak anlamlı tanımlar oluşturan bir görsel anlama işlemidir. Mobil platformun hesaplama gücündeki son gelişmeler, CV ve NLP tekniklerini kullanan birçok video altyazılama uygulamasının önünü açmıştır. Bu video altyazılama uygulamaları, çoğunlukla, kodlayıcı üzerinde evrişimli sinir ağları (CNN'ler) ve kod çözücü üzerinde tekrarlayan sinir ağları (RNN’ler) kullanan internet bağlantısıyla çalışan kodlayıcı-kod çözücü yaklaşımına bağlıdır. Ancak, bu yaklaşım çevrimiçi veri aktarımından dolayı doğru altyazı sonuçları ve hızlı yanıt alma açısından yeterince güçlü değildir. Bu nedenle, bu bildiride, kodlayıcı-kod çözücü yaklaşımı anlamsal olarak daha uyumlu altyazı oluşturmak için çok katmanlı kapılı tekrarlayan birim (GRU) altında diziden dizeye yaklaşımı ile genişletilmiştir. Her video karesinin görüntü özelliklerinden görsel bilgiler, altyazı oluşturma amacıyla çok katmanlı GRU tabanlı kod çözücüyü beslemek için kodlayıcıdaki ResNet-101 CNN ile çıkarılır. Önerilen yaklaşım, sekiz performans metriği altında MSVD veri kümesi üzerinde deneyler kullanılarak gelişmiş yaklaşımlarla karşılaştırılmıştır. Ayrıca, önerilen yaklaşım internet bağlantısı olmadan daha hızlı altyazı üretme yeteneğine sahip, WeCap adlı, özel tasarlanmış Android uygulamamıza gömülmüştür.
Proje Numarası
120N995, 2021-ÖDL-MÜMF-0006
Kaynakça
- Amaresh, M., & Chitrakala, S. (2019). Video captioning using deep learning: An overview of methods, datasets and metrics. Paper presented at the 2019 International Conference on Communication and Signal Processing.
- Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). Spice: Semantic propositional image caption evaluation. Paper presented at the European Conference on Computer Vision.
- Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Paper presented at the Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.
- Baraldi, L., Grana, C., & Cucchiara, R. (2017). Hierarchical boundary-aware neural encoder for video captioning. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Çaylı, Ö., Makav, B., Kılıç, V., & Onan, A. (2020). Mobile Application Based Automatic Caption Generation for Visually Impaired. Paper presented at the International Conference on Intelligent and Fuzzy Systems.
- Chen, D., & Dolan, W. B. (2011). Collecting highly parallel data for paraphrase evaluation. Paper presented at the Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies.
- Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Gan, C., Yao, T., Yang, K., Yang, Y., & Mei, T. (2016). You lead, we exceed: Labor-free video concept learning by jointly exploiting web videos and images. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., & Saenko, K. (2013). Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. Paper presented at the Proceedings of the IEEE International Conference on Computer Vision.
- Guo, Z., Gao, L., Song, J., Xu, X., Shao, J., & Shen, H. T. (2016). Attention-based LSTM with semantic consistency for videos captioning. Paper presented at the Proceedings of the 24th ACM International Conference on Multimedia.
- He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Keskin, R., Moral, Ö. T., Kılıç, V., & Onan, A. (2021). Multi-GRU Based Automated Image Captioning for Smartphones. Paper presented at the 2021 29th Signal Processing and Communications Applications Conference
- Kılıç, V. (2021). Deep Gated Recurrent Unit for Smartphone-Based Image Captioning. Sakarya University Journal of Computer Information Sciences, 4(2), 181-191.
- Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. Paper presented at the Text summarization branches out.
Makav, B., & Kılıç, V. (2019). Smartphone-based image captioning for visually and hearing impaired. Paper presented at the 11th International Conference on Electrical and Electronics Engineering
- Pan, P., Xu, Z., Yang, Y., Wu, F., & Zhuang, Y. (2016). Hierarchical recurrent neural encoder for video representation with application to captioning. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Pan, Y., Mei, T., Yao, T., Li, H., & Rui, Y. (2016). Jointly modeling embedding and translation to bridge video and language. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. Paper presented at the Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.
- Rohrbach, A., Rohrbach, M., Tandon, N., & Schiele, B. (2015). A dataset for movie description. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Shen, F., Shen, C., Shi, Q., Van Den Hengel, A., & Tang, Z. (2013). Inductive hashing on manifolds. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Targ, S., Almeida, D., & Lyman, K. (2016). Resnet in resnet: Generalizing residual architectures. arXiv preprint arXiv:.08029.
Torabi, A., Pal, C., Larochelle, H., & Courville, A. (2015). Using descriptive video services to create a large data source for video annotation research. arXiv preprint arXiv:.01070.
- Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., & Saenko, K. (2015). Sequence to sequence-video to text. Paper presented at the Proceedings of the IEEE International Conference on Computer Vision.
- Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., & Saenko, K. (2014). Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:.1412.4729.
- Xu, J., Mei, T., Yao, T., & Rui, Y. (2016). Msr-vtt: A large video description dataset for bridging video and language. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Xu, R., Xiong, C., Chen, W., & Corso, J. (2015). Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. Paper presented at the Proceedings of the AAAI Conference on Artificial Intelligence.
- Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., & Courville, A. (2015). Describing videos by exploiting temporal structure. Paper presented at the Proceedings of the IEEE International Conference on Computer Vision.
- Yu, H., Wang, J., Huang, Z., Yang, Y., & Xu, W. (2016). Video paragraph captioning using hierarchical recurrent neural networks. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Video Captioning Based on Multi-layer Gated Recurrent Unit for Smartphones
Yıl 2021,
, 221 - 226, 31.12.2021
Bengü Fetiler
,
Özkan Çaylı
,
Özge Taylan Moral
,
Volkan Kılıç
,
Aytuğ Onan
Öz
Video captioning is the visual understanding process to generate grammatically and semantically meaningful descriptions that are of interest in the fields of computer vision (CV) and natural language processing (NLP). Recent advances in the computing power of the mobile platform have led to many video captioning applications that use CV and NLP techniques. These video captioning applications mainly depend on the encoder-decoder approach running with the internet connection, which employs convolutional neural networks (CNNs) on the encoder and recurrent neural networks (RNNs) on the decoder. However, this approach is not powerful enough to get accurate captioning results, and fast response due to online data transfer. In this paper, therefore, the encoder-decoder approach has been extended with a sequence-to-sequence model under a multi-layer gated recurrent unit (GRU) to generate a semantically more coherent caption. Visual information from image features of each video frame is extracted with ResNet-101 CNN in the encoder to feed the multi-layer GRU based decoder for caption generation. The proposed approach has been compared with the state-of-the-art approaches using experiments on the MSVD dataset under eight performance metrics. In addition, the proposed approach is embedded into our custom-designed Android application, called WeCap, capable of faster caption generation without an internet connection.
Destekleyen Kurum
TÜBİTAK, BAP
Proje Numarası
120N995, 2021-ÖDL-MÜMF-0006
Kaynakça
- Amaresh, M., & Chitrakala, S. (2019). Video captioning using deep learning: An overview of methods, datasets and metrics. Paper presented at the 2019 International Conference on Communication and Signal Processing.
- Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). Spice: Semantic propositional image caption evaluation. Paper presented at the European Conference on Computer Vision.
- Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Paper presented at the Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.
- Baraldi, L., Grana, C., & Cucchiara, R. (2017). Hierarchical boundary-aware neural encoder for video captioning. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Çaylı, Ö., Makav, B., Kılıç, V., & Onan, A. (2020). Mobile Application Based Automatic Caption Generation for Visually Impaired. Paper presented at the International Conference on Intelligent and Fuzzy Systems.
- Chen, D., & Dolan, W. B. (2011). Collecting highly parallel data for paraphrase evaluation. Paper presented at the Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies.
- Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Gan, C., Yao, T., Yang, K., Yang, Y., & Mei, T. (2016). You lead, we exceed: Labor-free video concept learning by jointly exploiting web videos and images. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., & Saenko, K. (2013). Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. Paper presented at the Proceedings of the IEEE International Conference on Computer Vision.
- Guo, Z., Gao, L., Song, J., Xu, X., Shao, J., & Shen, H. T. (2016). Attention-based LSTM with semantic consistency for videos captioning. Paper presented at the Proceedings of the 24th ACM International Conference on Multimedia.
- He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Keskin, R., Moral, Ö. T., Kılıç, V., & Onan, A. (2021). Multi-GRU Based Automated Image Captioning for Smartphones. Paper presented at the 2021 29th Signal Processing and Communications Applications Conference
- Kılıç, V. (2021). Deep Gated Recurrent Unit for Smartphone-Based Image Captioning. Sakarya University Journal of Computer Information Sciences, 4(2), 181-191.
- Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. Paper presented at the Text summarization branches out.
Makav, B., & Kılıç, V. (2019). Smartphone-based image captioning for visually and hearing impaired. Paper presented at the 11th International Conference on Electrical and Electronics Engineering
- Pan, P., Xu, Z., Yang, Y., Wu, F., & Zhuang, Y. (2016). Hierarchical recurrent neural encoder for video representation with application to captioning. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Pan, Y., Mei, T., Yao, T., Li, H., & Rui, Y. (2016). Jointly modeling embedding and translation to bridge video and language. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. Paper presented at the Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.
- Rohrbach, A., Rohrbach, M., Tandon, N., & Schiele, B. (2015). A dataset for movie description. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Shen, F., Shen, C., Shi, Q., Van Den Hengel, A., & Tang, Z. (2013). Inductive hashing on manifolds. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Targ, S., Almeida, D., & Lyman, K. (2016). Resnet in resnet: Generalizing residual architectures. arXiv preprint arXiv:.08029.
Torabi, A., Pal, C., Larochelle, H., & Courville, A. (2015). Using descriptive video services to create a large data source for video annotation research. arXiv preprint arXiv:.01070.
- Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., & Saenko, K. (2015). Sequence to sequence-video to text. Paper presented at the Proceedings of the IEEE International Conference on Computer Vision.
- Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., & Saenko, K. (2014). Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:.1412.4729.
- Xu, J., Mei, T., Yao, T., & Rui, Y. (2016). Msr-vtt: A large video description dataset for bridging video and language. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Xu, R., Xiong, C., Chen, W., & Corso, J. (2015). Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. Paper presented at the Proceedings of the AAAI Conference on Artificial Intelligence.
- Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., & Courville, A. (2015). Describing videos by exploiting temporal structure. Paper presented at the Proceedings of the IEEE International Conference on Computer Vision.
- Yu, H., Wang, J., Huang, Z., Yang, Y., & Xu, W. (2016). Video paragraph captioning using hierarchical recurrent neural networks. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.