Research Article
BibTex RIS Cite

Leveraging Pre-trained 3D-CNNs for Video Captioning

Year 2024, Issue: 53, 58 - 63, 15.02.2024

Abstract

Video captioning is a visual understanding task that aims to generate grammatically and semantically accurate descriptions. One of the main challenges in video captioning is capturing the complex dynamics present in videos. This study addresses this challenge by leveraging pre-trained 3D Convolutional Neural Networks (3D-CNNs). These networks are particularly effective at modeling such dynamics, enhancing video contextual understanding. We evaluated the approach on the Microsoft Research Video Description (MSVD) dataset, with commonly utilized performance metrics in video captioning including CIDEr, BLEU-1 through BLEU-4, ROUGE-L, METEOR, and SPICE. The results show significant improvements across all these metrics, proving the advantage of pre-trained 3D-CNNs in enhancing video captioning accuracy

References

  • Akosman, Ş. A., Öktem, M., Moral, Ö. T., & Kılıç, V. (2021). Deep Learning-based Semantic Segmentation for Crack Detection on Marbles. 2021 29th Signal Processing and Communications Applications Conference (SIU),
  • Amaresh, M., & Chitrakala, S. (2019). Video captioning using deep learning: an overview of methods, datasets and metrics. 2019 International Conference on Communication and Signal Processing (ICCSP),
  • Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). Spice: Semantic propositional image caption evaluation. European Conference on Computer Vision (ECCV),
  • Aydın, S., Çaylı, Ö., Kılıç, V., & Onan, A. (2022). Sequence-to-sequence video captioning with residual connected gated recurrent units. Avrupa Bilim ve Teknoloji Dergisi(35), 380-386.
  • Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization,
  • Baraldi, L., Grana, C., & Cucchiara, R. (2017). Hierarchical boundary-aware neural encoder for video captioning. Conference on Computer Vision and Pattern Recognition (CVPR),
  • Baran, M., Moral, Ö. T., & Kılıç, V. (2021). Akıllı telefonlar için birleştirme modeli tabanlı görüntü altyazılama. Avrupa Bilim ve Teknoloji Dergisi(26), 191-196.
  • Chen, D., & Dolan, W. B. (2011). Collecting highly parallel data for paraphrase evaluation. Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies,
  • Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. Proceedings of the IEEE conference on computer vision and pattern recognition,
  • Çaylı, Ö., Kılıç, V., Onan, A., & Wang, W. (2022). Auxiliary classifier based residual rnn for image captioning. 2022 30th European Signal Processing Conference (EUSIPCO),
  • Çaylı, Ö., Liu, X., Kılıç, V., & Wang, W. (2023). Knowledge Distillation for Efficient Audio-Visual Video Captioning. arXiv preprint arXiv:2306.09947.
  • Çaylı, Ö., Makav, B., Kılıç, V., & Onan, A. (2021). Mobile application based automatic caption generation for visually impaired. Intelligent and Fuzzy Techniques: Smart and Innovative Solutions: Proceedings of the INFUS 2020 Conference, Istanbul, Turkey, July 21-23, 2020,
  • Doǧan, V., Isık, T., Kılıç, V., & Horzum, N. (2022). A field-deployable water quality monitoring with machine learning-based smartphone colorimetry. Analytical Methods, 14(35), 3458-3466.
  • Doğan, V., Evliya, M., Kahyaoglu, L. N., & Kılıç, V. (2024). On-site colorimetric food spoilage monitoring with smartphone embedded machine learning. Talanta, 266, 125021.
  • Fetiler, B., Çaylı, Ö., Moral, Ö. T., Kılıç, V., & Onan, A. (2021). Video captioning based on multi-layer gated recurrent unit for smartphones. Avrupa Bilim ve Teknoloji Dergisi(32), 221-226.
  • Gan, C., Yao, T., Yang, K., Yang, Y., & Mei, T. (2016). You lead, we exceed: Labor-free video concept learning by jointly exploiting web videos and images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
  • Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., & Saenko, K. (2013). Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. Proceedings of the IEEE international conference on computer vision,
  • Guo, Z., Gao, L., Song, J., Xu, X., Shao, J., & Shen, H. T. (2016). Attention-based LSTM with semantic consistency for videos captioning. Proceedings of the 24th ACM international conference on Multimedia,
  • Keskin, R., Çaylı, Ö., Moral, Ö. T., Kılıç, V., & Onan, A. (2021). A benchmark for feature-injection architectures in image captioning. Avrupa Bilim ve Teknoloji Dergisi(31), 461-468.
  • Keskin, R., Moral, Ö. T., Kılıç, V., & Onan, A. (2021). Multi-gru based automated image captioning for smartphones. 2021 29th Signal Processing and Communications Applications Conference (SIU),
  • Kılcı, M., Çaylı, Ö., & Kılıç, V. (2023). Fusion of High-Level Visual Attributes for Image Captioning. Avrupa Bilim ve Teknoloji Dergisi(52), 161-168.
  • Kılıç, V. (2021). Deep gated recurrent unit for smartphone-based image captioning. Sakarya University Journal of Computer and Information Sciences, 4(2), 181-191.
  • Kılıç, V., Mercan, Ö. B., Tetik, M., Kap, Ö., & Horzum, N. (2022). Non-enzymatic colorimetric glucose detection based on Au/Ag nanoparticles using smartphone and machine learning. Analytical Sciences, 38(2), 347-358.
  • Kılıç, V., Zhong, X., Barnard, M., Wang, W., & Kittler, J. (2014). Audio-visual tracking of a variable number of speakers with a random finite set approach. 17th International Conference on Information Fusion (FUSION),
  • Koca, Ö. A., & Kılıç, V. (2023). Multi-Parametric Glucose Prediction Using Multi-Layer LSTM. Avrupa Bilim ve Teknoloji Dergisi(52), 169-175.
  • Koca, Ö. A., Türköz, A., & Kılıç, V. (2023). Tip 1 Diyabette Çok Katmanlı GRU Tabanlı Glikoz Tahmini. Avrupa Bilim ve Teknoloji Dergisi(52), 80-86.
  • Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. Text summarization branches out,
  • Makav, B., & Kılıç, V. (2019). Smartphone-based image captioning for visually and hearing impaired. 2019 11th international conference on electrical and electronics engineering (ELECO),
  • Mercan, Ö. B., Doğan, V., & Kılıç, V. (2020). Time Series Analysis based Machine Learning Classification for Blood Sugar Levels. 2020 Medical Technologies Congress (TIPTEKNO),
  • Mercan, Ö. B., & Kılıç, V. (2020). Deep learning based colorimetric classification of glucose with au-ag nanoparticles using smartphone. 2020 Medical Technologies Congress (TIPTEKNO),
  • Moral, Ö. T., Kiliç, V., Onan, A., & Wang, W. (2022). Automated Image Captioning with Multi-layer Gated Recurrent Unit. 2022 30th European Signal Processing Conference (EUSIPCO),
  • Palaz, Z., Doğan, V., & Kılıç, V. (2021). Smartphone-based Multi-parametric Glucose Prediction using Recurrent Neural Networks. Avrupa Bilim ve Teknoloji Dergisi(32), 1168-1174.
  • Pan, P., Xu, Z., Yang, Y., Wu, F., & Zhuang, Y. (2016a). Hierarchical recurrent neural encoder for video representation with application to captioning. Proceedings of the IEEE conference on computer vision and pattern recognition,
  • Pan, P., Xu, Z., Yang, Y., Wu, F., & Zhuang, Y. (2016b). Hierarchical recurrent neural encoder for video representation with application to captioning. Conference on computer vision and pattern recognition,
  • Pan, Y., Mei, T., Yao, T., Li, H., & Rui, Y. (2016). Jointly modeling embedding and translation to bridge video and language. Conference on Computer Vision and Pattern Recognition (CVPR),
  • Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. Annual Meeting of the Association for Computational Linguistics (ACL),
  • Rohrbach, A., Rohrbach, M., Tandon, N., & Schiele, B. (2015). A dataset for movie description. Proceedings of the IEEE conference on computer vision and pattern recognition,
  • Sayracı, B., Ağralı, M., & Kılıç, V. (2023). Artificial Intelligence Based Instance-Aware Semantic Lobe Segmentation on Chest Computed Tomography Images. Avrupa Bilim ve Teknoloji Dergisi(46), 109-115.
  • Shen, F., Shen, C., Shi, Q., Van Den Hengel, A., & Tang, Z. (2013). Inductive hashing on manifolds. Proceedings of the IEEE conference on computer vision and pattern recognition,
  • Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. Proceedings of the IEEE conference on computer vision and pattern recognition,
  • Targ, S., Almeida, D., & Lyman, K. (2016). Resnet in resnet: Generalizing residual architectures. arXiv preprint arXiv:1603.08029.
  • Torabi, A., Pal, C., Larochelle, H., & Courville, A. (2015). Using descriptive video services to create a large data source for video annotation research. arXiv preprint arXiv:1503.01070.
  • Uslu, B., Çaylı, Ö., Kılıç, V., & Onan, A. (2022). Resnet based deep gated recurrent unit for image captioning on smartphone. Avrupa Bilim ve Teknoloji Dergisi(35), 610-615.
  • Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. Conference on Computer Vision and Pattern Recognition (CVPR),
  • Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., & Saenko, K. (2015). Sequence to sequence-video to text. Proceedings of the IEEE international conference on computer vision,
  • Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., & Saenko, K. (2014). Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729.
  • Xu, J., Mei, T., Yao, T., & Rui, Y. (2016). Msr-vtt: A large video description dataset for bridging video and language. Proceedings of the IEEE conference on computer vision and pattern recognition,
  • Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., & Courville, A. (2015). Describing videos by exploiting temporal structure. International Conference on Computer Vision (ICCV),
  • Yu, H., Wang, J., Huang, Z., Yang, Y., & Xu, W. (2016). Video paragraph captioning using hierarchical recurrent neural networks. Conference on Computer Vision and Pattern Recognition (CVPR),

Video Altyazılama için Önceden Eğitilmiş 3B-CNN'lerden Yararlanma

Year 2024, Issue: 53, 58 - 63, 15.02.2024

Abstract

Video altyazılama, hem dilbilgisel hem de anlamsal olarak doğru açıklamalar oluşturmayı amaçlayan bir görsel anlama görevidir. Video altyazılamadaki ana zorluklardan biri, videolardaki karmaşık dinamikleri yakalamaktır. Bu çalışma bu zorluğu aşmak için önceden eğitilmiş 3B Evrişimli Sinir Ağlarını (3D-CNNs) kullanmaktadır. Bu ağlar bu tür dinamikleri modellemede özellikle etkilidir, böylece videoların bağlamsal anlayışını artırır. Önerilen yaklaşım, video altyazılama için yaygın olarak tanınan bir ölçüt olan Microsoft Araştırma Video Açıklama (MSVD) veri seti üzerinde değerlendirildi. Performansı değerlendirmek için BLEU-1’den BLEU-4’e, CIDEr, ROUGE-L, METEOR ve SPICE de dahil olmak üzere standart metrikler kullandık. Sonuçlar, tüm bu metriklerde önemli iyileşmeler göstererek, önceden eğitilmiş 3D-CNN’lerin video altyazılama doğruluğunu artırdığını vurgulamaktadır.

References

  • Akosman, Ş. A., Öktem, M., Moral, Ö. T., & Kılıç, V. (2021). Deep Learning-based Semantic Segmentation for Crack Detection on Marbles. 2021 29th Signal Processing and Communications Applications Conference (SIU),
  • Amaresh, M., & Chitrakala, S. (2019). Video captioning using deep learning: an overview of methods, datasets and metrics. 2019 International Conference on Communication and Signal Processing (ICCSP),
  • Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). Spice: Semantic propositional image caption evaluation. European Conference on Computer Vision (ECCV),
  • Aydın, S., Çaylı, Ö., Kılıç, V., & Onan, A. (2022). Sequence-to-sequence video captioning with residual connected gated recurrent units. Avrupa Bilim ve Teknoloji Dergisi(35), 380-386.
  • Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization,
  • Baraldi, L., Grana, C., & Cucchiara, R. (2017). Hierarchical boundary-aware neural encoder for video captioning. Conference on Computer Vision and Pattern Recognition (CVPR),
  • Baran, M., Moral, Ö. T., & Kılıç, V. (2021). Akıllı telefonlar için birleştirme modeli tabanlı görüntü altyazılama. Avrupa Bilim ve Teknoloji Dergisi(26), 191-196.
  • Chen, D., & Dolan, W. B. (2011). Collecting highly parallel data for paraphrase evaluation. Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies,
  • Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. Proceedings of the IEEE conference on computer vision and pattern recognition,
  • Çaylı, Ö., Kılıç, V., Onan, A., & Wang, W. (2022). Auxiliary classifier based residual rnn for image captioning. 2022 30th European Signal Processing Conference (EUSIPCO),
  • Çaylı, Ö., Liu, X., Kılıç, V., & Wang, W. (2023). Knowledge Distillation for Efficient Audio-Visual Video Captioning. arXiv preprint arXiv:2306.09947.
  • Çaylı, Ö., Makav, B., Kılıç, V., & Onan, A. (2021). Mobile application based automatic caption generation for visually impaired. Intelligent and Fuzzy Techniques: Smart and Innovative Solutions: Proceedings of the INFUS 2020 Conference, Istanbul, Turkey, July 21-23, 2020,
  • Doǧan, V., Isık, T., Kılıç, V., & Horzum, N. (2022). A field-deployable water quality monitoring with machine learning-based smartphone colorimetry. Analytical Methods, 14(35), 3458-3466.
  • Doğan, V., Evliya, M., Kahyaoglu, L. N., & Kılıç, V. (2024). On-site colorimetric food spoilage monitoring with smartphone embedded machine learning. Talanta, 266, 125021.
  • Fetiler, B., Çaylı, Ö., Moral, Ö. T., Kılıç, V., & Onan, A. (2021). Video captioning based on multi-layer gated recurrent unit for smartphones. Avrupa Bilim ve Teknoloji Dergisi(32), 221-226.
  • Gan, C., Yao, T., Yang, K., Yang, Y., & Mei, T. (2016). You lead, we exceed: Labor-free video concept learning by jointly exploiting web videos and images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
  • Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., & Saenko, K. (2013). Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. Proceedings of the IEEE international conference on computer vision,
  • Guo, Z., Gao, L., Song, J., Xu, X., Shao, J., & Shen, H. T. (2016). Attention-based LSTM with semantic consistency for videos captioning. Proceedings of the 24th ACM international conference on Multimedia,
  • Keskin, R., Çaylı, Ö., Moral, Ö. T., Kılıç, V., & Onan, A. (2021). A benchmark for feature-injection architectures in image captioning. Avrupa Bilim ve Teknoloji Dergisi(31), 461-468.
  • Keskin, R., Moral, Ö. T., Kılıç, V., & Onan, A. (2021). Multi-gru based automated image captioning for smartphones. 2021 29th Signal Processing and Communications Applications Conference (SIU),
  • Kılcı, M., Çaylı, Ö., & Kılıç, V. (2023). Fusion of High-Level Visual Attributes for Image Captioning. Avrupa Bilim ve Teknoloji Dergisi(52), 161-168.
  • Kılıç, V. (2021). Deep gated recurrent unit for smartphone-based image captioning. Sakarya University Journal of Computer and Information Sciences, 4(2), 181-191.
  • Kılıç, V., Mercan, Ö. B., Tetik, M., Kap, Ö., & Horzum, N. (2022). Non-enzymatic colorimetric glucose detection based on Au/Ag nanoparticles using smartphone and machine learning. Analytical Sciences, 38(2), 347-358.
  • Kılıç, V., Zhong, X., Barnard, M., Wang, W., & Kittler, J. (2014). Audio-visual tracking of a variable number of speakers with a random finite set approach. 17th International Conference on Information Fusion (FUSION),
  • Koca, Ö. A., & Kılıç, V. (2023). Multi-Parametric Glucose Prediction Using Multi-Layer LSTM. Avrupa Bilim ve Teknoloji Dergisi(52), 169-175.
  • Koca, Ö. A., Türköz, A., & Kılıç, V. (2023). Tip 1 Diyabette Çok Katmanlı GRU Tabanlı Glikoz Tahmini. Avrupa Bilim ve Teknoloji Dergisi(52), 80-86.
  • Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. Text summarization branches out,
  • Makav, B., & Kılıç, V. (2019). Smartphone-based image captioning for visually and hearing impaired. 2019 11th international conference on electrical and electronics engineering (ELECO),
  • Mercan, Ö. B., Doğan, V., & Kılıç, V. (2020). Time Series Analysis based Machine Learning Classification for Blood Sugar Levels. 2020 Medical Technologies Congress (TIPTEKNO),
  • Mercan, Ö. B., & Kılıç, V. (2020). Deep learning based colorimetric classification of glucose with au-ag nanoparticles using smartphone. 2020 Medical Technologies Congress (TIPTEKNO),
  • Moral, Ö. T., Kiliç, V., Onan, A., & Wang, W. (2022). Automated Image Captioning with Multi-layer Gated Recurrent Unit. 2022 30th European Signal Processing Conference (EUSIPCO),
  • Palaz, Z., Doğan, V., & Kılıç, V. (2021). Smartphone-based Multi-parametric Glucose Prediction using Recurrent Neural Networks. Avrupa Bilim ve Teknoloji Dergisi(32), 1168-1174.
  • Pan, P., Xu, Z., Yang, Y., Wu, F., & Zhuang, Y. (2016a). Hierarchical recurrent neural encoder for video representation with application to captioning. Proceedings of the IEEE conference on computer vision and pattern recognition,
  • Pan, P., Xu, Z., Yang, Y., Wu, F., & Zhuang, Y. (2016b). Hierarchical recurrent neural encoder for video representation with application to captioning. Conference on computer vision and pattern recognition,
  • Pan, Y., Mei, T., Yao, T., Li, H., & Rui, Y. (2016). Jointly modeling embedding and translation to bridge video and language. Conference on Computer Vision and Pattern Recognition (CVPR),
  • Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. Annual Meeting of the Association for Computational Linguistics (ACL),
  • Rohrbach, A., Rohrbach, M., Tandon, N., & Schiele, B. (2015). A dataset for movie description. Proceedings of the IEEE conference on computer vision and pattern recognition,
  • Sayracı, B., Ağralı, M., & Kılıç, V. (2023). Artificial Intelligence Based Instance-Aware Semantic Lobe Segmentation on Chest Computed Tomography Images. Avrupa Bilim ve Teknoloji Dergisi(46), 109-115.
  • Shen, F., Shen, C., Shi, Q., Van Den Hengel, A., & Tang, Z. (2013). Inductive hashing on manifolds. Proceedings of the IEEE conference on computer vision and pattern recognition,
  • Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. Proceedings of the IEEE conference on computer vision and pattern recognition,
  • Targ, S., Almeida, D., & Lyman, K. (2016). Resnet in resnet: Generalizing residual architectures. arXiv preprint arXiv:1603.08029.
  • Torabi, A., Pal, C., Larochelle, H., & Courville, A. (2015). Using descriptive video services to create a large data source for video annotation research. arXiv preprint arXiv:1503.01070.
  • Uslu, B., Çaylı, Ö., Kılıç, V., & Onan, A. (2022). Resnet based deep gated recurrent unit for image captioning on smartphone. Avrupa Bilim ve Teknoloji Dergisi(35), 610-615.
  • Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. Conference on Computer Vision and Pattern Recognition (CVPR),
  • Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., & Saenko, K. (2015). Sequence to sequence-video to text. Proceedings of the IEEE international conference on computer vision,
  • Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., & Saenko, K. (2014). Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729.
  • Xu, J., Mei, T., Yao, T., & Rui, Y. (2016). Msr-vtt: A large video description dataset for bridging video and language. Proceedings of the IEEE conference on computer vision and pattern recognition,
  • Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., & Courville, A. (2015). Describing videos by exploiting temporal structure. International Conference on Computer Vision (ICCV),
  • Yu, H., Wang, J., Huang, Z., Yang, Y., & Xu, W. (2016). Video paragraph captioning using hierarchical recurrent neural networks. Conference on Computer Vision and Pattern Recognition (CVPR),
There are 49 citations in total.

Details

Primary Language Turkish
Subjects Computer Vision, Pattern Recognition, Video Processing, Natural Language Processing
Journal Section Articles
Authors

Bengü Fetiler 0000-0002-2761-7751

Özkan Çaylı 0000-0002-3389-3867

Volkan Kılıç 0000-0002-3164-1981

Early Pub Date February 6, 2024
Publication Date February 15, 2024
Published in Issue Year 2024 Issue: 53

Cite

APA Fetiler, B., Çaylı, Ö., & Kılıç, V. (2024). Video Altyazılama için Önceden Eğitilmiş 3B-CNN’lerden Yararlanma. Avrupa Bilim Ve Teknoloji Dergisi(53), 58-63.