Video Altyazılama için Önceden Eğitilmiş 3B-CNN'lerden Yararlanma

Bengü Fetiler; Özkan Çaylı; Volkan Kılıç

TR EN

Video Altyazılama için Önceden Eğitilmiş 3B-CNN'lerden Yararlanma

Öz

Video altyazılama, hem dilbilgisel hem de anlamsal olarak doğru açıklamalar oluşturmayı amaçlayan bir görsel anlama görevidir. Video altyazılamadaki ana zorluklardan biri, videolardaki karmaşık dinamikleri yakalamaktır. Bu çalışma bu zorluğu aşmak için önceden eğitilmiş 3B Evrişimli Sinir Ağlarını (3D-CNNs) kullanmaktadır. Bu ağlar bu tür dinamikleri modellemede özellikle etkilidir, böylece videoların bağlamsal anlayışını artırır. Önerilen yaklaşım, video altyazılama için yaygın olarak tanınan bir ölçüt olan Microsoft Araştırma Video Açıklama (MSVD) veri seti üzerinde değerlendirildi. Performansı değerlendirmek için BLEU-1’den BLEU-4’e, CIDEr, ROUGE-L, METEOR ve SPICE de dahil olmak üzere standart metrikler kullandık. Sonuçlar, tüm bu metriklerde önemli iyileşmeler göstererek, önceden eğitilmiş 3D-CNN’lerin video altyazılama doğruluğunu artırdığını vurgulamaktadır.

Anahtar Kelimeler

Kaynakça

Akosman, Ş. A., Öktem, M., Moral, Ö. T., & Kılıç, V. (2021). Deep Learning-based Semantic Segmentation for Crack Detection on Marbles. 2021 29th Signal Processing and Communications Applications Conference (SIU),
Amaresh, M., & Chitrakala, S. (2019). Video captioning using deep learning: an overview of methods, datasets and metrics. 2019 International Conference on Communication and Signal Processing (ICCSP),
Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). Spice: Semantic propositional image caption evaluation. European Conference on Computer Vision (ECCV),
Aydın, S., Çaylı, Ö., Kılıç, V., & Onan, A. (2022). Sequence-to-sequence video captioning with residual connected gated recurrent units. Avrupa Bilim ve Teknoloji Dergisi(35), 380-386.
Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization,
Baraldi, L., Grana, C., & Cucchiara, R. (2017). Hierarchical boundary-aware neural encoder for video captioning. Conference on Computer Vision and Pattern Recognition (CVPR),
Baran, M., Moral, Ö. T., & Kılıç, V. (2021). Akıllı telefonlar için birleştirme modeli tabanlı görüntü altyazılama. Avrupa Bilim ve Teknoloji Dergisi(26), 191-196.
Chen, D., & Dolan, W. B. (2011). Collecting highly parallel data for paraphrase evaluation. Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies,

Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. Proceedings of the IEEE conference on computer vision and pattern recognition,
Çaylı, Ö., Kılıç, V., Onan, A., & Wang, W. (2022). Auxiliary classifier based residual rnn for image captioning. 2022 30th European Signal Processing Conference (EUSIPCO),
Çaylı, Ö., Liu, X., Kılıç, V., & Wang, W. (2023). Knowledge Distillation for Efficient Audio-Visual Video Captioning. arXiv preprint arXiv:2306.09947.
Çaylı, Ö., Makav, B., Kılıç, V., & Onan, A. (2021). Mobile application based automatic caption generation for visually impaired. Intelligent and Fuzzy Techniques: Smart and Innovative Solutions: Proceedings of the INFUS 2020 Conference, Istanbul, Turkey, July 21-23, 2020,
Doǧan, V., Isık, T., Kılıç, V., & Horzum, N. (2022). A field-deployable water quality monitoring with machine learning-based smartphone colorimetry. Analytical Methods, 14(35), 3458-3466.
Doğan, V., Evliya, M., Kahyaoglu, L. N., & Kılıç, V. (2024). On-site colorimetric food spoilage monitoring with smartphone embedded machine learning. Talanta, 266, 125021.
Fetiler, B., Çaylı, Ö., Moral, Ö. T., Kılıç, V., & Onan, A. (2021). Video captioning based on multi-layer gated recurrent unit for smartphones. Avrupa Bilim ve Teknoloji Dergisi(32), 221-226.
Gan, C., Yao, T., Yang, K., Yang, Y., & Mei, T. (2016). You lead, we exceed: Labor-free video concept learning by jointly exploiting web videos and images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., & Saenko, K. (2013). Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. Proceedings of the IEEE international conference on computer vision,
Guo, Z., Gao, L., Song, J., Xu, X., Shao, J., & Shen, H. T. (2016). Attention-based LSTM with semantic consistency for videos captioning. Proceedings of the 24th ACM international conference on Multimedia,
Keskin, R., Çaylı, Ö., Moral, Ö. T., Kılıç, V., & Onan, A. (2021). A benchmark for feature-injection architectures in image captioning. Avrupa Bilim ve Teknoloji Dergisi(31), 461-468.
Keskin, R., Moral, Ö. T., Kılıç, V., & Onan, A. (2021). Multi-gru based automated image captioning for smartphones. 2021 29th Signal Processing and Communications Applications Conference (SIU),
Kılcı, M., Çaylı, Ö., & Kılıç, V. (2023). Fusion of High-Level Visual Attributes for Image Captioning. Avrupa Bilim ve Teknoloji Dergisi(52), 161-168.
Kılıç, V. (2021). Deep gated recurrent unit for smartphone-based image captioning. Sakarya University Journal of Computer and Information Sciences, 4(2), 181-191.
Kılıç, V., Mercan, Ö. B., Tetik, M., Kap, Ö., & Horzum, N. (2022). Non-enzymatic colorimetric glucose detection based on Au/Ag nanoparticles using smartphone and machine learning. Analytical Sciences, 38(2), 347-358.
Kılıç, V., Zhong, X., Barnard, M., Wang, W., & Kittler, J. (2014). Audio-visual tracking of a variable number of speakers with a random finite set approach. 17th International Conference on Information Fusion (FUSION),
Koca, Ö. A., & Kılıç, V. (2023). Multi-Parametric Glucose Prediction Using Multi-Layer LSTM. Avrupa Bilim ve Teknoloji Dergisi(52), 169-175.
Koca, Ö. A., Türköz, A., & Kılıç, V. (2023). Tip 1 Diyabette Çok Katmanlı GRU Tabanlı Glikoz Tahmini. Avrupa Bilim ve Teknoloji Dergisi(52), 80-86.
Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. Text summarization branches out,
Makav, B., & Kılıç, V. (2019). Smartphone-based image captioning for visually and hearing impaired. 2019 11th international conference on electrical and electronics engineering (ELECO),
Mercan, Ö. B., Doğan, V., & Kılıç, V. (2020). Time Series Analysis based Machine Learning Classification for Blood Sugar Levels. 2020 Medical Technologies Congress (TIPTEKNO),
Mercan, Ö. B., & Kılıç, V. (2020). Deep learning based colorimetric classification of glucose with au-ag nanoparticles using smartphone. 2020 Medical Technologies Congress (TIPTEKNO),
Moral, Ö. T., Kiliç, V., Onan, A., & Wang, W. (2022). Automated Image Captioning with Multi-layer Gated Recurrent Unit. 2022 30th European Signal Processing Conference (EUSIPCO),
Palaz, Z., Doğan, V., & Kılıç, V. (2021). Smartphone-based Multi-parametric Glucose Prediction using Recurrent Neural Networks. Avrupa Bilim ve Teknoloji Dergisi(32), 1168-1174.
Pan, P., Xu, Z., Yang, Y., Wu, F., & Zhuang, Y. (2016a). Hierarchical recurrent neural encoder for video representation with application to captioning. Proceedings of the IEEE conference on computer vision and pattern recognition,
Pan, P., Xu, Z., Yang, Y., Wu, F., & Zhuang, Y. (2016b). Hierarchical recurrent neural encoder for video representation with application to captioning. Conference on computer vision and pattern recognition,
Pan, Y., Mei, T., Yao, T., Li, H., & Rui, Y. (2016). Jointly modeling embedding and translation to bridge video and language. Conference on Computer Vision and Pattern Recognition (CVPR),
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. Annual Meeting of the Association for Computational Linguistics (ACL),
Rohrbach, A., Rohrbach, M., Tandon, N., & Schiele, B. (2015). A dataset for movie description. Proceedings of the IEEE conference on computer vision and pattern recognition,
Sayracı, B., Ağralı, M., & Kılıç, V. (2023). Artificial Intelligence Based Instance-Aware Semantic Lobe Segmentation on Chest Computed Tomography Images. Avrupa Bilim ve Teknoloji Dergisi(46), 109-115.
Shen, F., Shen, C., Shi, Q., Van Den Hengel, A., & Tang, Z. (2013). Inductive hashing on manifolds. Proceedings of the IEEE conference on computer vision and pattern recognition,
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. Proceedings of the IEEE conference on computer vision and pattern recognition,
Targ, S., Almeida, D., & Lyman, K. (2016). Resnet in resnet: Generalizing residual architectures. arXiv preprint arXiv:1603.08029.
Torabi, A., Pal, C., Larochelle, H., & Courville, A. (2015). Using descriptive video services to create a large data source for video annotation research. arXiv preprint arXiv:1503.01070.
Uslu, B., Çaylı, Ö., Kılıç, V., & Onan, A. (2022). Resnet based deep gated recurrent unit for image captioning on smartphone. Avrupa Bilim ve Teknoloji Dergisi(35), 610-615.
Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. Conference on Computer Vision and Pattern Recognition (CVPR),
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., & Saenko, K. (2015). Sequence to sequence-video to text. Proceedings of the IEEE international conference on computer vision,
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., & Saenko, K. (2014). Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729.
Xu, J., Mei, T., Yao, T., & Rui, Y. (2016). Msr-vtt: A large video description dataset for bridging video and language. Proceedings of the IEEE conference on computer vision and pattern recognition,
Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., & Courville, A. (2015). Describing videos by exploiting temporal structure. International Conference on Computer Vision (ICCV),
Yu, H., Wang, J., Huang, Z., Yang, Y., & Xu, W. (2016). Video paragraph captioning using hierarchical recurrent neural networks. Conference on Computer Vision and Pattern Recognition (CVPR),

Ayrıntılar

Birincil Dil

Türkçe

Konular

Bilgisayar Görüşü , Örüntü Tanıma , Video İşleme , Doğal Dil İşleme

Bölüm

Araştırma Makalesi

Yazarlar

Bengü Fetiler ^*
0000-0002-2761-7751
Türkiye

Özkan Çaylı
0000-0002-3389-3867
Türkiye

Volkan Kılıç
0000-0002-3164-1981
Türkiye

Erken Görünüm Tarihi

6 Şubat 2024

Yayımlanma Tarihi

15 Şubat 2024

Gönderilme Tarihi

6 Ekim 2023

Kabul Tarihi

19 Kasım 2023

Yayımlandığı Sayı

Yıl 2024 Sayı: 53

IZ

https://izlik.org/JA63GK83AG

Kaynak Göster

RIS / Bibtex

APA

Fetiler, B., Çaylı, Ö., & Kılıç, V. (2024). Video Altyazılama için Önceden Eğitilmiş 3B-CNN’lerden Yararlanma. Avrupa Bilim ve Teknoloji Dergisi, 53, 58-63. https://izlik.org/JA63GK83AG

Video Altyazılama için Önceden Eğitilmiş 3B-CNN'lerden Yararlanma

Video Altyazılama için Önceden Eğitilmiş 3B-CNN'lerden Yararlanma

Öz

Anahtar Kelimeler

Leveraging Pre-trained 3D-CNNs for Video Captioning

Abstract

Keywords

Kaynakça

Ayrıntılar

Birincil Dil

Konular

Bölüm

Yazarlar

Erken Görünüm Tarihi

Yayımlanma Tarihi

Gönderilme Tarihi

Kabul Tarihi

Yayımlandığı Sayı

IZ

Kaynak Göster