VİDEO ETİKETLEME UYGULAMALARINDA DERİN ÖĞRENME YAKLAŞIMLARININ KULLANILMASI ÜZERİNE KAPSAMLI BİR İNCELEME

Özlem Alpay; M. Ali Akcayol

doi:10.21923/jesd.830587

EN TR

VİDEO ETİKETLEME UYGULAMALARINDA DERİN ÖĞRENME YAKLAŞIMLARININ KULLANILMASI ÜZERİNE KAPSAMLI BİR İNCELEME

Abstract

Video etiketleme, otomatik bir şekilde videolar için etiket oluşturma olarak tanımlanmaktadır. Hem bilgisayar görmesi hem de doğal dil yaklaşımlarını birlikte içerdiği için gittikçe ilgi çeken bir alan olmaktadır İfadeleri doğal dilde üretip ve onları görüntü çerçeveleri ile birleştirmek zorlu bir süreçtir. Bu sorunu çözmek için çeşitli yaklaşımlar geliştirilmiştir. Bu çalışmada, video etiketleme araştırmalarındaki gelişmeler hakkında bir literatür çalışması sunulmuştur. İncelenen çalışmalar kullanılan yöntemlere göre farklı kategorilerde incelenmiştir. Yöntemler özetlenmiş, güçlü ve sınırlı yönleri analiz edilmiştir. Derin öğrenme, bu konuda kullanılan en yaygın yöntemlerden biridir. Video etiketleme sistemlerinde derin öğrenme yaklaşımlarının uygulanabilirliği üzerine araştırmalar yapılmıştır. Bu konuda kullanılan veri setleri, performans değerlendirme kriterleri karşılaştırılarak analiz edilmiştir. Derin öğrenme yöntemlerindeki gelişmeler video etiketleme konusunda yeni yaklaşımlar sağlamıştır. Video etiketleme konusunda yapılan çalışmalarda derin öğrenme yöntemlerinin kullanılması ile başarılı sonuçlar elde edilmiştir

Keywords

References

Aafaq, N., Akhtar, N., Liu, W., Mian, A., 2019. Empirical Autopsy of Deep Video Captioning Frameworks, arxiv.org/pdf/1911.09345v1
Amaresh, M. and Chitrakala, S., 2019. Video Captioning using Deep Learning: An Overview of Methods, Datasets and Metrics, International Conference on Communication and Signal Processing, India
Ayers, D. and Shah, M. 2001. Monitoring human behavior from video taken in an office environment. Image and Vision Computing, 19(12),833–846.
Baraldi, L., Grana, C. and Cucchiara, R., 2017. Hierarchical Boundary-Aware Neural Encoder for Video Captioning, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA
Carreira, J., Zisserman, A., 2017. Quo Vadis, Action Recognition? A New Model and The Kinetics Dataset. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA
Chen, D., Dolan, W., 2011, Collecting Highly Parallel Data For Paraphrase Evaluation. In ACL: Human Language Technologies, 1, 190-200.
Chen, Y., Zhang, W., Wang, S., Li, L., Huang, Q., 2018. Saliency-Based Spatiotemporal Attention for Video Captioning, International Conference on Multimedia Big Data (BigMM), Xi'an, China
Cho, K.,Van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y., 2014. Learning Phrase Representations Using Rnn Encoder-Decoder for Statistical Machine Translation. arXiv:1406.1078,

Çtamak,B., Kuyu, M., Erdem, A., Erdem, E., 2019. MSVD-Turkish: A Large-Scale Dataset for Video Captioning in Turkish, 27th Signal Processing and Communications Applications Conference (SIU), Sivas, Türkiye
Das, P., Xu, C., Doell, R. F., Corso. and J. J., 2013. A Thousand Frames in Just a Few Words: Lingual Description of Videos Through Latent Topics and Sparse Object Stitching. 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA
Ding, S., Qu, S., Xi, Y., Wan, S., 2019. A Long Video Captioning Generation Algorithm for Big Video Data Retrieval, Future Generation Computer Systems 93, 583–595
Elman, J. L., 1990. Finding Structure in time. Cognitive Science, 14(2), 179–211
Gan, Z., Gan, C., Hez, X., Puy, Y., Tranz, K., Gaoz, J.,Cariny, L., Dengz, L., 2017. Semantic Compositional Networks for Visual Captioning, arXiv:1611.08002v2,
Gao, L., Guo, Z., Zhang, H., Xu, X., Shen, H., 2017. Video Captioning With Attention-Based LSTM And Semantic Consistency, IEEE Transactions Multimedia, 19(9), 2045–2055
Gella, S., Lewis, M. and Rohrbach. M., 2018. A Dataset for Telling the Stories of Social Media Videos. In Proc of the 2018 Conference on Empirical Methods in Natural Language Processing. 968-974
Gers, F., Long Short-Term Memory in Recurrent Neural Networks, Ph.D. dissertation, Dept. Comput. Sci., Univ. Hannover, Hannover, Germany, 2001.
Graves, A., Jaitly, N., 2014. Towards End-To-End Speech Recognition With Recurrent Neural Networks.31st International Conference on Machine Learning (ICML-14). 1764- 1772.
Hochreiter S. and Schmidhuber, J., 1997. Long Short-Term Memory,Neural Computer, 9(8), 1735–1780.
Jegham, I., Khalifa, A.B., Alouani, I.,Mahjoub, M.A., 2020. Vision-Based Human Action Recognition: An Overview and Real World Challenges, Forensic Science International: Digital Investigation, 32, 200901
Kiros, R., Salakhutdinov, R., Zemel, R., 2014. Multimodal Neural Language Models, Proceedings of the 31st International Conference on Machine Learning, PMLR, 32(2), 595-603
Krishna, R., Hata, K., Ren, F., Fei-Fei, L. and Niebles. J. C., 2017.Dense-Captioning Events in Videos. In Proceedings of the IEEE International Conference on Computer Vision , Venice, Italy
Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S.,Choi, Y., 2013. Babytalk: Understanding and Generating Simple Image Descriptions, IEEE Transactions On Pattern Analysis and Machine Intelligence, 35 (12)2891–2903.
Kuznetsova, P.,Ordonez, V., Berg, T.L., Choi, Y., 2014. TREETALK: Composition and Compression of Trees for Image Descriptions, Transactions of the Association for Computational Linguistics 2 (1), 351–362.
Kojima A., and Tamura, T.,2002, Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions, International Journal of Computer Vision 50(2), 171–184
Lavie, A., Agarwall, A., 2007. Meteor: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments,2007, Proceedings of the Second Workshop on Statistical Machine Translation, Prague, Czech Republic
Li, H., Song, D., Liao, L., Peng, C., 2019. Revnet: Brıng Revıewıng Into Vıdeo Captioning for a Better Descrıptıon, IEEE International Conference on Multimedia and Expo (ICME) Chiana
Li, S., Tao, Z., Li, K., Fu, Y., 2019. Visual to Text: Survey of Image and Video Captioning, IEEE Transactions on Emerging Topics in Computational Intelligence, 3(4), 297-312.
Li, W., Guo, D., Fang, X., (2018). Multimodal Architecture for Video Captioning with Memory Networks and an Attention Mechanism, Pattern Recognition Letters 105, 23–29
Lin, C.Y., 2004. ROUGE: A Package for Automatic Evaluation of Summaries, In Proceedings of Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL 2004, Barcelona, Spain
Liu, J., Wang, Z., Liu, H., 2020. HDS-SP: A Novel Descriptor For Skeleton-Based Human Action Recognition, Neurocomputing, 385,22-32
Ma, M., Wang, B., 2017. A Grey Relational Analysis based Evaluation Metric for Image Captioning and Video Captioning, 2017 International Conference on Grey Systems and Intelligent Services (GSIS), Stockholm, Sweden
Nabati, M., Behrad, A., 2020. Video Captioning Using Boosted And Parallel Long Short-Term Memory Networks, Computer Vision and Image Understanding, 190, 102840.
Nan, W., Zhigang, Z., Huan, L., Jingqi, M., Jiajun, Z., Guangxue, D., 2019. Gesture Recognition Based on Deep Learning in Complex Scenes, 2019 Chinese Control And Decision Conference (CCDC). Nanchang, China, China
Özer E.G., Karapınar İ.N., Başbuğ S., Turan S., Utku A., Akcayol M.A., 2020. Deep learning based new model for video captioning, International Journal of Advanced Computer Science and Applications, 11(3), 1-6.
Pan, P., Xu, Z., Yang, Y., Wu, F.,Zhuang, Y., 2016. Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA
Park, J., Song, C., Han. J-H., (2017), A Study of Evaluation Metrics and Datasets for Video Captioning. International Conference Intelligent Informatics and Biomedical Sciences (ICIIBMS), Okinawa, Japan
Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B. and Pinkal. M., 2013. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics (TACL) 1, 25–36,
Rohrbach, A., Rohrbach, M., Qiu, W., Friedrich, A., Pinkal, M., Schiele, B., 2014. Coherent Multi-Sentence Video Description with Variable Level of Detail. Pattern Recognition, 184-195, Germany
Rohrbach, A., Rohrbach, M., Tandon, N. and Schiele. B., 2015. A Dataset for Movie Description. arXiv.org/abs/1501.02530
Rohrbach, M., Regneri, M., Andriluka, M., Amin, S., Pinkal, M. and Schiele. B., 2012. Script Data for Attribute-Based Recognition of Composite Activities. In European Conference on Computer Vision, 144-157, Springer
Rohrbach, M., Qiu, W., Titov, I., 2013. Translating Video Content to Natural Language Descriptions, 2013. IEEE International Conference on Computer Vision, Sydney, NSW, Australia
Sigurdsson, G. A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A., 2016. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding. arxiv.org/abs/1604.01753
Smirnov, E.A., Timoshenko, D.M., Andrianov, S.N., 2014. Comparison of Regularization Methods for ImageNet Classification with Deep Convolutional Neural Networks, AASRI Procedia, 6,89-94
Song, J., Guo, Y., Gao, L.,Li, X., Hanjalic, A.,Shen, H.T., (2015), From Deterministic to Generative: Multimodal Stochastic RNNs for Video Captioning, Journal Of Latex Class Files, 14(8),1-10.
Su, J., 2018. Study of Video Captioning Problem
Szegedy, C., Ioffe, S., Vanhoucke, S. Alemi, A., 2016. Inception-v4, Inception-Resnet And The Impact Of Residual Connections on Learning. /arxiv.org/abs/1602.07261
Şeker, A., Diri, B., Balık, H.H., 2017. Derin Öğrenme Yöntemleri ve Uygulamaları Hakkında Bir İnceleme, Gazi Mühendislik Bilimleri Dergisi, 3(3). 47-64
Tang, P., Wang, H., Kwong, S., 2017. G- MS2F: Googlenet Based Multi-Stage Feature Fusion Of Deep CNN For Scene Recognition, Neurocomputing, 225,188-197
Torabi, A., Pal, C., Larochelle, H. and Courville. A., 2015. Using Descriptive Video Services to Create a Large Data Source for Video Annotation Research. arXiv:1503.01070
Trabelsi, A., Elouedi, Z., Lefevre, E., 2019. Decision Tree Classifiers For Evidential Attribute Values And Class Labels, Fuzzy Sets and Systems, 366,46-62
Tran, D., Bourdev, L., Fergus, R., Torresani,L., Paluri, M., 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. arxiv.org/abs/1412.0767
Unal, ME, Citamak, B., Yagcioglu, S., Erdem, A., Erdem, E., Cinbis, N.I., Cakici, R., 2016. Tasviret: A Benchmark Dataset for Automatic Turkish Description Generation From Images, 2016 24th Signal Processing and Communication Application Conference (SIU), Zonguldak, Turkey
Vedantam, R., Zitnick, C.L., Parikh, D., 2015. Cider: Consensus-Based İmage Description Evaluation, IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Boston, MA, USA.
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K., 2015. Sequence To Sequence—Video To Text, 2015 IEEE International Conference Computer Vision, Santiago, Chile
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R. and K. Saenko. 2014. Translating Videos to Natural Language Using Deep Recurrent Neural Networks. arXiv preprint arXiv:1412.4729, 2014
Vinyals, O., Toshev, A., Bengio, S., Erhan, D., 2015. Show and Tell: A Neural Image Captioning Generator, 28th IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA
Wang, B., Ma, L., Zhang, W., Liu, W., 2018. Reconstruction Network for Video Captioning, IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City USA
Wang, H., Gao, C., Han, Y., (2018). Sequence in Sequence for Video Captioning, Pattern Recognition Letters, 130, 327-334
Wu, A., Han, Y., Yang, Y., Hu, Q., Wu, F., 2019. Convolutional Reconstruction-to-Sequence for Video Captioning, IEEE Transactions on Circuits and Systems for Video Technology, 30(11), 4299 - 4308
Wu, X., Sahoo, D., Hoi, S.C.H., 2020. Recent Advances in Deep Learning For Object Detection, Neurocomputing, 396,39-64
Wu, Z., Yao,T., Fu, Y., Jiang, Y.-G., 2016. Deep Learning for Video Classification and Captioning, Frontiers of Multimedia Research 3-29
Xiao, H., Shi, J., 2019. Video Captioning with Adaptive Attention and Mixed Loss Optimization, IEEE Access, 7, 135757-13769.
Xu, J., Mei, T., Yao, T., Rui, Y., 2016. Msr-vtt: A Large Video Description Dataset for Bridging Video and Language. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R. and Bengio, Y., 2016. Show, Attend and Tell: Neural Image Captioning Generation with Visual Attention, Proceedings of the 32nd International Conference on Machine Learning, (PMLR) 37, 2048-2057,
Xu, N., Liu, A., 2018. Dual-Stream Recurrent Neural Network for Video Captioning, IEEE Transactions On Circuits And Systems For Video Technology, 29(8), 2482-2493
Yang, Y., Zhou, J., Jiangbo A., Bin, Y., Hanjalic, A., Shen, H.T., Ji, Y., 2018. Video Captioning by Adversarial LSTM, IEEE Transactions on Image Processing, 27(11), 5600-5611
Yang, Z., Yuan, Y., Wu, Y., Salakhutdinov, R., Cohen, W. W., 2016. Review Networks for Caption Generation, 30th International Conference on Neural Information Processing, Barcelona, SPAIN
Yang, Z., Yue, J., Li, Z., Zhu, L., 2018. Vegetable Image Retrieval with Fine-tuning VGG Model and Image Hash, IFAC-PapersOnLine, 51(17), 280-285.
Yao, L., Cho, K., Ballas, N., Pa´ı, C., Courville, A., 2015. Describing Videos By Exploiting Temporal Structure, International Conference on Computer Vision (ICCV), Santiago, Chile
Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A., 2015. Describing Videos By Exploiting Temporal Structure, 2015 IEEE International Conference Computer Vision, Santiago, Chile
Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T., 2016. Boosting İmage Captioning with Attributes, in Proc. IEEE Int. Conference Computer Vision, Venice, Italy
Yingwei, P., Mei, T., Yao,T., Li, H., Rui. Y., 2015. Jointly Modeling Embedding and Translation to Bridge Video and Language. arxiv.org/abs/1505.01861
Yingwei, P., Yao, T., Li, H., Mei. T., 2016. Video Captioning with Transferred Semantic Attributes. arxiv.org/abs/1611.07675
You, Q., Jin, H., Wang, Z., Fang, C., Luo J., 2016. Image Captioning with Semantic Attention, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA
Yu, Y., Choi, J., Kim, Y., Yoo, K., Lee, S.-H., Kim, G., 2017. Supervising Neural Attention Models for Video Captioning by Human Gaze Data. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017). Honolulu, Hawaii
Yuan, J., Xiong, H-C., Xiao, Y., Guan, W., Wang, M., Hong, R., Li, Z.Y., 2019. Gated CNN: Integrating Multi-Scale Feature Layers For Object Detection, Pattern Recognition 105, 107131
Zeng, K., Chen, T., Niebles, J. C., Sun, M., 2016. Title Generation for User Generated Videos. arxiv.org/abs/1608.07068
Zhao, H., Li, X., 2017. A Cost Sensitive Decision Tree Algorithm Based On Weighted Class Distribution With Batch Deleting Attribute Mechanism, Information Sciences, 378, 303-316
Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A., 2014. Learning Deep Features For Scene Recognition Using Places Database, Proceedings of the Advances in Neural Information Processing Systems (NIPS). 487–495.
Zhou, L., Kalantidis, Y., Chen, X., Corso, J. J., Rohrbach, M., 2018. Grounded video description. arxiv.org/abs/1812.06587

Details

Primary Language

Turkish

Subjects

Computer Software

Journal Section

Research Article

Authors

Özlem Alpay ^*
0000-0002-5432-4102
Türkiye

M. Ali Akcayol
0000-0002-6615-1237
Türkiye

Publication Date

December 29, 2020

Submission Date

November 24, 2020

Acceptance Date

December 28, 2020

Published in Issue

Year 2020 Volume: 8 Number: 5

DOI

https://doi.org/10.21923/jesd.830587

IZ

https://izlik.org/JA85LG84FZ

Cite

RIS / Bibtex

APA

Alpay, Ö., & Akcayol, M. A. (2020). VİDEO ETİKETLEME UYGULAMALARINDA DERİN ÖĞRENME YAKLAŞIMLARININ KULLANILMASI ÜZERİNE KAPSAMLI BİR İNCELEME. Mühendislik Bilimleri Ve Tasarım Dergisi, 8(5), 271-289. https://doi.org/10.21923/jesd.830587

AMA

1.Alpay Ö, Akcayol MA. VİDEO ETİKETLEME UYGULAMALARINDA DERİN ÖĞRENME YAKLAŞIMLARININ KULLANILMASI ÜZERİNE KAPSAMLI BİR İNCELEME. JESD. 2020;8(5):271-289. doi:10.21923/jesd.830587

Chicago

Alpay, Özlem, and M. Ali Akcayol. 2020. “VİDEO ETİKETLEME UYGULAMALARINDA DERİN ÖĞRENME YAKLAŞIMLARININ KULLANILMASI ÜZERİNE KAPSAMLI BİR İNCELEME”. Mühendislik Bilimleri Ve Tasarım Dergisi 8 (5): 271-89. https://doi.org/10.21923/jesd.830587.

EndNote

Alpay Ö, Akcayol MA (December 1, 2020) VİDEO ETİKETLEME UYGULAMALARINDA DERİN ÖĞRENME YAKLAŞIMLARININ KULLANILMASI ÜZERİNE KAPSAMLI BİR İNCELEME. Mühendislik Bilimleri ve Tasarım Dergisi 8 5 271–289.

IEEE

[1]Ö. Alpay and M. A. Akcayol, “VİDEO ETİKETLEME UYGULAMALARINDA DERİN ÖĞRENME YAKLAŞIMLARININ KULLANILMASI ÜZERİNE KAPSAMLI BİR İNCELEME”, JESD, vol. 8, no. 5, pp. 271–289, Dec. 2020, doi: 10.21923/jesd.830587.

ISNAD

Alpay, Özlem - Akcayol, M. Ali. “VİDEO ETİKETLEME UYGULAMALARINDA DERİN ÖĞRENME YAKLAŞIMLARININ KULLANILMASI ÜZERİNE KAPSAMLI BİR İNCELEME”. Mühendislik Bilimleri ve Tasarım Dergisi 8/5 (December 1, 2020): 271-289. https://doi.org/10.21923/jesd.830587.

JAMA

1.Alpay Ö, Akcayol MA. VİDEO ETİKETLEME UYGULAMALARINDA DERİN ÖĞRENME YAKLAŞIMLARININ KULLANILMASI ÜZERİNE KAPSAMLI BİR İNCELEME. JESD. 2020;8:271–289.

MLA

Alpay, Özlem, and M. Ali Akcayol. “VİDEO ETİKETLEME UYGULAMALARINDA DERİN ÖĞRENME YAKLAŞIMLARININ KULLANILMASI ÜZERİNE KAPSAMLI BİR İNCELEME”. Mühendislik Bilimleri Ve Tasarım Dergisi, vol. 8, no. 5, Dec. 2020, pp. 271-89, doi:10.21923/jesd.830587.

Vancouver

1.Özlem Alpay, M. Ali Akcayol. VİDEO ETİKETLEME UYGULAMALARINDA DERİN ÖĞRENME YAKLAŞIMLARININ KULLANILMASI ÜZERİNE KAPSAMLI BİR İNCELEME. JESD. 2020 Dec. 1;8(5):271-89. doi:10.21923/jesd.830587

A COMPREHENSIVE REVIEW ON USING OF DEEP LEARNING APPROACHES IN VIDEO CAPTIONING APPLICATIONS

Abstract

Keywords

VİDEO ETİKETLEME UYGULAMALARINDA DERİN ÖĞRENME YAKLAŞIMLARININ KULLANILMASI ÜZERİNE KAPSAMLI BİR İNCELEME

Abstract

Keywords

References

Details

Primary Language

Subjects

Journal Section

Authors

Publication Date

Submission Date

Acceptance Date

Published in Issue

DOI

IZ

Cite