Transforming Frame-Based Video Descriptions into Text Using Large Language Models

Özge Soğukpınar Gül; Ahmet Sayar

doi:10.54287/gujsa.1839173

Transforming Frame-Based Video Descriptions into Text Using Large Language Models

Abstract

This study proposes a novel framework that integrates frame-based video description techniques with transformer-based large language models (LLMs) to automatically generate coherent and contextually accurate narratives from video content. In the first stage, each frame of a video is processed using a vision-language model, specifically BLIP, to produce preliminary captions that describe the visual elements present in individual frames. In the second stage, these frame-level captions are sequentially provided to a fine-tuned LLM, FLAN-T5, which is capable of leveraging temporal coherence and contextual information across frames to generate a fluent and semantically consistent text. The proposed method is applied to Turkish video datasets, highlighting its contribution to low-resource language processing within the field of Natural Language Processing (NLP). Experimental results demonstrate that the system achieves a BLEU-4 score of 0.67 and a ROUGE-L score of 0.72, significantly improving the overall narrative quality compared to baseline captioning methods. These results indicate that combining frame-based video analysis with advanced LLMs not only enhances contextual understanding but also opens new opportunities for video indexing, accessibility solutions, and multimodal content analysis in Turkish and other underrepresented languages.

Keywords

References

Demiral, Y., & Sayar, A. (2022). Real-time video stream processing: Spark-based sub-stream generation for scalable analytics. International Conference on Advanced Information Networking and Applications.
Denizgez, T. M., Kamiloğlu, O., Kul, S., & Sayar, A. (2021). Guiding visually impaired people to find an object by using image to speech over the smart phone cameras. 2021 International Conference on Innovations in Intelligent Systems and Applications (ASYU). https://doi.org/10.1109/ASYU52992.2021.9599013
Dikilitaş, Y., & Sayar, A. (2023). An intelligent framework for secure object detection and image transmission in UAV systems. International Conference on Intelligent and Fuzzy Systems, 374–381.
Eken, S., & Sayar, A. (2015). An automated technique to determine spatio-temporal changes in satellite island images with vectorization and spatial queries. Sadhana, 40(1), 121–137.
Gu, C., Sun, C., Ross, D. A., Vondrick, C., Pantofaru, C., Li, Y., Vijayanarasimhan, S., Toderici, G., Ricco, S., Sukthankar, R., Schmid, C., & Malik, J. (2018). AVA: A video dataset of spatio-temporally localized atomic visual actions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Johnson, J., Karpathy, A., & Fei-Fei, L. (2016). DenseCap: Fully convolutional localization networks for dense captioning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4565–4574.
Krishna, R., Hata, K., Ren, F., Fei-Fei, L., & Niebles, J. C. (2017). Dense-captioning events in videos. Proceedings of the IEEE International Conference on Computer Vision (ICCV), 706–715.
Kul, S., Eken, S., & Sayar, A. (2017). Distributed and collaborative real-time vehicle detection and classification over the video streams. International Journal of Advanced Robotic Systems, 14(4), 1–11. https://doi.org/10.1177/1729881417720782

Kul, S., Eken, S., & Sayar, A. (2016). Evaluation of real-time performance for BGSLibrary algorithms: A case study on traffic surveillance video. 2016 6th International Conference on IT Convergence and Security (ICITCS), 1–4.
Kul, S., & Sayar, A. (2020). A smart recipe recommendation system based on image processing and deep learning. Proceedings of the International Conference on Smart City Applications.
Lei, J., Yu, L., Bansal, M., & Berg, T. L. (2018). TVQA: Localized, compositional video question answering. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 1369–1379.
Li, J., Li, D., Xiong, C., & Hoi, S. (2022). BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. Proceedings of the 39th International Conference on Machine Learning, 12888–12900.
Omurca, S. I., Ekinci, E., Sevim, S., Edinç, E. B., Eken, S., & Sayar, A. (2023). A document image classification system fusing deep and machine learning models. Applied Intelligence, 53(12), 15295–15310.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140), 1–67.
Rohrbach, A., Rohrbach, M., Tandon, N., & Schiele, B. (2015). A dataset for movie description. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Rohrbach, A., Torabi, A., Rohrbach, M., Tandon, N., Pal, C., Larochelle, H., Courville, A., & Schiele, B. (2017). Movie description. International Journal of Computer Vision, 123(1), 94–120.
Sayar, A. (2021). A distributed framework for measuring average vehicle speed using real-time traffic camera videos. Proceedings of Third International Conference on Sustainable Expert Systems.
Sayar, A., Eken, S., & Mert, U. (2015). Tiling of satellite images to capture an island object. International Conference on Engineering Applications of Neural Networks, 195–204.
Sayar, A., & Mustacoglu, A. F. (2024). Street-based parking lot detection with image processing and deep learning. Signal, Image and Video Processing, 18(Suppl. 1), 945–952.
Şentaş, A., Kul, S., & Sayar, A. (2019). Real-time traffic rules infringing determination over the video stream: Wrong way and clearway violation detection. 2019 International Artificial Intelligence and Data Processing Symposium.
Sevim, S., Ekinci, E., Omurca, S. I., Edinç, E. B., Eken, S., Erdem, T., & Sayar, A. (2022). Multi-class document image classification using deep visual and textual features. International Journal of Computational Intelligence and Applications, 21(02).
Sigurdsson, G. A., Varol, G., Wang, X., Farhadi, A., Laptev, I., & Gupta, A. (2016). Hollywood in homes: Crowdsourcing data collection for activity understanding. Proceedings of the European Conference on Computer Vision (ECCV).
Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., & Fidler, S. (2016). MovieQA: Understanding stories in movies through question-answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4631–4640.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998–6008.
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., & Saenko, K. (2015). Sequence to sequence – video to text. Proceedings of the IEEE International Conference on Computer Vision (ICCV), 4534–4542.
Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3156–3164.
Wang, X., Chen, W., Wu, J., Wang, Y.-F., & Wang, W. Y. (2018). Video captioning via hierarchical reinforcement learning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4213–4222.
Xu, J., Mei, T., Yao, T., & Rui, Y. (2016). MSR-VTT: A large video description dataset for bridging video and language. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 5288–5296.
Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., & Courville, A. (2015). Describing videos by exploiting temporal structure. Proceedings of the IEEE International Conference on Computer Vision, 4507–4515.
Zhang, K., Chao, W.-L., Sha, F., & Grauman, K. (2016). Video summarization with long short-term memory. Proceedings of the European Conference on Computer Vision (ECCV), 766–782.
Zhao, B., Li, X., & Lu, X. (2017). HSA-RNN: Hierarchical structure-adaptive RNN for video summarization. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 7005–7014.
Zhou, L., Kalantidis, Y., Chen, X., Corso, J. J., & Rohrbach, M. (2020). End-to-end dense video captioning with parallel decoding. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6847–6857.
Zhou, L., Zhou, Y., Corso, J. J., Socher, R., & Xiong, C. (2018). End-to-end dense video captioning with masked transformer. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 8739–8748.

Details

Primary Language

English

Subjects

Image Processing, Modelling and Simulation

Journal Section

Research Article

Authors

Özge Soğukpınar Gül ^*
0000-0002-8348-1887
Türkiye

Ahmet Sayar
0000-0002-6335-459X
Türkiye

Publication Date

March 31, 2026

Submission Date

December 9, 2025

Acceptance Date

January 27, 2026

Published in Issue

Year 2026 Volume: 13 Number: 1

DOI

https://doi.org/10.54287/gujsa.1839173

IZ

https://izlik.org/JA69ZP37XX

Cite

RIS / Bibtex

APA

Soğukpınar Gül, Ö., & Sayar, A. (2026). Transforming Frame-Based Video Descriptions into Text Using Large Language Models. Gazi University Journal of Science Part A: Engineering and Innovation, 13(1), 229-241. https://doi.org/10.54287/gujsa.1839173

AMA

1.Soğukpınar Gül Ö, Sayar A. Transforming Frame-Based Video Descriptions into Text Using Large Language Models. GU J Sci, Part A. 2026;13(1):229-241. doi:10.54287/gujsa.1839173

Chicago

Soğukpınar Gül, Özge, and Ahmet Sayar. 2026. “Transforming Frame-Based Video Descriptions into Text Using Large Language Models”. Gazi University Journal of Science Part A: Engineering and Innovation 13 (1): 229-41. https://doi.org/10.54287/gujsa.1839173.

EndNote

Soğukpınar Gül Ö, Sayar A (March 1, 2026) Transforming Frame-Based Video Descriptions into Text Using Large Language Models. Gazi University Journal of Science Part A: Engineering and Innovation 13 1 229–241.

IEEE

[1]Ö. Soğukpınar Gül and A. Sayar, “Transforming Frame-Based Video Descriptions into Text Using Large Language Models”, GU J Sci, Part A, vol. 13, no. 1, pp. 229–241, Mar. 2026, doi: 10.54287/gujsa.1839173.

ISNAD

Soğukpınar Gül, Özge - Sayar, Ahmet. “Transforming Frame-Based Video Descriptions into Text Using Large Language Models”. Gazi University Journal of Science Part A: Engineering and Innovation 13/1 (March 1, 2026): 229-241. https://doi.org/10.54287/gujsa.1839173.

JAMA

1.Soğukpınar Gül Ö, Sayar A. Transforming Frame-Based Video Descriptions into Text Using Large Language Models. GU J Sci, Part A. 2026;13:229–241.

MLA

Soğukpınar Gül, Özge, and Ahmet Sayar. “Transforming Frame-Based Video Descriptions into Text Using Large Language Models”. Gazi University Journal of Science Part A: Engineering and Innovation, vol. 13, no. 1, Mar. 2026, pp. 229-41, doi:10.54287/gujsa.1839173.

Vancouver

1.Özge Soğukpınar Gül, Ahmet Sayar. Transforming Frame-Based Video Descriptions into Text Using Large Language Models. GU J Sci, Part A. 2026 Mar. 1;13(1):229-41. doi:10.54287/gujsa.1839173