Research Article
BibTex RIS Cite

Transforming Frame-Based Video Descriptions into Text Using Large Language Models

Year 2026, Volume: 13 Issue: 1 , 229 - 241 , 31.03.2026
https://doi.org/10.54287/gujsa.1839173
https://izlik.org/JA69ZP37XX

Abstract

This study proposes a novel framework that integrates frame-based video description techniques with transformer-based large language models (LLMs) to automatically generate coherent and contextually accurate narratives from video content. In the first stage, each frame of a video is processed using a vision-language model, specifically BLIP, to produce preliminary captions that describe the visual elements present in individual frames. In the second stage, these frame-level captions are sequentially provided to a fine-tuned LLM, FLAN-T5, which is capable of leveraging temporal coherence and contextual information across frames to generate a fluent and semantically consistent text. The proposed method is applied to Turkish video datasets, highlighting its contribution to low-resource language processing within the field of Natural Language Processing (NLP). Experimental results demonstrate that the system achieves a BLEU-4 score of 0.67 and a ROUGE-L score of 0.72, significantly improving the overall narrative quality compared to baseline captioning methods. These results indicate that combining frame-based video analysis with advanced LLMs not only enhances contextual understanding but also opens new opportunities for video indexing, accessibility solutions, and multimodal content analysis in Turkish and other underrepresented languages.

References

  • Demiral, Y., & Sayar, A. (2022). Real-time video stream processing: Spark-based sub-stream generation for scalable analytics. International Conference on Advanced Information Networking and Applications.
  • Denizgez, T. M., Kamiloğlu, O., Kul, S., & Sayar, A. (2021). Guiding visually impaired people to find an object by using image to speech over the smart phone cameras. 2021 International Conference on Innovations in Intelligent Systems and Applications (ASYU). https://doi.org/10.1109/ASYU52992.2021.9599013
  • Dikilitaş, Y., & Sayar, A. (2023). An intelligent framework for secure object detection and image transmission in UAV systems. International Conference on Intelligent and Fuzzy Systems, 374–381.
  • Eken, S., & Sayar, A. (2015). An automated technique to determine spatio-temporal changes in satellite island images with vectorization and spatial queries. Sadhana, 40(1), 121–137.
  • Gu, C., Sun, C., Ross, D. A., Vondrick, C., Pantofaru, C., Li, Y., Vijayanarasimhan, S., Toderici, G., Ricco, S., Sukthankar, R., Schmid, C., & Malik, J. (2018). AVA: A video dataset of spatio-temporally localized atomic visual actions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Johnson, J., Karpathy, A., & Fei-Fei, L. (2016). DenseCap: Fully convolutional localization networks for dense captioning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4565–4574.
  • Krishna, R., Hata, K., Ren, F., Fei-Fei, L., & Niebles, J. C. (2017). Dense-captioning events in videos. Proceedings of the IEEE International Conference on Computer Vision (ICCV), 706–715.
  • Kul, S., Eken, S., & Sayar, A. (2017). Distributed and collaborative real-time vehicle detection and classification over the video streams. International Journal of Advanced Robotic Systems, 14(4), 1–11. https://doi.org/10.1177/1729881417720782
  • Kul, S., Eken, S., & Sayar, A. (2016). Evaluation of real-time performance for BGSLibrary algorithms: A case study on traffic surveillance video. 2016 6th International Conference on IT Convergence and Security (ICITCS), 1–4.
  • Kul, S., & Sayar, A. (2020). A smart recipe recommendation system based on image processing and deep learning. Proceedings of the International Conference on Smart City Applications.
  • Lei, J., Yu, L., Bansal, M., & Berg, T. L. (2018). TVQA: Localized, compositional video question answering. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 1369–1379.
  • Li, J., Li, D., Xiong, C., & Hoi, S. (2022). BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. Proceedings of the 39th International Conference on Machine Learning, 12888–12900.
  • Omurca, S. I., Ekinci, E., Sevim, S., Edinç, E. B., Eken, S., & Sayar, A. (2023). A document image classification system fusing deep and machine learning models. Applied Intelligence, 53(12), 15295–15310.
  • Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140), 1–67.
  • Rohrbach, A., Rohrbach, M., Tandon, N., & Schiele, B. (2015). A dataset for movie description. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Rohrbach, A., Torabi, A., Rohrbach, M., Tandon, N., Pal, C., Larochelle, H., Courville, A., & Schiele, B. (2017). Movie description. International Journal of Computer Vision, 123(1), 94–120.
  • Sayar, A. (2021). A distributed framework for measuring average vehicle speed using real-time traffic camera videos. Proceedings of Third International Conference on Sustainable Expert Systems.
  • Sayar, A., Eken, S., & Mert, U. (2015). Tiling of satellite images to capture an island object. International Conference on Engineering Applications of Neural Networks, 195–204.
  • Sayar, A., & Mustacoglu, A. F. (2024). Street-based parking lot detection with image processing and deep learning. Signal, Image and Video Processing, 18(Suppl. 1), 945–952.
  • Şentaş, A., Kul, S., & Sayar, A. (2019). Real-time traffic rules infringing determination over the video stream: Wrong way and clearway violation detection. 2019 International Artificial Intelligence and Data Processing Symposium.
  • Sevim, S., Ekinci, E., Omurca, S. I., Edinç, E. B., Eken, S., Erdem, T., & Sayar, A. (2022). Multi-class document image classification using deep visual and textual features. International Journal of Computational Intelligence and Applications, 21(02).
  • Sigurdsson, G. A., Varol, G., Wang, X., Farhadi, A., Laptev, I., & Gupta, A. (2016). Hollywood in homes: Crowdsourcing data collection for activity understanding. Proceedings of the European Conference on Computer Vision (ECCV).
  • Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., & Fidler, S. (2016). MovieQA: Understanding stories in movies through question-answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4631–4640.
  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998–6008.
  • Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., & Saenko, K. (2015). Sequence to sequence – video to text. Proceedings of the IEEE International Conference on Computer Vision (ICCV), 4534–4542.
  • Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3156–3164.
  • Wang, X., Chen, W., Wu, J., Wang, Y.-F., & Wang, W. Y. (2018). Video captioning via hierarchical reinforcement learning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4213–4222.
  • Xu, J., Mei, T., Yao, T., & Rui, Y. (2016). MSR-VTT: A large video description dataset for bridging video and language. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 5288–5296.
  • Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., & Courville, A. (2015). Describing videos by exploiting temporal structure. Proceedings of the IEEE International Conference on Computer Vision, 4507–4515.
  • Zhang, K., Chao, W.-L., Sha, F., & Grauman, K. (2016). Video summarization with long short-term memory. Proceedings of the European Conference on Computer Vision (ECCV), 766–782.
  • Zhao, B., Li, X., & Lu, X. (2017). HSA-RNN: Hierarchical structure-adaptive RNN for video summarization. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 7005–7014.
  • Zhou, L., Kalantidis, Y., Chen, X., Corso, J. J., & Rohrbach, M. (2020). End-to-end dense video captioning with parallel decoding. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6847–6857.
  • Zhou, L., Zhou, Y., Corso, J. J., Socher, R., & Xiong, C. (2018). End-to-end dense video captioning with masked transformer. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 8739–8748.
There are 33 citations in total.

Details

Primary Language English
Subjects Image Processing, Modelling and Simulation
Journal Section Research Article
Authors

Özge Soğukpınar Gül 0000-0002-8348-1887

Ahmet Sayar 0000-0002-6335-459X

Submission Date December 9, 2025
Acceptance Date January 27, 2026
Publication Date March 31, 2026
DOI https://doi.org/10.54287/gujsa.1839173
IZ https://izlik.org/JA69ZP37XX
Published in Issue Year 2026 Volume: 13 Issue: 1

Cite

APA Soğukpınar Gül, Ö., & Sayar, A. (2026). Transforming Frame-Based Video Descriptions into Text Using Large Language Models. Gazi University Journal of Science Part A: Engineering and Innovation, 13(1), 229-241. https://doi.org/10.54287/gujsa.1839173