Transforming Frame-Based Video Descriptions into Text Using Large Language Models
Abstract
This study proposes a novel framework that integrates frame-based video description techniques with transformer-based large language models (LLMs) to automatically generate coherent and contextually accurate narratives from video content. In the first stage, each frame of a video is processed using a vision-language model, specifically BLIP, to produce preliminary captions that describe the visual elements present in individual frames. In the second stage, these frame-level captions are sequentially provided to a fine-tuned LLM, FLAN-T5, which is capable of leveraging temporal coherence and contextual information across frames to generate a fluent and semantically consistent text. The proposed method is applied to Turkish video datasets, highlighting its contribution to low-resource language processing within the field of Natural Language Processing (NLP). Experimental results demonstrate that the system achieves a BLEU-4 score of 0.67 and a ROUGE-L score of 0.72, significantly improving the overall narrative quality compared to baseline captioning methods. These results indicate that combining frame-based video analysis with advanced LLMs not only enhances contextual understanding but also opens new opportunities for video indexing, accessibility solutions, and multimodal content analysis in Turkish and other underrepresented languages.
Keywords
References
- Demiral, Y., & Sayar, A. (2022). Real-time video stream processing: Spark-based sub-stream generation for scalable analytics. International Conference on Advanced Information Networking and Applications.
- Denizgez, T. M., Kamiloğlu, O., Kul, S., & Sayar, A. (2021). Guiding visually impaired people to find an object by using image to speech over the smart phone cameras. 2021 International Conference on Innovations in Intelligent Systems and Applications (ASYU). https://doi.org/10.1109/ASYU52992.2021.9599013
- Dikilitaş, Y., & Sayar, A. (2023). An intelligent framework for secure object detection and image transmission in UAV systems. International Conference on Intelligent and Fuzzy Systems, 374–381.
- Eken, S., & Sayar, A. (2015). An automated technique to determine spatio-temporal changes in satellite island images with vectorization and spatial queries. Sadhana, 40(1), 121–137.
- Gu, C., Sun, C., Ross, D. A., Vondrick, C., Pantofaru, C., Li, Y., Vijayanarasimhan, S., Toderici, G., Ricco, S., Sukthankar, R., Schmid, C., & Malik, J. (2018). AVA: A video dataset of spatio-temporally localized atomic visual actions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Johnson, J., Karpathy, A., & Fei-Fei, L. (2016). DenseCap: Fully convolutional localization networks for dense captioning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4565–4574.
- Krishna, R., Hata, K., Ren, F., Fei-Fei, L., & Niebles, J. C. (2017). Dense-captioning events in videos. Proceedings of the IEEE International Conference on Computer Vision (ICCV), 706–715.
- Kul, S., Eken, S., & Sayar, A. (2017). Distributed and collaborative real-time vehicle detection and classification over the video streams. International Journal of Advanced Robotic Systems, 14(4), 1–11. https://doi.org/10.1177/1729881417720782
Details
Primary Language
English
Subjects
Image Processing, Modelling and Simulation
Journal Section
Research Article
Publication Date
March 31, 2026
Submission Date
December 9, 2025
Acceptance Date
January 27, 2026
Published in Issue
Year 2026 Volume: 13 Number: 1