Research Article

Transforming Frame-Based Video Descriptions into Text Using Large Language Models

Volume: 13 Number: 1 March 31, 2026

Transforming Frame-Based Video Descriptions into Text Using Large Language Models

Abstract

This study proposes a novel framework that integrates frame-based video description techniques with transformer-based large language models (LLMs) to automatically generate coherent and contextually accurate narratives from video content. In the first stage, each frame of a video is processed using a vision-language model, specifically BLIP, to produce preliminary captions that describe the visual elements present in individual frames. In the second stage, these frame-level captions are sequentially provided to a fine-tuned LLM, FLAN-T5, which is capable of leveraging temporal coherence and contextual information across frames to generate a fluent and semantically consistent text. The proposed method is applied to Turkish video datasets, highlighting its contribution to low-resource language processing within the field of Natural Language Processing (NLP). Experimental results demonstrate that the system achieves a BLEU-4 score of 0.67 and a ROUGE-L score of 0.72, significantly improving the overall narrative quality compared to baseline captioning methods. These results indicate that combining frame-based video analysis with advanced LLMs not only enhances contextual understanding but also opens new opportunities for video indexing, accessibility solutions, and multimodal content analysis in Turkish and other underrepresented languages.

Keywords

References

  1. Demiral, Y., & Sayar, A. (2022). Real-time video stream processing: Spark-based sub-stream generation for scalable analytics. International Conference on Advanced Information Networking and Applications.
  2. Denizgez, T. M., Kamiloğlu, O., Kul, S., & Sayar, A. (2021). Guiding visually impaired people to find an object by using image to speech over the smart phone cameras. 2021 International Conference on Innovations in Intelligent Systems and Applications (ASYU). https://doi.org/10.1109/ASYU52992.2021.9599013
  3. Dikilitaş, Y., & Sayar, A. (2023). An intelligent framework for secure object detection and image transmission in UAV systems. International Conference on Intelligent and Fuzzy Systems, 374–381.
  4. Eken, S., & Sayar, A. (2015). An automated technique to determine spatio-temporal changes in satellite island images with vectorization and spatial queries. Sadhana, 40(1), 121–137.
  5. Gu, C., Sun, C., Ross, D. A., Vondrick, C., Pantofaru, C., Li, Y., Vijayanarasimhan, S., Toderici, G., Ricco, S., Sukthankar, R., Schmid, C., & Malik, J. (2018). AVA: A video dataset of spatio-temporally localized atomic visual actions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  6. Johnson, J., Karpathy, A., & Fei-Fei, L. (2016). DenseCap: Fully convolutional localization networks for dense captioning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4565–4574.
  7. Krishna, R., Hata, K., Ren, F., Fei-Fei, L., & Niebles, J. C. (2017). Dense-captioning events in videos. Proceedings of the IEEE International Conference on Computer Vision (ICCV), 706–715.
  8. Kul, S., Eken, S., & Sayar, A. (2017). Distributed and collaborative real-time vehicle detection and classification over the video streams. International Journal of Advanced Robotic Systems, 14(4), 1–11. https://doi.org/10.1177/1729881417720782

Details

Primary Language

English

Subjects

Image Processing, Modelling and Simulation

Journal Section

Research Article

Publication Date

March 31, 2026

Submission Date

December 9, 2025

Acceptance Date

January 27, 2026

Published in Issue

Year 2026 Volume: 13 Number: 1

APA
Soğukpınar Gül, Ö., & Sayar, A. (2026). Transforming Frame-Based Video Descriptions into Text Using Large Language Models. Gazi University Journal of Science Part A: Engineering and Innovation, 13(1), 229-241. https://doi.org/10.54287/gujsa.1839173
AMA
1.Soğukpınar Gül Ö, Sayar A. Transforming Frame-Based Video Descriptions into Text Using Large Language Models. GU J Sci, Part A. 2026;13(1):229-241. doi:10.54287/gujsa.1839173
Chicago
Soğukpınar Gül, Özge, and Ahmet Sayar. 2026. “Transforming Frame-Based Video Descriptions into Text Using Large Language Models”. Gazi University Journal of Science Part A: Engineering and Innovation 13 (1): 229-41. https://doi.org/10.54287/gujsa.1839173.
EndNote
Soğukpınar Gül Ö, Sayar A (March 1, 2026) Transforming Frame-Based Video Descriptions into Text Using Large Language Models. Gazi University Journal of Science Part A: Engineering and Innovation 13 1 229–241.
IEEE
[1]Ö. Soğukpınar Gül and A. Sayar, “Transforming Frame-Based Video Descriptions into Text Using Large Language Models”, GU J Sci, Part A, vol. 13, no. 1, pp. 229–241, Mar. 2026, doi: 10.54287/gujsa.1839173.
ISNAD
Soğukpınar Gül, Özge - Sayar, Ahmet. “Transforming Frame-Based Video Descriptions into Text Using Large Language Models”. Gazi University Journal of Science Part A: Engineering and Innovation 13/1 (March 1, 2026): 229-241. https://doi.org/10.54287/gujsa.1839173.
JAMA
1.Soğukpınar Gül Ö, Sayar A. Transforming Frame-Based Video Descriptions into Text Using Large Language Models. GU J Sci, Part A. 2026;13:229–241.
MLA
Soğukpınar Gül, Özge, and Ahmet Sayar. “Transforming Frame-Based Video Descriptions into Text Using Large Language Models”. Gazi University Journal of Science Part A: Engineering and Innovation, vol. 13, no. 1, Mar. 2026, pp. 229-41, doi:10.54287/gujsa.1839173.
Vancouver
1.Özge Soğukpınar Gül, Ahmet Sayar. Transforming Frame-Based Video Descriptions into Text Using Large Language Models. GU J Sci, Part A. 2026 Mar. 1;13(1):229-41. doi:10.54287/gujsa.1839173