This study proposes a novel framework that integrates frame-based video description techniques with transformer-based large language models (LLMs) to automatically generate coherent and contextually accurate narratives from video content. In the first stage, each frame of a video is processed using a vision-language model, specifically BLIP, to produce preliminary captions that describe the visual elements present in individual frames. In the second stage, these frame-level captions are sequentially provided to a fine-tuned LLM, FLAN-T5, which is capable of leveraging temporal coherence and contextual information across frames to generate a fluent and semantically consistent text. The proposed method is applied to Turkish video datasets, highlighting its contribution to low-resource language processing within the field of Natural Language Processing (NLP). Experimental results demonstrate that the system achieves a BLEU-4 score of 0.67 and a ROUGE-L score of 0.72, significantly improving the overall narrative quality compared to baseline captioning methods. These results indicate that combining frame-based video analysis with advanced LLMs not only enhances contextual understanding but also opens new opportunities for video indexing, accessibility solutions, and multimodal content analysis in Turkish and other underrepresented languages.
Video Processing Llm Video Description Video Analysis With LLM Large Language Models Turkish NLP
| Primary Language | English |
|---|---|
| Subjects | Image Processing, Modelling and Simulation |
| Journal Section | Research Article |
| Authors | |
| Submission Date | December 9, 2025 |
| Acceptance Date | January 27, 2026 |
| Publication Date | March 31, 2026 |
| DOI | https://doi.org/10.54287/gujsa.1839173 |
| IZ | https://izlik.org/JA69ZP37XX |
| Published in Issue | Year 2026 Volume: 13 Issue: 1 |