SPoRC-VIST: A Benchmark for Evaluating Generative Natural Narrative in Vision-Language Models

February 23, 2026

Reading time: 6 minute

...

#Computer Science #Model #Machine Learning

📝 Original Info

Title: SPoRC-VIST: A Benchmark for Evaluating Generative Natural Narrative in Vision-Language Models
ArXiv ID: 2601.01062
Date: 2026-01-03
Authors: ** Yunlin Zeng (Georgia Institute of Technology, yzeng@gatech.edu) **

📝 Abstract

Vision-Language Models (VLMs) have achieved remarkable success in descriptive tasks such as image captioning and visual question answering (VQA). However, their ability to generate engaging, long-form narratives-specifically multi-speaker podcast dialogues-remains under-explored and difficult to evaluate. Standard metrics like BLEU and ROUGE fail to capture the nuances of conversational naturalness, personality, and narrative flow, often rewarding safe, repetitive outputs over engaging storytelling. In this work, we present a novel pipeline for end-to-end visual podcast generation, and fine-tune a Qwen3-VL-32B model on a curated dataset of 4,000 image-dialogue pairs. Crucially, we use a synthetic-to-real training strategy: we train on high-quality podcast dialogues from the Structured Podcast Research Corpus (SPoRC) paired with synthetically generated imagery, and evaluate on real-world photo sequences from the Visual Storytelling Dataset (VIST). This rigorous setup tests the model's ability to generalize from synthetic training data to real-world visual domains. We propose a comprehensive evaluation framework that moves beyond textual overlap, and use AI-as-a-judge (Gemini 3 Pro, Claude Opus 4.5, GPT 5.2) and novel style metrics (average turn length, speaker switch rate) to assess quality. Our experiments demonstrate that our fine-tuned 32B model significantly outperforms a 235B base model in conversational naturalness (>80% win rate) and narrative depth (+50% turn length), while maintaining identical visual grounding capabilities (CLIPScore: 20.39).

💡 Deep Analysis

Deep Dive into SPoRC-VIST: A Benchmark for Evaluating Generative Natural Narrative in Vision-Language Models.

📄 Full Content

SPoRC-VIST: A Benchmark for Evaluating Generative Natural Narrative in Vision-Language Models Yunlin Zeng Georgia Institute of Technology yzeng@gatech.edu Abstract Vision-Language Models (VLMs) have achieved remark- able success in descriptive tasks such as image captioning and visual question answering (VQA). However, their abil- ity to generate engaging, long-form narratives—specifically multi-speaker podcast dialogues—remains under-explored and difficult to evaluate. Standard metrics like BLEU and ROUGE fail to capture the nuances of conversational nat- uralness, personality, and narrative flow, often rewarding safe, repetitive outputs over engaging storytelling. In this work, we present a novel pipeline for end-to-end visual pod- cast generation, and fine-tune a Qwen3-VL-32B model on a curated dataset of 4,000 image-dialogue pairs. Crucially, we use a synthetic-to-real training strategy: we train on high-quality podcast dialogues from the Structured Podcast Research Corpus (SPoRC) paired with synthetically gener- ated imagery, and evaluate on real-world photo sequences from the Visual Storytelling Dataset (VIST). This rigor- ous setup tests the model’s ability to generalize from syn- thetic training data to real-world visual domains. We pro- pose a comprehensive evaluation framework that moves be- yond textual overlap, and use AI-as-a-judge (Gemini 3 Pro, Claude Opus 4.5, GPT 5.2) and novel style metrics (aver- age turn length, speaker switch rate) to assess quality. Our experiments demonstrate that our fine-tuned 32B model sig- nificantly outperforms a 235B base model in conversational naturalness (>80% win rate) and narrative depth (+50% turn length), while maintaining identical visual grounding capabilities (CLIPScore: 20.39). 1. Introduction The field of Computer Vision has rapidly evolved from pas- sive perception (classification, detection) to active genera- tion. Modern Vision-Language Models (VLMs) are capa- ble of processing complex visual inputs and generating de- tailed textual descriptions. However, a significant gap re- mains between description and storytelling. While state- of-the-art models can accurately identify “a white bus in a forest,” they often struggle to weave that visual cue into an engaging, multi-turn conversation that exhibits personality, humor, and natural flow. This limitation is partly due to training data—most VLMs are trained on caption-heavy datasets like LAION [1] or COCO [2], which prioritize factual brevity—and partly due to the lack of appropriate evaluation metrics for narrative quality. Standard n-gram metrics (BLEU [3], ROUGE [4]) penalize creativity and linguistic diversity, ef- fectively encouraging models to produce safe, repetitive, and robotic outputs. As Generative AI moves into creative domains, assessing the “quality” of a generated narrative requires new frameworks that account for hallucinations of personality, conversational dynamics, and prosodic struc- ture. In this paper, we address the challenge of visual podcast generation: transforming a sequence of images into a coher- ent, entertaining podcast script between two distinct hosts. Figure 1 illustrates a typical input from the Visual Story- telling (VIST) dataset [5] (five images, each with a simple one-sentence caption), which we aim to transform into rich multi-turn dialogues. We introduce the SPoRC-VIST Benchmark, a frame- work that uses the abundance of high-quality text data by pairing it with synthetic visuals for training, while test- ing on real-world photographic sequences. Our contribu- tions are threefold: (1) We curate a dataset of 4,000 visual- dialogue pairs and fine-tune a parameter-efficient Qwen3- VL-32B model using LoRA to perform style transfer from “captioner” to “podcaster.” (2) We propose a new set of style-aware metrics (turn length, switch rate) and an AI- as-a-Judge protocol to evaluate “Hallucination of Person- ality” and conversational naturalness. (3) We demonstrate that a smaller, fine-tuned model (32B) can outperform a massive base model (235B) in narrative quality without degrading visual grounding performance, and validate the effectiveness of our synthetic-to-real generalization strat- egy. Code to reproduce data generation and model train- ing is available at https://github.com/Yunlin- 1 arXiv:2601.01062v1 [cs.LG] 3 Jan 2026 Zeng/visual-podcast-VLM. 2. Related Work 2.1. Visual Storytelling The task of generating narratives from image sequences was formalized by the VIST dataset [5]. While VIST es- tablished sequential visual storytelling, its annotations con- sist of a sequence of descriptive single sentences per im- age. Other approaches have focused on generating a sin- gle, paragraph-length story for a set of images. Our work diverges from these fields by structuring the narrative as a multi-speaker dialogue, a significantly more complex task that requires modeling conversational flow, personality, and inter-speaker dynamics, which are not the primary focus

…(Full text truncated)…

📄 Read Full PDF on ArXiv