SPoRC-VIST: A Benchmark for Evaluating Generative Natural Narrative in Vision-Language Models

Reading time: 6 minute
...

📝 Original Info

  • Title: SPoRC-VIST: A Benchmark for Evaluating Generative Natural Narrative in Vision-Language Models
  • ArXiv ID: 2601.01062
  • Date: 2026-01-03
  • Authors: ** Yunlin Zeng (Georgia Institute of Technology, yzeng@gatech.edu) **

📝 Abstract

Vision-Language Models (VLMs) have achieved remarkable success in descriptive tasks such as image captioning and visual question answering (VQA). However, their ability to generate engaging, long-form narratives-specifically multi-speaker podcast dialogues-remains under-explored and difficult to evaluate. Standard metrics like BLEU and ROUGE fail to capture the nuances of conversational naturalness, personality, and narrative flow, often rewarding safe, repetitive outputs over engaging storytelling. In this work, we present a novel pipeline for end-to-end visual podcast generation, and fine-tune a Qwen3-VL-32B model on a curated dataset of 4,000 image-dialogue pairs. Crucially, we use a synthetic-to-real training strategy: we train on high-quality podcast dialogues from the Structured Podcast Research Corpus (SPoRC) paired with synthetically generated imagery, and evaluate on real-world photo sequences from the Visual Storytelling Dataset (VIST). This rigorous setup tests the model's ability to generalize from synthetic training data to real-world visual domains. We propose a comprehensive evaluation framework that moves beyond textual overlap, and use AI-as-a-judge (Gemini 3 Pro, Claude Opus 4.5, GPT 5.2) and novel style metrics (average turn length, speaker switch rate) to assess quality. Our experiments demonstrate that our fine-tuned 32B model significantly outperforms a 235B base model in conversational naturalness (>80% win rate) and narrative depth (+50% turn length), while maintaining identical visual grounding capabilities (CLIPScore: 20.39).

💡 Deep Analysis

Deep Dive into SPoRC-VIST: A Benchmark for Evaluating Generative Natural Narrative in Vision-Language Models.

Vision-Language Models (VLMs) have achieved remarkable success in descriptive tasks such as image captioning and visual question answering (VQA). However, their ability to generate engaging, long-form narratives-specifically multi-speaker podcast dialogues-remains under-explored and difficult to evaluate. Standard metrics like BLEU and ROUGE fail to capture the nuances of conversational naturalness, personality, and narrative flow, often rewarding safe, repetitive outputs over engaging storytelling. In this work, we present a novel pipeline for end-to-end visual podcast generation, and fine-tune a Qwen3-VL-32B model on a curated dataset of 4,000 image-dialogue pairs. Crucially, we use a synthetic-to-real training strategy: we train on high-quality podcast dialogues from the Structured Podcast Research Corpus (SPoRC) paired with synthetically generated imagery, and evaluate on real-world photo sequences from the Visual Storytelling Dataset (VIST). This rigorous setup tests the model’s a

📄 Full Content

SPoRC-VIST: A Benchmark for Evaluating Generative Natural Narrative in Vision-Language Models Yunlin Zeng Georgia Institute of Technology yzeng@gatech.edu Abstract Vision-Language Models (VLMs) have achieved remark- able success in descriptive tasks such as image captioning and visual question answering (VQA). However, their abil- ity to generate engaging, long-form narratives—specifically multi-speaker podcast dialogues—remains under-explored and difficult to evaluate. Standard metrics like BLEU and ROUGE fail to capture the nuances of conversational nat- uralness, personality, and narrative flow, often rewarding safe, repetitive outputs over engaging storytelling. In this work, we present a novel pipeline for end-to-end visual pod- cast generation, and fine-tune a Qwen3-VL-32B model on a curated dataset of 4,000 image-dialogue pairs. Crucially, we use a synthetic-to-real training strategy: we train on high-quality podcast dialogues from the Structured Podcast Research Corpus (SPoRC) paired with synthetically gener- ated imagery, and evaluate on real-world photo sequences from the Visual Storytelling Dataset (VIST). This rigor- ous setup tests the model’s ability to generalize from syn- thetic training data to real-world visual domains. We pro- pose a comprehensive evaluation framework that moves be- yond textual overlap, and use AI-as-a-judge (Gemini 3 Pro, Claude Opus 4.5, GPT 5.2) and novel style metrics (aver- age turn length, speaker switch rate) to assess quality. Our experiments demonstrate that our fine-tuned 32B model sig- nificantly outperforms a 235B base model in conversational naturalness (>80% win rate) and narrative depth (+50% turn length), while maintaining identical visual grounding capabilities (CLIPScore: 20.39). 1. Introduction The field of Computer Vision has rapidly evolved from pas- sive perception (classification, detection) to active genera- tion. Modern Vision-Language Models (VLMs) are capa- ble of processing complex visual inputs and generating de- tailed textual descriptions. However, a significant gap re- mains between description and storytelling. While state- of-the-art models can accurately identify “a white bus in a forest,” they often struggle to weave that visual cue into an engaging, multi-turn conversation that exhibits personality, humor, and natural flow. This limitation is partly due to training data—most VLMs are trained on caption-heavy datasets like LAION [1] or COCO [2], which prioritize factual brevity—and partly due to the lack of appropriate evaluation metrics for narrative quality. Standard n-gram metrics (BLEU [3], ROUGE [4]) penalize creativity and linguistic diversity, ef- fectively encouraging models to produce safe, repetitive, and robotic outputs. As Generative AI moves into creative domains, assessing the “quality” of a generated narrative requires new frameworks that account for hallucinations of personality, conversational dynamics, and prosodic struc- ture. In this paper, we address the challenge of visual podcast generation: transforming a sequence of images into a coher- ent, entertaining podcast script between two distinct hosts. Figure 1 illustrates a typical input from the Visual Story- telling (VIST) dataset [5] (five images, each with a simple one-sentence caption), which we aim to transform into rich multi-turn dialogues. We introduce the SPoRC-VIST Benchmark, a frame- work that uses the abundance of high-quality text data by pairing it with synthetic visuals for training, while test- ing on real-world photographic sequences. Our contribu- tions are threefold: (1) We curate a dataset of 4,000 visual- dialogue pairs and fine-tune a parameter-efficient Qwen3- VL-32B model using LoRA to perform style transfer from “captioner” to “podcaster.” (2) We propose a new set of style-aware metrics (turn length, switch rate) and an AI- as-a-Judge protocol to evaluate “Hallucination of Person- ality” and conversational naturalness. (3) We demonstrate that a smaller, fine-tuned model (32B) can outperform a massive base model (235B) in narrative quality without degrading visual grounding performance, and validate the effectiveness of our synthetic-to-real generalization strat- egy. Code to reproduce data generation and model train- ing is available at https://github.com/Yunlin- 1 arXiv:2601.01062v1 [cs.LG] 3 Jan 2026 Zeng/visual-podcast-VLM. 2. Related Work 2.1. Visual Storytelling The task of generating narratives from image sequences was formalized by the VIST dataset [5]. While VIST es- tablished sequential visual storytelling, its annotations con- sist of a sequence of descriptive single sentences per im- age. Other approaches have focused on generating a sin- gle, paragraph-length story for a set of images. Our work diverges from these fields by structuring the narrative as a multi-speaker dialogue, a significantly more complex task that requires modeling conversational flow, personality, and inter-speaker dynamics, which are not the primary focus

…(Full text truncated)…

📸 Image Gallery

AI-as-a-judge_prompt_compress.png finetuning_hyperparameters_compress.png generated_scene_1_compress.png generated_scene_2_compress.png generated_scene_3_compress.png generated_scene_4_compress.png generated_scene_5_compress.png image_1_4711890205_compress.jpg image_2_4712530124_compress.jpg image_3_4712532304_compress.jpg image_4_4711890869_compress.jpg image_5_4712530976_compress.jpg image_count_table_compress.png sample13_image_1_compress.jpg sample13_image_2_compress.jpg sample13_image_3_compress.jpg sample13_image_4_compress.jpg sample13_image_5_compress.jpg stable_diffusion_image1_compress.png stable_diffusion_image2_compress.png stable_diffusion_image3_compress.png stable_diffusion_image4_compress.png stable_diffusion_image5_compress.png word_count_table_compress.png

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut