VideoSTF: Stress-Testing Output Repetition in Video Large Language Models
Video Large Language Models (VideoLLMs) have recently achieved strong performance in video understanding tasks. However, we identify a previously underexplored generation failure: severe output repetition, where models degenerate into self-reinforcing loops of repeated phrases or sentences. This failure mode is not captured by existing VideoLLM benchmarks, which focus primarily on task accuracy and factual correctness. We introduce VideoSTF, the first framework for systematically measuring and stress-testing output repetition in VideoLLMs. VideoSTF formalizes repetition using three complementary n-gram-based metrics and provides a standardized testbed of 10,000 diverse videos together with a library of controlled temporal transformations. Using VideoSTF, we conduct pervasive testing, temporal stress testing, and adversarial exploitation across 10 advanced VideoLLMs. We find that output repetition is widespread and, critically, highly sensitive to temporal perturbations of video inputs. Moreover, we show that simple temporal transformations can efficiently induce repetitive degeneration in a black-box setting, exposing output repetition as an exploitable security vulnerability. Our results reveal output repetition as a fundamental stability issue in modern VideoLLMs and motivate stability-aware evaluation for video-language systems. Our evaluation code and scripts are available at: https://github.com/yuxincao22/VideoSTF_benchmark.
💡 Research Summary
VideoSTF introduces the first systematic framework for measuring and stress‑testing output repetition in video large language models (VideoLLMs). The authors observe that modern VideoLLMs, despite strong performance on captioning, question answering, and other video‑language tasks, can degenerate into self‑reinforcing loops that repeatedly generate the same phrase or sentence. This failure mode is not captured by existing benchmarks, which focus on task accuracy, factual correctness, or hallucinations.
The framework consists of three complementary n‑gram‑based metrics: Repetition Rate (RR), Repetition Intensity (RI), and Information Entropy (IE). RR is a binary indicator that counts how many generated outputs contain at least one n‑gram (n = 5 by default) appearing more than once, thus measuring the presence of looping behavior. RI quantifies the proportion of duplicated n‑grams within each output using the Rep‑n formulation, capturing how extensive the duplication is when repetition occurs. IE computes a normalized entropy over the empirical n‑gram distribution (unigram for RI and IE), with lower values indicating reduced lexical diversity and stronger repetition. Together these metrics distinguish between isolated repeats and sustained degeneration.
To evaluate the metrics, the authors construct a standardized testbed of 10,000 videos sampled from four public video‑instruction datasets (LLaVA‑Video‑178K, Next‑QA, ActivityNetQA, LLaVA‑Hound). The videos span a wide range of durations (2 s to 180 s) and semantic domains (comedy, lifestyle, sports, etc.), providing a realistic and diverse benchmark.
A key contribution is the Temporal Stressor Library, which defines five controlled temporal transformations that preserve semantic content while altering the temporal order: Add (insert random frames), Delete (remove random frames), Replace (swap frames), Reverse (invert the entire sequence), and Shuffle (randomly permute frames). These transformations enable precise probing of how temporal coherence influences generation stability.
The experimental protocol evaluates ten representative VideoLLMs, including LLaVA‑Video‑7B‑Qwen2, LLaVA‑NeXT‑7B‑DPO, VideoLLaMA2, ShareGPT‑4V, InternVL3.5‑8B, Qwen3‑VL‑8B‑Instruct, and Molmo2‑8B. All models are run in deterministic mode (temperature = 0, do_sample = False) to isolate the effect of the visual input.
Pervasive Testing: Models are tested with four frame‑sampling rates (8, 16, 24, 32 frames). Across all models, RR remains non‑trivial (often > 5 %) and does not vary significantly with the number of sampled frames, indicating that repetition is a pervasive phenomenon. Videos containing recurring or highly similar scenes trigger characteristic looping phrases such as “continues to”.
Temporal Stress Testing: Applying each temporal transformation dramatically increases repetition. In many settings, RR jumps to 30 %–90 % compared with the original videos, and RI and IE show corresponding spikes in duplication and entropy loss. The most aggressive effects are observed for Shuffle and Reverse, suggesting that the models rely heavily on sequential temporal cues; when these cues are disrupted, the language decoder falls into a self‑reinforcing loop.
Adversarial Exploitation: The authors treat the temporal transformations as black‑box attacks. An attacker who can modify the sampled frames (but not the model internals) can induce repetition in a video that originally produced a normal caption with only a few dozen queries. This demonstrates that output repetition is not merely a diagnostic artifact but an exploitable security vulnerability, potentially enabling denial‑of‑service attacks that waste computational resources.
The paper’s contributions are fourfold: (1) identification of output repetition as a distinct stability failure in VideoLLMs; (2) a comprehensive framework (metrics, testbed, stressors) for its measurement; (3) extensive empirical evidence that repetition is widespread and highly sensitive to temporal perturbations; and (4) demonstration that simple temporal manipulations constitute a practical black‑box attack surface.
Limitations include the reliance on n‑gram metrics that capture only lexical repetition; visual repetition (e.g., repeated frames) is not directly quantified. The study also focuses on deterministic decoding; the interaction of temperature, sampling strategies, and repetition remains an open question. Future work could extend the metrics to multimodal n‑gram analysis, explore mitigation strategies such as temporal consistency checks or repetition‑aware decoding, and evaluate robustness under streaming or real‑time inference scenarios.
In summary, VideoSTF provides a much‑needed tool for assessing the hidden instability of VideoLLMs, revealing that output repetition is both a performance degradation and a security risk. Incorporating stability‑aware evaluation into the development pipeline will be essential for deploying reliable video‑language systems in real‑world applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment