CAST: Modeling Visual State Transitions for Consistent Video Retrieval

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

As video content creation shifts toward long-form narratives, composing short clips into coherent storylines becomes increasingly important. However, prevailing retrieval formulations remain context-agnostic at inference time, prioritizing local semantic alignment while neglecting state and identity consistency. To address this structural limitation, we formalize the task of Consistent Video Retrieval (CVR) and introduce a diagnostic benchmark spanning YouCook2, COIN, and CrossTask. We propose CAST (Context-Aware State Transition), a lightweight, plug-and-play adapter compatible with diverse frozen vision-language embedding spaces. By predicting a state-conditioned residual update ($Δ$) from visual history, CAST introduces an explicit inductive bias for latent state evolution. Extensive experiments show that CAST improves performance on YouCook2 and CrossTask, remains competitive on COIN, and consistently outperforms zero-shot baselines across diverse foundation backbones. Furthermore, CAST provides a useful reranking signal for black-box video generation candidates (e.g., from Veo), promoting more temporally coherent continuations.

💡 Research Summary

The paper addresses a critical gap in video retrieval for long‑form, procedural narratives such as cooking or assembly tutorials. Traditional text‑to‑video retrieval methods treat each clip as an independent embedding and rank candidates solely by semantic similarity to the query text. This “context‑agnostic” paradigm ignores two essential consistency constraints: (1) identity consistency – the same actor, environment, and visual style should persist across consecutive clips, and (2) state consistency – the visual state must evolve according to the causal order of the procedure (e.g., a tomato cannot be sliced after it has already been plated). As a result, existing systems often retrieve clips that are semantically relevant but temporally incoherent, leading to “state errors” and “identity errors”.

To formalize the problem, the authors introduce Consistent Video Retrieval (CVR) and construct a diagnostic benchmark using three procedural datasets: YouCook2, COIN, and CrossTask. For each query step they create a fixed‑size candidate pool consisting of the ground‑truth clip, hard negatives that share either the same visual identity but an incorrect procedural state (state negatives), or the same procedural semantics but a different visual identity (identity negatives), plus easy random negatives. This 1‑vs‑9 setting forces a model to satisfy both semantic relevance and the two consistency constraints, and performance is measured by Recall@K within the pool.

The core contribution is CAST (Context‑Aware State Transition), a lightweight trainable adapter that sits on top of any frozen vision‑language backbone (e.g., CLIP). CAST predicts the embedding of the next clip, (\hat{v}t), by adding a residual transition vector (\Delta) to the current state embedding (v{t-1}): (\hat{v}t = \text{L2norm}(v{t-1} + \Delta(v_{t-1}, q_t, H_t))). The residual formulation introduces an inductive bias: the anchor embedding preserves identity and static background information, while (\Delta) focuses on the procedural change required by the query.

(\Delta) is decomposed into two complementary pathways: (1) a conditioned transition branch that concatenates the text embedding (f_t(q_t)) with the current visual state (v_{t-1}) and passes them through an MLP, thereby directly modeling how the current scene should evolve under the given instruction; (2) a temporal context attention branch that treats the query embedding as a query in a multi‑head cross‑attention over the visual history (H_t). The attention output is projected by another MLP to produce a context‑aware adjustment. The sum of these two vectors yields the final (\Delta).

Training freezes the backbone and optimizes only the adapter using a type‑aware contrastive loss. For each query, the loss encourages the predicted embedding (\hat{v}_t) to be close to the ground‑truth continuation (v^+_t) while pushing away state negatives, identity negatives, and easy negatives, all measured with cosine similarity and a temperature scaling. This loss explicitly penalizes violations of both consistency dimensions.

Extensive experiments demonstrate that CAST consistently outperforms zero‑shot baselines and prior fine‑tuned models across all three datasets. On YouCook2 and CrossTask, CAST improves Recall@1/5/10 by 4–7 percentage points, particularly excelling at discriminating state negatives. On COIN, performance remains competitive, indicating robustness to different procedural domains. Moreover, the authors show that CAST’s predicted transition score can be used as a reranking signal for black‑box video generation systems (e.g., Veo). Human evaluations reveal that reranked generations exhibit significantly higher temporal coherence, confirming CAST’s utility beyond retrieval.

In summary, CAST offers a simple yet powerful mechanism to inject temporal state reasoning into video retrieval without requiring explicit state annotations or complex sequence models. Its plug‑and‑play nature makes it applicable to any pre‑trained vision‑language model, and the introduced CVR benchmark provides a rigorous testbed for future work on consistency‑aware video understanding. The approach opens avenues for more coherent storyboarding, instructional video assembly, and even robot action planning where maintaining visual and procedural continuity is essential.

CAST: Modeling Visual State Transitions for Consistent Video Retrieval

💡 Research Summary

Comments & Academic Discussion

Leave a Comment