Less is More: Label-Guided Summarization of Procedural and Instructional Videos
Video summarization helps turn long videos into clear, concise representations that are easier to review, document, and analyze, especially in high-stakes domains like surgical training. Prior work has progressed from using basic visual features like color, motion, and structural changes to using pre-trained vision-language models that can better understand what’s happening in the video (semantics) and capture temporal flow, resulting in more context-aware video summarization. We propose a three-stage framework, PRISM: Procedural Representation via Integrated Semantic and Multimodal analysis, that produces semantically grounded video summaries. PRISM combines adaptive visual sampling, label-driven keyframe anchoring, and contextual validation using a large language model (LLM). Our method ensures that selected frames reflect meaningful and procedural transitions while filtering out generic or hallucinated content, resulting in contextually coherent summaries across both domain-specific and instructional videos. We evaluate our method on instructional and activity datasets, using reference summaries for instructional videos. Despite sampling fewer than 5% of the original frames, our summaries retain 84% semantic content while improving over baselines by as much as 33%. Our approach generalizes across procedural and domain-specific video tasks, achieving strong performance with both semantic alignment and precision.
💡 Research Summary
The paper introduces PRISM (Procedural Representation via Integrated Semantic and Multimodal analysis), a three‑stage, zero‑shot video summarization framework designed for procedural and instructional videos. PRISM tackles two persistent challenges in video summarization: the high computational cost of processing thousands of frames in long videos, and the difficulty of preserving semantic coherence and procedural flow while eliminating redundancy and hallucinated content.
Stage 1 – Adaptive Visual Sampling: The raw video is first sampled at a modest rate (e.g., 1 fps) and each frame is encoded with a ResNet‑18 backbone to obtain 512‑dimensional embeddings. A change‑point detection algorithm (Pruned Exact Linear Time, PELT) identifies statistically significant visual transitions, producing an initial set of candidate frames. The video is then segmented between change points; for each segment the median Euclidean distance among embeddings is computed. Segments with median distance below a threshold δ = 0.30 retain a single representative frame, while those above retain two. This adaptive strategy reduces the total number of processed frames to less than 5 % of the original while keeping visually diverse content.
Stage 2 – Label Generation and Semantic Anchoring: Each sampled frame is fed to a vision‑language model (e.g., BLIP, CLIP, BioMedCLIP) to generate a detailed caption. Captions are abstracted into concise procedural labels such as “sprinkling shredded cabbage” or “adding oil to pan.” To filter out vague, irrelevant, or hallucinated labels, a large language model (GPT‑4) validates each label, discarding those that lack procedural significance. Both labels and frames are embedded into a shared multimodal space; cosine similarity is computed and only pairs exceeding a high threshold (≥ 0.9) are kept. This step removes noisy frames (black screens, transition effects) and ensures that each retained frame is strongly aligned with a meaningful procedural label.
Stage 3 – Contextual Validation and Summary Construction: Validated label‑frame pairs are ordered temporally. A large language model then synthesizes concise textual summaries for each segment, checking for consistency across labels and merging redundant descriptions. The final output consists of a set of keyframes together with their associated natural‑language annotations, forming a multimodal summary that captures both visual highlights and procedural semantics.
The authors evaluate PRISM on standard keyframe selection benchmarks (TVSum, SumMe) using Kendall’s Tau and Spearman’s Rho, and on dense video captioning datasets (YouCook2, ActivityNet Captions) using BLEU, ROUGE‑L, METEOR, and BERTScore. PRISM achieves up to 33 % relative improvement in METEOR on YouCook2 and 17.9 % on ActivityNet Captions, while retaining 84 % of the original semantic content despite processing fewer than 5 % of frames. The method also outperforms recent vision‑language baselines such as CLIP‑It, Cap2Sum, and LMSKE, demonstrating that a bottom‑up, label‑driven approach can surpass top‑down methods that rely on global video captions or predefined queries.
A notable contribution is the fully zero‑shot nature of PRISM: it requires no external annotations, supervised fine‑tuning, or pre‑trained task‑specific queries. All procedural labels are generated on‑the‑fly from the video itself, validated by an LLM, and used as anchors for frame selection. This makes PRISM readily applicable to domains where labeled data are scarce or expensive, such as medical surgery videos. The authors provide preliminary experiments on medical video summarization, suggesting that the framework can preserve critical procedural steps while drastically reducing review time.
In summary, PRISM integrates (1) adaptive change‑point‑driven sampling to cut computational load, (2) vision‑language captioning coupled with LLM validation to produce high‑quality procedural labels, and (3) high‑confidence label‑frame alignment plus LLM‑based textual refinement to generate coherent, concise summaries. The results substantiate the “less is more” hypothesis: using far fewer frames yields summaries that are both semantically rich and procedurally accurate, offering a scalable, domain‑agnostic solution for modern video summarization challenges.
Comments & Academic Discussion
Loading comments...
Leave a Comment