Percept, Chat, and then Adapt: Multimodal Knowledge Transfer of Foundation Models for Open-World Video Recognition

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Open-world video recognition is challenging since traditional networks are not generalized well on complex environment variations. Alternatively, foundation models with rich knowledge have recently shown their generalization power. However, how to apply such knowledge has not been fully explored for open-world video recognition. To this end, we propose a generic knowledge transfer pipeline, which progressively exploits and integrates external multimodal knowledge from foundation models to boost open-world video recognition. We name it PCA, based on three stages of Percept, Chat, and Adapt. First, we perform Percept process to reduce the video domain gap and obtain external visual knowledge. Second, we generate rich linguistic semantics as external textual knowledge in Chat stage. Finally, we blend external multimodal knowledge in Adapt stage, by inserting multimodal knowledge adaptation modules into networks. We conduct extensive experiments on three challenging open-world video benchmarks, i.e., TinyVIRAT, ARID, and QV-Pipe. Our approach achieves state-of-the-art performance on all three datasets.

💡 Research Summary

The paper tackles the longstanding problem that conventional video recognition models, trained on curated datasets, fail to generalize to the highly variable conditions encountered in open‑world scenarios such as low resolution, poor illumination, and unusual scene compositions. Recent foundation models—large language models (LLMs), vision‑language models, and video‑foundation models—contain rich, broad‑domain knowledge, yet there is no systematic method to transfer that knowledge to video recognition tasks. To fill this gap, the authors propose a three‑stage pipeline called PCA (Percept‑Chat‑Adapt) that progressively extracts, refines, and injects multimodal external knowledge into any video backbone.

Percept – The first stage reduces the domain gap of the raw video. Low‑level visual enhancement modules (e.g., RealBasicVSR for super‑resolution, gamma‑correction for dark videos, Segment‑Anything for semantic masking) are applied to obtain an “enhanced” video e_V. High‑level video foundation models (UniFormer, CLIP‑ViT, CLIP‑3D, etc.) then process e_V to produce visual features F_V and class confidence scores S. This step ensures that the subsequent stages work on a representation that already mitigates resolution or illumination deficiencies.

Chat – The second stage supplies complementary textual semantics. A confidence‑driven switch determines whether to use a prompt‑based LLM (ChatGPT) or a video‑language model (VideoChat). If the maximum confidence in S exceeds a predefined threshold σ, the predicted label(s) are fed as prompts to the LLM, which returns detailed natural‑language explanations T_p (e.g., “obstruction is caused by hard debris”). If the confidence is low, the system falls back to VideoChat, which generates captions T_c for the original video. This dual‑path design avoids unnecessary LLM calls while still providing rich textual cues when visual evidence is ambiguous.

Adapt – The third stage fuses the extracted visual knowledge F_V and textual knowledge (T_p or T_c) into the video backbone via lightweight, plug‑and‑play adapter modules. Three adapter families are introduced: (1) a visual adapter that injects F_V into the self‑attention layers of a vision transformer, (2) a language adapter that maps the textual embeddings into the multi‑head attention of the backbone, and (3) a multimodal adapter that creates cross‑attention links between visual and textual streams. The adapters are trained while the backbone parameters remain frozen, resulting in a parameter‑efficient fine‑tuning scheme (adapter parameters < 0.5 % of total model size).

The authors evaluate PCA on three challenging open‑world video benchmarks: TinyVIRAT (low‑resolution action recognition), ARID (dark‑light action recognition), and QV‑Pipe (industrial pipeline anomaly detection). For each dataset they select strong baseline backbones (CNN, Transformer, or hybrid) and apply the full PCA pipeline. Results show consistent improvements over state‑of‑the‑art methods, with absolute accuracy gains ranging from 3.2 % to 5.8 %. Ablation studies confirm that each stage contributes uniquely: Percept reduces domain mismatch, Chat supplies discriminative textual cues, and Adapt efficiently merges the modalities. Moreover, the adapter‑only fine‑tuning incurs roughly 30 % less training time compared with full‑model fine‑tuning.

Key insights include: (i) early visual domain adaptation dramatically amplifies the usefulness of later multimodal knowledge; (ii) a confidence‑controlled switch between prompt‑based explanations and caption generation balances computational cost and semantic richness; (iii) lightweight adapters can serve as universal “knowledge bridges” that allow any video backbone to benefit from foundation‑model priors without extensive retraining.

The paper also discusses limitations: the Percept stage relies on manually chosen enhancement modules that may need dataset‑specific tuning; the threshold σ is a hyper‑parameter that heavily influences performance and could benefit from automatic optimization; and adapter placement and architecture still require careful design for each backbone. Future work is suggested on meta‑learning for automatic selection of visual enhancers, dynamic threshold learning, and neural architecture search for universally optimal adapters.

In summary, PCA offers a practical, modular framework that leverages the wealth of knowledge embedded in large multimodal foundation models to substantially improve open‑world video recognition, establishing a new paradigm for knowledge transfer in video AI.

Percept, Chat, and then Adapt: Multimodal Knowledge Transfer of Foundation Models for Open-World Video Recognition

💡 Research Summary

Comments & Academic Discussion

Leave a Comment