LinkedOut: Linking World Knowledge Representation Out of Video LLM for Next-Generation Video Recommendation

Reading time: 5 minute
...

📝 Original Info

  • Title: LinkedOut: Linking World Knowledge Representation Out of Video LLM for Next-Generation Video Recommendation
  • ArXiv ID: 2512.16891
  • Date: 2025-12-18
  • Authors: Haichao Zhang, Yao Lu, Lichen Wang, Yunzhe Li, Daiwei Chen, Yunpeng Xu, Yun Fu

📝 Abstract

Video Large Language Models (VLLMs) unlock world-knowledge-aware video understanding through pretraining on internet-scale data and have already shown promise on tasks such as movie analysis and video question answering. However, deploying VLLMs for downstream tasks such as video recommendation remains challenging, since real systems require multi-video inputs, lightweight backbones, low-latency sequential inference, and rapid response. In practice, (1) decode-only generation yields high latency for sequential inference, (2) typical interfaces do not support multi-video inputs, and (3) constraining outputs to language discards fine-grained visual details that matter for downstream vision tasks. We argue that these limitations stem from the absence of a representation that preserves pixel-level detail while leveraging world knowledge. We present LinkedOut, a representation that extracts VLLM world knowledge directly from video to enable fast inference, supports multi-video histories, and removes the language bottleneck. LinkedOut extracts semantically grounded, knowledge-aware tokens from raw frames using VLLMs, guided by promptable queries and optional auxiliary modalities. We introduce a cross-layer knowledge fusion MoE that selects the appropriate level of abstraction from the rich VLLM features, enabling personalized, interpretable, and low-latency recommendation. To our knowledge, LinkedOut is the first VLLM-based video recommendation method that operates on raw frames without handcrafted labels, achieving state-of-the-art results on standard benchmarks. Interpretability studies and ablations confirm the benefits of layer diversity and layer-wise fusion, pointing to a practical path that fully leverages VLLM world-knowledge priors and visual reasoning for downstream vision tasks such as recommendation.

💡 Deep Analysis

📄 Full Content

Video recommendation is the primary interface through which viewers discover content at scale [6]. Its inherent need for lightweight backbones, low-latency sequential inference, and rapid response makes it a representative downstream setting for studying foundation-model integration. Despite remarkable advances in representation learning and ranking architectures [16,26,32,34,48], production systems still depend heavily on hand-crafted or pre-extracted labels (e.g., categories, tags, reactions) distilled from raw videos. This label-centric design reduces input dimensionality and eases serving, but it also discards the majority of the information present in pixels, limits semantic coverage to what was annotated in the training set, and makes personalization brittle in cold-start and long-tail regimes. Moreover, pipelines that summarize videos into text before recommendation inherit a language bottleneck: they constrain the system to preselected features and textual abstractions rather than the full visual token space. In particular, such pipelines struggle to recognize nuanced attributes (e.g., visual humor and narrative pacing) or to leverage broader factual or commonsense context and to adapt quickly to fastevolving trends without expensive retraining. Multimodal LLMs have recently demonstrated strong transferability from internet-scale pretraining, aligning visual tokens with natural language and encoding broad world knowledge [1,19,21,30]. In video, joint training on raw frames and transcripts unlocks temporally grounded semantics and cross-modal reasoning [2,25]. These models are promptable, so natural language can steer them to extract task-specific attributes and abstractions on demand without modifying the backbone. Transformer layers also specialize by depth: later layers aggregate global context and higherlevel semantics, while earlier layers emphasize local patterns. This suggests an opportunity to link world knowledge and pixel-level cues from a video LLM directly to a recommender, without first reducing videos to text summaries, while adaptively selecting the semantic granularity that best serves each ranking decision. Adapting VLLMs to video recommendation is therefore a promising way to move beyond hand-crafted or pre-extracted label-centric pipelines and to bring world-knowledge-aware visual understanding and possible reasoning into recommendation.

Although many recent efforts have built video large language models (VLLMs) with world knowledge and reasoning that benefit tasks such as video content understanding, visual question answering, and video world modeling, the scope of these downstream uses remains narrow. They mainly target settings that can tolerate the current limitations of VLLMs. When a task requires multi-video inputs, lightweight backbones, low-latency sequential inference, and rapid response, these limitations become impractical for straightforward pipelines like Fig. 1(c-e).

Video recommendation is therefore a representative task for analyzing this problem. It must keep the pretrained world knowledge inside the VLLM, but its outputs are item indices from a database rather than natural language, so they do not follow the language distribution. Current VLLMs typically rely on text outputs due to the language interface. Some tasks, such as robotics planning, fine-tune VLLMs to produce structured actions, but this often narrows the output distribution and can lead to catastrophic forgetting of the original knowledge. In addition, current VLLMs (and many LLMs) are slow at inference time because of the decodeonly transformer architecture: every newly generated token must be fed back as input for the next step, which lengthens the reasoning process and is unsuitable for real-time video recommendation. Typical VLLM interfaces also lack native support for multi-video inputs. A single video can already occupy tens of thousands of tokens, and large token budgets further increase latency and computation, making it difficult to design or post-train VLLMs that accept multiple videos at once. However, video recommendation usually needs to consume a sequence of historical videos per user and to encode a large candidate pool. These challenges have so far limited the use of VLLMs in tasks that simultaneously require multi-video inputs, lightweight models, low-latency sequential inference, and rapid response, while still needing world knowledge for visual reasoning, content understanding, and generalization.

To this end, we introduce LinkedOut, a knowledgeaware, modality-extendable video recommendation framework that links world knowledge out of video pixels through a video LLM. The general structure is shown in Figure 2. To address the language-output constraint, we do not fine-tune the VLLM into the recommendation space, which would risk domain shift and catastrophic forgetting. Instead, we directly extract embeddings from intermediate token representations of the VLLM, since world knowled

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut