When Large Multimodal Models Confront Evolving Knowledge: Challenges and Explorations
Large Multimodal Models (LMMs) store vast amounts of pretrained knowledge but struggle to remain aligned with real-world updates, making it difficult to avoid capability degradation when acquiring evolving knowledge. Furthermore, most current work focuses on exploring static textual knowledge injection, neglecting dynamic multimodal evolving knowledge injection, leaving the potential of LMMs for multimodal knowledge injection as an open question. To address this, we first propose a pipeline to construct MMEVOKE, a benchmark for evaluating LMMs’ ability in multimodal evolving knowledge injection. MMEVOKE contains 9,422 samples spanning 159 subtypes. Then, based on extensive experiments with MMEVOKE, we reveal challenges such as poor injection performance and capability degradation in existing knowledge injection methods through knowledge injection tests and general capability tests. Finally, to tackle these challenges, we introduce knowledge augmentation and knowledge retention methods, finding that knowledge-aware augmentation strengthens knowledge injection performance, and that Data Replay and MoE methods effectively mitigate capability degradation.
💡 Research Summary
This paper addresses a critical challenge facing Large Multimodal Models (LMMs): their inherent static nature makes it difficult for them to acquire and retain knowledge that evolves over time, leading to obsolescence and inaccuracies. While most existing work focuses on injecting static textual knowledge, the problem of dynamic multimodal evolving knowledge remains largely unexplored. To bridge this gap, the authors make three primary contributions: the creation of a novel benchmark, a comprehensive evaluation revealing key challenges, and explorations into potential solutions.
First, the paper introduces MMEVOKE (Multi-Modal Evolving Knowledge Evaluation), a benchmark specifically designed to assess LMMs’ ability to handle evolving multimodal knowledge. The authors develop a reproducible pipeline that automatically collects timely and popular “News” and “Entity” information from authoritative sources like CNN and Wikipedia starting from 2024. This data is processed into a structured format consisting of injection data (image, heuristic query, knowledge summary) for teaching the model and evaluation data (query image, question, ground truth) for testing. The final benchmark comprises 9,422 evolving knowledge samples spanning 159 fine-grained subfields, ensuring diversity and real-world relevance.
Second, using MMEVOKE, the paper conducts extensive experiments to evaluate existing knowledge injection paradigms. The “knowledge injection tests” apply various methods—including Supervised Fine-Tuning (Full FT and LoRA), Multimodal Retrieval-Augmented Generation (MM-RAG), commercial AI web search engines (Gemini, Perplexity, GPT-4), and providing sufficient context—to base LMMs like LLaVA-v1.5 and Qwen-VL-Chat. The results reveal significant limitations: (1) Poor knowledge adaptation performance across all methods, indicating that LMMs struggle to learn and apply new multimodal facts. (2) Surprisingly, even when all necessary knowledge is provided in the context (“sufficient context”), model performance remains imperfect, suggesting a fundamental comprehension bottleneck. Furthermore, “general capability tests” conducted after knowledge injection show (3) severe “capability degradation” across seven different skill dimensions (e.g., instruction following, reasoning, coding). (4) This degradation follows a consistent ranking of severity, and (5) a collapse in instruction-following ability leads to cascading failures in other capabilities, highlighting the fragility of LMMs during knowledge updates.
Finally, the paper explores preliminary methods to tackle these identified challenges. It distinguishes between generic data augmentation and “knowledge-aware augmentation,” demonstrating that the latter can strengthen knowledge adaptation while also slightly mitigating capability degradation. For more directly addressing catastrophic forgetting, the authors test several continual learning techniques. They find that direct “knowledge rehearsal” methods like Data Replay and structured parameter isolation methods like MoE (Mixture of Experts) are effective in preserving original capabilities. In contrast, indirect regularization-based methods (e.g., EWC, LwF) yield unstable results and can sometimes worsen degradation.
In conclusion, this work successfully frames the problem of evolving knowledge injection for LMMs, provides a much-needed benchmark (MMEVOKE) for standardized evaluation, empirically uncovers the dual challenges of poor adaptation and severe capability degradation, and points towards promising research directions involving knowledge-aware training and rehearsal-based continual learning techniques. It establishes a foundation for developing LMMs that can remain accurate and capable in a dynamically changing world.
Comments & Academic Discussion
Loading comments...
Leave a Comment