Test-Time Mixture of World Models for Embodied Agents in Dynamic Environments

Test-Time Mixture of World Models for Embodied Agents in Dynamic Environments
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Language model (LM)-based embodied agents are increasingly deployed in real-world settings. Yet, their adaptability remains limited in dynamic environments, where constructing accurate and flexible world models is crucial for effective reasoning and decision-making. To address this challenge, we extend the Mixture-of-Experts (MoE) paradigm to embodied agents. While conventional MoE architectures modularize knowledge into expert components with pre-trained routing, they remain rigid once deployed, making them less effective for adapting to unseen domains in dynamic environments. We therefore propose Test-time Mixture of World Models (TMoW), a framework that enhances adaptability to unseen and evolving domains. TMoW updates its routing function over world models at test time, unlike conventional MoE where the function remains fixed, enabling agents to recombine existing models and integrate new ones for continual adaptation. It achieves this through (i) multi-granular prototype-based routing, which adapts mixtures across object- to scene-level similarities, (ii) test-time refinement that aligns unseen domain features with prototypes during inference, and (iii) distilled mixture-based augmentation, which efficiently constructs new models from few-shot data and existing prototypes. We evaluate TMoW on VirtualHome, ALFWorld, and RLBench benchmarks, demonstrating strong performance in both zero-shot adaptation and few-shot expansion scenarios, and showing that it enables embodied agents to operate effectively in dynamic environments.


💡 Research Summary

The paper introduces Test‑time Mixture of World Models (TMoW), a novel extension of the Mixture‑of‑Experts (MoE) paradigm designed for language‑model‑based embodied agents operating in dynamic environments. Traditional MoE systems train a routing function that remains fixed after deployment, which makes adaptation to unseen domains costly, requiring full retraining or expensive knowledge distillation. TMoW addresses this limitation by allowing the routing function to be updated at test time, thereby enabling agents to recombine existing world‑model experts and incorporate newly created ones without retraining the entire system.

Three core mechanisms constitute TMoW:

  1. Multi‑granular prototype‑based routing – Observations and instructions are encoded as hierarchical graphs. A layer‑wise Message Passing Neural Network (MPNN) extracts features ranging from local object attributes in early layers to global scene structures in deeper layers. For each world‑model expert, a prototype vector is computed per layer. At inference, the current input embedding is compared to each prototype using cosine similarity, producing routing scores. Only the top‑K scores are kept, normalized with a softmax and a temperature parameter, and used to weight the corresponding lightweight adapters (e.g., LoRA) that implement each world model. This hierarchical routing lets the system exploit partial similarities across domains, such as shared object affordances, even when overall scene semantics differ.

  2. Test‑time prototype refinement – When the agent encounters an unseen domain, the router refines the prototypes on‑the‑fly. The refinement updates each prototype by interpolating between its current value and a weighted sum of other prototypes, where the interpolation weight α is modulated by the similarity between the current input embedding and the prototype. This process expands the prototype space around frequently observed features, preserving core knowledge while adapting to new domain traits without any gradient‑based back‑propagation on the base model.

  3. Distilled mixture‑based model augmentation – For domains that differ substantially from any seen during training, TMoW can construct a brand‑new world‑model expert from a few demonstration trajectories. The method distills knowledge from the mixture of existing experts, using their combined outputs as soft targets while training a new lightweight adapter on the few‑shot data. The newly distilled expert is then added to the prototype pool, making it immediately available for routing decisions.

The authors evaluate TMoW on three widely used embodied‑agent benchmarks—VirtualHome, ALFWorld, and RLBench—as well as on real‑world robotic tasks. In zero‑shot adaptation, TMoW outperforms the previous state‑of‑the‑art MoE‑based method SayCanPay by 27.21% in task success rate. In few‑shot expansion scenarios, it yields a 25.66% improvement over baselines that lack dynamic routing or augmentation. Visualizations of routing‑score heatmaps demonstrate that prototype refinement dramatically reshapes the routing distribution for unseen domains, confirming the effectiveness of test‑time adaptation.

Overall, TMoW contributes a practical solution for continual learning in embodied agents: (i) it breaks the rigidity of fixed MoE routers by training the router at inference time, (ii) it leverages hierarchical, multi‑granular prototypes to capture domain semantics at various abstraction levels, and (iii) it enables data‑efficient expansion through distilled, few‑shot world‑model creation. These advances collectively allow agents to maintain robust planning and action execution as environments evolve, marking a significant step toward deployable, adaptable embodied AI systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment