Optimus-3: Dual-Router Aligned Mixture-of-Experts Agent with Dual-Granularity Reasoning-Aware Policy Optimization

Optimus-3: Dual-Router Aligned Mixture-of-Experts Agent with Dual-Granularity Reasoning-Aware Policy Optimization
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Developing generalist agents capable of solving open-ended tasks in visually rich, dynamic environments remains a core pursuit of embodied AI. While Minecraft has emerged as a compelling benchmark, existing agents often suffer from fragmented cognitive abilities, lacking the synergy between reflexive execution (System 1) and deliberative reasoning (System 2). In this paper, we introduce Optimus-3, a generalist agent that organically integrates these dual capabilities within a unified framework. To achieve this, we address three fundamental challenges. First, to overcome the scarcity of reasoning data, we propose a Knowledge-Enhanced Automated Data Generation Pipeline. It synthesizes high-quality System 2 reasoning traces from raw System 1 interaction trajectories, effectively mitigating hallucinations via injection of domain knowledge. We release the resulting dataset, \textbf{OptimusM$^{4}$}, to the community. Second, to reconcile the dichotomous computational requirements of the dual systems, we design a Dual-Router Aligned MoE Architecture. It employs a Task Router to prevent task interference via parameter decoupling, and a Layer Router to dynamically modulate reasoning depth, creating a computational Fast Path'' for System 1 and a Deep Path’’ for System 2. Third, to activate the reasoning capabilities of System 2, we propose Dual-Granularity Reasoning-Aware Policy Optimization (DGRPO) algorithm. It enforces Process-Outcome Co-Supervision via dual-granularity dense rewards, ensuring consistency between the thought process and the answer. Extensive evaluations demonstrate that Optimus-3 surpasses existing state-of-the-art methods on both System2 (21$%$ on Planning, 66% on Captioning, 76% on Embodied QA, 3.4$\times$ on Grounding, and 18% on Reflection) and System1 (3% on Long-Horizon Action) tasks, with a notable 60% success rate on open-ended tasks.


💡 Research Summary

Optimus‑3 tackles the long‑standing challenge of building a generalist embodied agent that can both react quickly (System 1) and reason deeply (System 2) in visually rich, open‑world environments such as Minecraft. The authors identify three core obstacles: (1) a scarcity of domain‑specific reasoning traces, (2) a computational mismatch between low‑latency action execution and high‑cost deliberative reasoning, and (3) the inability of conventional reinforcement learning to supervise the intermediate thought process.

To address (1), they introduce a Knowledge‑Enhanced Automated Data Generation Pipeline. Raw gameplay trajectories are fed to large multimodal language models, but the generated reasoning is filtered and corrected using a Minecraft‑specific knowledge base (crafting recipes, physics rules, entity existence). This “knowledge injection” dramatically reduces hallucinations and yields high‑quality System 2 traces, which are compiled into the publicly released OptimusM⁴ dataset.

For (2), the paper proposes a Dual‑Router Aligned Mixture‑of‑Experts (MoE) architecture. A horizontal Task Router assigns each input to a task‑specific expert (e.g., Planning, Captioning, Action) together with a shared knowledge expert, thereby orthogonalizing parameters and preventing task interference. A vertical Layer Router decides, based on the cognitive demand of the current request, whether to activate the full depth of the network (Deep Path for System 2) or to skip intermediate layers (Fast Path for System 1). Both routing decisions are made once before the forward pass, keeping runtime overhead minimal while allowing the model to allocate compute adaptively.

To activate System 2 capabilities, the authors devise Dual‑Granularity Reasoning‑Aware Policy Optimization (DGRPO). Unlike standard RL that only supplies sparse outcome rewards, DGRPO introduces Process‑Outcome Co‑Supervision. Two dense reward components are defined: (i) a Dependency‑Aware Synthesis Reward that uses a domain knowledge graph to enforce logical ordering among reasoning steps, and (ii) a Hallucination‑Aware Consistency Reward that penalizes mentions of entities not grounded in the visual observation. These rewards guide the agent to “think‑before‑act,” ensuring that the generated reasoning chain is both logically coherent and visually grounded.

Extensive experiments compare Optimus‑3 against state‑of‑the‑art Minecraft agents, GPT‑4o, and Qwen2.5‑VL across a suite of tasks. Optimus‑3 achieves absolute improvements of 21 % in Planning, 66 % in Captioning, 76 % in Embodied QA, a 3.4× boost in Grounding, and 18 % in Reflection, while also improving Long‑Horizon Action by 3 %. Most strikingly, on open‑ended tasks the agent attains a 60 % success rate, a regime where prior agents virtually fail.

In summary, Optimus‑3 demonstrates that a carefully engineered combination of knowledge‑augmented data generation, dual‑router MoE architecture, and dense reasoning‑aware RL can reconcile the divergent demands of fast reflexive control and slow deliberative cognition. The work not only sets a new benchmark for Minecraft agents but also provides a scalable blueprint for building generalist agents capable of integrated perception, planning, and action in any complex, multimodal environment.


Comments & Academic Discussion

Loading comments...

Leave a Comment