Multitask learning poses significant challenges due to the highly multimodal and diverse nature of robot action distributions. However, effectively fitting policies to these complex task distributions is often difficult, and existing monolithic models often underfit the action distribution and lack the flexibility required for efficient adaptation. We introduce a novel modular diffusion policy framework that factorizes complex action distributions into a composition of specialized diffusion models, each capturing a distinct sub-mode of the behavior space for a more effective overall policy. In addition, this modular structure enables flexible policy adaptation to new tasks by adding or fine-tuning components, which inherently mitigates catastrophic forgetting. Empirically, across both simulation and real-world robotic manipulation settings, we illustrate how our method consistently outperforms strong modular and monolithic baselines.
Imitation learning has emerged as a powerful paradigm for acquiring complex robotic manipulation skills [7,8,10,19,38]. However, extending this success to multitask settings remains a significant challenge. As the variety of tasks increases, the underlying action distribution becomes highly multimodal and diverse, often involving distinct control strategies across different objects. Traditional monolithic policies often struggle to generalize across tasks, represent multiple behavior modes, or adapt efficiently to new skills [6,16,38].
To address these limitations, modular policy architectures, most notably Mixture-of-Experts (MoE) models [23,32], have emerged as a promising direction. By decomposing the policy into specialized components, modular methods improve scalability and reuse across tasks [7,31,38,40,42]. Yet, existing MoE-based approaches often suffer from training instability [23], lack a principled probabilistic formulation, and produce expert modules with unclear or overlapping roles [15,40], limiting their interpretability.
We propose Factorized Diffusion Policy (FDP), a simple yet effective modular policy architecture. FDP decomposes the policy into multiple diffusion components (Fig. 1a), each capturing a distinct behavioral mode, which are dynamically composed at inference time via an observationconditioned router (Fig. 1c). Instead of discrete expert selection as in standard MoE architectures, FDP uses continuous score aggregation, enabling stable training, preventing routing imbalance, and promoting clearer specialization across 1 University of Illinois at Urbana-Champaign 2 Harvard University 3 Norwegian University of Science and Technology 4 Columbia University Corresponding author: chaoqil2@illinois.edu. components. FDP is grounded in compositional diffusion modeling [12,15,26], where aggregating scores corresponds to sampling from the product of distributions, providing a principled probabilistic interpretation and a natural formulation as constraint satisfaction. The modular structure further enables efficient task adaptation: we extend the policy by introducing new diffusion components initialized via upcycling [23] from existing components (Fig. 1b), allowing efficient skill expansion without retraining the entire policy. This factorization improves multitask learning and supports scalable adaptation.
We validate FDP through extensive experiments in simulation benchmarks MetaWorld [44] and RLBench [20], and further demonstrate its practical benefits in real-world robotic manipulation. Our contributions are summarized as follows: (1) We introduce a modular diffusion policy architecture that composes specialized components via observationconditioned compositional sampling. (2) We demonstrate that our compositional framework improves multitask performance and enables sub-skill decomposition across diffusion modules. (3) We propose a simple and effective strategy for adapting to new tasks by selectively tuning or augmenting existing components, achieving superior sample efficiency and modular reuse. II. RELATED WORKS 1) Diffusion Models for Robotics: Diffusion models have emerged as a powerful tool for modeling complex distributions, achieving strong performance in image [17,27,29] and video generation [18,39]. Their stable training and generative flexibility have led to increasing adoption in robotic domains, including video-conditioned policy learning [2,14],
grasp synthesis [36], bimanual manipulation [8], tool use [9], trajectory planning [1,5,21], and closed-loop visuomotor control. Diffusion Policy (DP) [10] demonstrated that diffusion models can be used to learn reactive visuomotor policies from demonstrations, achieving state-of-the-art performance in single-task imitation learning.
- Multitask Imitation Learning and Adaptation: Traditional approaches to multitask imitation learning often rely on monolithic networks [22,35] or language-conditioned policies [16,30], which limit scalability, reusability, and interpretability. While early research established modular architectures to improve task decomposition [3,11], modern Sparse Diffusion Policy (SDP) [40] and variational distillation methods for MoE [45] extend this modular principle by introducing MoE layers in diffusion models, activating sparse expert sets based on observations. While this modular design enables expert reuse and policy expansion, it suffers from instability and load imbalance [23]. Mixtureof-Denoising-Experts (MoDE) [31] conditions expert routing on noise level, distributing learning across noise levels, making its experts less interpretable or transferable across tasks. In contrast, FDP composes diffusion models through continuous score aggregation, avoiding hard expert selection and ensuring all components are jointly optimized. This promotes stable optimization, clear specialization, and better load balancing. While maintaining modular extensibility like MoE designs, FDP allows efficient adaptation
This content is AI-processed based on open access ArXiv data.