Preference Conditioned Multi-Objective Reinforcement Learning: Decomposed, Diversity-Driven Policy Optimization
Multi-objective reinforcement learning (MORL) seeks to learn policies that balance multiple, often conflicting objectives. Although a single preference-conditioned policy is the most flexible and scalable solution, existing approaches remain brittle in practice, frequently failing to recover complete Pareto fronts. We show that this failure stems from two structural issues in current methods: destructive gradient interference caused by premature scalarization and representational collapse across the preference space. We introduce $D^3PO$, a PPO-based framework that reorganizes multi-objective policy optimization to address these issues directly. $D^3PO$ preserves per-objective learning signals through a decomposed optimization pipeline and integrates preferences only after stabilization, enabling reliable credit assignment. In addition, a scaled diversity regularizer enforces sensitivity of policy behavior to preference changes, preventing collapse. Across standard MORL benchmarks, including high-dimensional and many-objective control tasks, $D^3PO$ consistently discovers broader and higher-quality Pareto fronts than prior single- and multi-policy methods, matching or exceeding state-of-the-art hypervolume and expected utility while using a single deployable policy.
💡 Research Summary
The paper tackles a fundamental challenge in multi‑objective reinforcement learning (MORL): single preference‑conditioned policies often fail to recover a complete Pareto front. The authors identify two structural causes: (1) destructive gradient interference caused by premature scalarization of the multi‑objective reward vector, and (2) representational collapse where the policy’s behavior becomes insensitive to changes in the preference vector. To address these issues, they propose D³PO (Decomposed, Diversity‑Driven Policy Optimization), a PPO‑based framework that restructures the learning pipeline.
Key components of D³PO are:
-
Multi‑Head Critic – A shared backbone processes the state and the preference vector, then branches into d heads, each predicting the unweighted value for one objective. This preserves raw per‑objective learning signals and allows the critic to be conditioned on the policy’s preference‑dependent trajectories.
-
Per‑Objective PPO Surrogate Losses – For each objective i, a standard PPO clipped surrogate loss is computed independently using the corresponding advantage estimate A⁽ⁱ⁾. By applying PPO’s clipping before any weighting, each objective’s gradient is stabilized and protected from interference.
-
Late‑Stage Weighting (LSW) – After the per‑objective surrogate losses have been stabilized, the user‑provided preference weights ω are multiplied with these losses to form the final actor loss. This postpones scalarization to a “late stage,” preventing the destructive cancellation that occurs when weights are applied too early.
-
Scaled Diversity Regularizer – To avoid mode collapse, a regularization term encourages the KL divergence between action distributions conditioned on two different preferences to be proportional to the Euclidean distance between those preferences. The loss is added to the actor objective with a scaling factor λ. The authors prove (Proposition F.2) that any minimizer of this objective cannot exhibit representational collapse, guaranteeing that distinct preferences map to distinct behaviors.
The training loop proceeds as follows: (a) collect trajectories using the current preference‑conditioned policy; (b) compute Generalized Advantage Estimates (GAE) for each objective independently; (c) update the multi‑head critic by minimizing mean‑squared error against unweighted returns; (d) compute per‑objective PPO surrogate losses; (e) combine them with LSW and the diversity regularizer to update the actor over several epochs.
Experiments are conducted on standard continuous‑control benchmarks (Hopper, Ant, Humanoid) with 3–5 objectives, comparing D³PO against recent single‑policy methods (Pareto‑Conditioned Networks, PD‑MORL, MOPPO) and multi‑policy baselines (GPI‑based, curriculum‑based MORL). Evaluation metrics include Hypervolume (HV), Sparsity (SP), and Expected Utility (EU). D³PO consistently achieves the highest HV and EU, while also delivering superior SP, indicating a well‑spread front. Ablation studies show that removing LSW leads to severe front shrinkage, and omitting the diversity term results in behavior that is nearly invariant to preference changes, confirming the necessity of both innovations.
Strengths of the approach are its on‑policy sample efficiency, the preservation of per‑objective signals, and the ability to deploy a single network for any preference, dramatically reducing memory and routing overhead. Limitations include the reliance on linear weighted‑sum preferences (non‑linear scalarizations are not directly addressed) and sensitivity to the diversity regularization coefficient λ, which may require tuning for new domains.
In summary, D³PO provides a theoretically grounded and empirically robust solution for learning a universal preference‑conditioned policy in MORL. By decomposing advantage computation, delaying preference weighting, and enforcing proportional behavioral diversity, it overcomes the two primary failure modes of prior work and sets a new benchmark for single‑policy multi‑objective reinforcement learning.
Comments & Academic Discussion
Loading comments...
Leave a Comment