A Survey of Multi-Objective Sequential Decision-Making

Sequential decision-making problems with multiple objectives arise naturally in practice and pose unique challenges for research in decision-theoretic planning and learning, which has largely focused on single-objective settings. This article surveys algorithms designed for sequential decision-making problems with multiple objectives. Though there is a growing body of literature on this subject, little of it makes explicit under what circumstances special methods are needed to solve multi-objective problems. Therefore, we identify three distinct scenarios in which converting such a problem to a single-objective one is impossible, infeasible, or undesirable. Furthermore, we propose a taxonomy that classifies multi-objective methods according to the applicable scenario, the nature of the scalarization function (which projects multi-objective values to scalar ones), and the type of policies considered. We show how these factors determine the nature of an optimal solution, which can be a single policy, a convex hull, or a Pareto front. Using this taxonomy, we survey the literature on multi-objective methods for planning and learning. Finally, we discuss key applications of such methods and outline opportunities for future work.

💡 Research Summary

This survey addresses the growing need for decision‑theoretic methods that can handle multiple, often conflicting objectives in sequential decision‑making problems. The authors begin by arguing that most of the existing literature on planning and reinforcement learning assumes a single scalar reward, which is insufficient for many real‑world domains such as robotics, autonomous driving, healthcare, and finance where trade‑offs among safety, cost, performance, and other criteria are intrinsic. To clarify when a multi‑objective formulation truly requires dedicated techniques, the paper defines three mutually exclusive scenarios: (1) Impossible – the objectives are so antagonistic that any scalarization would discard essential information; (2) Infeasible – computational or sample‑complexity constraints make it impractical to solve the scalarized problem exactly; and (3) Undesirable – the decision maker does not wish to commit to a particular scalarization a priori, preferring instead a set of Pareto‑optimal policies.

Building on these scenarios, the authors propose a three‑dimensional taxonomy. The first dimension characterizes the scalarization function: linear weighted sums, non‑linear utility functions, ε‑dominance, hypervolume‑based measures, etc. The second dimension classifies the type of policy that the algorithm produces: a single deterministic policy, a stochastic policy, or a policy set (i.e., a representation of the Pareto front). The third dimension identifies the structure of the optimal solution: a single policy (when the scalarization is convex and the problem reduces to a standard MDP), a convex hull of value vectors (when the scalarization is convex but the Pareto front is not a single point), or the full Pareto front (when the scalarization is non‑convex or when the decision maker explicitly requests the entire frontier).

The survey then systematically reviews the literature according to this taxonomy, separating planning‑based and learning‑based approaches. In the planning category, classic dynamic‑programming methods are extended to the multi‑objective case: Multi‑Objective Value Iteration (MOVI) and Multi‑Objective Policy Iteration (MOPI) maintain a set of nondominated value vectors for each state and use dominance pruning or ε‑approximation to keep the set tractable. Label‑setting algorithms for multi‑objective shortest‑path problems are also discussed, highlighting how convexity of the scalarization determines whether a single optimal path suffices or whether a full frontier must be enumerated.

In the learning domain, the authors cover both value‑based and policy‑based multi‑objective reinforcement‑learning (MORL) methods. Vector‑valued Q‑learning updates a set of Q‑vectors per state‑action pair and selects actions based on Pareto dominance. Deep extensions (Multi‑Objective DQN) adapt the network architecture to output a vector of Q‑values and employ experience replay buffers that store multi‑objective transitions. Policy‑gradient approaches such as Multi‑Objective Policy Gradient (MOPG) and Multi‑Objective Actor‑Critic introduce a set of Lagrange multipliers that are either pre‑specified or learned online to balance the objectives during gradient ascent. Evolutionary strategies and Bayesian optimization are presented as meta‑learning tools that can efficiently explore the space of scalarizations, producing a diverse collection of policies that approximate the Pareto front with relatively few environment interactions.

The paper also surveys concrete applications, illustrating how the taxonomy guides algorithm selection. In robotic path planning, energy consumption and travel time are jointly optimized; because the objectives are convex, a convex‑hull representation often suffices, and ε‑approximate MOVI is employed. In autonomous driving, safety constraints make the problem fall into the “Impossible” scenario, prompting the use of Pareto‑front approximations that can be queried at run‑time based on user preferences. In personalized medicine, treatment efficacy versus side‑effect risk is a classic “Undesirable” case, leading to the deployment of policy‑set methods that present clinicians with a spectrum of treatment options. Financial portfolio management is highlighted as an example where risk‑adjusted return trade‑offs are handled via scalarizations that are learned from market data, illustrating the “Infeasible” scenario where exact solutions are too costly and approximate methods are required.

Finally, the authors outline several promising research directions. First, automatic scalarization learning—inferring the decision maker’s utility function from interaction data—remains largely open. Second, scalable Pareto‑front representation (e.g., using succinct data structures, incremental pruning, or neural approximators) is needed for high‑dimensional state spaces. Third, transfer and meta‑learning across tasks with similar objective structures could dramatically reduce sample complexity. Fourth, human‑in‑the‑loop interfaces that allow users to navigate the Pareto front intuitively are essential for real‑world deployment.

In summary, this survey provides a coherent framework that clarifies when multi‑objective methods are indispensable, categorizes existing algorithms by scalarization, policy type, and solution structure, and maps them to practical domains. It serves as both a reference for researchers seeking to develop new multi‑objective planning or learning algorithms and a guide for practitioners aiming to select the most appropriate technique for their specific multi‑objective sequential decision‑making problem.