Continuous-Time Value Iteration for Multi-Agent Reinforcement Learning
Existing reinforcement learning (RL) methods struggle with complex dynamical systems that demand interactions at high frequencies or irregular time intervals. Continuous-time RL (CTRL) has emerged as a promising alternative by replacing discrete-time Bellman recursion with differential value functions defined as viscosity solutions of the Hamilton–Jacobi–Bellman (HJB) equation. While CTRL has shown promise, its applications have been largely limited to the single-agent domain. This limitation stems from two key challenges: (i) conventional solution methods for HJB equations suffer from the curse of dimensionality (CoD), making them intractable in high-dimensional systems; and (ii) even with HJB-based learning approaches, accurately approximating centralized value functions in multi-agent settings remains difficult, which in turn destabilizes policy training. In this paper, we propose a CT-MARL framework that uses physics-informed neural networks (PINNs) to approximate HJB-based value functions at scale. To ensure the value is consistent with its differential structure, we align value learning with value-gradient learning by introducing a Value Gradient Iteration (VGI) module that iteratively refines value gradients along trajectories. This improves gradient fidelity, in turn yielding more accurate values and stronger policy learning. We evaluate our method using continuous-time variants of standard benchmarks, including multi-agent particle environment (MPE) and multi-agent MuJoCo. Our results demonstrate that our approach consistently outperforms existing continuous-time RL baselines and scales to complex multi-agent dynamics.
💡 Research Summary
This paper tackles the long‑standing gap between continuous‑time reinforcement learning (CTRL) and multi‑agent reinforcement learning (MARL). While CTRL replaces the discrete‑time Bellman recursion with differential value functions that satisfy the Hamilton–Jacobi–Bellman (HJB) partial differential equation, most prior work has been limited to single‑agent settings. The authors identify two fundamental obstacles to extending CTRL to multi‑agent domains: (i) solving HJB equations directly suffers from the curse of dimensionality, making it infeasible for high‑dimensional joint state spaces; and (ii) even when learning‑based approaches mitigate the computational burden, approximating a centralized value function accurately remains difficult, and errors in the value gradient destabilize policy updates.
To overcome these challenges, the authors propose a novel framework called CT‑MARL that integrates physics‑informed neural networks (PINNs) with a Value Gradient Iteration (VGI) module. The key components are:
-
Problem formulation – The authors define a cooperative continuous‑time multi‑agent system as a tuple ((\mathcal{X},{U_i}_{i=1}^N, f, r, {t_k}, \rho)) with variable decision times (t_k). The global value function (V(x)) is the viscosity solution of the HJB equation.
-
PINN‑based critic (VIP) – A neural network (V_\theta(x)) is trained to satisfy the HJB residual (R_\theta(x_t)= -\rho V_\theta(x_t) + \nabla_x V_\theta(x_t)^\top f(x_t,u_t) + r(x_t,u_t)). The loss combines the PDE residual, boundary/anchor terms, and a Monte‑Carlo sampling scheme that scales to high dimensions, thereby avoiding the curse of dimensionality that plagues grid‑based solvers.
-
Value Gradient Iteration (VGI) – Recognizing that the HJB residual alone does not guarantee accurate gradients, the VGI module propagates gradient estimates along sampled trajectories using a small‑step integration of the differential relationship (\nabla_x V_{t+\Delta t} = \nabla_x V_t + \Delta t\big
Comments & Academic Discussion
Loading comments...
Leave a Comment