Continuous-time reinforcement learning for optimal switching over multiple regimes

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper studies the continuous-time reinforcement learning (RL) for optimal switching problems across multiple regimes. We consider a type of exploratory formulation under entropy regularization where the agent randomizes both the timing of switches and the selection of regimes through the generator matrix of an associated continuous-time finite-state Markov chain. We establish the well-posedness of the associated system of Hamilton-Jacobi-Bellman (HJB) equations and provide a characterization of the optimal policy. The policy improvement and the convergence of the policy iterations are rigorously established by analyzing the system of equations. We also show the convergence of the value function in the exploratory formulation towards the value function in the classical formulation as the temperature parameter vanishes. Finally, a reinforcement learning algorithm is devised and implemented by invoking the policy evaluation based on the martingale characterization. Our numerical examples with the aid of neural networks illustrate the effectiveness of the proposed RL algorithm.

💡 Research Summary

This paper develops a comprehensive continuous‑time reinforcement learning (RL) framework for optimal switching problems involving multiple regimes. The authors introduce an exploratory formulation in which the decision maker randomizes both the timing of switches and the choice of target regime by controlling the generator matrix of an associated continuous‑time finite‑state Markov chain (CTMC). Entropy regularization is imposed on the generator to encourage exploration, with a temperature parameter β>0 governing the exploration‑exploitation trade‑off.

The exploratory control problem is transformed into a system of coupled Hamilton‑Jacobi‑Bellman (HJB) partial differential equations (PDEs). For each regime i∈{1,…,m}, the value function V_i(t,x) satisfies a nonlinear variational inequality that simultaneously involves a minimum operator (reflecting the continuation region) and a maximum operator (reflecting the optimal switching decision). The authors prove, via Schauder estimates and a truncation argument, the existence of a bounded classical solution to this system (Lemma 3.2) and establish uniqueness through a comparison principle (Proposition 3.3). A verification theorem shows that this solution coincides with the exploratory value function.

A policy iteration (PI) scheme is then analyzed. Given a policy π_k that specifies a generator Q^{π_k}, the corresponding value function V^{π_k} is obtained by solving a linear PDE (policy evaluation). The policy improvement step updates the generator by an explicit entropy‑regularized soft‑max rule that depends only on the current value function, thus avoiding the need for its derivatives. Proposition 4.1 proves monotonic improvement of the value, and Theorem 4.2 establishes linear convergence of the PI algorithm with an explicit rate γ<1. This constitutes the first rigorous convergence‑rate result for multi‑regime continuous‑time switching.

The paper also investigates the limit β→0. Using stability analysis of viscosity solutions, Lemma 4.3 and Theorem 4.4 demonstrate that the exploratory value functions V^β converge pointwise to the classical optimal switching value V, and the PDE system reduces to the standard variational‑inequality formulation (2.7).

On the algorithmic side, the authors devise a model‑free RL algorithm based on martingale orthogonality. By simulating sample paths of the CTMC (jump times τ_k and post‑jump regimes), they construct unbiased estimators of the Bellman residual and use stochastic approximation to update neural‑network parameters θ that simultaneously represent the value functions V_i(t,x;θ) and the generator entries Q_{ij}(t,x;θ). The policy improvement step is implemented as a soft‑max over (V_j−g_{ij})/β. Theorem 5.4 provides an error bound for the stochastic approximation, showing O(1/√N) convergence with N samples.

Numerical experiments are presented in two settings: (1) a one‑dimensional diffusion with three regimes, and (2) a two‑dimensional diffusion with four regimes modeling an energy‑storage problem. In both cases, the PI algorithm converges within 10–15 iterations, the learned switching boundaries match those obtained from the classical solution, and the neural‑network approximations achieve high accuracy (L^∞ error < 0.02). The exploratory policies successfully discover optimal switching structures even when initialized from random policies, and the temperature schedule (β decreasing from 0.5 to 0.1) balances exploration and exploitation.

Overall, the paper makes four major contributions: (i) a novel exploratory continuous‑time formulation for multi‑regime optimal switching, (ii) rigorous analytical results on existence, uniqueness, and policy‑iteration convergence for the associated HJB system, (iii) a proof of convergence of the exploratory solution to the classical solution as the entropy regularization vanishes, and (iv) a practical martingale‑based RL algorithm with neural‑network function approximation and provable error bounds. These results significantly advance the theory and practice of continuous‑time RL for hybrid control problems and open avenues for extensions to high‑dimensional, partially observed, and risk‑sensitive switching environments.

Continuous-time reinforcement learning for optimal switching over multiple regimes

💡 Research Summary

Comments & Academic Discussion

Leave a Comment