We present a framework to \emph{certify} Hamilton--Jacobi (HJ) reachability learned by reinforcement learning (RL). Building on a discounted initial time \emph{travel-cost} formulation that makes small-step RL value iteration provably equivalent to a forward Hamilton--Jacobi (HJ) equation with damping, we convert certified learning errors into calibrated inner/outer enclosures of strict backward reachable tube. The core device is an additive-offset identity: if $W_λ$ solves the discounted travel-cost Hamilton--Jacobi--Bellman (HJB) equation, then $W_\varepsilon:=W_λ+ \varepsilon$ solves the same PDE with a constant offset $λ\varepsilon$. This means that a uniform value error is \emph{exactly} equal to a constant HJB offset. We establish this uniform value error via two routes: (A) a Bellman operator-residual bound, and (B) a HJB PDE-slack bound. Our framework preserves HJ-level safety semantics and is compatible with deep RL. We demonstrate the approach on a double-integrator system by formally certifying, via satisfiability modulo theories (SMT), a value function learned through reinforcement learning to induce provably correct inner and outer backward-reachable set enclosures over a compact region of interest.
Safety verification for autonomous systems hinges on computing (or certifying) the set of initial states from which a safe/unsafe region can (or cannot) be reached within a time horizon. Hamilton Jacobi (HJ) reachability encodes this guarantee via the sign of a value function that solves an HJ partial differential equation (PDE) or variational inequality (VI) in the viscosity sense [3], [10], [12]. Classical grid based solvers suffer from the curse of dimensionality [6], and even high-order schemes remain costly at scale. Approximation strategies include decomposition [5], algorithms mitigating high dimensional complexity [6], and operator-theoretic approaches (Hopf/Koopman) [15].
Learning-based methods pursue scalability along two directions. (i) Self supervised reachability learning: DeepReach trains neural networks to satisfy the terminal/reach cost HJ residual, but it requires an explicit dynamics model [4]. Certified Approximate Reachability (CARe) then converts bounded HJ losses into ε accurate inner/outer enclosures using δ complete Satisfiability Modulo Theory (SMT) and Counterexample guided Inductive Synthesis (CEGIS) [14]. (ii) Reinforcement Learning (RL): RL offers the possibility of learning from experience without necessarily requiring a model of the dynamics. However, standard terminal/reach cost formulations are not naturally compatible with small step discounted Bellman updates and typically do not recover exact strict reach/avoid sets. Some approaches have attempted to bridge this gap, but their presented formalism may alter safety semantics [2], [7], [8], [17].
A recently proposed discounted travel cost formulation resolves this incompatibility [13]: with cost zero off-target and strictly negative on target, the discounted Bellman iteration is provably equivalent to a forward HJB with damping, converging to its viscosity solution and recovering strict reach/avoid sets by sign, while remaining amenable to RL training. Because the learned object in this RL setting is the travel cost value (rather than the terminal/reach cost value used in DeepReach [4]), CARe cannot be applied verbatim [14]. This paper fills that gap by certifying RL learned HJ values based on the formulation in [13].
The paper’s contributions are threefold: (i) a characterization of the discounted HJB equations corresponding to an additively shifted value function (Section III), (ii) certified value-error bounds derived from Bellman and PDE residuals (Section IV), and (iii) SMT-based certification pipelines enabling provable reachability enclosures (Section V).
The structure of the article is as follows. Section II formulates the time-invariant control problem studied in this paper and states the standing assumptions used throughout. Section III introduces the additive-offset identity, showing how a uniform value-function error induces a constant offset in the associated forward HJB equation. Section IV then considers the discretetime Bellman counterpart of the forward HJB : it defines the one step Bellman operator, leverages contraction and uniqueness properties to convert a certified Bellman residual bound into a uniform value function error bound, and connects this bound to reachability enclosures. Section V presents two SMT based certification routes for establishing ε val over a compact region of interest. Section VI applies the proposed certification framework to a canonical benchmark problem the double integrator. Finally, Section VII summarizes the contributions of the paper and outlines directions for future research.
This section formulates the time invariant optimal control problem studied in this paper. We define the system dynamics, admissible controls, running cost, and associated Hamiltonian, and state the standing regularity assumptions used throughout. We then introduce the discounted travel-time value function, its Hamilton Jacobi Bellman (HJB) characterization, and an equivalent forward/initial time formulation used in subsequent sections.
System: We consider the deterministic, time-invariant control system ẋ(s) = f x(s), u(s) ,
where x(•) ∈ R n and u(•) is a measurable signal with values in a compact set U ⊂ R m . Admissible controls on [t, T ] are
Notation and Hamiltonians:
where p := ∇ x V (t, x).
Assumption 1 (Compact input set). The control set U ⊂ R m is nonempty and compact.
Assumption 2 (Lipschitz dynamics in state). There exists
Assumption 3 (Uniform growth bound). There exists
Assumption 4 (Continuity in the control). For each fixed x ∈ R n , the mapping u → f (x, u) is continuous on U .
Assumption 5 (Time dependent running cost regularity). The running cost h : [0, T ] × R n × U → R is measurable and satisfies: (i) (Lipschitz in x, uniform in t, u) There exists
Discount weight: Fix a discount parameter λ ∈ R and define the discount weight
Remark 1 (Discount and contraction). For the reinforcementlearning (RL) interpretation, we assume throughout that λ > 0. Then for all s ∈ [t,
This content is AI-processed based on open access ArXiv data.