Safety-Critical Reinforcement Learning with Viability-Based Action Shielding for Hypersonic Longitudinal Flight

Safety-Critical Reinforcement Learning with Viability-Based Action Shielding for Hypersonic Longitudinal Flight
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper presents a safety-critical reinforcement learning framework for nonlinear dynamical systems with continuous state and input spaces operating under explicit physical constraints. Hard safety constraints are enforced independently of the reward through action shielding and reachability-based admissible action sets, ensuring that unsafe behaviors are never intentionally selected during learning or execution. To capture nominal operation and recovery behavior within a single control architecture, the state space is partitioned into safe and unsafe regions based on membership in a safety box, and a mode-dependent reward is used to promote accurate tracking inside the safe region and recovery toward it when operating outside. To enable online tabular learning on continuous dynamics, a finite-state abstraction is constructed via state aggregation, and action selection and value updates are consistently restricted to admissible actions. The framework is demonstrated on a longitudinal point-mass hypersonic vehicle model with aerodynamic and propulsion couplings, using angle of attack and throttle as control inputs.


💡 Research Summary

This paper introduces a safety‑critical reinforcement learning (RL) framework tailored for nonlinear dynamical systems with continuous state and input spaces, exemplified by a longitudinal point‑mass hypersonic vehicle. The central challenge addressed is the need to guarantee hard physical constraints (e.g., altitude, speed, thermal load, structural load) during both learning and execution, without relying solely on reward shaping or episode termination.

The authors first partition the state space into a “safety box” (the nominal operating region) and an unsafe complement. A mode‑dependent reward is defined: inside the safety box, the reward penalizes tracking error of altitude, velocity, and flight‑path angle; outside, a recovery reward encourages actions that drive the state back into the safety box. This hybrid reward does not alter the system dynamics or admissible action set, allowing a single constrained Markov decision process (MDP) to capture both nominal and contingency behavior.

To enforce safety independently of the reward, the paper proposes a viability‑based action shielding mechanism. An offline reachability analysis is performed on a discretized abstraction of the continuous dynamics. The continuous state space is binned into a finite set of abstract states; for each abstract state, a fixed‑point computation determines the set of actions that guarantee forward‑invariance of the feasible abstract state set, i.e., actions that keep the system within the safety envelope for all future steps. This yields a state‑dependent admissible action set that eliminates not only instantaneously unsafe actions but also those that would inevitably lead to a non‑viable region. The admissible set is precomputed and stored, enabling real‑time masking of unsafe actions during both learning and deployment.

Learning proceeds with a tabular Q‑learning algorithm that respects the admissible mask at every step. Both action selection (ε‑greedy) and Bellman updates are restricted to actions belonging to the admissible set for the current abstract state. To improve control smoothness, a neighborhood‑based local action selection scheme is added: among admissible actions, those closest to the previously applied command are preferred, reducing abrupt command changes while preserving safety guarantees.

Because recovery from unsafe states may require long horizons, the authors introduce a constraint‑aware episode chaining mechanism. The terminal state of one episode becomes the initial state of the next, allowing the agent to experience extended recovery trajectories. Chaining is disabled if a hard‑constraint violation occurs, preventing the propagation of infeasible initial conditions.

The framework is evaluated on a high‑fidelity hypersonic longitudinal model. The state vector consists of altitude (h), velocity (V), flight‑path angle (γ), and mass (m). Control inputs are angle of attack (α) and throttle (δ). The dynamics incorporate altitude‑dependent gravity, a piecewise standard‑atmosphere model for density and speed of sound, Mach‑dependent lift and drag coefficients, and altitude‑Mach maps for maximum thrust and specific impulse. Thermal constraints are modeled with a simple heating proxy proportional to ρ V³, and structural load limits are expressed as bounds on the normal load factor.

Simulation scenarios include (i) initial conditions inside the safety box, (ii) initial conditions outside the box, and (iii) sudden disturbances that push the vehicle into unsafe regions. Results show that the admissible‑action mask enforces hard constraints with zero violations across all trials. Within the safety box, the learned policy tracks the desired altitude, speed, and flight‑path angle with average errors below 2 %. When the vehicle starts outside the box, the policy selects only admissible actions that steer the state back into the safety region, achieving recovery in 5–7 seconds of simulated time. The presence of the shield modestly slows convergence (≈10–15 % slower than an unconstrained tabular learner) but does not impede learning; the agent still discovers a near‑optimal policy that respects all constraints.

In summary, the paper demonstrates that a combination of offline reachability analysis, state‑dependent admissible action masking, and mask‑consistent tabular RL yields a practical, provably safe learning architecture for high‑speed aerospace systems. The approach avoids the need for complex dual‑optimization or barrier‑function gradients, making it attractive for deployment on real hardware where computational resources and safety certification are critical. Future work is suggested in extending the method to higher‑dimensional multi‑axis control, integrating deep function approximators while preserving hard‑constraint guarantees, and conducting hardware‑in‑the‑loop experiments on actual hypersonic testbeds.


Comments & Academic Discussion

Loading comments...

Leave a Comment