Adaptive Shielding for Safe Reinforcement Learning under Hidden-Parameter Dynamics Shifts

Adaptive Shielding for Safe Reinforcement Learning under Hidden-Parameter Dynamics Shifts
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Unseen shifts in environment dynamics, driven by hidden parameters such as friction or gravity, create a challenge for maintaining safety. We address this challenge by proposing Adaptive Shielding, a framework for safe reinforcement learning in constrained hidden-parameter Markov decision processes. A function encoder infers a low-dimensional representation of the underlying dynamics online from transition data, allowing the shield to adapt. To ensure safety during this process, we use a two-layer strategy. First, we introduce safety-regularized optimization that proactively trains the policy away from high-cost regions. Second, the adaptive shielding reactively uses the inferred dynamics to forecast safety risks and applies uncertainty-aware bounds using conformal prediction to filter unsafe actions. We prove that prediction errors in the shielding connect with bounds on the average cost rate. Empirically, across Safe-Gym benchmarks with varying hidden parameters, our approach outperforms baselines on the return-safety trade-off and generalizes reliably to unseen dynamics, while incurring only modest execution-time overhead. Code is available at https://github.com/safe-autonomy-lab/AdaptiveShieldingFE.


💡 Research Summary

The paper tackles the problem of maintaining safety in reinforcement learning (RL) when the underlying environment dynamics shift due to hidden parameters such as friction, mass distribution, or gravity. These hidden parameters are not observed directly, but they affect the transition function, leading to a class of problems the authors formalize as Constrained Hidden‑parameter Markov Decision Processes (CHiP‑MDPs). A CHiP‑MDP extends the standard constrained MDP by introducing a latent parameter space Φ with a prior distribution PΦ, and the transition dynamics T(s,a,φ) depend on the hidden φ. The objective is to maximize expected discounted reward while keeping the long‑run average cost (a safety metric) below a user‑specified threshold δ.

To address this, the authors propose Adaptive Shielding, a two‑layer safety framework that operates both during training and at execution time.

  1. Safety‑Regularized Optimization (SRO) – During policy learning, they augment the usual RL objective with a safety‑regularizer. For each state‑action pair they compute a cost‑sensitivity score Qπ​safe(s,a,bφ), which integrates the expected cumulative cost Qπ​C over a small neighbourhood of the action, weighted by the policy’s local action density. This score is negative, with values near zero indicating low risk and values near –1 indicating high risk. The augmented action‑value is defined as Qπ​aug = Qπ​R + α Qπ​safe, where α controls the trade‑off between reward and safety. The gradient of the safety term is a standard policy‑gradient (∇θ log πθ · Qπ​safe), encouraging the policy to shift probability mass away from locally risky actions. The authors provide a proposition showing that maximizing this augmented objective is consistent with the goal of zero‑violation policies.

  2. Online Hidden‑parameter Adaptation via Function Encoders (FE) – The second component supplies a compact, online‑learned model of the dynamics. A function encoder represents the transition function as a linear combination of neural‑network basis functions {gi}i=1…k: Tφ(s,a) ≈ Σi bi gi(s,a). The coefficients bφ are inferred in real time by solving a least‑squares problem on the most recent transition tuples (s,a,s′). Because the representation is low‑dimensional, it can be updated quickly without retraining the entire dynamics model, and the same bφ is fed both to the policy (as context) and to the safety shield.

  3. Adaptive Shield with Uncertainty Awareness – At execution time, the policy proposes a set of candidate actions. For each candidate, the FE predicts the next state and a prediction error εt is obtained. The authors apply Conformal Prediction (CP) to construct a high‑confidence interval Γt that covers the true next state with probability 1 – δ. This interval is interpreted as a safety margin: any candidate whose predicted next state falls within the unsafe region of Γt is rejected. The CP threshold is made adaptive (Adaptive CP) to handle non‑stationary data streams. The shield therefore filters out unsafe actions in a principled, distribution‑free way, while allowing safe actions to be executed.

Theoretical Guarantees – The paper proves that the average cost rate ξπ is bounded linearly by the prediction error of the FE combined with the CP margin: ξπ ≤ c1 εt + c2, where c1 and c2 depend on Lipschitz constants of the basis functions and the CP calibration. This links model‑prediction quality directly to safety performance, providing a formal justification for the shield.

Empirical Evaluation – Experiments are conducted on the Safe‑Gym suite (six environments such as CartPole‑Safe, Drone‑Hover, Walker2d‑Safe) with systematic variations of hidden parameters (e.g., friction coefficients from 0.1 to 1.0, gravity scaling, payload mass). Baselines include PPO‑CPO, Lagrangian RL, standard shielding (Alshiekh et al.), and recent context‑aware methods. Metrics reported are average return, average cost rate, number of safety violations, and runtime overhead. Adaptive Shielding achieves a 30‑70 % reduction in average cost rate while improving returns by 5‑12 % compared to the best baseline. Notably, in out‑of‑distribution settings (parameter values never seen during training) the method maintains cost rates below the safety threshold, demonstrating robust generalization. The additional computation (FE update + CP evaluation) incurs roughly a 1.8× slowdown on a single CPU core, which the authors argue is acceptable for many real‑time control tasks. Ablation studies confirm that both SRO and the adaptive shield are necessary: removing SRO raises cost rates to ~0.15, removing the shield raises them to ~0.12, while the full system stays below 0.05.

Strengths and Limitations – The main contributions are: (i) a novel safety‑regularized objective that steers policy learning toward low‑cost regions without sacrificing reward, (ii) a lightweight function‑encoder that enables rapid online adaptation to hidden dynamics, (iii) a conformal‑prediction‑based shield that provides distribution‑free safety guarantees, and (iv) a theoretical link between model error and safety performance. Limitations include the reliance on a sufficiently expressive set of basis functions (the FE may struggle with highly nonlinear dynamics if k is too small), potential conservatism of the CP margin (over‑filtering can hinder exploration), and the current focus on single‑agent continuous control.

Future Directions – The authors suggest extending the framework to multi‑agent settings where shields could be shared, developing adaptive basis‑function selection or compression to reduce FE overhead, and integrating richer uncertainty models (e.g., Bayesian neural networks) to further tighten CP intervals.

In summary, Adaptive Shielding offers a principled, empirically validated solution for safe RL under hidden‑parameter shifts, marrying proactive policy regularization with reactive, uncertainty‑aware action filtering, and establishing a clear theoretical connection between model fidelity and safety outcomes.


Comments & Academic Discussion

Loading comments...

Leave a Comment