Offline safe reinforcement learning (RL) is increasingly important for cyber-physical systems (CPS), where safety violations during training are unacceptable and only pre-collected data are available. Existing offline safe RL methods typically balance reward-safety tradeoffs through constraint relaxation or joint optimization, but they often lack structural mechanisms to prevent safety drift. We propose LexiSafe, a lexicographic offline RL framework designed to preserve safety-aligned behavior. We first develop LexiSafe-SC, a single-cost formulation for standard offline safe RL, and derive safety-violation and performance-suboptimality bounds that together yield sample-complexity guarantees. We then extend the framework to hierarchical safety requirements with LexiSafe-MC, which supports multiple safety costs and admits its own sample-complexity analysis. Empirically, LexiSafe demonstrates reduced safety violations and improved task performance compared to constrained offline baselines. By unifying lexicographic prioritization with structural bias, LexiSafe offers a practical and theoretically grounded approach for safety-critical CPS decision-making.
Reinforcement learning (RL) has achieved remarkable success across diverse domains such as robotics [7], manufacturing [35], recommender systems [1], healthcare [45], and even reasoning with large language models [47]. However, when applied to cyber-physical systems (CPS), such as autonomous driving [16], smart grids [26], building energy management [46], conventional RL faces critical limitations. These systems tightly couple computation and physical processes, where unsafe actions can directly cause physical harm, equipment failure, or service disruption. Ensuring safety is therefore not only desirable but mandatory for real-world deployment. This requirement is further amplified by the inherent vulnerabilities of deep RL agents, which often lack natural robustness to environmental perturbations [24] and remain susceptible to adversarial threats [22].
In CPS applications, safety often involves multiple and hierarchical constraints, rather than a single-cost signal. For instance, in autonomous driving, an agent must first avoid collisions (primary safety), then respect traffic regulation (secondary safety), and finally optimize fuel efficiency or passenger comfort (performance). Violating this hierarchy, e.g., prioritizing comfort over collision avoidance, is unacceptable. This multi-level safety dependency motivates a lexicographic structure, where safety objectives are optimized sequentially according to their criticality before considering performance. Nevertheless, existing safe RL approaches rarely capture such safety hierarchies, treating safety and performance as jointly optimized under a single constraint.
In practice, direct online interaction for learning safe behavior in CPS is costly and risky, as unsafe exploration can lead to physical damage or system instability. This has motivated the study of offline safe RL [4,41], where policies are trained from pre-collected datasets without further environment interactions. However, this setting introduces several challenges. Offline datasets often contain mixed or unsafe trajectories, complicating the identification of safe behaviors [18]. Furthermore, estimation errors in long-term cost and value functions may yield infeasible or overly conservative policies. While dual-variable or constrained formulations [13,17,51] attempt to balance safety and performance, they often suffer from optimization instability and lack of interpretability. Even if theoretical complexity bounds have been estabilished separately from safe RL [11] and offline RL [12], analogous guarantees for offline safe RL, particularly under hierarchical safety objectives, remain underexplored. Therefore, limitations motivate the central question of this work: How can we ensure hierarchical safety guarantees in offline reinforcement learning for cyber-physical systems, while still achieving near-optimal task performance? Recently, a few studies [37,48] have explored lexicographic to model hierarchical objectives. However, existing methods primarily focus on online interaction settings, where safety and performance are optimized through continual environment exploration. Such approaches lack theoretical sample complexity guarantees and are typically limited to single-cost evaluations, making them difficult to deploy in safety-critical CPS domains that demand strict offline learning and multiple safety hierarchies. Contributions. To address these gaps, in this paper, we leverage lexicographic order, which is of independent interest in recent multiobjective RL literature [37,43]. We introduce LexiSafe as in Figure 1, which addresses the fundamental tension between safety and performance in offline RL by introducing a lexicographic framework with multi-phase training. Unlike prior methods that relax constraints or sequentially train separate safety/performance models, LexiSafe unifies safety and performance by treating safety as a nonnegotiable priority (one or multiple lexicographic safety objectives) and performance as a secondary goal, ensuring policy updates never violate learned safety boundaries. Particularly, the multi-phase optimization enables the different cost minimization, ensuring the hierarchical safety priorities before the reward maximization. Our method theoretically grounds this mechanism with the first sample complexity bounds for lexicographic safe RL. LexiSafe demonstrates empirical dominance on the DSRL benchmark, outperforming constrained baselines across robotic manipulation and autonomous driving tasks by strictly enforcing safety and accelerating convergence. The main contributions are summarized in the following: (1) We propose LexiSafe (both LexiSafe-SC and LexiSafe-MC, SC and MC indicate single-cost and multi-cost), a novel framework that hierarchically separates safety constraints from performance optimization, ensuring safety violations are eliminated after initial convergence; (2) We formally establish the constraint violation and performance suboptimality b
This content is AI-processed based on open access ArXiv data.