Interacting safely with cyclists using Hamilton-Jacobi reachability and reinforcement learning

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this paper, we present a framework for enabling autonomous vehicles to interact with cyclists in a manner that balances safety and optimality. The approach integrates Hamilton-Jacobi reachability analysis with deep Q-learning to jointly address safety guarantees and time-efficient navigation. A value function is computed as the solution to a time-dependent Hamilton-Jacobi-Bellman inequality, providing a quantitative measure of safety for each system state. This safety metric is incorporated as a structured reward signal within a reinforcement learning framework. The method further models the cyclist’s latent response to the vehicle, allowing disturbance inputs to reflect human comfort and behavioral adaptation. The proposed framework is evaluated through simulation and comparison with human driving behavior and an existing state-of-the-art method.

💡 Research Summary

The paper proposes a novel framework that combines Hamilton‑Jacobi (HJ) reachability analysis with deep Q‑learning (DQN) to enable autonomous vehicles (AVs) to interact safely and efficiently with cyclists. Existing approaches either rely solely on conservative safety guarantees, which lead to overly cautious behavior, or on reinforcement learning (RL) that lacks formal safety assurances. By integrating the two, the authors aim to achieve a balance between safety (collision avoidance) and optimality (minimizing travel time).

Problem formulation
The interaction is modeled as a two‑agent zero‑sum differential game: the AV provides the control input (u) while the cyclist is treated as a disturbance input (d). The state vector (s = (\Delta x, \Delta v, \Delta y)) captures the longitudinal distance, relative speed, and lateral offset of the cyclist with respect to the vehicle. The objective is to maximize a value functional that penalizes proximity to a predefined collision set (C_0) (a 1 m radius circle) while encouraging rapid goal attainment.

HJ safety analysis
A time‑dependent Hamilton‑Jacobi‑Bellman (HJB) partial differential equation is solved over a three‑dimensional grid to obtain the backward reachable set (BRS) and a scalar safety value function (v(s,t)). Positive (v) indicates that a collision can be avoided with appropriate control; non‑positive (v) signals inevitable collision. The BRS (states that may lead to collision) and the safe complement (R = (B \cup C)’) provide a mathematically rigorous safety envelope. However, staying exclusively within (R) would make the AV excessively conservative, preventing timely goal achievement.

Modeling cyclist comfort as disturbance
To capture human‑centric aspects, the authors introduce a latent “comfort” variable that modulates the disturbance set (D). Using the Safety Pilot Michigan naturalistic driving dataset (≈34 M miles, 40 539 cyclist events), each state is labeled safe or dangerous based on the HJ value. An auto‑encoder is trained on safe states to learn a compact representation; reconstruction error on mixed (safe + dangerous) data serves as an anomaly detector. High reconstruction error flags a dangerous state, which is then mapped to a disturbance adjustment (w(\Delta x,\Delta v,\Delta y,\Delta a)). This mapping shrinks the disturbance set in situations where the cyclist is likely to feel uncomfortable, thereby embedding human comfort into the safety analysis.

Reinforcement learning integration
The safety value (v(s,t)) is transformed into a structured reward that is added to the conventional RL reward (e.g., progress toward goal, control effort). A DQN with experience replay is trained on a discretized Markov decision process where the state consists of the three relative measurements. The loss is the mean‑squared error between predicted Q‑values and target Q‑values computed using the Bellman update with discount factor (\gamma). The combined reward encourages the policy to remain in safe regions (high HJ value) while also seeking time‑optimal trajectories.

Experimental setup

Data extraction – Cyclist events are identified from Mobileye sensor logs; filtering removes events with longitudinal range > 50 m, duration < 1 s, or cyclists on the right side of the road.
HJ computation – Ian Mitchell’s Level Set Toolbox is used to compute the BRS once, based on the relative state formulation and a 1 m collision radius.
Disturbance modeling – Synthetic states are generated to augment the scarce dangerous samples; the auto‑encoder is trained on safe data and evaluated on the full set to obtain the mapping (w).
DQN training – The network receives the relative state and the HJ‑derived safety reward; training proceeds until convergence.

Results
Compared against a pure HJ‑based controller and a baseline DQN without safety shaping, the proposed method reduces average cyclist‑passing time by roughly 15 % while maintaining collision rates comparable to the pure HJ controller. Trajectories exhibit lateral distances and speeds similar to those observed in human driver data, indicating a more natural interaction.

Strengths

Formal safety guarantees from HJ analysis combined with the adaptability of RL.
Introduction of a cyclist‑comfort model that personalizes the disturbance set.
Use of a large naturalistic dataset for realistic scenario generation.

Limitations

HJ computation is limited to a 3‑D state space; scaling to higher‑dimensional multi‑agent scenarios (multiple vehicles, pedestrians) remains challenging.
Auto‑encoder based rare‑event detection suffers from class imbalance; mis‑classification could lead to unsafe disturbance modeling.
Validation is confined to simulation; real‑world deployment, sensor noise, and communication delays are not addressed.

Future work suggested includes: (i) employing approximate HJ methods or decomposition techniques to handle higher‑dimensional systems, (ii) improving rare‑event sampling or using cost‑sensitive learning for the disturbance model, and (iii) conducting on‑vehicle experiments to assess robustness under real traffic conditions.

In summary, the paper presents a compelling hybrid safety‑optimal control framework that bridges rigorous reachability analysis with data‑driven reinforcement learning, offering a promising direction for safe and human‑aware autonomous navigation around cyclists.

Interacting safely with cyclists using Hamilton-Jacobi reachability and reinforcement learning

💡 Research Summary

Comments & Academic Discussion

Leave a Comment