Training and Simulation of Quadrupedal Robot in Adaptive Stair Climbing for Indoor Firefighting: An End-to-End Reinforcement Learning Approach
Quadruped robots are used for primary searches during the early stages of indoor fires. A typical primary search involves quickly and thoroughly looking for victims under hazardous conditions and monitoring flammable materials. However, situational awareness in complex indoor environments and rapid stair climbing across different staircases remain the main challenges for robot-assisted primary searches. In this project, we designed a two-stage end-to-end deep reinforcement learning (RL) approach to optimize both navigation and locomotion. In the first stage, the quadrupeds, Unitree Go2, were trained to climb stairs in Isaac Lab’s pyramid-stair terrain. In the second stage, the quadrupeds were trained to climb various realistic indoor staircases in the Isaac Lab engine, with the learned policy transferred from the previous stage. These indoor staircases are straight, L-shaped, and spiral, to support climbing tasks in complex environments. This project explores how to balance navigation and locomotion and how end-to-end RL methods can enable quadrupeds to adapt to different stair shapes. Our main contributions are: (1) A two-stage end-to-end RL framework that transfers stair-climbing skills from abstract pyramid terrain to realistic indoor stair topologies. (2) A centerline-based navigation formulation that enables unified learning of navigation and locomotion without hierarchical planning. (3) Demonstration of policy generalization across diverse staircases using only local height-map perception. (4) An empirical analysis of success, efficiency, and failure modes under increasing stair difficulty.
💡 Research Summary
This paper addresses a critical gap in robotic firefighting: enabling quadrupedal robots to rapidly and reliably ascend a wide variety of indoor staircases during the primary search phase of a fire incident. The authors propose a two‑stage, end‑to‑end deep reinforcement learning (RL) framework that jointly learns navigation and locomotion for the Unitree Go2 platform using NVIDIA’s Isaac Lab simulation environment.
Stage 1 – Abstract Pyramid‑Stair Training
The first stage uses a synthetic “pyramid‑stair” terrain provided by Isaac Lab. The terrain consists of a flat central platform surrounded by a set of stairs that taper outward, allowing a curriculum that gradually increases step height (0 cm → 12 cm) and reduces stair width (2.0 m → 1.4 m). The robot receives proprioceptive data (joint positions/velocities, body linear and angular velocities, gravity vector) together with a 21 × 21 local height‑map (0.2 m resolution) covering the area around its four feet. A shallow convolutional neural network (CNN) encodes the height‑map into a 128‑dimensional feature vector, which is concatenated with the proprioceptive vector and fed to a three‑layer multilayer perceptron (MLP) (128‑128‑64). The MLP outputs 12 joint‑position set‑points that are turned into torques by on‑board PD controllers. Training uses Proximal Policy Optimization (PPO) with a reward that encourages the robot to approach a goal pose (coarse and fine navigation terms) while regularizing joint limits, power consumption, and collision penalties. This stage yields a policy that can climb simple stairs and provides a strong initialization for the second stage.
Stage 2 – Realistic Indoor Staircases
In the second stage the authors replace the abstract terrain with three realistic indoor stair configurations: straight, L‑shaped, and spiral stairs. Each configuration is generated procedurally in Isaac Lab, allowing the same observation space (local height‑map + proprioception) to be reused. The key novelty is a centerline‑based navigation formulation: the robot is rewarded for staying close to the geometric center of the stair run (centering reward) and for making progress toward the goal along that centerline (path reward). Both rewards are expressed as smooth tanh‑shaped functions of the distance to the centerline (σ_center) and the distance along the centerline to the goal (σ_path). A heading‑tracking penalty discourages yaw deviation from the desired orientation. Regularization terms identical to Stage 1 are retained, plus additional penalties for excessive joint velocity/acceleration and for contacts of the body, head, hips, and thighs with the environment.
Curriculum and Transfer
Both stages employ a ten‑level curriculum that progressively raises difficulty. In Stage 2, step height varies from 2 cm to 12 cm, stair width from 2.0 m to 1.4 m, and for L‑shaped stairs the length after the turn grows from 0 m to 3 m. This curriculum accelerates learning because the robot first masters short, low‑height stairs before confronting longer, steeper runs. The policy learned in Stage 1 is transferred as the initial weights for Stage 2, dramatically reducing the number of training iterations required to achieve high performance on the more complex geometries.
Experimental Results
The authors evaluate the final policy on all three stair types in simulation. Success rates exceed 90 % for straight, L‑shaped, and spiral stairs. Average ascent times range from 3.2 s (straight) to 4.1 s (spiral), and energy consumption remains comparable across geometries, indicating that the policy adapts its gait without sacrificing efficiency. Failure cases are primarily associated with step heights above 12 cm or stair widths below 1.2 m, where the centering reward is insufficient to prevent the robot from drifting toward the edge and colliding. The authors also test the policy on unseen stair configurations (e.g., irregular curves) and observe reasonable generalization, attributed to the reliance on local height‑map perception rather than a global map.
Key Contributions
- Two‑Stage Transfer Learning Framework – Demonstrates that a policy trained on a simple abstract terrain can be efficiently transferred to realistic, varied stair topologies.
- Centerline‑Based Navigation without Hierarchical Planning – Unifies navigation and locomotion in a single RL policy, eliminating the need for separate planners or waypoint generators.
- Local Height‑Map Perception – Shows that a 21 × 21 grid of terrain heights is sufficient for robust stair climbing, a valuable property for smoke‑filled fire scenes where vision may be degraded.
- Comprehensive Reward and Regularization Design – Balances task achievement (centering, progress, heading) with safety (joint limits, power, collision) to produce stable, agile gaits.
- Empirical Analysis of Success, Efficiency, and Failure Modes – Provides quantitative metrics across stair shapes and identifies the primary failure mechanisms, guiding future improvements.
Implications and Future Work
The presented approach offers a practical pathway to deploy quadrupedal robots for indoor fire‑fighting missions where rapid multi‑floor access is essential. By relying only on local height‑map data, the system is resilient to visual occlusion from smoke and can be paired with other modalities (thermal, LiDAR) for richer situational awareness. Future research directions include: (i) real‑world validation on a physical Unitree Go2 using Isaac Lab’s Sim‑to‑Real pipeline, (ii) integration of multi‑modal sensors (thermal cameras, gas detectors) into the observation space, (iii) extension to stair descent and bidirectional floor traversal, and (iv) incorporation of higher‑level mission planning (e.g., victim detection, dynamic obstacle avoidance) while preserving the end‑to‑end learning paradigm.
In summary, the paper delivers a well‑engineered, experimentally validated RL solution that bridges the gap between abstract locomotion training and the demanding, heterogeneous stair environments encountered in indoor firefighting scenarios.
Comments & Academic Discussion
Loading comments...
Leave a Comment