1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities

1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Scaling up self-supervised learning has driven breakthroughs in language and vision, yet comparable progress has remained elusive in reinforcement learning (RL). In this paper, we study building blocks for self-supervised RL that unlock substantial improvements in scalability, with network depth serving as a critical factor. Whereas most RL papers in recent years have relied on shallow architectures (around 2 - 5 layers), we demonstrate that increasing the depth up to 1024 layers can significantly boost performance. Our experiments are conducted in an unsupervised goal-conditioned setting, where no demonstrations or rewards are provided, so an agent must explore (from scratch) and learn how to maximize the likelihood of reaching commanded goals. Evaluated on simulated locomotion and manipulation tasks, our approach increases performance on the self-supervised contrastive RL algorithm by $2\times$ - $50\times$, outperforming other goal-conditioned baselines. Increasing the model depth not only increases success rates but also qualitatively changes the behaviors learned. The project webpage and code can be found here: https://wang-kevin3290.github.io/scaling-crl/.


💡 Research Summary

This paper investigates the impact of scaling network depth on self‑supervised goal‑conditioned reinforcement learning (RL). While recent breakthroughs in natural language processing and computer vision have shown that increasing model size—particularly depth—can unlock emergent capabilities, RL research has largely remained confined to shallow multilayer perceptrons (MLPs) of 2–5 layers. The authors ask whether similar performance jumps can be achieved in RL by dramatically deepening the networks, even in the absence of external rewards or demonstrations.

The study builds on Contrastive Reinforcement Learning (CRL), a simple self‑supervised algorithm that learns a goal‑conditioned policy by maximizing the similarity between state‑action embeddings and future‑goal embeddings using an InfoNCE loss. The critic computes the L2 distance between a state‑action encoder φ(s,a) and a goal encoder ψ(g); the policy πθ(a|s,g) is trained to minimize this distance, effectively classifying whether a given (s,a,g) triplet belongs to the same trajectory.

To enable very deep networks, the authors adopt modern deep‑learning architectural tricks: residual connections (ResNet), layer normalization, and Swish activation. Each residual block consists of four dense layers followed by layer‑norm and Swish, with a shortcut added after the final activation. Depth is defined as the total number of dense layers across all blocks, so a network with N residual blocks has 4 N layers. The same depth is applied to both the actor and the two critic encoders.

Experiments are conducted on a suite of simulated tasks using the JaxGCRL codebase, which leverages Brax and MJX for fast physics simulation. The benchmark includes locomotion (Ant, Humanoid), maze navigation (Ant U4‑Maze, Ant U5‑Maze, Humanoid U‑Maze, Humanoid Big Maze), and manipulation (Arm Push, Arm Binpick). All tasks use a sparse reward that is 1 only when the agent is within a small radius of the goal; performance is measured as the average number of time steps (out of 1000) the agent stays near the goal, averaged over the last five epochs.

Key findings:

  1. Depth‑Driven Performance Gains – Compared with the standard 4‑layer baseline, deeper networks (8, 16, 32, 64 layers) achieve 2–5× improvements on manipulation tasks, >20× on long‑horizon mazes, and up to 50× on humanoid tasks. Scaling to 256, 512, and even 1024 layers continues to improve performance, though gains taper after a certain point.

  2. Critical Depths and Emergent Behaviors – Performance does not increase smoothly; instead, there are sharp jumps at specific “critical depths” that vary by environment (e.g., 8 layers for Ant Big Maze, 64 layers for Humanoid U‑Maze). Visualizing policies reveals qualitatively new skills: shallow networks merely fling themselves toward the goal, medium depth agents learn upright walking, deeper agents develop acrobatic vaults or leverage leverage to overcome obstacles. These behaviors emerge without any explicit curriculum or shaping.

  3. Depth vs. Width – Ablation studies show that increasing width (hidden dimension) yields modest gains (≈2×), whereas depth scaling provides substantially larger improvements. This suggests that RL benefits more from hierarchical representation depth than from sheer parameter count.

  4. Stability via Residuals – Networks without residual connections become unstable beyond ~32 layers, suffering from gradient vanishing/explosion. Adding residual shortcuts, together with layer‑norm and Swish, stabilizes training even for 1024‑layer models.

  5. Batch Size and Data Efficiency – Larger batch sizes (e.g., 4096) improve sample efficiency for deep models, likely because they provide more diverse negative samples for the InfoNCE loss, which is crucial when the embedding space becomes high‑dimensional.

  6. Comparison to Baselines – Scaled CRL outperforms state‑of‑the‑art goal‑conditioned methods such as SAC+HER, TD3+HER, GCBC, and GCSL on 8 out of 10 tasks. Only SAC on Humanoid Maze shows earlier sample efficiency, but scaled CRL eventually matches its performance.

The paper concludes that network depth is a powerful lever for self‑supervised RL, capable of unlocking new capabilities even in sparse‑reward, demonstration‑free settings. Limitations include reliance on simulated environments and the need to bridge the sim‑to‑real gap; future work may explore ultra‑deep transformers, multimodal inputs, and real‑robot deployments. Overall, the work provides compelling evidence that “bigger is better” also holds for RL when depth is scaled responsibly with architectural safeguards.


Comments & Academic Discussion

Loading comments...

Leave a Comment