When Are RL Hyperparameters Benign? A Study in Offline Goal-Conditioned RL

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Hyperparameter sensitivity in Deep Reinforcement Learning (RL) is often accepted as unavoidable. However, it remains unclear whether it is intrinsic to the RL problem or exacerbated by specific training mechanisms. We investigate this question in offline goal-conditioned RL, where data distributions are fixed, and non-stationarity can be explicitly controlled via scheduled shifts in data quality. Additionally, we study varying data qualities under both stationary and non-stationary regimes, and cover two representative algorithms: HIQL (bootstrapped TD-learning) and QRL (quasimetric representation learning). Overall, we observe substantially greater robustness to changes in hyperparameter configurations than commonly reported for online RL, even under controlled non-stationarity. Once modest expert data is present ($\approx$ 20%), QRL maintains broad, stable near-optimal regions, while HIQL exhibits sharp optima that drift significantly across training phases. To explain this divergence, we introduce an inter-goal gradient alignment diagnostic. We find that bootstrapped objectives exhibit stronger destructive gradient interference, which coincides directly with hyperparameter sensitivity. These results suggest that high sensitivity to changes in hyperparameter configurations during training is not inevitable in RL, but is amplified by the dynamics of bootstrapping, offering a pathway toward more robust algorithmic objective design.

💡 Research Summary

This paper tackles a fundamental question in deep reinforcement learning (RL): is the notorious sensitivity to hyper‑parameter settings an inherent property of the RL problem, or does it stem from particular training mechanisms? To isolate the cause, the authors turn to offline goal‑conditioned RL (GCRL), a setting where the data distribution is fixed and the reward signal is a simple binary indicator of reaching a goal state. By removing online exploration, they can study the optimizer’s reaction to data quality alone.

Two representative algorithms are examined. Hierarchical Implicit Q‑Learning (HIQL) implements a bootstrapped TD‑learning objective: it learns a goal‑conditioned value function V(s,g) via expectile regression with targets y = r + γ V(s′,g). Because the target depends on the same network’s predictions, updates are mutually coupled across states and goals. Quasimetric RL (QRL) follows a non‑bootstrapped representation‑learning paradigm: it learns a latent quasimetric dϕ(s,g) that satisfies the triangle inequality and maximizes global distances while respecting local transition constraints. This avoids recursive target dependencies.

Data quality is controlled by mixing expert trajectories (high‑quality, goal‑directed) with exploratory trajectories (low‑quality). The authors evaluate both stationary mixtures (fixed expert/explore ratios) and a scheduled regime that gradually replaces exploratory data with expert data across four training phases (100 % → 80 % → 40 % → 0 % explore). This introduces a deliberate non‑stationarity in trajectory quality while keeping state coverage constant.

A comprehensive hyper‑parameter sweep is performed over learning rate, discount factor, target‑update frequency, and other algorithm‑specific knobs. Performance is measured by binary success rate and a normalized distance‑return metric Rdist that captures progress toward the goal even when success is not achieved. To characterize the resulting “hyper‑parameter landscapes,” three quantitative descriptors are introduced:

ε‑optimality mass (ρ ε) – the fraction of configurations achieving at least ε of the best observed performance, indicating how large the near‑optimal region is.
Phase‑to‑phase drift (Δ) – the mean absolute change in relative performance of each configuration between consecutive training phases, quantifying landscape instability.
Early‑selection regret (r) – the performance loss incurred when a configuration chosen in an early phase is evaluated later, measuring transferability of early choices.

Additionally, fANOVA (via Deep‑CAVE) is applied to each phase to obtain importance scores for each hyper‑parameter, and cosine distances between importance vectors across phases reveal how the set of influential knobs evolves.

The empirical findings are striking. When at least ~20 % expert data is present, QRL exhibits a large ε‑optimality mass, low drift, and minimal early‑selection regret—its near‑optimal region is broad and stable across phases, even under the scheduled quality shift. In contrast, HIQL’s optimal region remains narrow; its best hyper‑parameter settings shift dramatically between phases, leading to high drift and substantial regret when early choices are retained. The sensitivity of HIQL is especially pronounced during the non‑stationary schedule, where the optimal learning rate and discount factor move considerably.

To explain this divergence, the authors introduce an “inter‑goal gradient alignment” diagnostic. For a minibatch and a set of relabeled goals, they compute the normalized gradient of the critic loss with respect to network parameters and evaluate pairwise cosine similarities κ(g, g′). A distribution skewed toward negative κ indicates destructive interference between gradients from different goals. HIQL shows a wide κ distribution with a heavy negative tail, confirming strong gradient conflict. QRL’s κ distribution is tightly clustered near +1, indicating that gradients from different goals are largely aligned. This suggests that the bootstrapped TD target creates inter‑goal coupling that amplifies gradient interference, especially when data quality changes, thereby making the learning dynamics more fragile to hyper‑parameter variations.

The paper concludes that hyper‑parameter sensitivity is not an unavoidable characteristic of RL; rather, it is amplified by bootstrapping mechanisms that induce destructive gradient interference under shifting data distributions. This insight opens a pathway toward more robust algorithm design: either avoid bootstrapped targets, incorporate gradient‑alignment regularizers, or devise objectives that decouple updates across goals. Such directions could yield RL methods that are both data‑efficient and far less dependent on meticulous hyper‑parameter tuning.

When Are RL Hyperparameters Benign? A Study in Offline Goal-Conditioned RL

💡 Research Summary

Comments & Academic Discussion

Leave a Comment