ETGL-DDPG: A Deep Deterministic Policy Gradient Algorithm for Sparse Reward Continuous Control

ETGL-DDPG: A Deep Deterministic Policy Gradient Algorithm for Sparse Reward Continuous Control
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We consider deep deterministic policy gradient (DDPG) in the context of reinforcement learning with sparse rewards. To enhance exploration, we introduce a search procedure, \emph{$ε{t}$-greedy}, which generates exploratory options for exploring less-visited states. We prove that search using $εt$-greedy has polynomial sample complexity under mild MDP assumptions. To more efficiently use the information provided by rewarded transitions, we develop a new dual experience replay buffer framework, \emph{GDRB}, and implement \emph{longest n-step returns}. The resulting algorithm, \emph{ETGL-DDPG}, integrates all three techniques: \bm{$εt$}-greedy, \textbf{G}DRB, and \textbf{L}ongest $n$-step, into DDPG. We evaluate ETGL-DDPG on standard benchmarks and demonstrate that it outperforms DDPG, as well as other state-of-the-art methods, across all tested sparse-reward continuous environments. Ablation studies further highlight how each strategy individually enhances the performance of DDPG in this setting.


💡 Research Summary

This paper addresses the well‑known difficulty of applying Deep Deterministic Policy Gradient (DDPG) to sparse‑reward continuous‑control problems. In such settings, the agent receives a non‑zero reward only when it reaches a goal state, which makes random exploration inefficient, hampers the propagation of reward signals, and leads to poor sample efficiency. To overcome these three intertwined issues, the authors propose ETGL‑DDPG, a method that integrates three complementary components: (1) εₜ‑greedy exploration, (2) Goal‑conditioned Dual Replay Buffer (GDRB), and (3) Longest n‑step return.

εₜ‑greedy exploration augments the classic ε‑greedy scheme with a lightweight tree search. States are discretized using locality‑sensitive hashing (SimHash), and a visit‑count is maintained for each hash bucket. When the exploration branch is chosen (probability ε), the algorithm builds a search tree of bounded size N by repeatedly sampling a transition from the replay buffer that approximates the unknown dynamics. The tree expands until a node belonging to an unvisited bucket (visit count zero) is found; the action sequence from the root to that node is returned as an option. The authors prove that, provided N ≤ log(|S||A|)·log log(|S||A|), the probability of sampling any option is at least Θ(1/|S||A|), which implies polynomial sample complexity (PAC‑MDP) for εₜ‑greedy. This gives directed, multi‑step exploration without requiring a model of the environment.

Goal‑conditioned Dual Replay Buffer (GDRB) separates experiences into two buffers: a generic buffer Dβ that stores all transitions, and a success buffer De that stores only trajectories that reach the goal. The two buffers differ in size, retention policy, and sampling probability. During training, a mini‑batch is composed by drawing a proportion α from De and (1‑α) from Dβ, where α can be annealed. This design ensures that rewarded transitions are sampled far more frequently than in a single uniform buffer, while still preserving coverage of non‑rewarded experience.

Longest n‑step return replaces the standard one‑step TD target in DDPG with the longest possible n‑step return for each transition. For successful episodes the target is the full discounted return; for unsuccessful episodes it is the cumulative reward up to the time limit. By propagating the sparse reward backward across many steps at once, the critic learns faster and the actor receives more informative gradients.

The full ETGL‑DDPG algorithm proceeds as follows: at each time step the agent selects an action using εₜ‑greedy, executes it, stores the transition in the appropriate hash bucket, and routes the transition to Dβ and possibly De. Periodically, a batch is sampled according to the dual‑buffer scheme, the longest n‑step targets are computed, and the critic and actor are updated using the usual DDPG deterministic policy gradient with soft target updates.

Experimental evaluation is conducted on a suite of 2‑D and 3‑D continuous‑control benchmarks (e.g., FetchReach, HandManipulate, various MuJoCo tasks) where rewards are deliberately sparse. ETGL‑DDPG is compared against vanilla DDPG, DDPG + HER, RND‑augmented DDPG, SAC + HER, and several goal‑conditioned PPO variants. Across all environments, ETGL‑DDPG achieves higher success rates (15–30 percentage points improvement) and requires 2–3× fewer environment steps to reach a given performance threshold. Ablation studies show that each component contributes positively: removing εₜ‑greedy slows early discovery of successful trajectories, removing GDRB reduces the frequency of rewarded samples, and removing the longest n‑step return degrades value propagation, each causing a measurable drop in final performance.

Theoretical analysis confirms that εₜ‑greedy satisfies PAC‑MDP bounds, while the dual‑buffer architecture improves the effective sample reuse of rewarded transitions, and the longest n‑step return reduces TD error variance, accelerating critic convergence.

In conclusion, ETGL‑DDPG offers a practical, theoretically grounded solution for sparse‑reward continuous control. By jointly improving directed exploration, experience prioritization, and reward propagation, it substantially outperforms existing off‑policy methods. Future work may explore asynchronous tree expansion, priority‑based sampling within GDRB, and real‑robot deployments to assess sim‑to‑real transfer.


Comments & Academic Discussion

Loading comments...

Leave a Comment