Match or Replay: Self Imitating Proximal Policy Optimization
Reinforcement Learning (RL) agents often struggle with inefficient exploration, particularly in environments with sparse rewards. Traditional exploration strategies can lead to slow learning and suboptimal performance because agents fail to systemati…
Authors: Gaurav Chaudhary, Laxmidhar Behera, Washim Uddin Mondal
Match o r Repla y: Self Imitating Pro ximal P olicy Optimiza- tion Gaura v Chaudhary gaur avch@iitk.ac.in Dep artment of Ele ctric al Engine ering Indian Institute of T e chnolo gy K anpur Laxmidhar Behera lb eher a@iitk.ac.in Dep artment of Ele ctric al Engine ering Indian Institute of T e chnolo gy K anpur W ashim Uddin Mondal wmondal@iitk.ac.in Dep artment of Ele ctric al Engine ering Indian Institute of T e chnolo gy K anpur Abstract Reinforcemen t Learning (RL) agen ts often struggle with inefficien t exploration, particularly in environmen ts with sparse rew ards. T raditional exploration strategies can lead to slow learning and sub optimal p erformance b ecause agents fail to systematically build on pre- viously successful experiences, thereby reducing sample efficiency . T o tac kle this issue, w e prop ose a self-imitating on-policy algorithm that enhances exploration and sample efficiency b y leveraging past high-reward state-action pairs to guide policy updates. Our metho d in- corp orates self-imitation by using optimal transp ort distance in dense rew ard en vironments to prioritize state visitation distributions that matc h the most rewarding tra jectory . In sparse-rew ard en vironments, w e uniformly repla y successful self-encoun tered tra jectories to facilitate structured exploration. Experimental results across div erse en vironments demon- strate substan tial improv emen ts in learning efficiency , including MuJoCo for dense rewards and the partially observ able 3D Animal-AI Olympics and m ulti-goal P ointMaze for sparse rew ards. Our approach achiev es faster conv ergence and significantly higher success rates compared to state-of-the-art self-imitating RL baselines. These findings under score the p oten tial of self-imitation as a robust strategy for enhancing exploration in RL, with appli- cabilit y to more complex tasks. 1 Intro duction Deep Reinforcemen t Learning (DRL) (Li, 2017) has achiev ed remarkable success in solving complex problems across a v ariet y of domains, including rob otic manipulation (Han et al., 2023), flight con trol (Kaufmann et al., 2018), epidemic con trol (Ling et al., 2024), in telligent p erception system (Chaudhary et al., 2023), and real- time strategy game-play (Andersen et al., 2018). How ever, despite these adv ancemen ts, DRL algorithms still face significan t challenges in efficient learning, leading to po or sample efficiency (Bak er et al., 2019). A ma jor contrib uting factor is reliance on unguided exploration to discov er near-optimal p olicies, leading to slo w con vergence, particularly in environmen ts with sparse rewards. Guided exploration using expert demonstration has b een prop osed as a p oten tial solution. Approaches such as those in (Salimans & Chen, 2018; Ecoffet et al., 2019; Xu et al., 2023; Duan et al., 2017; Zhou et al., 2019; Haldar et al., 2023) hav e used exp ert data to guide the agent’s learning pro cess. Ho wev er, these metho ds often face c hallenges, suc h as difficult y obtaining exp ert demonstrations, the risk of bias, and the p otential for con vergence to suboptimal p olicies when demonstrations are insufficiently informativ e. 1 Our w ork addresses these c hallenges b y exploiting the agen t’s past successful state-action transitions to guide exploration. T o this end, we prop ose a no vel Self-Imitation Learning (SIL) (Oh et al., 2018) approac h that leverages the agen t’s past high-reward transitions to guide current policy up dates—effectively “bo ot- strapping" its learning pro cess. This strategy enhances exploration and reduces the risk of divergence from effectiv e behaviors (Lin, 1992). By leveraging the most rew arding past tra jectories, our approach preven ts the agen t from deviating to o far from previously learned successful b eha viors, thereby improving exploration and learning efficiency . Self-imitation learning has been applied to diverse complex tasks suc h as rob otics (Luo & Schomak er, 2023; Luo et al., 2021), text-based games (Shi et al., 2023), pro cedurally generated environmen ts (Lin et al., 2023), in teractiv e navigation (Kim et al., 2023), and large language models (Xiao et al., 2024). Despite these adv ancemen ts, a unified on-p olicy RL approach that effectively integrates self-imitation across b oth lo w- dimensional (structured) and high-dimensional (pixel-based) observ ations, while accommodating b oth dense and sparse reward settings, remains absen t. Prior work, such as (Oh et al., 2018), has explored self-imitation in reinforcement learning; how ever, it primarily fo cuses on off-p olicy algorithms or incorp orates off-p olicy elemen ts when adapting self-imitation to on-policy methods like PPO (Sch ulman et al., 2017). In contrast, w e introduce Self-Imitating Pro ximal P olicy Optimization (SIPP)—a framework explicitly designed for on- p olicy RL that seamlessly integrates self-imitation into PPO’s up date mec hanism, without relying on replay buffers or off-p olicy corrections. By doing so, SIPP preserves PPO’s stabilit y and theoretical guaran tees while significan tly enhancing exploration and sample efficiency . F or dense reward environmen ts, we prop ose the MA TCH strategy . Ev en with high-dimensional state and action spaces and dense rewards, RL agents struggle with sample efficiency , leading to slow conv ergence or sub optimal policies. As implemented in SIPP , self-imitation addresses this b y guiding the agent to w ard high-v alue regions of the state space, thereby reducing divergence from effectiv e p olicies and accelerating learning. Sp ecifically , the MA TCH strategy uses optimal transp ort (Peyré et al., 2019), particularly the Sinkhorn algorithm (Cuturi, 2013), to measure the similarit y b et ween the current p olicy’s state distributions and the most rewarding episo dic rollout from past exp erience. By prioritizing state-action transitions that closely matc h these rewarding distributions, the MA TCH strategy ensures that exploration fo cuses on the self-encoun tered regions of the state space with high exp ected rew ards. F or sparse and binary reward environmen ts, w e introduce the REPLA Y strategy . Sparse and binary rew ard en vironmen ts often present significant challenges for learning agen ts due to the limited a v ailability of p ositiv e feedbac k. Our proposed REPLA Y strategy , a v arian t of the MA TCH strategy tailored to suc h sparse- rew ard scenarios, main tains an imitation buffer that stores previously encountered successful tra jectories. Ho w ev er, while MA TCH prioritizes individual state-action pairs, REPLA Y introduces a trajectory-level repla y mec hanism tailored to sparse-rew ard settings. By replaying en tire successful tra jectories rather than isolated exp eriences, REPLA Y improv es learning in sparse-reward settings b y rep eatedly exp osing the agen t to high-return tra jectories, which effectively reinforces k ey state-action dep endencies ov er time. A dditionally , as demonstrated in the Animal-AI Olympics experiments, REPLA Y can handle partial observ ability , making it more v ersatile than existing metho ds. T o summarize, our k ey contributions are as follows: • Self-imitating on-p olicy algorithm: W e prop ose Self-Imitating Proximal Policy Optimization (SIPP), a nov el self-imitation learning algorithm that enhances exploration and sample efficiency in dense and sparse rew ard settings. • Optimal transport-based prioritization: W e introduce the MA TCH strategy , whic h uses Optimal T ransp ort (Peyré et al., 2019) and the Sinkhorn algorithm (Cuturi, 2013) to prioritize state-action transitions that closely match the state distribution of the most rewarding past episo dic rollout, thereb y impro ving learning efficiency in dense-reward environmen ts. • Sparse reward scenario: W e develop a REPLA Y strategy that stores and reuses successful tra jec- tories to reinforce long-term dependencies and improv e learning from dela yed rewards, resulting in enhanced sample efficiency . 2 • Div erse empirical v alidation: W e v alidate SIPP through exp erimen ts across a wide range of en- vironmen ts, including complex MuJoCo (T ow ers et al., 2023) tasks, multi-goal P oin tMaze naviga- tion (de Lazcano et al., 2023), and partially observ able 3D Animal-AI Olympics (Cr osby et al., 2019), demonstrating significant improv ements in learning efficiency and performance ov er state-of- the-art (Oh et al., 2018; Gangwani et al., 2018). 2 Related W ork Man y attempts ha v e addressed the sample efficiency and exploration problem in reinforcement learning. Ho w ev er, this literature has divided the long work history mainly into guided and unguided exploration. Guided exploration paradigms aim to lev erage exp ert trajectories (Chaudhary & Behera, 2025; Chaud- hary et al., 2025) to address RL agen ts’ sample-efficiency and exploration problems. Recently , in this direc- tion, (Sontakk e et al., 2024) presented an approac h that uses a single demonstration and distilled kno wledge con tained in Video-and-Language Mo dels (VLMs) to train a rob otics p olicy . They use VLMs to generate rew ards by comparing expert trajectories and p olicy rollouts. Another single demonstration guided approach w as presented by (Libardi et al., 2021) for solving three-dimensional sto chastic exploration. They exploit exp ert tra jectories and v alue-estimate prioritized tra jectories to learn optimal p olicies under uncertaint y . Similarly , (Salimans & Chen, 2018) trained a robust p olicy using a single demonstration by replaying the demonstration for n steps, after which agen ts learned in a self-sup ervised manner. T o make the agen t ro- bust to randomness, they monotonically decrease the replay steps n . (Uchendu et al., 2023) presen ts an exp ert-guided learning. They emplo y tw o policies to solv e tasks: the guide p olicy and the exploration p olicy . The guide p olicy in tro duces a curriculum of initial states for the exploration policy , significan tly easing the exploration c hallenge and facilitating rapid learning. As the exploration p olicy b ecomes more proficient, the reliance on the guide p olicy diminishes, allo wing the RL p olicy to develop indep endently and con tinue impro ving autonomously . This progressive reduction in the influence of the guide policy enables the agen t to transition to a fully autonomous exploration phase, thereby enhancing its long-term p erformance and adaptabilit y . (Xu et al., 2023) uses expert demonstration to improv e exploration in learning from demonstrations in sparse rew ard settings. They assign an exploration score to eac h demonstration, generate an episode, and train a p olicy to imitate exploration b eha viors. (Nair et al., 2018) designs an auxiliary ob jectiv e based on demonstra- tions to address hard exploration problems and gradually wean the demonstration guidance once the p olicy p erforms b etter than the demonstration. (Huang et al., 2023) used a tw o-comp onent approach: a no v el actor-critic-based p olicy-learning module that efficien tly uses demonstration data to guide RL exploration, and a non-parametric mo dule that emplo ys nearest-neighbor matching and lo cally w eigh ted regression for robust guidance propagation at states distant from the demonstrated ones. Unguided exploration approac hes use self-experience, coun t-based methods, or prioritized experience repla y buffers to guide p olicy in hard exploration problems. In this literature, we fo cus only on approaches within the Self-Imitation Learning (SIL) paradigm, as coined b y (Oh et al., 2018). (Oh et al., 2018) presented an approac h for self-imitation learning for off-p olicy algorithms. They store exp eriences in a repla y buffer and learn to imitate state-action pairs in the replay buffer only when the return in the past episo de is greater than the agent’s v alue estimate. They also extended their approac h to the on-p olicy algorithm. Ho wev er, the proposed algorithm lacks a strong theoretical connection to on-p olicy algorithms. (Gangw ani et al., 2018) introduces the Stein V ariational Policy Gradien t (SVPG), a self-imitating algorithm designed for on-p olicy reinforcement learning. In this approach, p olicy optimization is framed as a divergence minimization problem, with the ob jectiv e of minimizing the difference b etw een the visitation distribution of the current p olicy and the distribution induced by experience repla y trajectories with high returns. The metho d incorp orates an auxiliary ob jectiv e that regularizes this divergence, allowing for improv ed exploration and more effective p olicy up dates. How ever, their exp erimen ts are limited to episo dic, delay ed, or noisy rew ard settings, whic h may restrict the generalizability of their results to more complex environmen ts. (Chen & Lin, 2020) presents a SIL technique for off-p olicy algorithms. In their approac h, they pro vide a constan t rew ard at each step in addition to an episo dic environmen t rew ard. F urther, they maintain 3 t w o repla y buffers, one with K highest episodic reward tra jectories and the other with all agent-generated tra jectories, and sample from these tw o replay buffers to train the p olicy . They limit their work to dela yed episo dic rewards. (T ang, 2020) presents a self-imitation learning approac h for off-p olicy learning by extending the traditional Q-learning with a generalized n-step low er bound. They adopt SIL by leveraging trajectories where the b eha vior p olicy performs b etter than the current p olicy . (F erret et al., 2020) prop oses a self- imitating v ariant of DQN for dense rew ard environmen ts. In this approach, they prop ose adopting self- imitation via a mo dified reward function. They augment the true rew ard with a weigh ted adv antage term, the difference betw een a true discounted reward and an exp ected future return. (Kang & Chen, 2020) in tro duces the Explore-then-Exploit (EE) framew ork, which in tegrates Random Net- w ork Distillation (RND) (Burda et al., 2018) and Generative A dversarial Self-Imitation Learning (GASIL) (Guo et al., 2018). The framework addresses the exploration-exploitation trade-off b y leveraging RND to facilitate exploration and preven t the p olicy from stagnating in lo cal optima. At the same time, GASIL accelerates p olicy conv ergence b y leveraging past successful tra jectories. Rather than directly com bining these metho ds, which could confuse the agent, the authors prop ose an in terlea ving approach in which the agen t alternates betw een exploration and imitation based on specific criteria. Recen tly , (Li et al., 2023) extended the SIL approach to Goal-Conditioned Reinforcement Learning. They ac hiev e this b y designing a target-action-v alue function that effectiv ely combines the training mec hanisms of the self-initiated policy and the actor p olicy . The SILP (Luo et al., 2021) metho d uses a planning mechanism for robotic manipulation that identifies effective policies from prior exp erience, enabling the agen t to imitate high-qualit y actions even when explicit demonstrations are unav ailable. By incorp orating planning in to the SIL framework, the agent can efficiently explore and exploit past successful b eha viors. The approach impro v es the exploration-exploitation balance and enhances learning stability . The prop osed approac h aligns with unguided exploration with a fo cus on on-policy learning. The prop osed approac h uses past exp eriences to bo otstrap p olicy learning, making a strong connection with the self- imitation learning paradigm. 3 Prelimina ries W e consider a Mark o v Decision Pro cess (MDP) symbolized as the tuple M : ⟨S , A , T , R, γ , ρ ⟩ where S is the collection of en vironment states, A is the action space, T : S × A → ∆( S ) indicates the state transition function where ∆( · ) defines the probabilit y simplex o ver its argumen t set, R : S × A × S → R is the rew ard function, γ ∈ (0 , 1) denotes the discount factor, and ρ ∈ ∆( S ) is the initial state distribution. At eac h time step t , the agent observes the state s t and executes an action a t ∈ A . As a consequence, the state of the environmen t changes to s t +1 follo wing the transition law T , and the agen t receives a reward r t = R ( s t , a t , s t +1 ) . A (stationary) p olicy is defined to b e a map π : S → ∆( A ) . The reinforcement learning agen t is trained to maximize the exp ected long-term discounted rew ard defined b elo w ov er all π ∈ Π where Π is the collection of all p olicies (Mondal & Aggarw al, 2024). J π ρ = E π " ∞ X t =0 γ t r t s 0 ∼ ρ # (1) where E π denotes expectation ov er all π -induced tra jectories { ( s t , a t ) } ∞ t =0 emanating from the initial distri- bution ρ . F or large state spaces, the p olicies are represen ted by a parameter θ ∈ R d where the dimension d is chosen such that d ≪ |S ||A| . F or Neural Netw ork-based p olicies, θ is the weigh t parameter. In this framew ork, the agent’s goal is to maximize J π θ ρ ≜ J ( θ ) o ver θ ∈ R d . W e hav e dropp ed the dependence of ρ on J ( θ ) to simplify notations. W e ac hieve this using a PPO-st yle gradient-based learning algorithm with few c hanges, driven b y the self-imitation ob jectiv e as explained in the forthcoming section. 4 Metho d In this work, w e prop ose an approach that guides an agen t’s exploration b y combining its curren t behavior with past successful tra jectories to enable on-p olicy RL. Unlik e prior self-imitation metho ds that rely on 4 off-p olicy data or mo dify the reward function, SIPP operates entirely within the on-p olicy framework. By mo difying the rollout buffer sampling strategy (MA TCH) or selectiv ely repla ying successful trajectories (RE- PLA Y), SIPP reuses successful trajectories in a controlled fashion and integrates them in to PPO’s on-policy training lo op, without additional target net works or densit y-ratio corrections that are typical in off-p olicy RL. While this in tro duces some bias due to sample reuse, as discussed in w orks suc h as GePPO (Queeney et al., 2021), empirical results indicate that the training remains stable and effective. This seamless integration with PPO ensures that our approach is both stable and efficient. The key highlights are as follows: • Our approac h do es not alter the base RL p olicy (PPO) nor introduce new separate mo dels requiring training, unlik e prior works Gangwani et al. (2018); Kang & Chen (2020). • Our approac h do es not mo dify the true reward, preven ting any bias in learning, unlik e (Chen & Lin, 2020). • W e address exploration by self-imitation for dense, sparse, and binary rewards encompassing b oth lo w and high-dimensional (pixel-based) observ ations, unlik e (Oh et al., 2018; Gangw ani et al., 2018), whic h addressed only delay ed and noisy rewards for lo w-dimensional observ ations in an on-p olicy setting. 4.1 MA TCH: A Self-Imitating Proximal Policy Optimization The MA TCH strategy is designed to enhance exploration in dense rew ard environmen ts b y encouraging the agent to revisit transitions similar to those in the most rewarding past tra jectory . Inspired by self- imitation learning, this approach assigns higher priority to state-action pairs that closely align with previously successful b eha viors. T o formally define the idea of similarity , w e use Optimal T r ansp ort (OT) as a principled metho d to compare empirical state visitation distributions. Optimal T ransport (OT) (Cuturi, 2013; P eyré & Cuturi, 2020; Luo et al., 2023) is a geometry-a ware frame- w ork for comparing probabilit y distributions. Supp ose w e are given tw o empirical distributions represented as: µ = 1 T T X t =1 δ x t , ν = 1 T ′ T ′ X t ′ =1 δ y t ′ , where x t and y t ′ denote sample p oin ts in a metric space (e.g., state embeddings), and δ x t is the Dirac measure centered at x t , assigning unit mass at that p oint. The squared W asserstein distance b etw een µ and ν is given b y: W 2 ( µ, ν ) = min ζ ∈ Γ T X t =1 T ′ X t ′ =1 c ( x t , y t ′ ) ζ tt ′ , (2) where Γ = n ζ ∈ R T × T ′ + : ζ 1 T ′ = 1 T 1 T , ζ ⊤ 1 T = 1 T ′ 1 T ′ o is the set of doubly sto c hastic coupling matrices, and c ( x t , y t ′ ) is the cost of transp orting unit mass from x t to y t ′ . W e inv oke the cosine distance as the cost. T o solve the ab o ve optimization efficien tly , we apply the Sinkhorn algorithm (Cuturi, 2013), which incurs a computational complexit y of O ( T T ′ ) . The MA TCH algorithm has a nested lo op structure. At the k th instance of the outer lo op, it pro duces a T - length tra jectory { ( s k t , a k t ) } T t =1 pro duced b y the current policy π θ k (based on the task, T is either deterministic or random) and stores it in a data buffer D . If the tra jectory stored in D is the highest-rew arding one seen so far, then it is also stored in the imitation buffer B I , replacing any earlier-stored tra jectories in it. Let { s k 1 , · · · , s k T } and { s e 1 , · · · , s e T ′ } b e the states of the tra jectories stored in D and B I resp ectiv ely . Their empirical distributions are giv en as: ˆ q k = 1 T T X t =1 δ s k t , ˆ q e = 1 T ′ T ′ X t ′ =1 δ s e t ′ , (3) 5 Algorithm 1 MA TCH : Self-Imitating Pro ximal Policy 1: Input: IET ξ , initial state distribution ρ , batc h size B , the inner lo op length H , learning rate η 2: Initialize policy parameter θ 1 ← θ 0 3: Initialize imitation buffer B I ← {} 4: Initialize data buffer D ← {} **Outer Loop** 5: for k ∈ { 1 , 2 , · · · } do 6: Collect π θ k -induced trajectory { ( s k t , a k t ) } T t =1 in D 7: Up date B I b y storing the highest-rew arding tra jectory seen so far. 8: Obtain adv antage estimates ˆ A 1 , · · · , ˆ A T corresp onding to π θ k and the state-action pairs of the trajec- tory in D 9: θ k, 0 ← θ k , θ k, − 1 ← θ k − 1 **Inner Loop** 10: for h ∈ { 0 , · · · , H − 1 } do 11: Sample a B -sized batch of state-actions from D either uniformly or w eigh ted by equation 5, controlled b y ξ 12: Up date θ k,h follo wing equation 7 13: end for 14: θ k +1 ← θ k,H 15: Empt y data buffer D ← {} 16: end for Their corresponding OT distance is given as: W 2 ( ˆ q π , ˆ q e ) = min ζ ∈ Γ T X t =1 T ′ X t ′ =1 c ( s k t , s e t ′ ) ζ tt ′ . (4) Let ζ ∗ b e the solution to the abov e optimization. W e define an OT-based similarity score for eac h state s k t in the trajectory in D as follo ws. d OT ( s k t ) = − T ′ X t ′ =1 c ( s k t , s e t ′ ) ζ ∗ tt ′ (5) The inner loop starts with initialization: θ k, 0 ← θ k . A t the h th instant of the inner loop, the agent c ho oses a batc h of state-actions { ( s k j , a k j ) } j ∈J from D either via a uniform probability (exploration) or a priorit y-based strategy determined by the similarit y score in equation 5 (imitation). The choice b et ween exploration and imitation is decided b y a hyperparameter ξ , called the Imitation-Exploration T rade-off (IET) coefficient. F or the chosen batch of state-action pairs, we can now define the PPO-based surrogate loss function as follo ws (Sc h ulman et al., 2017). L PPO ( θ k,h ) = E j min r kh j A j , clip r kh j , 1 − ϵ, 1 + ϵ A j , (6) where E j denotes the empirical a verage o ver j ∈ J , ϵ defines a clipping h yp erparameter, A j is the estimate of the adv antage (Sc hulman et al., 2015) function corresp onding to the p olicy π θ kh and the pair ( s j , a j ) , and the ratio r kh j is giv en as follows. r kh j = π θ k,h ( a j | s j ) π θ k,h − 1 ( a j | s j ) where we use the conv ention that, for h = 0 , θ k,h − 1 ← θ k − 1 . The p olicy parameter is no w up dated using gradien t descen t. θ k,h +1 ← θ k,h − η ∇ θ L PPO ( θ k,h ) (7) 6 Algorithm 2 REPLA Y : Self-Imitating Proximal Policy 1: Input: IET ξ , initial state distribution ρ , batc h size B , the learning rate η , imitation buffer length L 2: Initialize policy parameter θ 1 ← θ 0 3: Initialize imitation buffer B I ← {} 4: Initialize data buffer D ← {} 5: for k ∈ { 1 , 2 , · · · } do 6: Sample τ ∼ Bernoulli( ξ ) 7: if τ = 0 then 8: Sample a trajectory from B I and store it in D 9: else if τ = 1 then 10: Store a π θ k -induced trajectory in D 11: end if 12: Obtain adv antage estimates ˆ A 1 , · · · , ˆ A T corresp onding to π θ k and the state-action pairs of the tra jec- tory in D 13: Sample a B -sized batc h of state-actions { ( s k j , a k j ) } j ∈J from data buffer D with uniform probability 14: Up date θ k follo wing equation 8 15: Up date B I b y storing the highest L rewarding trajectories seen so far 16: Empt y data buffer D ← {} 17: end for where η is the learning rate. Finally , we assign θ k +1 ← θ k,H where H is the inner loop length and start the ( k + 1) th instan t of the outer lo op. Algorithm 1 summarizes the en tire pro cess. Note that our prop osed approach do es not rely on any explicit exp ert p olicy . Instead, it utilizes the distribu- tion of the most-rewarding tra jectory generated b y the curren t b eha vior p olicy (stored in the imitation buffer) as the surrogate exp ert p olicy . This remov es the dep endency on external exp ert trajectories and leverages the agen t’s high-performing exp eriences. Our algorithm ensures that the imitation buffer evolv es contin uously as the agent disco v ers b etter-performing tra jectories, thus adapting the surrogate exp ert distribution o v er time. 4.2 REPLA Y: Self-Imitating Proximal Policy This section addresses the exploration c hallenge in sparse-reward settings using self-imitation. Sparse-rew ard en vironmen ts often pose significan t c hallenges for learning agents due to the limited a v ailability of p ositive feedbac k. T o mitigate this challenge, w e prop ose the REPLA Y strategy , adapted from (Libardi et al., 2021), a v arian t of the MA TCH strategy tailored specifically for sparse-rew ard scenarios. Unlike MA TCH, which uses past trajectories to generate preferences for curren t tra jectories, the REPLA Y strategy focuses on directly repla ying successful past behaviors. The structure of REPLA Y is v ery similar to that of MA TCH, except for some mo difications mentioned b elo w. REPLA Y also main tains an imitation buffer, B I , and a data buffer, D . How ever, unlike MA TCH, in this case, B I dynamically stores multiple tra jectories. The capacit y of B I is a given hyperparameter. REPLA Y runs in multiple ep ochs. Let the p olicy parameter at the k th ep och b e θ k . The agen t either selects a tra jectory from B I at random (imitation) or generates a tra jectory induced by the curren t p olicy π θ k (exploration), and stores it into the data buffer D . The probability of c ho osing either of these even ts is determined by the IET parameter ξ . This sampling mechanism ensures a balance b et ween imitation and exploration. A higher v alue of IET emphasizes exploitation by prioritizing trajectory sampling from B I , while a lo wer v alue encourages exploration b y focusing on the tra jectories generated from the agent’s most recen t in teractions. Next, a batch of state-action pairs, { ( s k j , a k j ) } j ∈J of length B , is uniformly selected from the tra jectory in D . W e can now define the surrogate PPO ob jective L PPO ( θ k ) in a wa y similar to equation 6. The policy parameter θ k is updated via gradient descen t. θ k +1 ← θ k − η ∇ θ L PPO ( θ k ) (8) 7 0 20000 40000 60000 80000 100000 T imesteps 100 50 0 50 100 Episode R ewar ds MountainCarContinuous- v0 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 1000 500 0 500 1000 1500 2000 2500 Episode R ewar ds Ant- v4 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 0 1000 2000 3000 Episode R ewar ds HalfCheetah- v4 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 0 500 1000 1500 2000 2500 Episode R ewar ds Hopper - v4 0 50000 100000 150000 200000 250000 300000 T imesteps 0 1000 2000 3000 4000 5000 6000 7000 Episode R ewar ds InvertedDoubleP endulum- v4 0 50000 100000 150000 200000 250000 300000 T imesteps 0 200 400 600 800 1000 Episode R ewar ds InvertedP endulum- v4 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 40000 60000 80000 100000 120000 140000 Episode R ewar ds HumanoidStandup- v4 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 0 500 1000 1500 2000 2500 3000 3500 4000 Episode R ewar ds W alk er2d- v4 Figure 1: Results show the p erformance of 8 MuJoCo (T o wers et al., 2023) con tinuous control tasks (refer to Figure 7 for results on all tasks). The plots show the learning curves and the episo dic rewards along the y-axis, ev aluated under the current p olicy . The reported results are the mean across seven seeds, with shaded regions indicating the standard deviation. The proposed algorithms outp erform all baselines across all tasks, ac hieving comp etitiv e or better p erformance. where η is the learning rate. Finally , the data buffer is emptied, the imitation buffer B I is up dated b y storing the top L num b er of rew arding tra jectories seen so far, where L is the capacit y of B I , and the agen t is prepared for the ( k + 1) th up date epo ch. Observe that the up date pro cedure of B I resem bles the first-in first-out (FIF O) queuing mechanism. Algorithm 2 summarizes the entire process. In our algorithm, the pro cess of “replay” refers to including the tra jectories stored in the imitation buffer B I in to the data buffer D and treating them as p oten tial b ehaviors generated by the curren t p olicy . This ensures that the agent encounters successful tra jectories at each ep och to guide p olicy learning, even in sparse- rew ard environmen ts where most interactions yield little to no rew ard. This selective reuse of high-rew arding tra jectories improv es the sample efficiency and reduces the risk of the agent getting stuck in unpro ductiv e exploration lo ops. Our REPLA Y strategy thus offers a robust framew ork for learning in sparse-rew ard en vironmen ts. 5 Exp eriments In this section, w e aim to answer the following questions: • Do es b ootstrapping p olicy learning, with its few past exp eriences, enhance sample efficiency and hard exploration across div erse tasks? • Can a single past successful b eha vior b e sufficient to guide p olicy learning in complex sequen tial con tin uous con trol tasks? • Is repla ying past successful tra jectories sufficien t for policy learning in multi-goal and partially observ able, sparse rew ard settings? 5.1 Implementation Details F or dense reward environmen ts in MuJoCo (T ow ers et al., 2023), w e implement the Match strategy . The net w ork arc hitecture utilizes a multi-la yer p erceptron (MLP) with t wo hidden lay ers containing 64 units and 8 tanh as the activ ation function. The PPO p olicy is up dated o ver 10 ep o c hs p er training iteration. T raining batc hes are sampled uniformly or prioritized based on the optimal transp ort distance b et ween the current tra jectories and past b est episo dic rollouts, con trolled by the Imitation-Exploration T rade-off co efficien t ϵ at eac h ep o c h. The imitation buffer is initiated with size 1. F urther details ab out the h yp erparameters and implemen tation can be found in the supplementary material T able 2. F or sparse-reward settings, suc h as multi-goal Poin tMaze (de Lazcano et al., 2023) navigation tasks, w e adopt the R eplay strategy . This setup uses the same MLP-based architecture as MuJoCo environmen ts. Similarly , for the Animal-AI Olympics (Crosby et al., 2019), a partially observ able 3D environmen t with binary rewards, we apply the R eplay strategy but with a differen t arc hitecture: a three-lay er con volutional neural netw ork (CNN). The input comprises the last four stack ed frames (84 × 84 RGB pixels). In b oth P oin tMaze and Animal-AI tasks, the rollout buffer is p opulated with tra jectories sampled either from the curren t policy with probability 1 − ϵ or from the imitation buffer B I with probabilit y ϵ . The imitation buffer is initialized with size 10. Comprehensive details of the hyperparameters for all environmen ts are provided in the supplemen tary material T able 3. 5.2 Choice of Baselines: Self-imitation learning (SIL) has been primarily explored to enhance exploration in off-policy reinforcemen t learning (RL) algorithms, with limited fo cus on on-p olicy RL algorithms. F urther, recent works (Luo & Sc homak er, 2023; Xiao et al., 2024; Shi et al., 2023) predominantly fo cus on problem-specific adoption of SIL rather than adv ancing SIL from a broader algorithmic p ersp ectiv e. Notably , there remains limited analysis of SIL’s p oten tial for on-p olicy RL algorithms across diverse problem settings, leaving a significant gap in understanding its general applicabilit y . This limits the c hoice of baseline for our problem. PPO: PPO is v anilla proximal policy algorithm (Sch ulman et al., 2017), which do es not imp ose an y self- imitation learning paradigm. SIL-PPO: SIL (Oh et al., 2018) is an off-p olicy RL algorithm that imitates past state-action pairs that ha v e higher returns than agen t v alue estimates. The prop osed approac h was also extended to the on-p olicy PPO (Sc hulman et al., 2017) algorithm with a fo cus on dense or dela yed rew ards. F urther, as highlighted b y (Oh et al., 2018), SIL lacks a theoretical connection with on-p olicy algorithms. SVPG-PPO: SVPG (Gangwani et al., 2018) is a self-imitating on-p olicy algorithm that uses Stein v aria- tional gradien t descen t to minimize the div ergence b et ween the current p olicy’s visitation distribution and that of past high-return tra jectories. Unlik e SIPP , SVPG introduces an auxiliary ob jectiv e that regularizes this div ergence, p otentially complicating the learning pro cess. PER-PPO : Prioritize Exp erience Repla y (PER) (Schaul et al., 2015) tec hnique uses TD-error based tran- sition prioritization. W e extended this method to PPO, prioritizing samples in the rollout buffer based on TD error. W e use a strategy similar to our method to balance exploration and exploitation. 5.3 P erformance of Match on Continuous Control T asks In this section, w e inv estigate the effect of self-imitation on contin uous control tasks with dense rew ards. W e ev aluate the performance of our SIPP-Match strategy across 10 MuJoCo (T ow ers et al., 2023) tasks, using selected baselines. Compared with all the baselines, the performance of the Match strategy on con tin uous control tasks is sho wn in Figure 1. The prop osed Match algorithm outperforms PPO (Sch ulman et al., 2017) and SIL-PPO across all tasks, with SVPG+PPO lagging in most tasks, except for competitive p erformance on W alk er2d-v4 and Humanoid-v4. PER (Sc haul et al., 2015) uses a TD-error-based strategy that prioritizes transitions and lags across all tasks using v alue function estimates. In MuJoCo b enchmark en vironments, the agent benefits from contin uous feedback via a smo oth and dense rew ard structure, facilitating faster exploration and learning. Despite this, our exp eriments demonstrate that the optimal transp ort distance, prioritizing self-bo otstrapping, can further enhance exploration for the 9 (a) Goal (b) Goal-behind wall (c) Goal-tunnel (d) Goal-occluded tunnel (e) Goal-on wall Figure 2: All tasks feature one goal and one agent. The agen t’s and goal’s p ositions are randomly selected at the start of each episo de from a predefined set of fixed initial p ositions. Each episo de initializes the en vironmen t by sampling these p ositions, ensuring v ariability while maintaining a structured distribution. There is only one source of rew ard p er environmen t, i.e., a binary rew ard is provided for reac hing the goal. The agen t observes the arena through a first-p erson view with partial visibility , reflecting the limitations of a partially observ able en vironment. pro ximal p olicy . By prioritizing the most informativ e experiences, our metho d ensures that the agen t focuses on high-v alue learning opportunities, accelerating conv ergence and impro ving p olicy robustness. A dditionally , the prop osed approach stores only the states visited by the most rewarding past episodic rollout, making it simpler than prior metho ds. Unlike our approac h, (Oh et al., 2018) compares returns of past exp eriences with agent v alue estimates to select exp eriences for self-imitation, which can b e noisy and in tro duce bias in p olicy learning (Libardi et al., 2021; Raileanu & F ergus, 2021). F urthermore, (Gangw ani et al., 2018) uses Stein v ariational gradient descent as a regularizer to minimize divergence b et w een state- action visitation distributions of the curren t policy and past rewarding exp eriences. How ever, their approac h in tro duces bias in p olicy learning, whic h they address by simultaneously learning m ultiple diverse p olicies. In summary , the prop osed Match algorithm integrates seamlessly with PPO without in tro ducing additional learning parameters and requires only one tra jectory to guide self-imitation learning. It in tro duces a single h yp erparameter, IET coefficient ( ϵ ), to control whether training batches are uniformly sampled or prioritized using optimal transport distance based on past successful episodic rollouts. The single h yp erparameter pro vides a simple mechanism to balance exploration and exploitation. This approach offers a practical and efficien t solution to enhance reinforcemen t learning p erformance in complex environmen ts. 5.4 P erformance of Replay in Spa rse Rewa rd T asks In this section, w e empirically ev aluate self-imitation p erformance in sparse and binary reward settings. W e b eliev e that self-imitation can play a crucial role in such rew ard settings, as the ability of an agen t to reac h some success can b e extremely difficult with sparse rew ards. Previous work w as limited to MuJoCo en vironmen ts with dense or delay ed rewards. Motiv ated b y this, w e ev aluate the p erformance of SIPP- R eplay on a div erse set of tasks, including multi-goal gymnasium-rob otics P ointMaze na vigation sparse rew ard environmen ts (de Lazcano et al., 2023) and partially observ able 3-dimensional Animal-AI Olympics (Crosb y et al., 2019) binary reward en vironments. 5.4.1 T ask Definitions: The Poin tMaze en vironment is a 2-dimensional maze. W e use tw o v arian ts of the Poin tMaze environmen t. First, with fixed agent p osition and v arying goal p osition, i.e., the goal p osition is reinitialized at eac h episo de. Second, both the goal and agent p ositions are reinitialized after ev ery reset. The Animal-AI Olympics (Crosby et al., 2019) is a partially observ able 3D en vironmen t in whic h an agent can na vigate freely within an arena. W e designed 5 exp eriments in total. Each exp erimen t has a differen t lev el of complexity based on the type of obstacles presen t in the arena. The descriptions of pla ygrounds are as follo ws : • Goal: In this arena, the agen t has to reach the goal position. The agen t and goal can b e anywhere in the arena. There are no obstacles in the arena. 10 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 0 50 100 150 200 250 Episode R ewar ds P ointMaze_Open_Diverse_GR - v3 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 0 50 100 150 200 250 300 Episode R ewar ds P ointMaze_Medium_Diverse_G- v3 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 50 0 50 100 150 200 Episode R ewar ds P ointMaze_Medium_Diverse_GR - v3 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 0 100 200 300 400 Episode R ewar ds P ointMaze_Lar ge_Diverse_G- v3 Figure 3: Results show the p erformance on 4 P ointMaze (de Lazcano et al., 2023) multi-goal sparse reward tasks (refer to Figure 8 for results on all tasks). The plots show the learning curves and the episodic rewards along the y-axis, ev aluated under the curren t p olicy . The reported results are the mean across seven differen t seeds. The prop osed algorithms outperform all the baselines b y a significant margin. 0 50000 100000 150000 200000 250000 300000 T imesteps 0.0 0.2 0.4 0.6 0.8 1.0 Episode R ewar ds Goal 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 0.0 0.2 0.4 0.6 0.8 1.0 Episode R ewar ds Goal-behind wall 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 0.0 0.2 0.4 0.6 0.8 1.0 Episode R ewar ds Goal-tunnel 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 T imesteps 1e6 0.0 0.2 0.4 0.6 0.8 Episode R ewar ds Goal- occluded tunnel Figure 4: Results sho w the p erformance on the 4 Animal-AI Olympics en vironment (Crosb y et al., 2019) binary rew ard tasks (refer to Figure 9 for results on all tasks). The plots show the learning curves, with episo dic rewards (success rate) on the y-axis, ev aluated under the current p olicy . The reported results are the mean across 5 seeds, with shaded regions highligh ting the standard deviation. The prop osed algorithms outp erform PPO by a significan t margin. • Goal-b ehind w all : The goal is hidden behind a w all in this arena. The agent and goal p ositions are differen t in eac h configuration. The agent needs to learn to find the goal, whic h is hidden behind the w all. • Goal-tunnel : This arena has a transparen t tunnel op en at both ends. The agent cannot penetrate the tunnel w alls and must enter the tunnel to reach the goal. • Goal-o ccluded tunnel: This arena is identical to the previous one, except that the tunnel entrances are occluded with mo v able boxes. The agent must learn to mo v e the boxes to find the goal inside the tunnel. • Goal-on w all: In this arena, we place the goal on an L-shap ed w all. The agen t must learn to find a ramp to clim b up the wall and av oid falling off the w all to reach the goal. 5.4.2 Empirical Analysis: The p erformance of SIPP-R eplay is sho wn in Figures 3 and 4. The choice of baselines for the Poin tMaze en vironmen t is consisten t with the MuJoCo tasks, as b oth inv olve fully observ able MDPs. How ever, for the Animal-AI Olympics task, the choice of baselines is restricted to PPO. This limitation arises b ecause the official implementations of baselines, SIL, and SVPG are tailored to fully observ able environmen ts lik e MuJoCo and do not support the Animal-AI Olympics. F urther, Baselines such as SIL and SVPG rely on explicit div ergence estimation or adv an tage comparisons o v er fully observed states, making them less suitable for partially observ able en vironmen ts. In contrast, 11 0 20000 40000 60000 80000 100000 T imesteps 100 50 0 50 100 Episode R ewar ds MountainCarContinuous- v0 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 500 0 500 1000 1500 2000 2500 Episode R ewar ds Ant- v4 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 0 1000 2000 3000 Episode R ewar ds HalfCheetah- v4 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 0 500 1000 1500 2000 2500 Episode R ewar ds Hopper - v4 Figure 5: Results show the ablation study on 4 MuJoCo (T ow ers et al., 2023) con tinu ous control tasks (refer to Figure 10 for complete results). The parameter ϵ con trols the balance b etw een exploration and exploitation. The plots sho w the learning curves and the episo dic rewards along the y-axis, ev aluated under the curren t p olicy with different ϵ . The rep orted results are the mean across 5 seeds, with shaded regions highligh ting the standard deviation. 0 50000 100000 150000 200000 250000 300000 T imesteps 0 50 100 150 200 250 Episode R ewar ds P ointMaze_Open_Diverse_GR - v3 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 0 50 100 150 200 250 300 Episode R ewar ds P ointMaze_Medium_Diverse_G- v3 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 0 25 50 75 100 125 150 Episode R ewar ds P ointMaze_Medium_Diverse_GR - v3 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 0 50 100 150 200 250 300 350 Episode R ewar ds P ointMaze_Lar ge_Diverse_G- v3 Figure 6: Results sho w the ablation study on 4 P ointMaze (de Lazcano et al., 2023) multi-goal sparse rew ard tasks (refer to Figure 11 for complete results). The parameter ϵ controls the replay frequency to balance exploration vs exploitation. The plots show the learning curv es and the episo dic rew ards along the y-axis, ev aluated under the curren t p olicy with differen t ϵ . The rep orted results are across 5 different seeds. SIPP’s repla y strategy treats stored trajectories as exp ert demonstrations and integrates them directly in to PPO’s training without requiring densit y estimation or changes to the underlying architecture. This allows SIPP to op erate naturally in POMDP settings, where mo deling state visitation distributions is non-trivial or infeasible. T o ensure fairness, we adapted the SIL baseline (Oh et al., 2018) by using observ ation histories as pro xies for states in the Animal-AI Olympics environmen t. How ever, ev en with this adaptation, SIL performed worse than v anilla PPO. This degradation stems from SIL’s reliance on either dense reward signals or access to true states to guide its imitation strategy—conditions that are not satisfied in our partially observ able setting. This comparison highlights the broader applicability of our method: SIPP do es not assume full observ abilit y or rew ard densit y and is, to our knowledge, the first self-imitation approac h successfully deploy ed in such a complex, div erse POMDP environmen t. In our experiments, the Imitation-Exploration T rade-off coefficient IET ( ϵ ) was set to 0.3 for all Poin tMaze tasks except for Poin tMaze_Medium_Diverse_G, where ϵ = 0 . 1 w as used based on preliminary exp eri- men ts. The p erformance of our proposed algorithm exceeds all the baselines on the Maze na vigation task, as shown in Figure 3. W e b elieve that the long-term dep endency problem pla ys a crucial role under sparse rew ard conditions, and we tac kle this b y replaying past tra jectories. Unlik e baseline metho ds, which prior- itize state-action pairs, SIPP fo cuses on episo dic trajectory-level prioritization rather than state visitation distributions. This helps the agen t to understand which actions con tribute to future rew ards. T o incorp orate successful tra jectories in the replay buffer, w e consider it as the p ossible behavior of the agent in the curren t en vironmen t. 12 The R eplay strategy results in even sup erior p erformance (Figure 4) in partially observ able environmen ts due to its inherent ability to adapt to partial observ abilit y . The results show that PPO agents encoun ter some success but fail to learn due to p olicy instability . Policy instability refers to the div ergence of PPO’s p olicy from successful b eha viors due to frequent up dates with new data, whic h ov erwrite past successes. Ho w ev er, the proposed R eplay strategy stores these b eha viors and rep eatedly repla ys them. This reinforces successful behaviors in policy learning, and the agen t ev entually learns to mimic them. T o summarize, the results show that self-imitation can help agents learn in b oth dense and sparse reward settings. In a dense reward setting, prioritizing state visitation that matches the past successful state is sufficien t, as a dense reward structure can guide the agent to learn the long-term consequences of actions tak en in those states. How ever, a more exploitativ e strategy is required in sparse rew ard settings, such as repla ying past successful episodic tra jectories. 5.5 T uning Self-Imitation vs. Exploration The prop osed strategy exploits the agen t’s past b ehaviors. Ho wev er, it is crucial for the agent to learn an optimal policy . This balance b et ween exploitation and exploration in SIPP is achiev ed through the Imitation-Exploration T rade-off (IET) coefficient ϵ . In SIPP-Match this parameter indicates the probabilit y of sampling training batches uniformly or with a priorit y prop ortional to the optimal transp ort distance with the most successful past tra jectory . In SIPP-R eplay , this parameter con trols the trajectory replay probability from the imitation buffer B I . The effect of IET for SIPP-Match strategy is sho wn in Figure 5. The ablation study sho ws the maximum p erformance impro vemen t for epsilon = 0 . 1 , 0 . 2 , or 0 . 3 across all tasks. This highligh ts the imp ortance of exploration, as greedy imitation results in sub-optimal p erformance. How ever, for simpler tasks such as MountainCarCon tinuous, the p erformance is mostly similar as these simpler environmen ts require less exploration, and ev en an aggressive imitation strategy results in similar p erformance. A similar analysis w as p erformed to find the balance b et ween the exploration and exploitation trade-off of the P ointMaze navigation environmen ts. The results of this ablation study are shown in Figure 6. W e didn’t p erform a similar analysis of the Animal-AI Olympics environmen t. How ever, the IET co efficien t for the Animal-AI Olympics en vironment w as kept ϵ = 0 . 3 based on the insigh ts from the ab o ve ablation studies. In conclusion, our ablation study on ϵ highlights its influence on SIPP’s p erformance. In dense reward en- vironmen ts lik e MuJoCo, smaller ϵ = 0 . 1 , 0 . 2 p erform well, as frequent rewards naturally guide exploration. Con v ersely , in sparse-reward settings lik e Poin tMaze, a higher v alue of ϵ = 0 . 3 impro v es outcomes by em- phasizing imitation of rare successful tra jectories. These results indicate that ϵ should b e adjusted based on the task’s rew ard structure and exploration demands. 6 Conclusion This pap er prop oses a self-imitating proximal p olicy framework to address exploration and sample-efficiency c hallenges in dense and sparse-rew ard environmen ts. Through extensive exp erimentation, we demonstrated that b ootstrapping p olicy learning from past rewarding exp eriences effectively reduces p olicy divergence, leading to enhanced exploration and stability . The simplicit y and efficacy of the prop osed algorithm highlight its v ersatility across different problem settings. F urthermore, we show ed that self-imitation and exploration are inheren tly complementary , enabling agents to leverage prior successes for guided learning, whic h can b e crucial in hard exploration tasks. 13 References P er-Arne Andersen, Morten Go odwin, and Ole-Christoffer Granmo. Deep rts: a game en vironmen t for deep reinforcemen t learning in real-time strategy games. In 2018 IEEE c onfer enc e on c omputational intel ligenc e and games (CIG) , pp. 1–8. IEEE, 2018. Bo w en Baker, Ingmar Kanitsc heider, T odor Marko v, Yi W u, Glenn Po well, Bob McGrew, and Igor Mordatc h. Emergen t tool use from multi-agen t auto curricula. arXiv pr eprint arXiv:1909.07528 , 2019. Y uri Burda, Harrison Edwards, Amos Storkey , and Oleg Klimov. Exploration by random net work distillation. arXiv pr eprint arXiv:1810.12894 , 2018. Gaura v Chaudhary and Laxmidhar Behera. F rom no v elt y to imitation: Self-distilled rewards for offline reinforcemen t learning. T r ansactions on Machine L e arning R ese ar ch , 2025. ISSN 2835-8856. URL https: //openreview.net/forum?id=F5K94JI2Jb . Gaura v Chaudhary , Laxmidhar Behera, and T ushar Sandhan. Activ e perception system for enhanced visual signal recov ery using deep reinforcement learning. In ICASSP 2023 - 2023 IEEE International Confer enc e on A c oustics, Sp e e ch and Signal Pr o c essing (ICASSP) , pp. 1–5, 2023. doi: 10.1109/ICASSP49357.2023. 10097084. Gaura v Chaudhary , W ashim Uddin Mondal, and Laxmidhar Behera. MOORL: A framew ork for integrating offline-online reinforcement learning. T r ansactions on Machine L e arning R ese ar ch , 2025. ISSN 2835-8856. URL https://openreview.net/forum?id=PHsfZnF2FC . Zhixin Chen and Mengxiang Lin. Self-imitation learning in sparse rew ard settings. arXiv pr eprint arXiv:2010.06962 , 2020. Matthew Crosb y , Benjamin Beyret, and Marta Halina. The animal-ai olympics. Natur e Machine Intel ligenc e , 1(5):257–257, 2019. Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. A dvanc es in neur al information pr o c essing systems , 26, 2013. Ro drigo de Lazcano, Kallinteris Andreas, Jun Jet T ai, Seung jae Ry an Lee, and Jordan T erry . Gymnasium rob otics, 2023. URL http://github.com/Farama- Foundation/Gymnasium- Robotics . Y an Duan, Marcin Andrycho wicz, Bradly Stadie, Op enAI Jonathan Ho, Jonas Schneider, Ilya Sutskev er, Pieter Abb eel, and W ojciech Zaremba. One-shot imitation learning. A dvanc es in neur al information pr o c essing systems , 30, 2017. A drien Ecoffet, Joost Huizinga, Jo el Lehman, Kenneth O Stanley , and Jeff Clune. Go-explore: a new approac h for hard-exploration problems. arXiv pr eprint arXiv:1901.10995 , 2019. Johan F erret, Olivier Pietquin, and Matthieu Geist. Self-imitation adv antage learning. arXiv pr eprint arXiv:2012.11989 , 2020. T anmay Gangwani, Qiang Liu, and Jian Peng. Learning self-imitating diverse p olicies. arXiv pr eprint arXiv:1805.10309 , 2018. Yijie Guo, Junhyuk Oh, Satinder Singh, and Honglak Lee. Generative adversarial self-imitation learning. arXiv pr eprint arXiv:1812.00950 , 2018. Siddhan t Haldar, V aibha v Mathur, Denis Y arats, and Lerrel Pin to. W atch and match: Supercharging imitation with regularized optimal transport. In Confer enc e on R ob ot L e arning , pp. 32–43. PMLR, 2023. Dong Han, Beni Mulyana, Vladimir Stank ovic, and Samu el Cheng. A survey on deep reinforcemen t learning algorithms for robotic manipulation. Sensors , 23(7):3762, 2023. 14 T ao Huang, Kai Chen, Bin Li, Y un-Hui Liu, and Qi Dou. Guided reinforcemen t learning with efficien t exploration for task automation of surgical rob ot. In 2023 IEEE International Confer enc e on R ob otics and A utomation (ICRA) , pp. 4640–4647. IEEE, 2023. Ch un-Y ao Kang and Ming-Syan Chen. Balancing exploration and exploitation in self-imitation learning. In A dvanc es in K now le dge Disc overy and Data Mining: 24th Pacific-A sia Confer enc e, P AKDD 2020, Singap or e, May 11–14, 2020, Pr o c e e dings, Part II 24 , pp. 274–285. Springer, 2020. Elia Kaufmann, Antonio Lo quercio, Rene Ranftl, Alexey Dosovitskiy , Vladlen Koltun, and Davide Scara- m uzza. Deep drone racing: Learning agile fligh t in dynamic en vironments. In Confer enc e on R ob ot L e arning , pp. 133–145. PMLR, 2018. Kib eom Kim, Kisung Shin, Min Who o Lee, Moonho en Lee, Min Who o Lee, and By oung-T ak Zhang. Visual hindsigh t self-imitation learning for in teractive navigation. IEEE A c c ess , 12:83796–83809, 2023. URL https://api.semanticscholar.org/CorpusID:265696304 . Y ao Li, Y uHui W ang, and XiaoY ang T an. Self-imitation guided goal-conditioned reinforcemen t learning. Pattern R e c o gnition , 144:109845, 2023. Y uxi Li. Deep reinforcement learning: An o verview. arXiv pr eprint arXiv:1701.07274 , 2017. Gabriele Libardi, Gianni De F abritiis, and Sebastian Dittert. Guided exploration with pro ximal policy optimization using a single demonstration. In International Confer enc e on Machine L e arning , pp. 6611– 6620. PMLR, 2021. Hao Lin, Y ue He, F anzhang Li, Quan Liu, Bangjun W ang, and F ei Zhu. T aking complemen tary adv antages: Impro ving exploration via double self-imitation learning in pro cedurally-generated en vironments. Exp ert Syst. Appl. , 238:122145, 2023. URL https://api.semanticscholar.org/CorpusID:264321073 . Long-Ji Lin. Self-impro ving reactive agents based on reinforcemen t learning, planning and teaching. Machine le arning , 8:293–321, 1992. Lu Ling, W ashim Uddin Mondal, and Satish V Ukkusuri. Coop erating graph neural net works with deep reinforcemen t learning for v accine prioritization. IEEE Journal of Biome dic al and He alth Informatics , 2024. Sha Luo, Hamidreza Kasaei, and Lambert Schomak er. Self-imitation learning by planning. In 2021 IEEE International Confer enc e on R ob otics and A utomation (ICRA) , pp. 4823–4829. IEEE, 2021. Shan Luo and Lambert Schomak er. Reinforcemen t learning in rob otic motion planning by combined exp erience-based planning and self-imitation learning. R ob otics A uton. Syst. , 170:104545, 2023. URL https://api.semanticscholar.org/CorpusID:259137688 . Yic heng Luo, Zhengy ao Jiang, Samuel Cohen, Edward Grefenstette, and Marc P eter Deisenroth. Optimal transp ort for offline imitation learning. arXiv pr eprint arXiv:2303.13971 , 2023. W ashim U Mondal and V aneet Aggarw al. Improv ed sample complexit y analysis of natural p olicy gradien t algorithm with general parameterization for infinite horizon discounted reward mark ov decision processes. In International Confer enc e on A rtificial Intel ligenc e and Statistics , pp. 3097–3105. PMLR, 2024. Ash vin Nair, Bob McGrew, Marcin Andrycho wicz, W o jciech Zaremba, and Pieter Abbeel. Overcoming exploration in reinforcement learning with demonstrations. In 2018 IEEE international c onfer enc e on r ob otics and automation (ICRA) , pp. 6292–6299. IEEE, 2018. Junh yuk Oh, Yijie Guo, Satinder Singh, and Honglak Lee. Self-imitation learning. In International c onfer enc e on machine le arning , pp. 3878–3887. PMLR, 2018. Gabriel Peyré, Marco Cuturi, et al. Computational optimal transp ort: With applications to data science. F oundations and T r ends ® in Machine L e arning , 11(5-6):355–607, 2019. 15 Gabriel P eyré and Marco Cuturi. Computational optimal transp ort, 2020. URL 1803.00567 . James Queeney , Y annis P aschalidis, and Christos G Cassandras. Generalized pro ximal policy optimization with sample reuse. A dvanc es in Neur al Information Pr o c essing Systems , 34:11909–11919, 2021. Rob erta Railean u and Rob F ergus. Decoupling v alue and p olicy for generalization in reinforcemen t learning. In International Confer enc e on Machine L e arning , pp. 8787–8798. PMLR, 2021. Tim Salimans and Richard Chen. Learning montezuma’s rev enge from a single demonstration. arXiv pr eprint arXiv:1812.03381 , 2018. T om Schaul, John Quan, Ioannis An tonoglou, and David Silv er. Prioritized exp erience replay . arXiv pr eprint arXiv:1511.05952 , 2015. John Sc h ulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abb eel. High-dimensional con tin uous con trol using generalized adv antage estimation. arXiv pr eprint arXiv:1506.02438 , 2015. John Sc hulman, Filip W olski, Prafulla Dhariw al, Alec Radford, and Oleg Klimo v. Pro ximal p olicy optimiza- tion algorithms. arXiv pr eprint arXiv:1707.06347 , 2017. Zijing Shi, Y unqiu Xu, Meng F ang, and Ling Chen. Self-imitation learning for action generation in text-based games. In Confer enc e of the Eur op e an Chapter of the A sso ciation for Computational Linguistics , 2023. URL https://api.semanticscholar.org/CorpusID:258378233 . Sumedh Son takke, Jesse Zhang, Séb Arnold, Karl Pertsc h, Erdem Bıyık, Dorsa Sadigh, Chelsea Finn, and Lauren t Itti. Rob oclip: One demonstration is enough to learn rob ot p olicies. A dvanc es in Neur al Infor- mation Pr o c essing Systems , 36, 2024. Y unhao T ang. Self-imitation learning via generalized lo w er bound q-learning. A dvanc es in neur al information pr o c essing systems , 33:13964–13975, 2020. Mark T o wers, Jordan K. T erry , Ariel Kwiatk owski, John U. Balis, Gianluca de Cola, T ristan Deleu, Manuel Goulão, Andreas Kallinteris, Arjun K G, Markus Krimmel, Ro drigo P erez-Vicente, Andrea Pierré, Sander Sc h ulhoff, Jun Jet T ai, Andrew T an Jin Shen, and Omar G. Y ounis. Gymnasium, March 2023. URL https://zenodo.org/record/8127025 . Ik ec h ukwu Uchendu, T ed Xiao, Y ao Lu, Bangh ua Zh u, Mengyuan Y an, Joséphine Simon, Matthew Bennice, Ch uyuan F u, Cong Ma, Jiantao Jiao, et al. Jump-start reinforcement learning. In International Confer enc e on Machine L e arning , pp. 34556–34583. PMLR, 2023. T eng Xiao, Mingxiao Li, Yige Y uan, Huaisheng Zhu, Chao Cui, and V.G. Hona v ar. Ho w to leverage demon- stration data in alignmen t for large language mo del? a self-imitation learning p erspective. In Confer enc e on Empiric al Metho ds in Natur al L anguage Pr o c essing , 2024. URL https://api.semanticscholar.org/ CorpusID:273345418 . Mao Xu, Sh uzhi Sam Ge, Dongjie Zhao, and Qian Zhao. Improv ed exploration with demonstrations in pro cedurally-generated environmen ts. IEEE T r ansactions on Games , 2023. Allan Zhou, Eric Jang, Daniel Kappler, Alex Herzog, Mohi Khansari, Paul W ohlhart, Y unfei Bai, Mrinal Kalakrishnan, Sergey Levine, and Chelsea Finn. W atch, try , learn: Meta-learning from demonstrations and rew ard. arXiv pr eprint arXiv:1906.03352 , 2019. 16 A Integrating SIPP with Hard Exploration T echniques In this section, we inv estigate in tegrating the Self-Imitating Proximal Policy (SIPP) with Random Net work Distillation (RND), a tec hnique that enhances exploration in reinforcement learning by encouraging agen ts to visit nov el states. This com bination leverages SIPP’s self-imitation mechanism, which reinforces past successful b eha viors, and RND’s intrinsic motiv ation, whic h promotes exploration, to improv e p erformance in en vironments with c hallenging exploration requirements. Our approac h combines t wo core comp onents: • Self-Imitating Proximal Policy (SIPP): SIPP enhances p olicy optimization by prioritizing high-return tra jectories from the agen t’s past exp eriences. In sparse reward settings, such as A tari games, we emplo y the Repla y strategy , main taining an imitation buffer of successful tra jectories (defined by cum ulativ e extrinsic rewards) that are selectively replay ed during p olicy up dates using Proximal P olicy Optimization (PPO). • Random Net work Distillation (RND): RND generates an in trinsic reward signal based on the pre- diction error b etw een a fixed random netw ork and a trainable net w ork, incentivizing the agent to explore no vel states b y quantifying their unfamiliarity . In the RND+SIPP framework, the agen t’s total reward at eac h timestep is the sum of the extrinsic rew ard from the en vironmen t and the in trinsic rew ard from RND. The p olicy is updated via PPO with the SIPP Repla y strategy , where the imitation buffer stores tra jectories based solely on their cumulativ e extrinsic rew ards. This ensures that SIPP reinforces b eha viors that lead to tangible en vironmental success, while RND indep enden tly explores nov el regions, mitigating p oten tial conflicts b et ween exploitation and exploration ob jectiv es. W e ev aluated RND+SIPP on three Atari 2600 games from the Arcade Learning Environmen t—Gra vitar, V enture, and Solaris—selected for their sparse rew ards and exploration challenges. The results of RND w ere tak en from (Burda et al., 2018). W e did not p erform an ablation study on the SIPP h yp erparameter for this study and kept the IET coefficient fixed across tasks to 0.1 and the imitation buffer size 1. In the Gravitar task, RND+SIPP ac hiev es an 11.7% impro vemen t, leveraging SIPP’s reinforcemen t of successful tra jectories alongside RND’s exploration. V enture Performance is comparable, with a slight 2.5% decrease, suggesting task-sp ecific tuning of ϵ may b e needed. In Solaris, A 9.3% gain highligh ts the b enefit of combining imitation with exploration in complex state spaces. T able 1: Performance Comparison on hard exploration tasks. T ask RND RND+SIPP Gra vitar 3906 4363 V enture 1859 1813 Solaris 3282 3589 The in tegration of SIPP with RND demonstrates that com bining self-imitation learning with in trinsic mo- tiv ation pro vides a dual b enefit. On the one hand, SIPP ensures that the agent lev erages its past successes to stabilize p olicy up dates. On the other hand, RND con tinually drives the agen t to explore unvisited or less familiar regions of the state space. The trade-off b et w een these comp onents is controlled by the Imitation-Exploration T rade-off co efficien t, enabling task-sp ecific tuning. The analysis suggests that while RND alone can foster exploration, it may not prev ent the div ergence of effectiv e strategies. RND+SIPP ov ercomes this limitation by con tinually reinforcing high-v alue b eha viors, thereb y improving ov erall p erformance. F uture work may in v olv e dynamically adapting the balance b e- t w een in trinsic and extrinsic rew ards based on the observ ed learning dynamics, thereb y further refining the exploration-exploitation trade-off. 17 B Guidelines fo r T uning ϵ -Greedy Pa rameter in SIPP The h yp erparameter ϵ go verns the balance b etw een imitation and exploration in SIPP . T o tune it effectively , w e suggest the follo wing: • Baseline Setting: Start with ϵ = 0 . 3 , whic h works reasonably well across diverse tasks. • Dense Rew ard T asks: F or environmen ts with frequen t rewards (e.g., MuJoCo), reduce ϵ to 0.1–0.2 to prioritize refining kno wn b eha viors ov er excessive exploration. • Sparse Rew ard T asks: In hard exploration scenarios (e.g., Poin tMaze), increase ϵ to 0.3–0.5 to lev erage imitation of scarce successes. • Stabilit y Chec k: If training sho ws a high v ariance in success rates, consider reducing ϵ to stabilize learning via stronger imitation. • Cross-V alidation: F or optimal results, p erform a grid search o v er v alues like 0.1, 0.3, and 0.5, esp ecially in critical applications. These guidelines, grounded in our empirical findings, should enhance SIPP’s repro ducibilit y and usability . 0 20000 40000 60000 80000 100000 T imesteps 100 50 0 50 100 Episode R ewar ds MountainCarContinuous- v0 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 1000 500 0 500 1000 1500 2000 2500 Episode R ewar ds Ant- v4 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 0 1000 2000 3000 Episode R ewar ds HalfCheetah- v4 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 0 500 1000 1500 2000 2500 Episode R ewar ds Hopper - v4 0 50000 100000 150000 200000 250000 300000 T imesteps 0 1000 2000 3000 4000 5000 6000 7000 Episode R ewar ds InvertedDoubleP endulum- v4 0 50000 100000 150000 200000 250000 300000 T imesteps 0 200 400 600 800 1000 Episode R ewar ds InvertedP endulum- v4 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 40000 60000 80000 100000 120000 140000 Episode R ewar ds HumanoidStandup- v4 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 0 500 1000 1500 2000 2500 3000 3500 4000 Episode R ewar ds W alk er2d- v4 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 130 120 110 100 90 80 70 60 50 Episode R ewar ds P usher - v4 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 100 200 300 400 500 600 700 Episode R ewar ds Humanoid- v4 Figure 7: Results sho w the p erformance of 10 MuJoCo (T ow ers et al., 2023) con tinuous con trol tasks. The plots are the learning curv es and show the episo dic rew ards along the y-axis, ev aluated through the current p olicy . The rep orted results are the mean across sev en different seeds. The prop osed algorithms outperform all baselines across all tasks by being comp etitiv e or better than others. 18 T able 2: Parameter s F or Animal-AI Olympics En vironmen t P arameter V alues episo de length 1000 image size (R GB) 84 × 84 × 3 initial rew ard threshold 0 frame-skip 2 frame-stac k 4 discoun t factor ( λ ) 0.99 gae-gamma ( γ ) 0.95 v alue loss coefficient ( c 1 ) 0.1 en trop y loss coefficient ( c 2 ) 0.02 learning rate 10 − 4 pp o-epo c h 4 n um b er-mini-batc h 7 v alue-clip 0.15 p olicy-clip 0.15 buffer size ( B I ) 10 T able 3: PPO Hyp er-parameters P arameter V alues learning rate 3 e − 4 n-steps 2048 batc h size 64 n-ep ochs 10 discoun t factor 0.99 gae-gamma 0.95 clip-range 0.2 normalize adv an tage T rue vf-co ef 0.5 max-grad-norm 0.5 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 0 50 100 150 200 250 Episode R ewar ds P ointMaze_Open_Diverse_GR - v3 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 0 50 100 150 200 250 300 Episode R ewar ds P ointMaze_Medium_Diverse_G- v3 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 50 0 50 100 150 200 Episode R ewar ds P ointMaze_Medium_Diverse_GR - v3 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 0 100 200 300 400 Episode R ewar ds P ointMaze_Lar ge_Diverse_G- v3 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 0 50 100 150 200 250 Episode R ewar ds P ointMaze_Open_Diverse_G- v3 Figure 8: Results sho w the performance on all 5 Poin tMaze m ulti-goal sparse reward tasks. The plots sho w the learning curv es and the episodic rew ards along the y-axis, ev aluated under the curren t p olicy . The rep orted results are across seven differen t seeds. The proposed algorithms outperform all the baselines by a significan t margin. 19 0 50000 100000 150000 200000 250000 300000 T imesteps 0.0 0.2 0.4 0.6 0.8 1.0 Episode R ewar ds Goal 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 0.0 0.2 0.4 0.6 0.8 1.0 Episode R ewar ds Goal-behind wall 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 0.0 0.2 0.4 0.6 0.8 1.0 Episode R ewar ds Goal-tunnel 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 T imesteps 1e6 0.0 0.2 0.4 0.6 0.8 Episode R ewar ds Goal- occluded tunnel 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Episode R ewar ds Goal- on wall Figure 9: Results show the p erformance on all 5 Animal-AI Olympics environmen ts sparse reward tasks. The plots sho w the learning curves, with episo dic rewards (success rate) on the y-axis, ev aluated under the curren t p olicy . The rep orted results are across 5 differen t seeds. The prop osed algorithms outp erform all the baselines b y a significant margin. 0 20000 40000 60000 80000 100000 T imesteps 100 50 0 50 100 Episode R ewar ds MountainCarContinuous- v0 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 500 0 500 1000 1500 2000 2500 Episode R ewar ds Ant- v4 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 0 1000 2000 3000 Episode R ewar ds HalfCheetah- v4 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 0 500 1000 1500 2000 2500 Episode R ewar ds Hopper - v4 0 50000 100000 150000 200000 250000 300000 T imesteps 0 1000 2000 3000 4000 5000 6000 7000 Episode R ewar ds InvertedDoubleP endulum- v4 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 100 200 300 400 500 600 Episode R ewar ds Humanoid- v4 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 0 1000 2000 3000 4000 Episode R ewar ds W alk er2d- v4 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 40000 60000 80000 100000 120000 140000 Episode R ewar ds HumanoidStandup- v4 Figure 10: Results sho w the ablation study on 8 MuJoCo (T o wers et al., 2023) con tin uous control tasks. The parameter ϵ controls the balance betw een exploration and exploitation. The plots show the learning curves and the episodic rew ards along the y-axis, ev aluated under the current p olicy with different ϵ . The rep orted results are the mean across five differen t seeds. 20 0 50000 100000 150000 200000 250000 300000 T imesteps 0 50 100 150 200 250 Episode R ewar ds P ointMaze_Open_Diverse_GR - v3 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 0 50 100 150 200 250 300 Episode R ewar ds P ointMaze_Medium_Diverse_G- v3 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 0 25 50 75 100 125 150 Episode R ewar ds P ointMaze_Medium_Diverse_GR - v3 0.0 0.2 0.4 0.6 0.8 1.0 T imesteps 1e6 0 50 100 150 200 250 300 350 Episode R ewar ds P ointMaze_Lar ge_Diverse_G- v3 0 50000 100000 150000 200000 250000 300000 T imesteps 0 50 100 150 200 250 Episode R ewar ds P ointMaze_Open_Diverse_G- v3 Figure 11: Results sho w the ablation study on all 5 Poin tMaze m ulti-goal sparse rew ard tasks. The parameter ϵ con trols the replay frequency to balance exploration vs exploitation. The plots show the learning curves and the episodic rew ards along the y-axis, ev aluated under the current p olicy with different ϵ . The rep orted results are across fiv e different seeds. 21
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment