Compositional Planning with Jumpy World Models

The ability to plan with temporal abstractions is central to intelligent decision-making. Rather than reasoning over primitive actions, we study agents that compose pre-trained policies as temporally extended actions, enabling solutions to complex ta…

Authors: Jesse Farebrother, Matteo Pirotta, Andrea Tirinzoni

Compositional Planning with Jumpy World Models
Compositional Planning with Jump y W orld Models Jesse F arebrother 2 , 3 , ∗ , Matteo Pirotta 1 , Andrea Tirinzoni 1 , Marc G. Bellemare 2 , 3 , † , Alessandro Lazaric 1 , Ahmed T ouati 1 1 F AIR at Meta, 2 Mila – Québ ec AI Institute, 3 McGill Univ ersity ∗ W ork done at Meta , † CIF AR AI Chair The abilit y to plan with temp oral abstractions is central to in telligent decision-making. Rather than reasoning ov er primitiv e actions, w e study agents that comp ose pre-trained p olicies as temp orally extended actions, enabling solutions to complex tasks that no constituen t alone can solv e. Suc h comp ositional planning remains elusiv e as comp ounding errors in long-horizon predictions make it c hallenging to estimate the visitation distribution induced by sequencing p olicies. Motiv ated by the ge ometric p olicy c omp osition framework introduced in Thakoor et al. ( 2022 ), we address these c hallenges by learning predictive models of multi-step dynamics — so-called jumpy world mo dels — that capture state o ccupancies induced b y pre-trained p olicies across multiple timescales in an off-policy manner. Building on T emp oral Difference Flows ( F arebrother et al. , 2025 ), we enhance these mo dels with a no vel consistency ob jectiv e that aligns predictions across timescales, impro ving long-horizon predictiv e accuracy . W e further demonstrate how to combine these generativ e predictions to estimate the v alue of executing arbitrary sequences of policies o ver v arying timescales. Empirically , w e find that comp ositional planning with jump y world mo dels significantly improv es zero-shot p erformance across a wide range of base policies on challenging manipulation and na vigation tasks, yielding, on a verage, a 200% relativ e improv emen t ov er planning with primitive actions on long-horizon tasks. 1 Introduction In recent y ears, the success of large-scale foundation mo dels in domains such as computer vision (e.g., Radford et al. , 2021 ; Ravi et al. , 2025 ; Assran et al. , 2025 ) and natural language pro cessing (e.g., Meta , 2024 ; Op enAI , 2024 ; Go ogle DeepMind , 2025 ) has inspired a similar shift in Reinforcement Learning (RL). F oundation p olicies pre-trained on diverse unlabeled data or via unsup ervised ob jectiv es can no w generalize to a wide range of downstream tasks, without additional training or explicit planning. This has led to remark able progress in areas like h umanoid con trol (e.g., Peng et al. , 2021 , 2022 ; T essler et al. , 2023 , 2024 ; Tirinzoni et al. , 2025 ; Alegre et al. , 2025 ) and real-world rob otics (e.g., Brohan et al. , 2023b , a ; Luo et al. , 2024 ; Ghosh et al. , 2024 ; Blac k et al. , 2024 , 2025 ; Physical Intelligence et al. , 2025 ; NVIDIA , 2025 ; Li et al. , 2026 ). Despite these adv ances, a k ey limitation p ersists: while foundation p olicies can handle many tasks out of the box, they often fall short when faced with complex, long-horizon problems that require reasoning o ver extended sequences of decisions. In such cases, the planning horizon of a single p olicy is insufficient, and agen ts must comp ose p olicies to achiev e their goals ( Schmidh ub er , 1991 ; Singh , 1992 ; Day an and Hin ton , 1992 ; Kaelbling , 1993b , a ; P arr and Russell , 1997 ; Dietterich , 1998 ; Sutton et al. , 1999 ; Precup , 2000 ). Hierarc hical RL ( Barto and Mahadev an , 2003 ; Klissaro v et al. , 2025 ), including the options framework ( Sutton et al. , 1999 ; Precup , 2000 ), aims to ac hieve comp ositionality by training task-sp ecific high-level p olicies to lev erage manual (e.g., Nach um et al. , 2018 ; Barreto et al. , 2019 ; Carv alho et al. , 2023 ; Park et al. , 2023 , 2025b ) or automatic (e.g., Bacon et al. , 2017 ; Mac hado et al. , 2017 , 2018 , 2023 ; Bagaria et al. , 2021 ; Sutton et al. , 2023 ) task decomp ositions. In this pap er, we take a fundamentally differen t approach: instead of learning task-specific hierarchies, we develop a framework for direct compositional planning ov er parameterized p olicies, requiring no task-sp e cific training. By learning “jump y” multi-step dynamics mo dels – also known as “jump y world models” ( Murph y , 2024 ) – we enable flexible composition of existing p olicies at planning time, transforming ho w agents tac kle nov el tasks through in telligent recombination of existing behavior. 1 T o op erationalize this idea, we prop ose learning a p olicy and horizon conditioned jumpy world mo del that captures the distribution of future states for all parameterized p olicies ov er a contin uum of geometrically deca ying time horizons ( Janner et al. , 2020 ; Thak o or et al. , 2022 ). T o make this p ossible, we first generalize the recen t T emp oral Difference Flo w ( F arebrother et al. , 2025 ) framework using a no vel consistency ob jectiv e that enforces coherence b etw een pred ictions at different timescales, consisten tly impro ving long-horizon predictions. A dditionally , we develop a nov el estimator of the v alue of executing an arbitrary sequence of p olicies, each with its own v ariable timescale. With these tw o pieces in place, we demonstrate ho w to plan ov er a wide range of parameterized p olicies, allo wing us to flexibly comp ose b ehaviors to solve complex, long-horizon tasks without the need for further environmen t in teraction or fine-tuning. Empirically , across multiple classes of base p olicies ev aluated on a suite of OGBench navigation and manipulation tasks ( Park et al. , 2025a ), b eha vior-level planning consistently improv es ov er zero-shot p erformance, often by a large margin. Finally , our approac h outp erforms state-of-the-art hierarchical baselines as well as alternative planning metho ds; in particular, planning with jumpy w orld mo dels achiev es a 200% relative impro vemen t o ver action-lev el planning with a one-step world mo del on long-horizon tasks. These results demonstrate that planning with jumpy w orld mo dels offers a p ow erful complement that is particularly effective for long-horizon decision-making. 2 Preliminaries W e mo del the environmen t as a rew ard-free discounted Marko v Decision Pro cess (MDP) defined as the 4 -tuple M = ( S , A , P , γ ) . Here, S and A represen t the state and action spaces, resp ectively; P : S × A → P ( S ) c haracterizes the distribution ov er next states; and γ ∈ [0 , 1) is the discount factor. At each step k , the agent follo ws a p olicy π : S → P ( A ) , generating a tra jectory of state-action pairs ( S k , A k ) k ≥ 0 where A k ∼ π ( · | S k ) and S k ∼ P ( · | S k − 1 , A k − 1 ) . When unam biguous, we use S ′ as the immediate next state of S , and S + for a successor state at some future step k > 0 . W e use Pr ( · | S 0 = s, A 0 = a, π ) and E [ · | S 0 = s, A 0 = a, π ] to denote the probabilit y and exp ectation ov er sequences induced by starting from ( s, a ) and follo wing π thereafter. Successor Measure F or a p olicy π and an initial state-action pair ( s, a ) , the (normalize d) suc c essor me a- sur e ( Day an , 1993 ; Blier et al. , 2021 ), denoted by m π γ ( · | s, a ) , is a probabilit y measure o ver the state space S . F or any subset X ⊆ S , it represen ts the cumulativ e discounted probability of visiting a state in X , with eac h visit discoun ted geometrically by its time of arriv al. F ormally , this is defined as: m π γ ( X | s, a ) = (1 − γ ) ∞ X k =0 γ k Pr( S k +1 ∈ X | S 0 = s, A 0 = a, π ) . The normalization factor 1 − γ ensures m π γ is a probability distribution. This admits an in tuitive interpretation: rather than viewing γ as a discoun t factor, one may equiv alen tly consider an auxiliary pro cess with a geometrically distributed lifetime – halting at each step with probabilit y 1 − γ ( Derman , 1970 ). Under this view, m π γ ( X | s, a ) equiv alen tly c haracterizes the probability that the state visited at the halting time lies in X . This yields a con venien t reparameterization of the action-v alue function as: Q π γ ( s, a ) = E h ∞ X k =0 γ k r ( S k +1 ) | S 0 = s, A 0 = a, π i ≡ (1 − γ ) − 1 E S + ∼ m π γ ( ·| s,a )  r ( S + )  , (1) expressing v alue as the exp ected reward at the geometrically distributed halting time scaled b y the av erage lifetime (1 − γ ) − 1 . Note that this equiv alence is purely mathematical 1 ; the underlying MDP remains unchanged. Geometric Horizon Model A Geometric Horizon Mo del (GHM; Janner et al. , 2020 ; Thakoor et al. , 2022 ) instan tiates a jumpy world model as a gener ative mo del of the successor measure. It can b e learned off-p olicy via temp oral-difference learning, exploiting the fact that m π γ is a fixed p oin t of the Bellman equation: m π γ ( · | s, a ) = (1 − γ ) P ( · | s, a ) + γ E S ′ ∼ P ( ·| s,a ) , A ′ ∼ π ( ·| S ′ )  m π γ ( · | S ′ , A ′ )  . (2) Due to the Bellman equation’s reliance on b o otstrapping, the choice of generative mo del is critical for main taining stability . F arebrother et al. ( 2025 ) demonstrate that prior approac hes suffer from systemic bias 1 This preserves expected cum ulants and o ccupancy measures, but not tra jectory-level statistics ( Bellemare et al. , 2023 ) 2 at long horizons due to these b o otstrapp ed predictions. T o address this, F arebrother et al. ( 2025 ) prop ose the use of flo w-matching techniques ( Lipman et al. , 2023 ; Albergo and V anden-Eijnden , 2023 ), which construct probabilit y paths that evolv e smo othly from a source distribution to the desired target distribution. By designing these paths to exploit structure in the temp oral difference target distribution, they show that b o otstrapping bias can b e con trolled, enabling more accurate long-horizon predictions. In this framework, the GHM mo dels a d -dimensional contin uous state space 2 as an ordinary differential equation (ODE) parameterized by a time-dep enden t vector field v t : R d × S × A → R d . Sampling from the GHM b egins by drawing initial noise X 0 ∈ R d from a prior distribution p 0 ∈ P ( R d ) and subsequently follo wing the flow ψ t : R d × S × A → R d , defined b y the following Initial V alue Problem (IVP) for t ∈ [0 , 1] :    d d t ψ t ( X 0 | s, a ) = v t  ψ t ( X 0 | s, a ) | s, a  ψ 0 ( X 0 | s, a ) = X 0 ⇐ ⇒ ψ t ( X 0 | s, a ) = X 0 + Z t 0 v τ  ψ τ ( X 0 | s, a ) | s, a  d τ . W e can solve this IVP using standard n umerical integration tec hniques ( Butcher , 2016 ). In doing so, we obtain an ODE-induced probability path defined as the pushforward p t := ψ t ( · | S, A ) ♯ p 0 ( · ) , i.e., the distribution of ψ t ( X 0 | S, A ) where X 0 ∼ p 0 ( · ) . T o ensure that p 1 coincides with the successor measure m π γ , F arebrother et al. ( 2025 ) prop ose to learn the parameterized v ector field v t ( · · · ; θ ) by minimizing the td-flo w loss: ℓ td-flow ( θ ; ¯ θ ) = (1 − γ ) E t ∼U ([0 , 1]) , ( S,A,S ′ ) ∼D X 0 ∼ p 0 ( · ) , X t =(1 − t ) X 0 + tS ′     v t ( X t | S, A ; θ ) − ( S ′ − X 0 )    2  + γ E t ∼U ([0 , 1]) , ( S,A,S ′ ) ∼D , A ′ ∼ π ( ·| S ′ ) X t ∼ ψ t ( ·| S ′ ,A ′ ; ¯ θ ) ♯ p 0 ( · )     v t ( X t | S, A ; θ ) − v t ( X t | S ′ , A ′ ; ¯ θ )    2  , (3) o ver transitions sampled from the dataset D with non-trainable target parameters ¯ θ ( Mnih et al. , 2015 ). Recall that the Bellman equation ( 2 ) defines the successor measure as a mixture distribution with weigh ts 1 − γ and γ . The td-flow ob jectiv e reflects this: the first term is a conditional flow-matc hing loss ( Lipman et al. , 2023 ) targeting the one-step transition kernel P ( · | S, A ) , while the second is a marginal flow-matc hing term targeting the b o otstrapp ed successor measure m π γ ( · | S ′ , A ′ ) . F arebrother et al. ( 2025 ) show that join tly optimizing these comp onen ts with the mixture weigh ting recov ers the successor measure at con vergence. 3 Planning via Geometric P olicy Composition In the sequel, we consider an agen t equipp ed with a rep ertoire of pretrained p olicies { π z } z ∈ Z indexed b y z (e.g., a state for goal-conditioned p olicies or, more generally , a latent v ariable parameterizing diverse b ehaviors). Our ob jectiv e is to learn a predictive mo del of the p olicies’ b ehaviors that enables planning for arbitrary do wnstream tasks, without requiring further online in teraction with the en vironment or additional fine-tuning. T o this end, we first formalize how GHM predictions comp ose to ev aluate plans that sto c hastically switch among a subset of p olicies, showing how the successor measures of constituen t p olicies combine to yield the successor measure of the comp osite p olicy . W e then address the challenge of learning accurate GHMs across timescales b y introducing a consistency ob jective that enforces coherence across horizons, improving long-horizon predictions. Finally , with these to ols we detail our complete comp ositional planning pro cedure. 3.1 E valuating Geometric S witching Policies A natural w ay to chain policies is through ge ometric switching : for each p olicy π z i in a sequence, execution con tinues with probability 1 − α i ∈ [0 , 1] , or switc hes to the subsequen t p olicy π z i +1 with probabilit y α i . Such a p olicy can b e written as: ν := π z 1 α 1 − → π z 2 · · · α n − 1 − − − → π z n . By definition, the final p olicy π z n is absorbing, meaning the agent commits to it for the remainder of the episo de, so its switching probability is α n = 0 . These non-Marko vian p olicies are called Geometric Switching Policies (GSPs; Thak o or et al. , 2022 ). The term “geometric” captures that each p olic y π z i is follo wed for a geometrically distributed duration T i ∼ Geom ( α i ) . 2 While w e assume a con tinuous state space S ⊆ R d for ease of exp osition, this flow-matc hing approac h readily extends to non-Euclidean and discrete spaces (e.g., Huang et al. , 2022 ; Chen and Lipman , 2024 ; Gat et al. , 2024 ; Kapusniak et al. , 2024 ). 3 T o analyze these p olicies, we m ust understand ho w this switching mechanism interacts with the MDP’s glob al discoun t factor, γ . Recall that γ can b e interpreted as the probability the episo de contin ues to the next step, while 1 − γ giv es the probabilit y of halting. When following p olicy π z k within a GSP , there are tw o reasons it migh t stop executing that p olicy: (1) the episo de halts, with probability 1 − γ ; or (2) the p olicy switches to π z k +1 , with probabilit y α k . Thus, contin uing to follo w π z k for one more step requires that neither ev ent o ccur, whic h happ ens with probabilit y β k := γ (1 − α k ) . This quantit y acts as an effective discount factor for the duration sp en t executing π z k . Note that β n = γ , since the final policy is absorbing. W e can no w characterize the successor measure of a GSP . Note that the agent migh t reac h a successor state s + through multiple paths: arriving while still executing π z 1 , or switc hing to π z 2 and reaching s + from there, and so on. Each path con tributes to the ov erall successor measure and must be weigh ted appropriately . Definition 1. L et ν := π z 1 α 1 − → π z 2 · · · α n − 1 − − − → π z n b e a ge ometric switching p olicy with glob al disc ount factor γ ∈ (0 , 1) and effe ctive disc ount factors β k : = γ (1 − α k ) for k ∈ J n K . The weight of the k -th p olicy is w k : = 1 − γ 1 − β k k − 1 Y i =1 γ − β i 1 − β i , wher e an empty pr o duct e quals 1 (henc e w 1 = 1 − γ 1 − β 1 ). These weigh ts capture the relative contribution of each p olicy phase to the successor measure of the GSP . In tuitively , w k reflects the probability that the agent (i) survives the first k − 1 p olicy phases without the episo de halting, and (ii) reaches states under p olicy π z k rather than having already switc hed to a later p olicy . With these weigh ts, we now define the successor measure of the comp osite GSP in the following result. Theorem 1. L et ν := π z 1 α 1 − → π z 2 · · · α n − 1 − − − → π z n b e a ge ometric switching p olicy with glob al disc ount factor γ ∈ (0 , 1) , effe ctive disc ount factors { β k } n k =1 , and weights { w k } n k =1 fr om Definition 1 . F or any state-action p air ( s, a ) , the suc c essor me asur e of ν de c omp oses as: m ν γ (d s + | s, a ) = n X k =1 w k Z s 1 ,...,s k − 1 a 1 ,...,a k − 1 m π z 1 β 1 (d s 1 | s, a ) π z 2 (d a 1 | s 1 ) · · · m π z k β k (d s + | s k − 1 , a k − 1 ) . This theorem decomp oses the successor measure of a GSP as a mixture distribution with n comp onen ts. The k -th comp onent, w eighted b y w k , captures the state distribution under p olicy π z k , having passed through all in termediate states visited under π z 1 , . . . , π z k − 1 3 . Practically , suc h a decomp osition yields a natural strategy for estimating action-v alue function similar to ( 1 ) : sample a state from each comp onent, ev aluate its reward, and form a weigh ted sum using the weigh ts w k . Rather than sampling from each comp onent indep enden tly , we can lev erage their ov erlapping sequential structure to draw samples m uch more efficiently . W e achiev e this through comp osition: starting from ( s, a ) , we sample S + 1 ∼ m π z 1 β 1 ( · | s, a ) , then use S + 1 to sample S + 2 ∼ m π z 2 β 2 ( · | S + 1 , A + 1 ) where A + 1 ∼ π z 2 ( · | S + 1 ) , and so on. As the following lemma formalizes, the w eighted sum P k w k r ( S + k ) yields an un biased estimator of Q ν γ . Lemma 1. L et ν := π z 1 α 1 − → π z 2 · · · α n − 1 − − − → π z n b e a ge ometric switching p olicy with glob al disc ount factor γ ∈ (0 , 1) , effe ctive disc ount factors { β k } n k =1 , and weights { w k } n k =1 fr om Definition 1 . F or any r ewar d function r : S → R and state-action p air ( s, a ) , set ( S + 0 , A + 0 ) = ( s, a ) and for k = 1 , . . . , n sample S + k ∼ m π z k β k ( · | S + k − 1 , A + k − 1 ) and A + k ∼ π z k +1 ( · | S + k ) . Then the single-sample monte-c arlo estimator b Q ν γ : = (1 − γ ) − 1 n X k =1 w k r ( S + k ) , is an unbiase d estimate of Q ν γ ( s, a ) , i.e., E  b Q ν γ  = Q ν γ ( s, a ) wher e the exp e ctation is over the joint distribution of ( S + 1 , A + 1 , . . . , S + n ) induc e d by the sampling pr o c e dur e ab ove. 3 Each comp onent can be seen as an application of the Chapman–Kolmogoro v equation ( Chapman , 1928 ; Kolmogoroff , 1931 ), retaining kernel comp osition via marginalization but replacing single-step dynamics P ( · | s, a ) with jumpy dynamics via m π β ( · | s, a ) . 4 This lemma generalizes several previous results: it extends Thakoor et al. ( 2022 , Theorem 3.2), whic h assumed fixed switching probabilities, i.e., α k = α, ∀ k ∈ J n − 1 K , and further generalizes results in Janner et al. ( 2020 ), that additionally imp ose a fixed p olicy throughout, i.e., π z k = π , ∀ k ∈ J n K . By allowing both p olicies and switc hing probabilities to v ary , this result enables the ev aluation of a wider class of switching policies and brings us closer to the options framew ork ( Sutton , 1995 ; Sutton et al. , 1999 ; Precup , 2000 ). 3.2 L earning Geometric Horizon Models Across Multiple Timescales A core requirement of our planning framework is the ability to predict the b ehavior of many p olicies ov er m ultiple timescales, enabling us to calculate the exp ected return of candidate geometric switching p olicies. A natural extension of the td-flo w ob jective ( 3 ) conditions the v ector field v on b oth the policy enco ding z and discoun t factor γ , yielding a single unified mo del across p olicies and horizons. Ho wev er, generalizing across man y horizons is c hallenging: v ariance increases with horizon length, reducing p er-horizon accuracy and destabilizing training ( P etrik and Scherrer , 2008 ). T o address this challenge, we propose a generalization of td-flow that enforces consistency across horizons. Rather than learning eac h horizon indep enden tly , w e exploit a Bellman-like relationship betw een the successor measure at t wo discoun t factors β ≤ γ to b o otstrap longer-horizon predictions from shorter-horizon ones: m π γ ( · | s, a ) = (1 − γ ) P ( · | s, a ) + γ 1 − γ 1 − β E S ′ ∼ P ( ·| s,a ) ,A ′ ∼ π ( ·| S ′ )  m π β ( · | S ′ , A ′ )  (4) + γ γ − β 1 − β E S ′ ∼ P ( ·| s,a ) ,A ′ ∼ π ( ·| S ′ ) S + ∼ m π β ( ·| S ′ ,A ′ ) ,A + ∼ π ( ·| S + )  m π γ ( · | S + , A + )  . This follows directly from Theorem 1 by considering the switc hing p olicy ν := π α 1 =1 − − − → π α 2 =1 − β / γ − − − − − − → π and noting that m π γ (1 − α 1 ) = m π 0 = P . Building on this result, we extend the deriv ation of td-flow from F arebrother et al. ( 2025 ) to the new Bellman equation in ( 4 ) (full deriv ation in App endix C.1 ), arriving at what we call the T emp or al Differ enc e Horizon Consistency ( td-hc ) loss: ℓ td-hc ( θ ; ¯ θ , β , γ ) = (1 − γ ) E t ∼U ([0 , 1]) , ( S,A,S ′ ) ∼D X 0 ∼ p 0 ( · ) , X t =(1 − t ) X 0 + tS ′     v t ( X t | S, A, γ ; θ ) − ( S ′ − X 0 )    2  (5) + γ 1 − γ 1 − β E t ∼U ([0 , 1]) , ( S,A,S ′ ) ∼D ,A ′ ∼ π ( ·| S ′ ) X t ∼ ψ t ( ·| S ′ ,A ′ , β ; ¯ θ ) ♯ p 0 ( · )     v t ( X t | S, A, γ ; θ ) − v t ( X t | S ′ , A ′ , β ; ¯ θ )    2  + γ γ − β 1 − β E t ∼U ([0 , 1]) , ( S,A,S ′ ) ∼D ,A ′ ∼ π ( ·| S ′ ) S + ∼ ψ 1 ( ·| S ′ ,A ′ , β ; ¯ θ ) ♯ p 0 ( · ) , A + ∼ π ( ·| S + ) X t ∼ ψ t ( ·| S + ,A + , γ ; ¯ θ ) ♯ p 0 ( · )     v t ( X t | S, A, γ ; θ ) − v t ( X t | S + , A + , γ ; ¯ θ )    2  . When γ = β , the third term v anishes, and w e recov er the original td-flow loss. In practice, we sample γ uniformly from [ γ min , γ max ] and β uniformly from [ γ min , γ ] , but only apply horizon consistency (i.e., β  = γ ) to a small prop ortion of each mini-batch. This is motiv ated by the fact that the consistency term requires sampling from the mo del’s o wn predictions at horizon β and using these samples as conditioning for the longer horizon γ , meaning errors in the model’s current predictions can comp ound and introduce bias. Restricting the consistency term to a small fraction of the batc h, we gain the b enefits of horizon alignment while ensuring the ma jorit y of up dates come from td-flow . This design follows standard practice for consistency-like generativ e mo deling (e.g., F rans et al. , 2025 ; Geng et al. , 2025 ; Boffi et al. , 2025 ; Ai et al. , 2026 ). 3.3 Compositional Planning With the mac hinery for training policy-conditioned GHMs across m ultiple timescales now in place, we can turn to the central question: ho w to use them to solve do wnstream tasks? Giv en a reward function r : S → R , our goal is to find a sequence of p olicies that maximizes exp ected return. Since the learned GHMs already capture the full complexity of how p olicies evolv e in the environmen t, ev aluating Q ν γ from Lemma 1 requires only sp ecifying the p olicy em b eddings z 1 , . . . , z n , hence planning reduces to the follo wing optimization problem: max a 1 ,z 1 ,...,z n Q π z 1 α 1 − → π z 2 ··· α n − 1 − − − → π z n γ ( s, a 1 ) . (6) 5 The switc hing probabilities { α i } con trol how long eac h p olicy executes b efore transitioning to the next, and are treated as hyperparameters. Once the optimal sequence ( a ∗ 1 , z ∗ 1 , . . . , z ∗ n ) is identified, we execute the first action a ∗ 1 follo wed b y the p olicy π z ∗ 1 , replanning at future states as needed. Notably , this planning ob jective unifies sev eral existing approaches as sp ecial cases. By v arying the switc hing probabilities { α i } , one can in terp olate b etw een action-level control and planning ov er p olicies by: • Setting α 1 = · · · = α n = 1 , whic h reduces to optimizing ov er sequences of primitive actions, equiv alent to Mo del-Predictiv e Control with horizon n . • Setting α 1 = 1 and α 2 = · · · = α n = 0 , yielding Gener alize d Policy Impr ovement ( GPI ; Barreto et al. , 2017 ). • Setting α 1 = · · · = α n − 1 = α for some fixed α ∈ (0 , 1) , which recov ers Ge ometric Gener alize d Policy Impr ovement ( GGPI ; Thakoor et al. , 2022 ). W e refer to our approach – which allo ws distinct switc hing probabilities α 1 , . . . , α n − 1 ∈ (0 , 1) – as CompPlan . Lik ewise, w e refer to action-level planning as A ctionPlan . Crucially , the same pretrained GHMs p ow er all of these metho ds. By conditioning on p olicies and a contin uum of timescales, our framew ork spans one-step world mo dels ( α = 1 ) through long-horizon p olicy composition ( α i ∈ (0 , 1) ), unifying previously disparate paradigms. Optimization via random shooting. The maximization in ( 6 ) is tractable when p olicies are indexed by a finite set, but becomes challenging for large or contin uous Z . In this pap er, w e fo cus on goal-conditioned p olicies where Z = S and z ∈ S represen ts a subgoal. Here, the k ey difficult y lies in prop osing go o d candidate subgoals without searching ov er the entire state space. Our solution is to use the GHMs themselves as a prop osal distribution. Given a goal g ∈ S , we generate w aypoints from z 0 := s b y comp osing m π g β i o ver horizons { β i } as: a i ∼ π g ( · | z i ) , z i +1 ∼ m π g β i +1 ( · | z i , a i ) , for i = 0 , . . . , n − 1 . This produces a sequence of subgoals ( z 1 , . . . , z n ) that naturally guides progress tow ards the final ob jectiv e. Alternativ ely , we can s ample from an unconditional GHM that predicts plausible successor states under the data distribution (i.e., the b ehavior p olicy). T o enable this, we sto chastically mask the p olicy encoding as z = ∅ during training, b o otstrapping with the dataset action at the next state in ( 3 ) when masked. With a prop osal distribution, planning reduces to random sho oting ( Maty áš , 1965 ): we (1) sample m candidate sequences { ( z ( i ) 1 , . . . , z ( i ) n ) } m i =1 from the prop osal distribution; (2) ev aluate Q ν ( i ) γ ( s, a ( i ) 1 ) for each candidate switc hing p olicy ν ( i ) := π z ( i ) 1 α 1 − → π z ( i ) 2 · · · α n − 1 − − − → π z ( i ) n using Lemma 1 where a ( i ) 1 ∼ π z ( i ) 1 ( · | s ) ; and (3) select the sequence ( a ∗ 1 , z ∗ 1 , . . . , z ∗ n ) with the highest v alue Q ν ∗ γ ( s, a ∗ 1 ) . The full metho d is summarized in Algorithm 2 . 4 Experiments Our empirical ev aluation tests the core h yp othesis of this w ork: that learning a jump y world mo del ov er a div erse collection of parameterized p olicies enables effective and efficien t comp ositional planning. W e b egin by describing the exp erimental setting and the training pro cedure for the base p olicies. W e subsequen tly compare the performance of these p olicies against that of comp ositional planning o ver them. A dditionally , w e b enchmark CompPlan against other test-time planning approaches and hierarchical me tho ds. Finally , we examine the effect of the proposed consistency loss on b oth model accuracy and planning p erformance. Additional ablations on the replanning frequency , planning ob jectiv e, and prop osal distribution are provided in App endix D . 4.1 Experimental Setup Benchmark and Dataset All exp eriments are conducted using the OGBench b enchmark ( Park et al. , 2025a ), whic h provides challenging long-horizon rob otic manipulation and lo comotion tasks structured as offline goal-conditioned reinforcemen t learning problems. W e fo cus on ant na vigation tasks across different maze top ologies ( medium , large , and giant ) as w ell as multi-cube robotic arm manipulation. F or both p olicy and GHM training, we use the standar d na viga te and pla y offline datasets for antmaze and cube , resp ectively . Base P olicies The effectiv eness of comp ositional planning dep ends on the qualit y and characteristics of the underlying p olicies. W e train five distinct p olicy types, each exhibiting different tradeoffs b etw een GHM 6 learning and planning p erformance: 1) Goal-Conditioned TD3 (GC-TD3; Pirotta et al. , 2024 ); 2) Goal- Conditioned 1-Step RL (GC-1S); 3) Contrastiv e RL (CRL; Eysenbac h et al. , 2022 ); 4) Goal-Conditioned Beha vior Cloning (GC-BC; Lynch et al. , 2019 ; Ghosh et al. , 2021 ); and 5) Hierarchical Flow Behavior Cloning (HFBC; P ark et al. , 2025b ). Additional details are pro vided in App endix F.1 . 4.2 How Do Policies Affect Compositional Planning? F or each policy family , we train a GHM in an off-policy manner using the td-hc loss describ ed in § 3.2 . GHMs are trained for 3M gradient steps using the Adam optimizer ( Kingma and Ba , 2015 ) with a batch size of 256. The mo del architecture follo ws a U-Net-style design, similar to F arebrother et al. ( 2025 ). Both the timestep t and the discoun t factor γ are embedded by first applying a sinusoidal em b edding to increase dimensionality , follo wed by a tw o-lay er MLP with mish activ ations ( Misra , 2019 ). F or the discount embedding, we further concatenate the vector [ γ , 1 − γ , − log (1 − γ )] , where − log (1 − γ ) corresp onds to the logarithm of the effective horizon; w e find this impro ves the mo del’s sensitivit y to the discoun t factor. Other conditioning information, suc h as the state-action pair and p olicy embedding z , is pro cessed through an additional MLP and added to b oth the time and discount em b eddings. The net work incorporates conditioning information via FiLM mo dulation ( P erez et al. , 2018 ). When training each GHM, w e apply the horizon-consistency ob jective ( 5 ) (i.e., β  = γ ) to 25% of each mini-batch in antmaze and 12 . 5% in cube . W e also train the unconditional mo del (i.e., z = ∅ ) for 10% of each mini-batc h; these tw o prop ortions do not ov erlap. During ev aluation, w e tailor the prop osal distribution to each domain based on their distinct characteristics. In antmaze , states are separated by large temp oral distances and physical barriers, so an unconditional prop osal w ould waste most samples on irrelev an t regions of the state space; we therefore sample 256 subgoal sequences from the goal-conditioned GHM. In cube , tasks consist of short pick-and-place sequences with man y viable paths, making the unconditional GHM a natural fit; we sample 1024 sequences to cov er the broader prop osal. Ablations of these choices are provided in App endix D.5 . W e b egin b y comparing the zero-shot p erformance of the base policies against our comp ositional planning approac h. T able 1 details these results (av eraged ov er three seeds), sho wing that comp ositional planning consisten tly improv es up on the zero-shot p olicies. This demonstrates the effectiveness of our metho d in selecting p olicy sequences that exceed the performance of the b est individual p olicy . While impro vemen ts are eviden t across all domains, the gains are particularly pronounced in complex, long-horizon tasks such as antmaze-giant and cube-{3,4} , where success rates can rise from 10% to 90% in the most extreme cases. Among the ev aluated p olicy classes, HFBC emerges as the most consistent zero-shot p erformer. CompPlan further improv es HFBC, notably in antmaze-giant and cube-4 . CRL, in contrast, is effectiv e in antmaze but underp erforms in cube . W e hypothesize this stems from the inductive bias in CRL’s representation learning, whic h appro ximates the goal-conditioned v alue function as Q ( s, a, g ) ≈ ϕ ( s, a ) ⊤ ψ ( g ) . This factorization effectiv ely captures the ant’s spatial p osition – yielding robust navigation in antmaze – but fails to enco de complex ob ject-related features, resulting in weak er cube p olicies. Notably , incorp orating planning not only impro ves up on CRL’s strong baseline in antmaze , but also achiev es non-trivial success rates in cube , demonstrating that our approach can extract utility from base p olicies otherwise limited by their inductive bias. T able 1 Success rate ( ↑ ) of base p olicies π g (Zero Shot) and comp ositional planning with GHMs ( CompPlan ; ours) a veraged ov er tasks. W e report the mean and standard deviation ov er 3 seeds. W e highligh t relative increases and decreases in p erformance w.r.t. the base policies. Additionally , we bold the best performance for each domain. Domain CRL GC-1S GC-BC GC- TD3 HFBC Z ero Shot CompPlan Zero Shot CompPlan Z ero Shot CompPlan Zero Shot CompPlan Z ero Shot CompPlan antmaze-medium 0.88 0.97 (0.02) 0.56 0.87 (0.05) 0.49 0.85 (0.08) 0.65 0.65 (0.03) 0.94 0.94 (0.01) antmaze-large 0.84 0.90 (0.00) 0.21 0.61 (0.04) 0.18 0.73 (0.02) 0.23 0.48 (0.05) 0.78 0.92 (0.02) antmaze-giant 0.16 0.29 (0.03) 0.00 0.02 (0.00) 0.00 0.03 (0.01) 0.00 0.01 (0.01) 0.42 0.79 (0.04) cube-1 0.28 0.86 (0.02) 0.37 0.66 (0.02) 0.90 0.99 (0.01) 0.58 0.91 (0.01) 0.80 0.97 (0.01) cube-2 0.02 0.50 (0.03) 0.10 0.57 (0.09) 0.15 0.97 (0.01) 0.12 0.82 (0.01) 0.76 0.77 (0 .02) cube-3 0.01 0.73 (0.02) 0.01 0.67 (0.02) 0.09 0.92 (0.01) 0.12 0.83 (0.04) 0.64 0.83 (0.03) cube-4 0.00 0.39 (0.04) 0.01 0.60 (0.02) 0.00 0.76 (0.03) 0.00 0.57 (0.03) 0.24 0.67 (0.03) 7 In terestingly , the other p olicies exhibit the opp osite trend. When ev aluated zero-shot, they show w eak p erformance in medium to long-horizon tasks – antmaze-medium , antmaze-lar ge , and cube tasks all pro ve c hallenging, with cube success rates falling b elow 10 − 15% . How ev er, these p olicies prov e remark ably effectiv e when integrated in to our comp ositional planning framew ork, with success rates clim bing ab ov e 70% in many tasks. T aken together, these results suggest that zero-shot metrics tell us little ab out how well a p olicy will comp ose; its utilit y may only become clear when orchestrated with other policies. 4.3 How Does Compositional Planning Compare to Other Planning and Hierarchical Approaches ? Our results in § 4.2 demonstrate that comp ositional planning unlo cks capabilities beyond what any single policy ac hieves alone. W e now situate our approach among existing planning and hierarchical methods, revealing that CompPlan ’s adv an tages stem from its unique com bination of temp oral abstraction and flexible comp osition. Recall from § 3.3 that different choices of switching probabilities reco ver existing metho ds as sp ecial cases. This observ ation suggests a natural ablation: comparing CompPlan against these sp ecial cases can disen tangle the con tribution of its key comp onents. A t one extreme lies Gener alize d Policy Impr ovement (GPI; Barreto et al. , 2017 ) whic h sets α 1 = 1 and commits to a single policy for the remainder of the episo de. GPI leverages our GHMs to estimate eac h p olicy’s v alue, selecting the b est in-class p olicy at each timestep according to: max z ,A ∼ π z ( ·| s ) Q π z γ ( s, a ) = (1 − γ ) − 1 E S ∼ m π z γ ( ·| s,a ) [ r ( S ) ] . GPI thus serves as a test of whether comp osing p olicies offers an adv an tage ov er merely selecting among them. A t the other extreme, we compare against action-lev el planning ( A ctionPlan ), which sets α 1 = · · · = α n = 1 and op erates at the gran ularity of individual actions rather than p olicies. T o fairly isolate the effect of temp oral abstraction, w e train a dedicated one-step world mo del ˜ p ( ·| s, a ) using the same flo w-matching framework, net work architecture, and training pro cedure as our GHMs – differing only in the remov al of p olicy and discoun t conditioning. Giv en this mo del, we optimize the following ob jective: arg max A 1 ,...,A n n X k =1 γ k r ( S k +1 ) , where S k +1 ∼ ˜ p ( · | S k , A k ) and A k ∼ π g ( · | S k ) . This comparison directly tests whether planning ov er sequences of p olicies – enabled b y jumpy predictions – confers an adv antage o ver action-level planning. Figure 1 reveals a clear pattern: on long-horizon tasks, CompPlan substantially outperforms both alternativ es, ac hieving an 89% relative impro vemen t ov er GPI and a 201% gain ov er ActionPlan (a veraged across p olicies and long-horizon domains). These gains indicate that neither p olicy selection alone nor action-lev el planning captures the full b enefit of our comp ositional framew ork. Rather, the combination of planning ov er sequences of p olicies at m ultiple timescales unlo cks strong long-horizon capabilities. T able 2 Success rate ( ↑ ) of hierarchical baselines and com- p ositional planning with HFBC p olicies ( CompPlan ; ours). F or eac h domain, w e highligh t the best performance. Domain HIQL SHARSA HFBC CompPlan antmaze-medium 0.96 (0.01) 0.91 (0.03) 0.94 0.94 (0.01) antmaze-large 0.91 (0.02) 0.88 (0.03) 0.78 0.92 (0.02) antmaze-giant 0.65 (0.05) 0.56 (0.07) 0.42 0.79 (0.04) cube-1 0.15 (0.03) 0.70 (0.03) 0.84 0.97 (0.01) cube-2 0.06 (0.02) 0.60 (0.07) 0.70 0.77 (0.02) cube-3 0.03 (0.01) 0.50 (0.09) 0.54 0.83 (0.03) cube-4 0.00 (0.00) 0.09 (0.04) 0.34 0.67 (0.03) F urthermore, we compare against metho ds that learn hierarc hical structure during training: HIQL ( P ark et al. , 2023 ) and SHARSA ( P ark et al. , 2025b ), the current state-of-the-art on OGBench. These ap- proac hes train high-lev el p olicies to select subgoals or skills, whereas CompPlan p erforms comp osi- tion at test time using pre-trained GHMs. T able 2 presen ts this comparison where CompPlan em- plo ys HFBC base p olicies similar to SHARSA. Our approac h consistently outp erforms b oth hierarc hi- cal metho ds, with the largest margins on the most c hallenging tasks. In cube-4 , CompPlan ac hieves 67% success compared to 9% for SHARSA and 0% for HIQL – demonstrating that test-time comp osi- tion can surpass learned hierarchies when tasks demand flexible, long-horizon reasoning. Notably , CompPlan requires no task-sp ecific training, suggesting a promising alternativ e to task-sp ecific hierarchical methods. 8 CRL GC-1S GC- B C GC- TD3 HFB C -0.2 0.0 +0.2 +0.4 +0.6 P er cent P oint Change in Success antmaze-lar ge +6% +3% +3% +40% +4% -2% +55% +49% +16% +25% +5% +3% +14% +10% +11% CRL GC-1S GC- B C GC- TD3 HFB C antmaze-giant +13% -2% -9% +2% +3% +1% +1% +1% +37% +20% +31% antmaze CRL GC-1S GC- B C GC- TD3 HFB C -0.3 0.0 +0.3 +0.6 +0.9 P er cent P oint Change in Success cube-3 +72% +82% +60% +66% +37% +58% +83% +81% +19% +71% +49% +51% +19% -25% -5% CRL GC-1S GC- B C GC- TD3 HFB C cube-4 +39% +22% +27% +59% +32% +18% +76% +65% +3% +57% +47% +27% +43% -16% +11% cube A ctionPlan GPI CompPlan Figure 1 P ercentage point change ( ↑ ) ov er zero-shot p olicies. W e compare: action-level planning ( ActionPlan ) with a w orld model; generalized p olicy improv ement (GPI) and comp ositional planning ( CompPlan ; ours) with GHMs. 4.4 How Does Horizon Consistency Affect GHM Learning? Ha ving established that comp ositional planning yields strong empirical gains, we no w turn inw ard to examine one of our metho dological contributions: the horizon consistency ob jectiv e from § 3.2 . W e inv estigate its impact through t wo complementary lenses – generative fidelit y and downstream planning – revealing that consistency pla ys distinct roles at different stages of the pip eline. T able 3 Accuracy (EMD; ↓ ) of GHMs trained with our horizon consistency loss ( td-hc ) and without ( td-flo w ) for discount factor γ = 0 . 995 . W e highligh t the b est performing metho d. Domain CRL GC-1S td-flow ( ✗ ) td-hc ( ✓ ) td-flow ( ✗ ) td-hc ( ✓ ) antmaze-medium 4.41 (0.05) 4.22 (0.06) 4.40 (0.02) 4.22 (0.03) antmaze-large 5.24 (0.07) 4.81 (0.03) 5.12 (0.18) 4.67 (0.04) antmaze-giant 6.77 (0.49) 5.74 (0.06) 7.29 (0.69) 5.25 (0.08) cube-1 1.60 (0.02) 1.57 (0.03) 1.43 (0.00) 1.33 (0.03) cube-2 2.36 (0.03) 2.23 (0.02) 1.86 (0.04) 1.71 (0.01) cube-3 2.15 (0.02) 2.10 (0.02) 1.80 (0.04) 1.71 (0.03) cube-4 2.41 (0.03) 2.34 (0.01) 2.13 (0.03) 2.05 (0.03) W e b egin by asking whether enforcing consistency across timescales pro duces more accurate GHM predictions. T o isolate this effect, w e train t w o GHMs – one using td-flow and the other using td- hc – k eeping all else fixed. W e obtain ground-truth samples by executing 64 p olicy rollouts from 256 randomly selected (state, goal) pairs and resampling 2048 visited states according to t ∼ Geom (1 − γ ) . W e then draw an equal num ber of samples from eac h GHM and compute the Earth Mov er’s Distance (EMD; Rubner et al. , 2000 ) b etw een the t wo sets. As T able 3 demonstrates, consistency systematically impro ves accuracy at long horizons. The effect is particularly striking in antmaze , where the maze structure imposes hard constraints on reac hability . F or example, employing td-hc with GC-1S p olicies in antmaze-giant results in a 28% reduction in EMD. Qualitatively , Figure 9 shows that td-hc on antmaze- giant leads to fewer samples erroneously trav ersing w alls, a failure mo de that comp ounds o ver long horizons. Giv en these accuracy improv ements, one might exp ect corresp ondingly large gains in planning. Surprisingly , T able 9 tells a different story: planning success rates are nearly identical with and without consistency , a veraging only a 5% relativ e improv emen t. These findings are not con tradictory but rather illuminate when consistency matters most. Our planning pro cedure ev aluates candidate sequences using effectiv e horizons { β i } in the range of 50 − 100 steps (i.e., β i ∈ [0 . 98 , 0 . 99] ), not the 200+ step horizons where consistency provides its 9 largest accuracy gains. At these mo derate timescales, the base td-flow ob jectiv e already learns sufficiently accurate mo dels to rank p olicies correctly . The consistency ob jective th us pro vides a margin of safety for long-horizon predictions without b ei ng strictly necessary for the planning horizons we employ in OGBench. This suggests that practitioners facing longer-horizon tasks w ould b enefit most from the consistency loss herein. 5 Discussion This w ork reframes pre-trained policies not as isolated controllers but as comp osable primitiv es – building blo c ks to b e sequenced. Jump y w orld mo dels pro vide the mec hanism: by predicting successor states for many p olicies across a contin uum of horizons, they enable planning ov er b ehavior rather than primitive actions. Empirically , comp ositional planning consisten tly outp erforms individual p olicies, hierarc hical metho ds, and action-lev el planning, with striking gains at long horizons. Lo oking ahead, w e see many promising directions: learning state-dep endent switc hing probabilities, join tly learning p olicies and predictiv e models, employing more sample-efficien t mo del-predictive con trol metho ds, and exploring jumpy world mo dels in learned laten t spaces. A cknowledg ements The authors thank Harley Wiltzer, Arnav Jain, Pierluca D’oro, Nate Rahn, Michael Rabbat, Y ann Ollivier, Marlos C. Mac hado, Michael Bo wling, Adam White, and Doina Precup for useful discussions that help ed impro ve this work. MGB is supp orted by the Canada CIF AR AI Chair program and NSER C. Finally , this w ork w as made p ossible by the Python communit y , particularly NumPy ( Harris et al. , 2020 ), Matplotlib ( Hun ter , 2007 ), Seab orn ( W ask om , 2021 ), Einops ( Rogozhnik ov , 2022 ), and Mujo co ( T odorov et al. , 2012 ). R eferences Siddhan t Agarw al, Harshit Sikc hi, Peter Stone, and Amy Zhang. Proto successor measure: Representing the behavior space of an RL agent. In International Confer enc e on Machine L e arning (ICML) , 2025. Xin yue Ai, Y utong He, Alb ert Gu, Ruslan Salakhutdino v, J. Zico Kolter, Nicholas M. Boffi, and Max Simcho witz. Join t distillation for fast lik eliho o d ev aluation and sampling in flo w-based mo dels. In International Confer enc e on L e arning R epr esentations (ICLR) , 2026. An urag Aja y , Yilun Du, Abhi Gupta, Joshua B. T enenbaum, T ommi S. Jaakkola, and Pulkit Agraw al. Is conditional generativ e modeling all you need for decision making? In International Confer enc e on L e arning R epr esentations (ICLR) , 2023. Mic hael S. Albergo and Eric V anden-Eijnden. Building normalizing flows with sto chastic in terp olants. In International Confer enc e on L e arning R epr esentations, (ICLR) , 2023. Lucas N. Alegre, Agon Serifi, Rub en Grandia, David Müller, Esp en Kno op, and Moritz Bächer. Amor: Adaptiv e c haracter con trol through multi-ob jective reinforcement learning. In Confer enc e on Computer Gr aphics and Inter active T e chniques (SIGGRAPH) , 2025. Marcin Andrycho wicz, Dwight Cro w, Alex Ray , Jonas Schneider, Rachel F ong, P eter W elinder, Bob McGrew, Josh T obin, Pieter Abb eel, and W o jciech Zaremba. Hindsight exp erience replay . In Neur al Information Pr o c essing Systems (NeurIPS) , 2017. Mido Assran, A drien Bardes, Da vid F an, Quentin Garrido, Russell How es, Mo jtaba Komeili, Matthew J. Muckley , Ammar Rizvi, Claire Rob erts, K oustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, F rancois Rob ert Hogan, Daniel Dugas, Piotr Bo janowski, V asil Khalidov, P atrick Labatut, F rancisco Massa, Marc Szafraniec, Kapil Krishnakumar, Y ong Li, Xiao dong Ma, Sarath Chandar, F ranzisk a Meier, Y ann LeCun, Michael Rabbat, and Nicolas Ballas. V-JEP A 2: Self-sup ervised video mo dels enable understanding, prediction and planning. CoRR , abs/2506.09985, 2025. Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic arc hitecture. In AAAI Confer enc e on Artificial Intel ligenc e , 2017. Akhil Bagaria, Jason K. Senthil, and George Konidaris. Skill disco very for exploration and planning using deep skill graphs. In International Confer enc e on Machine L e arning (ICML) , 2021. 10 Marco Bagatella, Matteo Pirotta, Ahmed T ouati, Alessandro Lazaric, and Andrea Tirinzoni. TD-JEP A: Latent- predictiv e represen tations for zero-shot reinforcemen t learning. In International Confer enc e on L e arning R epr esenta- tions (ICLR) , 2026. André Barreto, Will Dabney , Rémi Munos, Jonathan J. Hunt, T om Schaul, David Silver, and Hado v an Hasselt. Successor features for transfer in reinforcemen t learning. In Neur al Information Pr o c essing Systems (NeurIPS) , 2017. André Barreto, Diana Borsa, Shaob o Hou, Gheorghe Comanici, Eser A ygün, Philippe Hamel, Daniel T oy ama, Jonathan Hun t, Shibl Mourad, David Silver, and Doina Precup. The option keyboard: Com bining skills in reinforcement learning. In Neur al Information Pr o c essing Systems (NeurIPS) , 2019. Andrew G. Barto and Sridhar Mahadev an. Recent adv ances in hierarchical reinforcemen t learning. Discr ete event dynamic systems , 13(4):341–379, 2003. Marc G. Bellemare, Will Dabney , and Rémi Munos. A distributional p ersp ective on reinforcement learning. In International Confer enc e on Machine L e arning (ICML) , 2017. Marc G. Bellemare, Will Dabney , and Mark Rowland. Distributional R einfor c ement L e arning . MIT Press, 2023. Kevin Black, Noah Bro wn, Dann y Driess, A dnan Esmail, Mic hael Equi, Chelsea Finn, Niccolo F usai, Lac hy Groom, Karol Hausman, Brian Ich ter, Szymon Jakub czak, Tim Jones, Liyiming Ke, Sergey Levine, A drian Li-Bell, Mohith Moth ukuri, Sura j Nair, Karl Pertsc h, Lucy Xiaoy ang Shi, James T anner, Quan V uong, Anna W alling, Haohuan W ang, and Ury Zhilinsky . π 0 : A vision-language-action flo w mo del for general rob ot control. CoRR , abs/2410.24164, 2024. Kevin Black, Noah Bro wn, James Darpinian, Karan Dhabalia, Danny Driess, A dnan Esmail, Michael Rob ert Equi, Chelsea Finn, Niccolo F usai, Manuel Y. Galliker, Dib ya Ghosh, Lach y Gro om, Karol Hausman, brian ic hter, Szymon Jakub czak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Sura j Nair, Karl Pertsc h, Allen Z. Ren, Lucy Xiao yang Shi, Laura Smith, Jost T obias Springenberg, Kyle Stacho wicz, James T anner, Quan V uong, Homer W alke, Anna W alling, Haohuan W ang, Lili Y u, and Ury Zhilinsky . π 0 . 5 : a vision-language-action mo del with open-world generalization. In Confer enc e on R ob ot L e arning (CORL) , 2025. Léonard Blier. Some Principle d Metho ds for De ep R einfor c ement L e arning . PhD thesis, Université P aris-Saclay , 2022. Léonard Blier, Corentin T allec, and Y ann Ollivier. Learning successor states and goal-dependent v alues: A mathematical viewp oin t. CoRR , abs/2101.07123, 2021. Nic holas M. Boffi, Michael S. Albergo, and Eric V anden-Eijnden. How to build a consistency mo del: Learning flow maps via self-distillation. In Neur al Information Pr o c essing Systems (NeurIPS) , 2025. Diana Borsa, André Barreto, John Quan, Daniel J. Manko witz, Hado v an Hasselt, Rémi Munos, Da vid Silver, and T om Sc haul. Universal successor features appro ximators. In International Confer enc e on L e arning R epr esentations (ICLR) , 2019. An thony Brohan, Noah Brown, Justice Carba jal, Y evgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Dann y Driess, A vina v a Dub ey , Chelsea Finn, Pete Florence, Chuyuan F u, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ich ter, Alex Irpan, Nikhil J. Joshi, Ryan Julian, Dmitry Kalashniko v, Y uheng Kuang, Isab el Leal, Lisa Lee, T sang-W ei Edward Lee, Sergey Levine, Y ao Lu, Henryk Mic halewski, Igor Mordatc h, Karl Pertsc h, Kanishk a Rao, Krista Reymann, Michael S. Ry o o, Grecia Salazar, Pannag Sank eti, Pierre Sermanet, Jaspiar Singh, Anik ait Singh, Radu Soricut, Huong T. T ran, Vincent V anhouck e, Quan V uong, A yzaan W ahid, Stefan W elker, Paul W ohlhart, Jialin W u, F ei Xia, T ed Xiao, Peng Xu, Sic hun Xu, Tianhe Y u, and Brianna Zitko vich. R T-2: vision-language-action mo dels transfer web kno wledge to rob otic control. In Confer enc e on R ob ot L e arning (CoRL) , 2023a. An thony Brohan, Noah Brown, Justice Carba jal, Y evgen Cheb otar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alexander Herzog, Jasmine Hsu, Julian Ibarz, Brian Ich ter, Alex Irpan, T omas Jackson, Sally Jesmonth, Nikhil J. Joshi, Ryan Julian, Dmitry Kalashnik ov, Y uheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Y ao Lu, Utsav Malla, Deeksha Manjunath, Igor Mordatch, Ofir Nach um, Carolina Parada, Jo dilyn P eralta, Emily Perez, Karl Pertsc h, Jornell Quiambao, Kanishk a Rao, Mic hael S. Ryoo, Grecia Salazar, Pannag R. Sank eti, Kevin Say ed, Jaspiar Singh, Sumedh Sontakk e, Austin Stone, Clayton T an, Huong T. T ran, Vincent V anhouck e, Steve V ega, Quan V uong, F ei Xia, T ed Xiao, Peng Xu, Sich un Xu, Tianhe Y u, and Brianna Zitko vic h. R T-1: rob otics transformer for real-world con trol at scale. In R ob otics: Science and Systems , 2023b. John C. Butcher. Numerical metho ds for or dinary differ ential e quations . John Wiley & Sons, 2016. 11 Wilk a Carv alho, Andre Saraiv a, Angelos Filos, Andrew K. Lampinen, Loic Matthey , Richard L. Lewis, Honglak Lee, Satinder Singh, Danilo Jimenez Rezende, and Daniel Zoran. Com bining b ehaviors with the successor features k eyb oard. In Neur al Information Pr o c essing Systems (NeurIPS) , 2023. Edoardo Cetin, Ahmed T ouati, and Y ann Ollivier. Finer b ehavioral foundation mo dels via auto-regressive features and adv an tage w eighting. In R einfor c ement L e arning Confer enc e (RLC) , 2025. Elliot Chane-Sane, Cordelia Sc hmid, and Iv an Laptev. Goal-conditioned reinforcement learning with imagined subgoals. In International Confer enc e on Machine L e arning (ICML) , 2021. Sydney Chapman. On the bro wnian displacements and thermal diffusion of grains susp ended in a non-uniform fluid. Pr o c e e dings of the R oyal So ciety of L ondon , 119(781):34–54, 05 1928. Chang Chen, Hany Hamed, Do o jin Baek, T aegu Kang, Y oshua Bengio, and Sung jin Ahn. Extendable long-horizon planning via hierarchical multiscale diffusion. CoRR , abs/2503.20102, 2025. Ric ky T. Q. Chen and Y aron Lipman. Flo w matching on general geometries. In International Confer enc e on L e arning R epr esentations (ICLR) , 2024. Will Dabney , Mark Rowland, Marc G. Bellemare, and Rémi Munos. Distributional reinforcement learning with quan tile regression. In AAAI Confer enc e on Artificial Intel ligenc e , 2018. P eter Da yan. Impro ving generalization for temporal difference learning: The successor representation. Neur al Computation , 1993. P eter Day an and Geoffrey E. Hinton. F eudal reinforcement learning. In Neur al Information Pr o c essing Systems (NeurIPS) , 1992. Cyrus Derman. Finite State Markovian De cision Pr o c esses . Academic Press, 1970. Thomas G. Dietterich. The MAXQ method for hierarc hical reinforcement learning. In International Confer enc e on Machine L e arning (ICML) , 1998. Benjamin Eysenbac h, Ruslan Salakh utdinov, and Sergey Levine. Search on the repla y buffer: Bridging planning and reinforcemen t learning. In Neur al Information Pro c essing Systems (NeurIPS) , 2019. Benjamin Eysenbac h, Tianjun Zhang, Sergey Levine, and Ruslan Salakhutdino v. Contrastiv e learning as goal- conditioned reinforcement learning. In Neur al Information Pr o c essing Systems (NeurIPS) , 2022. Kuan F ang, Patric k Yin, Ashvin Nair, and Sergey Levine. Planning to practice: Efficient online fine-tuning by comp osing goals in laten t space. In International Conferenc e on Intel ligent R ob ots and Systems (IROS) , 2022. Jesse F arebrother, Matteo Pirotta, Andrea Tirinzoni, Rémi Munos, Alessandro Lazaric, and Ahmed T ouati. T emporal difference flows. In International Confer enc e on Machine L e arning (ICML) , 2025. Kevin F rans, Seohong Park, Pieter Abb eel, and Sergey Levine. Unsup ervised zero-shot reinforcement learning via functional reward encodings. In International Confer enc e on Machine L e arning (ICML) , 2024. Kevin F rans, Danijar Hafner, Sergey Levine, and Pieter Abb eel. One step diffusion via shortcut models. In International Confer enc e on L e arning R epr esentations (ICLR) , 2025. Dror F reiric h, T zahi Shimkin, Ron Meir, and A viv T amar. Distributional m ultiv ariate policy ev aluation and exploration with the Bellman GAN. In International Confer enc e on Machine L e arning (ICML) , 2019. Scott F ujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning. In Neur al Information Pr o c essing Systems (NeurIPS) , 2021. Scott F ujimoto, Herk e v an Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In International Confer enc e on Machine L e arning (ICML) , 2018. Itai Gat, T al Remez, Neta Shaul, F elix Kreuk, Ric ky T. Q. Chen, Gabriel Synnaeve, Y ossi Adi, and Y aron Lipman. Discrete flow matc hing. In Neur al Information Pr o c essing Systems (NeurIPS) , 2024. Zhengy ang Geng, Mingyang Deng, Xing jian Bai, J Zico Kolter, and Kaiming He. Mean flo ws for one-step generative mo deling. In Neur al Information Pr o c essing Systems (NeurIPS) , 2025. Dib ya Ghosh, Abhishek Gupta, Ash win Reddy , Justin F u, Coline Manon Devin, Benjamin Eysen bach, and Sergey Levine. Learning to reac h goals via iterated supervised learning. In International Confer enc e on L e arning R epr esentations (ICLR) , 2021. 12 Dib ya Ghosh, Homer Rich W alke, Karl Pertsc h, Kevin Black, Oier Mees, Sudeep Dasari, Jo ey Hejna, T obias Kreiman, Charles Xu, Jianlan Luo, Y ou Liang T an, La wrence Y unliang Chen, Quan V uong, T ed Xiao, Pannag R. Sanketi, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An op en-source generalist rob ot p olicy . In R ob otics: Scienc e and Systems , 2024. Mic hael Gimelfarb, Andre Barreto, Scott Sanner, and Chi-Guhn Lee. Risk-aw are transfer in reinforcemen t learning using successor features. In Neur al Information Pr o c essing Systems (NeurIPS) , 2021. Go ogle DeepMind. Gemini 2.5: Pushing the frontier with adv anced reasoning, multimodality , long con text, and next generation agentic capabilities. CoRR , abs/2507.06261, 2025. Nico Gürtler and Georg Martius. Long-horizon planning with predictable skills. In R einfor c ement Le arning Confer enc e (RLC) , 2025. Nico Gürtler, Dieter Büchler, and Georg Martius. Hierarchical reinforcemen t learning with timed subgoals. In Neur al Information Pr o c essing Systems (NeurIPS) , 2021. Danijar Hafner, Kuang-Huei Lee, Ian Fischer, and Pieter Abb eel. Deep hierarchical planning from pixels. Neur al Information Pr o c essing Systems (NeurIPS) , 2022. Charles R. Harris, Millman K. Jarrod, Stéfan J. v an der W alt, Ralf Gommers, Pauli Virtanen, Da vid Cournap eau, Eric Wieser, Julian T a ylor, Sebastian Berg, Nathaniel J. Smith, Rob ert Kern, Matti Picus, Stephan Hoy er, Marten H. v an Kerkwijk, Matthew Brett, Allan Haldane, Jaime F ernández del Río, Mark Wieb e, Pearu Peterson, Pierre Gérard-Marc hant, Kevin Sheppard, Tyler Reddy , W arren W ec kesser, Hameer Abbasi, Christoph Gohlke, and T ravis E. Oliphant. Array programming with nump y . Natur e , 585(7825):357–362, 2020. Jonathan Ho, Ajay Jain, and Pieter Abb eel. Denoising diffusion probabilistic mo dels. In Neur al Information Pro c essing Systems (NeurIPS) , 2020. Chin-W ei Huang, Milad Agha johari, Jo ey Bose, Prak ash Panangaden, and Aaron C. Courville. Riemannian diffusion mo dels. In Neur al Information Pr o c essing Systems (NeurIPS) , 2022. John D. Hunter. Matplotlib: A 2d graphics environmen t. Computing in Scienc e & Engine ering , 9(3):90–95, 2007. Mic hael Janner, Igor Mordatch, and Sergey Levine. Gamma-mo dels: Generativ e temp oral difference learning for infinite-horizon prediction. In Neur al Information Pr o c essing Systems (NeurIPS) , 2020. Mic hael Janner, Yilun Du, Joshua B. T enen baum, and Sergey Levine. Planning with diffusion for flexible behavior syn thesis. In International Confer enc e on Machine L e arning (ICML) , 2022. Y uu Jinnai, David Abel, Da vid Hershk owitz, Mic hael Littman, and George K onidaris. Finding options that minimize planning time. In International Confer enc e on Machine L e arning (ICML) , 2019. Leslie Kaelbling. Hierarchical learning in sto chastic domains: Preliminary results. In International Conferenc e on Machine L e arning (ICML) , 1993a. Leslie Kaelbling. Learning to achiev e goals. In International Joint Confer enc e on Artificial Intel ligenc e (IJCAI) , 1993b. Kacp er Kapusniak, Peter P otaptchik, T eodora Reu, Leo Zhang, Alexander T ong, Michael M. Bronstein, Jo ey Bose, and F rancesco Di Giov anni. Metric flow matc hing for smo oth in terp olations on the data manifold. In Neur al Information Pr o c essing Systems (NeurIPS) , 2024. Diederik P . Kingma and Jimm y Ba. Adam: A metho d for sto chastic optimization. In International Confer enc e on L e arning R epr esentations (ICLR) , 2015. Martin Klissarov, Akhil Bagaria, Ziyan Luo, George Konidaris, Doina Precup, and Marlos C. Machado. Discov ering temp oral structure: An ov erview of hierarchical reinforcemen t learning. CoRR , abs/2506.14045, 2025. Andrei Kolmogoroff. Üb er die analytisc hen metho den in der w ahrscheinlic hkeitsrec hnung. Mathematische Annalen , 104(1):415–458, 1931. T ejas D. Kulk arni, Karthik Narasimhan, Ardav an Saeedi, and Josh T enen baum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motiv ation. In Neur al Information Pr o c essing Systems (NeurIPS) , 2016. Chieh-Hsin Lai, Y ang Song, Dongjun Kim, Y uki Mitsufuji, and Stefano Ermon. The principles of diffusion mo dels. CoRR , abs/2510.21890, 2025. 13 Ky ow o on Lee and Jaesik Choi. State-co vering tra jectory stitching for diffusion planners. In Neur al Information Pr o c essing Systems (NeurIPS) , 2025. Sergey Levine, A viral Kumar, George T uck er, and Justin F u. Offline reinforcemen t learning: T utorial, review, and p erspectives on op en problems. CoRR , abs/2005.01643, 2020. Andrew Levy , Robert Platt, and Kate Saenko. Hierarchical reinforcemen t learning with hindsigh t. In International Confer enc e on L e arning R epr esentations (ICLR) , 2019. Yitang Li, Zhengyi Luo, T onghe Zhang, Cunxi Dai, Anssi Kanervisto, Andrea Tirinzoni, Haoy ang W eng, Kris Kitani, Mateusz Guzek, Ahmed T ouati, Alessandro Lazaric, Matteo Pirotta, and Guany a Shi. BFM-zero: A promptable b eha vioral foundation mo del for humanoid control using unsup ervised reinforcement learning. In International Confer enc e on L e arning R epr esentations (ICLR) , 2026. Y aron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nic kel, and Matthew Le. Flow matching for generative mo deling. In International Confer enc e on Le arning R epr esentations (ICLR) , 2023. Ch unlok Lo, Kevin Roice, Parham Mohammad Panahi, Scott M. Jordan, Adam White, Gab or Mihucz, F arzane Aminmansour, and Martha White. Goal-space planning with subgoal mo dels. Journal of Machine L e arning R ese ar ch (JMLR) , 25(330):1–57, 2024. Y unhao Luo, Utk arsh A. Mishra, Yilun Du, and Danfei Xu. Generative tra jectory stitching through diffusion comp osition. In Neur al Information Pr o c essing Systems (NeurIPS) , 2025. Zhengyi Luo, Jinkun Cao, Josh Merel, Alexander Winkler, Jing Huang, Kris M. Kitani, and W eip eng Xu. Universal h umanoid motion representations for physics-based con trol. In International Confer enc e on L e arning R epr esentations (ICLR) , 2024. Corey Lync h, Mohi Khansari, T ed Xiao, Vik ash Kumar, Jonathan T ompson, Sergey Levine, and Pierre Sermanet. Learning latent plans from play . In Confer enc e on R ob ot L e arning (CoRL) , 2019. Marlos C. Machado, Marc G. Bellemare, and Michael H. Bowling. A laplacian framew ork for option discov ery in reinforcemen t learning. In International Confer enc e on Machine L e arning (ICML) , 2017. Marlos C. Machado, Clemens Rosen baum, Xiaoxiao Guo, Miao Liu, Gerald T esauro, and Murra y Campbell. Eigenoption disco very through the deep successor representation. In International Confer enc e on Le arning R epr esentations (ICLR) , 2018. Marlos C. Machado, André Barreto, Doina Precup, and Michael Bo wling. T emp oral abstraction in reinforcement learning with the successor representation. Journal of Machine L e arning R ese ar ch (JMLR) , 24:80:1–80:69, 2023. Josef Maty áš. Random optimization. Automation and R emote c ontr ol , 26(2):246–253, 1965. Meta. The llama 3 herd of mo dels. CoRR , abs/2407.21783, 2024. Utk arsh A. Mishra, Shang jie Xue, Y ongxin Chen, and Danfei Xu. Generative skill chaining: Long-horizon skill planning with diffusion mo dels. In Confer enc e on Rob ot L e arning (CoRL) , 2023. Digan ta Misra. Mish: A self regularized non-monotonic neural activ ation function. CoRR , abs/1908.08681, 2019. V olo dymyr Mnih, Kora y Kavuk cuoglu, David Silv er, Andrei A. Rusu, Jo el V eness, Marc G. Bellemare, Alex Gra ves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig P etersen, Charles Beattie, Amir Sadik, Ioannis An tonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level con trol through deep reinforcement learning. Natur e , 518(7540):529–533, 2015. Kevin Murphy . Reinforcement learning: An ov erview. CoRR , abs/2412.05265, 2024. Ofir Nach um, Shixiang Shane Gu, Honglak Lee, and Sergey Levine. Data-efficient hierarchical reinforcement learning. In Neur al Information Pr o c essing Systems (NeurIPS) , 2018. Sura j Nair and Chelsea Finn. Hierarchical foresight: Self-sup ervised learning of long-horizon tasks via visual subgoal generation. In International Confer enc e on Le arning R epr esentations (ICLR) , 2020. Soroush Nasiriany , Vitch yr Pong, Stev en Lin, and Sergey Levine. Planning with goal-conditioned p olicies. In Neur al Information Pr o c essing Systems (NeurIPS) , 2019. NVIDIA. GR00T N1: an op en foundation mo del for generalist humanoid robots. CoRR , abs/2503.14734, 2025. Op enAI. Op enAI o1 system card. CoRR , abs/2412.16720, 2024. 14 Seohong Park, Dib ya Ghosh, Benjamin Eysenbac h, and Sergey Levine. Hiql: Offline goal-conditioned rl with laten t states as actions. In Neur al Information Pr o c essing Systems (NeurIPS) , 2023. Seohong Park, T obias Kreiman, and Sergey Levine. F oundation policies with hilb ert represen tations. In International Confer enc e on Machine Le arning (ICML) , 2024. Seohong P ark, Kevin F rans, Benjamin Eysenbac h, and Sergey Levine. Ogb ench: Benchmarking offline goal-conditioned RL. In International Confer enc e on L e arning R epr esentations (ICLR) , 2025a. Seohong Park, Kevin F rans, Deepinder Mann, Benjamin Eysenbac h, A viral Kumar, and Sergey Levine. Horizon reduction makes RL scalable. In Neur al Information Pr o c essing Systems (NeurIPS) , 2025b. Seohong P ark, Qiyang Li, and Sergey Levine. Flow q-learning. In International Confer enc e on Machine L e arning (ICML) , 2025c. Ronald Parr and Stuart Russell. Reinforcement learning with hierarc hies of mac hines. In Neur al Information Pr o c essing Systems (NeurIPS) , 1997. Xue Bin Peng, Ze Ma, Pieter Abb eel, Sergey Levine, and Ang jo o Kanazaw a. AMP: adv ersarial motion priors for st ylized ph ysics-based c haracter control. ACM T r ansactions on Gr aphics , 40(4):144:1–144:20, 2021. Xue Bin P eng, Y unrong Guo, Lina Halp er, Sergey Levine, and Sanja Fidler. ASE: large-scale reusable adversarial skill em b eddings for physically sim ulated characters. ACM T r ansactions on Gr aphics , 41(4):94:1–94:17, 2022. Ethan Perez, Florian Strub, Harm de V ries, Vincent Dumoulin, and Aaron C. Courville. Film: Visual reasoning with a general conditioning lay er. In AAAI Conferenc e on Artificial Intel ligenc e , 2018. Marek Petrik and Bruno Sc herrer. Biasing approximate dynamic programming with a lo wer discoun t factor. In Neur al Information Pr o c essing Systems (NeurIPS) , 2008. Ph ysical In telligence, Ali Amin, Raichelle Aniceto, Ash win Balakrishna, Kevin Black, Ken Conley , Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, Danny Driess, Michael Equi, Adnan Esmail, Y unhao F ang, Chelsea Finn, Catherine Glossop, Thomas Godden, Iv an Goryac hev, Lach y Gro om, Hun ter Hanco ck, Karol Hausman, Gashon Hussein, Brian Ich ter, Szymon Jakubczak, Ro wan Jen, Tim Jones, Ben Katz, Liyiming Ke, Chandra Kuchi, Marinda Lamb, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Y ao Lu, Vishnu Mano, Mohith Mothukuri, Sura j Nair, Karl P ertsch, Allen Z. Ren, Charvi Sharma, Lucy Xiaoy ang Shi, Laura Smith, Jost T obias Springenberg, Kyle Stac howicz, Will Stoeckle, Alex Swerdlo w, James T anner, Marcel T orne, Quan V uong, Anna W alling, Haoh uan W ang, Blake Williams, Sukwon Y o o, Lili Y u, Ury Zhilinsky , and Zhiyuan Zhou. π ∗ 0 . 6 : a VLA that learns from exp erience. CoRR , abs/2511.14759, 2025. Matteo Pirotta, Andrea Tirinzoni, Ahmed T ouati, Alessandro Lazaric, and Y ann Ollivier. F ast imitation via b eha vior foundation mo dels. In International Confer enc e on L e arning R epr esentations (ICLR) , 2024. Doina Precup. T emp or al abstr action in r einfor c ement le arning . PhD thesis, Universit y of Massach usetts Amherst, 2000. Doina Precup and Richard S. Sutton. Multi-time models for temp orally abstract planning. In Neur al Information Pr o c essing Systems (NeurIPS) , 1997. Doina Precup, Richard S. Sutton, and Satinder Singh. Planning with closed-lo op macro actions. In AAAI F al l Symp osium on Mo del-dir e cte d Autonomous Systems , 1997. Doina Precup, Richard S. Sutton, and Satinder Singh. Theoretical results on reinforcement learning with temp orally abstract options. In Eur op e an c onfer enc e on machine le arning , 1998. Alec Radford, Jong W ook Kim, Chris Hallacy , Adit ya Ramesh, Gabriel Goh, Sandhini Agarw al, Girish Sastry , Amanda Ask ell, P amela Mishkin, Jac k Clark, Gretc hen Krueger, and Ilya Sutsk ever. Learning transferable visual mo dels from natural language supervision. In International Confer enc e on Machine L e arning (ICML) , 2021. Nikhila Ravi, V alentin Gab eur, Y uan-Ting Hu, Ronghang Hu, Chaitan ya Ryali, T engyu Ma, Haitham Khedr, Roman Rädle, Chloé Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan V asudev Alwala, Nicolas Carion, Chao-Y uan W u, Ross B. Girshick, Piotr Dollár, and Christoph F eich tenhofer. SAM 2: Segment anything in images and videos. In International Confer enc e on L e arning R epr esentations (ICLR) , 2025. Rafael Ro driguez-Sanchez and George Konidaris. Learning abstract world mo del for v alue-preserving planning with options. In R einfor c ement L e arning Confer enc e (RLC) , 2024. Alex Rogozhniko v. Einops: Clear and reliable tensor manipulations with einstein-like notation. In International Confer enc e on L e arning R epr esentations (ICLR) , 2022. 15 Olaf Ronneb erger, Philipp Fischer, and Thomas Brox. U-net: Conv olutional net works for biomedical image segmen tation. In Me dic al Image Computing and Computer-Assiste d Intervention (MICCAI) , v olume 9351, pages 234–241, 2015. Y ossi Rubner, Carlo T omasi, and Leonidas J. Guibas. The earth mov er’s distance as a metric for image retriev al. International Journal of Computer Vision , 40(2):99–121, 2000. T om Schaul, Daniel Horgan, Karol Gregor, and Da vid Silver. Universal v alue function approximators. In International Confer enc e on Machine Le arning (ICML) , 2015. Jürgen Schmidh uber. Learning to generate sub-goals for action sequences. In Artificial neur al networks , pages 967–972, 1991. Liam Schramm and Ab deslam Boularias. Bellman diffusion mo dels. CoRR , abs/2407.12163, 2024. Lucy Xiao yang Shi, Joseph J Lim, and Y oungwoon Lee. Skill-based mo del-based reinforcement learning. In Confer enc e on R ob ot L e arning (CoRL) , 2022. Harshit Sikc hi, Siddhant Agarw al, Pranay a Ja jo o, Samy ak Para juli, Caleb Chuc k, Max Rudolph, Peter Stone, Amy Zhang, and Scott Niekum. RLZero: Direct p olicy inference from language without in-domain sup ervision. In Neur al Information Pr o c essing Systems (NeurIPS) , 2025a. Harshit Sikc hi, Andrea Tirinzoni, Ahmed T ouati, Yingc hen Xu, Anssi Kanervisto, Scott Niekum, Am y Zhang, Alessandro Lazaric, and Matteo Pirotta. F ast adaptation with b ehavioral foundation models. In R einfor c ement L e arning Confer enc e (RLC) , 2025b. Da vid Silv er and Kamil Ciosek. Comp ositional planning using optimal option models. In International Confer enc e on Machine L e arning (ICML) , 2012. Satinder Singh. T ransfer of learning by comp osing solutions of elemen tal sequen tial tasks. Machine L e arning , 8:323–339, 1992. Jasc ha Sohl-Dickstein, Eric W eiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermo dynamics. In International Confer enc e on Machine L e arning (ICML) , 2015. Y ang Song, Jascha Sohl-Dickstein, Diederik P . Kingma, Abhishek Kumar, Stefano Ermon, and Ben P o ole. Score- based generative mo deling through sto chastic differential equations. In International Confer enc e on L e arning R epr esentations (ICLR) , 2021. Ric hard S. Sutton. TD mo dels: Mo deling the world at a mixture of time scales. In International Confer enc e on Machine L e arning (ICML) , 1995. Ric hard S. Sutton, Doina Precup, and Satinder Singh. Betw een mdps and semi-mdps: A framework for temp oral abstraction in reinforcement learning. Artificial intel ligenc e , 112(1-2):181–211, 1999. Ric hard S. Sutton, Marlos C. Machado, G. Zacharias Holland, David Szepesv ari, Finbarr Timbers, Brian T anner, and Adam White. Reward-respecting subtasks for mo del-based reinforcement learning. Artificial Intel ligenc e , 324: 104001, 2023. Chen T essler, Y oni Kasten, Y unrong Guo, Shie Mannor, Gal Chechik, and Xue Bin Peng. CALM: conditional adv ersarial laten t models for directable virtual characters. In A CM SIGGRAPH , 2023. Chen T essler, Y unrong Guo, Ofir Nabati, Gal Chechik, and Xue Bin Peng. Maskedmimic: Unified physics-based c haracter con trol through masked motion inpain ting. ACM T r ansactions on Gr aphics , 43(6):209:1–209:21, 2024. Shan tanu Thakoor, Mark Ro wland, Diana Borsa, Will Dabney , Rémi Munos, and André Barreto. Generalised p olicy impro vemen t with geometric p olicy comp osition. In International Confer enc e on Machine L e arning (ICML) , 2022. Andrea Tirinzoni, Ahmed T ouati, Jesse F arebrother, Mateusz Guzek, Anssi Kanervisto, Yingc hen Xu, Alessandro Lazaric, and Matteo Pirotta. Zero-shot whole-b o dy humanoid control via b ehavioral foundation mo dels. In International Confer enc e on Le arning R epr esentations (ICLR) , 2025. Eman uel T o dorov, T om Erez, and Y uv al T assa. Mujo co: A physics engine for model-based con trol. In International Confer enc e on Intel ligent R ob ots and Systems (IR OS) , 2012. Manan T omar, Philipp e Hansen-Estruch, Philip Bac hman, Alex Lamb, John Langford, Matthew E T aylor, and Sergey Levine. Video o ccupancy mo dels. CoRR , abs/2407.09533, 2024. Ahmed T ouati and Y ann Ollivier. Learning one represen tation to optimize all rewards. In Neur al Information Pr o c essing Systems (NeurIPS) , 2021. 16 Ahmed T ouati, Jérémy Rapin, and Y ann Ollivier. Does zero-shot reinforcement learning exist? In International Confer enc e on L e arning R epr esentations (ICLR) , 2023. Aäron v an den Oord, Y azhe Li, and Oriol Viny als. Representation learning with contrastiv e predictive co ding. CoRR , abs/1807.03748, 2018. Alexander Sasha V ezhnev ets, Simon Osindero, T om Schaul, Nicolas Heess, Max Jaderberg, David Silver, and K oray Ka vukcuoglu. F eudal net works for hierarc hical reinforcemen t learning. In International Confer enc e on Machine L e arning (ICML) , 2017. P ascal Vincent. A connection b et ween score matching and denoising auto enco ders. Neur al Computation , 23(7): 1661–1674, 2011. Mic hael L. W askom. Seab orn: Statistical data visualization. Journal of Op en Sour c e Software , 6(60):3021, 2021. Harley Wiltzer, Jesse F arebrother, Arth ur Gretton, and Mark Rowland. F oundations of multiv ariate distributional reinforcemen t learning. In Neur al Information Pro c essing Systems (NeurIPS) , 2024a. Harley Wiltzer, Jesse F arebrother, Arthur Gretton, Y unhao T ang, André Barreto, Will Dabney , Marc G. Bellemare, and Mark Rowland. A distributional analogue to the successor representation. In International Confer enc e on Machine L e arning (ICML) , 2024b. Runzhe W u, Masatosh Uehara, and W en Sun. Distributional offline p olicy ev aluation with predictive error guarantees. In International Confer enc e on Machine L e arning (ICML) , 2023. Kevin Xie, Homanga Bharadhw a j, Danijar Hafner, Animesh Garg, and Florian Shkurti. Latent skill planning for exploration and transfer. In International Confer enc e on L e arning R epr esentations (ICLR) , 2021. Jaesik Y o on, Hy eonseo Cho, Do o jin Baek, Y oshua Bengio, and Sungjin Ahn. Monte carlo tree diffusion for system 2 planning. In International Confer enc e on Machine L e arning (ICML) , 2025a. Jaesik Y oon, Hyeonseo Cho, Y oshua Bengio, and Sung jin Ahn. F ast monte carlo tree diffusion: 100 × sp eedup via parallel and sparse planning. In Neur al Information Pr o c essing Systems (NeurIPS) , 2025b. Jingw ei Zhang, Jost T obias Springen b erg, Arunkumar Byra v an, Leonard Hasenclev er, Abbas Ab dolmaleki, Dushy ant Rao, Nicolas Heess, and Martin A. Riedmiller. Leveraging jumpy models for planning and fast learning in rob otic domains. CoRR , abs/2302.12617, 2023. Pushi Zhang, Xiao yu Chen, Li Zhao, W ei Xiong, T ao Qin, and Tie-Y an Liu. Distributional reinforcemen t learning for m ulti-dimensional rew ard functions. In Neur al Information Pr o c essing Systems (NeurIPS) , 2021. Chongyi Zheng, Seohong P ark, Sergey Levine, and Benjamin Eysenbac h. Inten tion-conditioned flow occupancy mo dels. In International Confer enc e on L e arning R epr esentations (ICLR) , 2026. Qinqing Zheng, Matt Le, Neta Shaul, Y aron Lipman, A dity a Grov er, and Ricky T. Q. Chen. Guided flows for generativ e mo deling and decision making. CoRR , abs/2311.13443, 2023. Ch uning Zh u, Xinqi W ang, Tyler Han, Simon S. Du, and Abhishek Gupta. Distributional successor features enable zero-shot p olicy optimization. In Neur al Information Pro c essing Systems (NeurIPS) , 2024. 17 App endices App endix A Extended Related W orks 18 App endix B Algorithms 20 Algorithm 1: T emp oral Difference Flows with Horizon Consistency . . . . . . . . . . . . . . . . . . 20 Algorithm 2: Comp ositional Planning with Jumpy W orld Mo dels . . . . . . . . . . . . . . . . . . . 21 Algorithm 3: Goal-Conditioned Prop osal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Algorithm 4: Unconditional Prop osal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 App endix C Theoretical Results 22 C.1 Multi-Timescale T emporal Difference Flows with Horizon Consistency . . . . . . . . . . . . . 23 App endix D A dditional Results 25 D.1 Comp ositional Planning / Zero-Shot Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 D.2 Comp ositional Planning / GPI / Action Planning Results . . . . . . . . . . . . . . . . . . . . 27 D.3 Ablation on the Planning F requency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 D.4 Ablation on the Planning Ob jectiv e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 D.5 Ablation on the Prop osal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 D.6 Ablation on the Consistency Ob jectiv e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 App endix E Qualitative Geometric Horizon Mo del Samples 34 App endix F Exp erimental Details 35 F.1 Base Policies & Hyp erparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 F.2 Geometric Horizon Mo del Hyp erparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 F.3 Planning Hyp erparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 A Extended Related W orks Successor Measure Metho ds that learn (discounted) state o ccupancies employing temp oral difference learning ( Sutton , 1995 ) date back to the successor representation ( Day an , 1993 ) with more recent extensions like successor features ( Barreto et al. , 2017 ) and the successor measure ( Blier et al. , 2021 ; Blier , 2022 ). Janner et al. ( 2020 ) w as the first to in tro duce a generativ e mo del of the successor measure with γ -mo dels also referred to as geometric horizon mo dels ( Thak o or et al. , 2022 ). Wiltzer et al. ( 2024b ) additionally in tro duced δ -mo dels that learn a distribution ov er γ -mo dels enabling applications in distributional RL ( Bellemare et al. , 2017 , 2023 ; Dabney et al. , 2018 ). Many generativ e mo deling tec hniques ha ve been applied to learn these mo dels, including GANs ( Janner e t al. , 2020 ; Wiltzer et al. , 2024b ), normalizing flows ( Janner et al. , 2020 ), V AEs ( Thakoor et al. , 2022 ; T omar et al. , 2024 ), flo w matching ( F arebrother et al. , 2025 ; Zheng et al. , 2026 ), and diffusion ( Sc hramm and Boularias , 2024 ; F arebrother et al. , 2025 ). Closely related is w ork on distributional successor features also known as multi-v ariate distributional RL ( F reirich et al. , 2019 ; Gimelfarb et al. , 2021 ; Zhang et al. , 2021 ; W u et al. , 2023 ; Wiltzer et al. , 2024a ; Zh u et al. , 2024 ), that inv olv es mo deling the distribution o ver cum ulative finite-dimensi onal features induced by a policy . F rom a generative modeling p ersp ectiv e, our w ork generalizes temp oral difference flows ( F arebrother et al. , 2025 ) and sho ws how long-horizon predictions can b e impro ved b y training across multiple timescales with a nov el horizon consistency ob jectiv e. 18 A dditionally , our comp ositional framework can b e viewed as a generalization of Geometric Generalized P olicy Improv ement (GGPI; Thak o or et al. , 2022 ) to arbitrary switching probabilities; how ev er, our concrete form ulation and empirical ev aluation differ substantially from Thak o or et al. ( 2022 ). First, Thakoor et al. ( 2022 ) considers only four pre-trained p olicies in practice, and learn tw o separate GHMs with effectiv e horizons of 5 and 10 steps resp ectively . Their exp eriments also restrict composition to sequences of only t wo policies. In contrast, we learn GHM mo dels conditioned on a contin uous family of policies and timescales with horizons up to 25 × longer, and ev aluate GSPs with lengths ranging from 3 to 24 p olicies. As a result, our w ork not only enables a richer class of geometric switching policies but also p erforms a more comprehensive empirical v alidation of these techniques on c hallenging long-horizon tasks where temp oral abstraction is most imp ortant. Planning With T emporal Abstractions Sev eral prior works hav e explored planning ov er subgoals or wa yp oin ts to solv e long-horizon tasks (e.g., Nasiriany et al. , 2019 ; Eysenbac h et al. , 2019 ; Nair and Finn , 2020 ; Chane-Sane et al. , 2021 ; F ang et al. , 2022 ; Hafner et al. , 2022 ; Lo et al. , 2024 ; Gürtler and Martius , 2025 ). These metho ds t ypically inv olv e learning sub-goal conditioned policies together with a high-lev el dynamics mo del that predicts the outcomes of reaching these subgoals k steps in the future, then employ mo del predictive con trol to select subgoal sequences. In contrast, GHMs mo del the entire state-o ccupancy distribution rather than a fixed k -step lo ok ahead, allo wing planning ov er arbitrary rew ard functions rather than just goal-conditioned tasks. A parallel line of w ork employs diffusion mo dels ( Vincent , 2011 ; Sohl-Dickstein et al. , 2015 ; Ho et al. , 2020 ; Song et al. , 2021 ; Lai et al. , 2025 ) for tra jectory-level planning (e.g., Janner et al. , 2022 ; Aja y et al. , 2023 ; Zheng et al. , 2023 ; Chen et al. , 2025 ; Y o on et al. , 2025a , b ; Luo et al. , 2025 ; Lee and Choi , 2025 ). Rather than mo deling p olicy-induced dynamics, these metho ds train generative mo dels o ver tra jectory segmen ts and p erform planning within the denoising process. Execution then relies on in verse dynamics mo dels to extract actions from planned state sequences. While p ow erful, this paradigm has notable limitations: (i) it requires learning accurate inv erse dynamics; (ii) planning quality depends hea vily on the tra jectory distribution in the training data rather than on the capabilities of an y particular p olicy; and (iii) these metho ds often assume access to oracle goal representations during the denoising pro cess. In con trast, our w ork is p olicy-grounded: it directly comp oses the b ehaviors of pre-trained p olicies by predicting their induced state occupancies, rather than planning o ver abstract tra jectory segments. This makes our approach agnostic to the p olicy class and a voids in verse dynamics entirely since actions are sampled directly from the base p olicies. Closer to our approac h, are metho ds that learn dynamics mo dels o ver temp orally extended b ehaviors (e.g., Xie et al. , 2021 ; Shi et al. , 2022 ; Zhang et al. , 2023 ; Mishra et al. , 2023 ; Gürtler and Martius , 2025 ). These metho ds learn general latent skills together with a high-level dynamics mo del predicting the outcomes of their execution (more precisely the states reached a fixed num ber of steps in the future), and use MPC to plan o ver these skills. Ho wev er, these approaches suffer from the same limitations as the k -step subgoal metho ds discussed ab ov e. While we only rep ort exp eriments for goal-based p olicies, CompPlan also enables planning on top of any set of parameterized skills or p olicies, for example, those learned via unsup ervised RL metho ds ( Borsa et al. , 2019 ; T ouati and Ollivier , 2021 ; T ouati et al. , 2023 ; Park et al. , 2024 ; F rans et al. , 2024 ; Cetin et al. , 2025 ; Agarw al et al. , 2025 ; Tirinzoni et al. , 2025 ; Sikc hi et al. , 2025b , a ; Bagatella et al. , 2026 ). Finally , our work connects to b oth the options framework ( Sutton et al. , 1999 ; Precup , 2000 ) and hierarchical RL ( Sc hmidhuber , 1991 ; Kaelbling , 1993a ; Parr and Russell , 1997 ; Barto and Mahadev an , 2003 ; Klissaro v et al. , 2025 ), which share a fo cus on temporal abstraction and multi-lev el decision making. Prior work has explored planning with options (e.g., Silver and Ciosek , 2012 ; Jinnai et al. , 2019 ; Barreto et al. , 2019 ; Carv alho et al. , 2023 ; Rodriguez-Sanchez and K onidaris , 2024 ) or learning p olicies and v alue functions at multiple lev els of abstraction (e.g., Precup and Sutton , 1997 ; Precup et al. , 1997 , 1998 ; Day an and Hinton , 1992 ; Dietteric h , 1998 ; V ezhnev ets et al. , 2017 ; Kulk arni et al. , 2016 ; Gürtler et al. , 2021 ; Nach um et al. , 2018 ; Levy et al. , 2019 ; P ark et al. , 2023 , 2025b ). Our comp ositional planning metho d can b e viewed b oth as planning o ver options with a simple termination condition – where each p olicy terminates after a random num b er of steps – and as a hierarc hical RL metho d that replaces the high-level policy with a test-time planning pro cedure. 19 B Alg orithms W e present the full td-hc algorithm along with CompPlan and the tw o prop osal distributions outlined herein. Algorithm 1 T emporal Difference Flows with Horizon Consistency 1: Inputs : offline dataset D , p olicy π , batch size K , Poly ak coefficient ζ , randomly initialized weigh ts θ , learning rate η , maximum discount factor γ max ∈ [0 , 1) , horizon consistency prop ortion τ c ∈ [0 , 1] . 2: for n = 1 , . . . do 3: Sample mini-batch { ( S k , A k , S ′ k , A ′ k ) } K k =1 from D 4: for k = 1 , . . . , K do 5: t k ∼ U ([0 , 1]) 6: γ k ∼ U ([0 , γ max ]) 7: 8: # One-Step T erm 9: X 0 ∼ p 0 ( · ) 10: → X t k ← (1 − t k ) X 0 + t k S ′ k 11: → ℓ k ( θ ) =   v t k ( → X t k | S k , A k , γ k ; θ ) − ( S ′ k − X 0 )   2 12: 13: if k ≤ ⌈ K · τ c ⌉ then 14: β k ∼ U ([0 , γ k ]) 15: # β -Bo otstrap T erm 16: X 0 ∼ p 0 ( · ) 17: ↷ X β t k ← ψ t k ( X 0 | S ′ k , A ′ k , β k ; ¯ θ ) 18: ↷ ℓ β k ( θ ) =   v t k ( ↷ X β t k | S k , A k , γ k ; θ ) − v t k ( ↷ X β t k | S ′ k , A ′ k , β k ; ¯ θ )   2 19: # γ -Bo otstrap T erm 20: ( X 0 , X ′′ 0 ) ∼ p 0 ( · ) 21: S ′′ k ← ψ 1 ( X ′′ 0 | S ′ k , A ′ k , β k ; ¯ θ ) 22: A ′′ k ∼ π ( · | S ′′ k ) 23: ↷ X γ t k ← ψ t k ( X 0 | S ′′ k , A ′′ k , γ k ; ¯ θ ) 24: ↷ ℓ γ k ( θ ) =   v t k ( ↷ X γ t k | S k , A k , γ k ; θ ) − v t k ( ↷ X γ t k | S ′′ k , A ′′ k , γ k ; ¯ θ )   2 25: # Mixture Loss 26: ℓ k ( θ ) = (1 − γ k ) → ℓ k ( θ ) + γ k 1 − γ k 1 − β k ↷ ℓ β k ( θ ) + γ k γ k − β k 1 − β k ↷ ℓ γ k ( θ ) 27: else 28: # γ -Bo otstrap T erm 29: X 0 ∼ p 0 ( · ) 30: ↷ X γ t k ← ψ t k ( X 0 | S ′ k , A ′ k , γ k ; ¯ θ ) 31: ↷ ℓ γ k ( θ ) =   v t k ( ↷ X γ t k | S k , A k , γ k ; θ ) − v t k ( ↷ X γ t k | S ′ k , A ′ k , γ k ; ¯ θ )   2 32: # Mixture Loss 33: ℓ k ( θ ) = (1 − γ k ) → ℓ k ( θ ) + γ k ↷ ℓ γ k ( θ ) 34: end if 35: end for 36: # Compute loss 37: ℓ ( θ ) = 1 K P K k =1 ℓ k ( θ ) 38: # Perform gradient step 39: θ ← θ − η ∇ θ ℓ ( θ ) 40: # Up date parameters of target vector field 41: ¯ θ ← ζ ¯ θ + (1 − ζ ) θ 42: end for 20 Algorithm 2 Comp ositional Planning with Jumpy W orld Mo dels 1: Inputs : parameterized class of p olicies { π z } z ∈ Z , geometric horizon mo del m π z γ , p olicy sequence length K , prop osal distribution ρ : S → P ( Z K ) , num b er of prop osals M , num b er of monte-carlo samples N , reward function r , effectiv e discoun t factors { β k } K k =1 , mixture weigh ts { w k } K k =1 2: function CompPlan ( s ) 3: for i = 1 , . . . , M do 4: # Sample p olicy sequence from prop osal distribution 5: ( z ( i ) 1 , . . . , z ( i ) K ) ∼ ρ ( · | s ) 6: 7: # Sample initial action 8: a ( i ) 1 ∼ π z ( i ) 1 ( · | s ) 9: 10: # Monte Carlo Q-v alue estimation ( Lemma 1 ) 11: for j = 1 , . . . , N do 12: ( S 0 , A 0 ) ← ( s, a ( i ) 1 ) 13: for k = 1 , . . . , K do 14: S k ∼ m π z ( i ) k β k ( · | S k − 1 , A k − 1 ) 15: A k ∼ π z ( i ) k +1 ( · | S k ) 16: end for 17: b Q ( i,j ) ← (1 − γ ) − 1 P K k =1 w k · r ( S k ) 18: end for 19: b Q ( i ) ← 1 N P N j =1 b Q ( i,j ) 20: end for 21: 22: # Select b est candidate 23: i ∗ ← arg max i ∈ J M K b Q ( i ) 24: 25: # Return optimal action and p olicy 26: return ( a ( i ∗ ) 1 , z ( i ∗ ) 1 ) 27: end function Algorithm 3 Goal-Conditioned Prop osal 1: Inputs : geometric horizon mo del m π z γ , policy sequence length K , effective discount factors { β k } K k =1 2: function GoalCondPr oposal ( s, g ) 3: # Chain GHM samples to ward goal 4: z 0 ← s 5: for k = 1 , . . . , K do 6: A k − 1 ∼ π g ( · | z k − 1 ) 7: z k ∼ m π g β k ( · | z k − 1 , A k − 1 ) 8: end for 9: return ( z 1 , . . . , z K ) 10: end function Algorithm 4 Unconditional Prop osal 1: Inputs : unconditional (behavior p olicy µ ) geometric horizon mo del m µ γ , policy sequence length K , effectiv e discoun t factors { β k } K k =1 2: function UncondProposal ( s ) 3: # Chain unconditional GHM samples 4: z 0 ← s 5: for k = 1 , . . . , K do 6: z k ∼ m µ β k ( · | z k − 1 ) 7: end for 8: return ( z 1 , . . . , z K ) 9: end function Figure 2 T wo prop osal distributions for goal-conditioned comp ositional planning. L eft: GoalCondPr oposal samples subgoal sequences by c haining GHM predictions conditioned on the goal g ∈ S , guiding the agent tow ard the target. Right: Unconditional proposal samples from the b ehavior p olicy’s GHM that can be trained alongside m π z γ b y p eriodically setting z = ∅ , a = ∅ . 21 C Theoretical Results Theorem 1. L et ν := π z 1 α 1 − → π z 2 · · · α n − 1 − − − → π z n b e a ge ometric switching p olicy with glob al disc ount factor γ ∈ (0 , 1) , effe ctive disc ount factors { β k } n k =1 , and weights { w k } n k =1 fr om Definition 1 . F or any state-action p air ( s, a ) , the suc c essor me asur e of ν de c omp oses as: m ν γ (d s + | s, a ) = n X k =1 w k Z s 1 ,...,s k − 1 a 1 ,...,a k − 1 m π z 1 β 1 (d s 1 | s, a ) π z 2 (d a 1 | s 1 ) · · · m π z k β k (d s + | s k − 1 , a k − 1 ) . Pr o of. Let’s denote ν l : n = π z l α l − → π z l +1 . . . α n − 1 − − − → π z n the geometric switching p olicy that starts by π z l . w e will pro ceed b y induction ov er l ∈ { n, n − 1 , . . . 1 } to show that: m ν l : n γ (d s ′ | s, a ) = n X k = l 1 − γ 1 − β k k − 1 Y i = l γ − β i 1 − β i ! Z s l ,...,s k − 1 a l ,...,a k − 1 m π z l β l (d s l | s, a ) π z l +1 (d a l | s l ) . . . m π z k β k (d s ′ | s k − 1 , a k − 1 ) , (7) where ( s l − 1 , a l − 1 ) = ( s, a ) . F or the case l = n , it is straigh tforward to see that the induction h yp othesis ( 7 ) is satisfied since m ν n : n γ = m π z n γ . Let us now assume that the induction hypothesis ( 7 ) holds for l + 1 ∈ { n, n − 1 , . . . , 2 } . Our goal is to demonstrate that it also holds for l . After executing a single step of ν l : n , tw o outcomes are p ossible:: with probability (1 − α l ) , we remain committed to ν l : n , or with probability α l , we switch to the next p olicy π z l +1 , thereby contin uing the episo de with ν l +1: n . This leads to the follo wing Bellman-equation: m ν l : n = (1 − γ ) P + γ (1 − α l ) P π z l m ν l : n + γ α l P π z l +1 m ν l +1: n whic h implies ( I − γ β l P π z l ) m ν l : n = (1 − γ ) P + γ α l P π z l +1 m ν l +1: n = ⇒ m ν l : n = (1 − γ )( I − γ β l P π z l ) − 1 P + γ α l ( I − γ β l P π z l ) − 1 P π z l +1 m ν l +1: n = ⇒ m ν l : n (d s ′ | s, a ) = 1 − γ 1 − β l m π l β l (d s ′ | s, a ) + γ − β l 1 − β l Z s l m π l β l (d s l | s, a ) π π z l +1 (d a l | s l ) m ν l +1: n (d s ′ | s l , a l ) . Using the induction h yp othesis for l + 1 , w e find: m ν l : n (d s ′ | s, a ) = 1 − γ 1 − β l m π l β l (d s ′ | s, a ) + γ − β l 1 − β l Z s l m π l β l (d s l | s, a ) π π z l +1 (d a l | s l ) × n X k = l +1 1 − γ 1 − β k k − 1 Y i = l +1 γ − β i 1 − β i ! Z s l +1 ,...,s k − 1 a l +1 ,...,a k − 1 m π z l +1 β l +1 (d s l +1 | s, a ) π z l +2 (d a l +1 | s l +1 ) . . . m π z k β k (d s ′ | s k − 1 , a k − 1 ) = 1 − γ 1 − β l m π l β l (d s ′ | s, a ) + n X k = l +1 1 − γ 1 − β k k − 1 Y i = l γ − β i 1 − β i ! Z s l ,...,s k − 1 a l ,...,a k − 1 m π z l β l (d s l | s, a ) π z l +1 (d a l | s l ) . . . m π z k β k (d s ′ | s k − 1 , a k − 1 ) = n X k = l 1 − γ 1 − β k k − 1 Y i = l γ − β i 1 − β i ! Z s l ,...,s k − 1 a l ,...,a k − 1 m π z l β l (d s l | s, a ) π z l +1 (d a l | s l ) . . . m π z k β k (d s ′ | s k − 1 , a k − 1 ) whic h shows the desired result. 22 C.1 Multi- Timescale T emporal Difference Flo ws with Horizon Consistency The results in this section generalize those found in F arebrother et al. ( 2025 ) to arbitrary mixture distributions. Lemma 2. L et { v i t } i ∈ J N K a family of N ∈ N ve ctor fields that gener ate the pr ob ability p aths { p i t } i ∈ J N K . Then, the mixtur e pr ob ability p ath p t = P i λ i p i t , wher e { λ i } i ∈ J N K ∈ [0 , 1] and P i λ i = 1 is gener ate d by the ve ctor field v t := P i λ i p i t v i t P i λ i p i t . (8) Pr o of. Since v t i generates p i t , w e know from the con tinuit y equation that: ∀ i ∈ J N K , ∂ p i t ∂ t = div ( p i t v i t ) where div denotes the div ergence op erator. Then, b y linearity of div, ∂ p t ∂ t = ∂  P i λ i p i t  ∂ t = X i λ i div ( p i t v i t ) = div X i λ i p i t v i t ! = div P i λ i p i t v i t P i λ i p i t X i λ i p i t ! = div ( v t p t ) . Hence, ( v t , p t ) satisfies the con tinuit y equation, which implies that v t generates p t . Lemma 3. L et { v i t } i ∈ J N K a family of N ∈ N ve ctor fields that gener ate the pr ob ability p aths { p i t } i ∈ J N K . F or λ i ∈ [0 , 1] such that P i λ i = 1 , the ve ctor field v t = P i λ i p i t v i t P i λ i p i t satisfies v t = arg min v : R d → R d n X i λ i E x t ∼ p 1 t  ∥ v t ( x t ) − v i t ( x t ) ∥ 2  o . Pr o of. Let ℓ t ( v ) := P i λ i E x t ∼ p i t  ∥ v t ( x t ) − v i t ( x t ) ∥ 2  . The functional deriv ativ e of this quantit y wrt v ev aluated at some p oin t x is ∇ v ℓ t ( v )( x ) = X i λ i p t i ( x )( v t ( x ) − v i t ( x )) . Setting this to zero and solving for v t ( x ) yields the result. The consistency op erator in equation ( 4 ) com bines three different distributions. Lemmas 2 and 3 indicate that w e can construct distinct probability paths for eac h distribution as follows 1. F or the first distribution i.e P ( · | s, a ) , W e apply the standard Conditional Flow Matching (CFM) approach, where the probabilit y path is defined as the marginal o ver a simple conditional path (sp ecifically , we use the Optimal T ransp ort (OT) path): q t ( x | s, a ) = E S ′ ∼ P ( ·| s,a )  N ( x ; tS ′ , (1 − t ) 2 )  (9) 23 where N ( x ; tS ′ , (1 − t ) 2 ) is the gaussian distribution of mean tS ′ and v ariance (1 − t ) 2 . This leads to the standard CFM ob jectiv e: E t, ( S,A,S ′ ) X 0 ∼ p 0 ,X t = tS ′ +(1 − t ) X 0     v t ( X t | S, A ; θ , γ ) − ( S ′ − X 0 )    2  (10) 2. F or the second distribution, i.e., E S ′ ∼ P ( ·| s,a ) ,A ′ ∼ π ( ·| S ′ ) h m π β ( · | S ′ , A ′ ) i = ( P π m π β )( · | s, a ) , we leverage that m π β is parametrized b y a flow matc hing mo del to define the probability path: q t ( x | s, a ) = E S ′ ∼ P ( ·| s,a ) ,A ′ ∼ π ( ·| S ′ )  p 0 # ψ t ( · | S ′ , A ′ ; θ , β )  (11) q t is a v alid probabilit y path, satisfying the boundary conditions: q 0 ( x | s, a ) = E S ′ ∼ P ( ·| s,a ) ,A ′ ∼ π ( ·| S ′ ) [ p 0 ( x )] = p 0 ( x ) and q 1 = P π m π β . q t can b e interpreted as aggregation of conditional paths p 0 # ψ t ( · | S ′ , A ′ , β , θ ) , for whic h we ha ve access to their v ector field v t ( X t | S ′ , A ′ ; θ , β ) . Using the equiv alence b etw een marginal flow matc hing and conditional flow matc hing ( Lipman et al. , 2023 ), we arrive at the following ob jective. E t, ( S,A,S ′ ) ,A ′ ∼ π ( ·| S ′ ) X t ∼ p 0 # ψ t ( ·| S ′ ,A ′ ,β ; ¯ θ )     v t ( X t | S, A, γ ; θ ) − v t ( X t | S ′ , A ′ , β ; ¯ θ )    2  (12) 3. Similarly , for the third distribution E S ′ ∼ P ( ·| s,a ) ,A ′ ∼ π ( ·| S ′ ) S ′′ ∼ m π β ( ·| S ′ ,A ′ ) ,A ′′ ∼ π ( ·| S ′′ )  m π γ ( · | S ′′ , A ′′ )  , we again leverage that m π γ is parametrized b y flow matc hing mo del to define the following probability path: q t ( x | s, a ) = E S ′ ∼ P ( ·| s,a ) ,A ′ ∼ π ( ·| S ′ ) S ′′ ∼ p 0 # ψ 1 ( ·| S ′ ,A ′ ,β ; ¯ θ ) ,A ′′ ∼ π ( ·| S ′′ )  p 0 # ψ t ( · | S ′ , A ′ ; θ , γ )  (13) q t can b e in terpreted as aggregation of conditional paths p 0 # ψ t ( · | S ′ , A ′ , γ , θ ) , for whic h we ha ve access to their vector field v t ( X t | S ′ , A ′ , γ ; θ ) . Using the equiv alence betw een marginal flow matc hing and conditional flo w matching ( Lipman et al. , 2023 ), w e arrive at the following ob jective. E t, ( S,A,S ′ ) ,A ′ ∼ π ( ·| S ′ ) S ′′ ∼ p 0 # ψ 1 ( ·| S ′ ,A ′ ,β ; ¯ θ ) X t ∼ p 0 # ψ t ( ·| S ′′ ,A ′′ ,γ ; ¯ θ )     v t ( X t | S, A, γ ; θ ) − v t ( X t | S ′′ , A ′′ , γ ; ¯ θ )    2  (14) 24 D A dditional R esults D .1 Compositional Planning / Z ero-Shot Results W e rep ort the full planning results here. T able 4 summarizes the success rate for each task. W e ev aluate 7 domains with 5 tasks p er domain, for a total of 35 tasks. F or eac h task we ev aluate the base p olicy and CompPlan by rolling out 10 tra jectories. W e rep ort standard deviation o ver the 3 seeds used for GHM training, for the base p olicy we do not hav e multiple seeds. W e see that CompPlan is b etter than the base p olicy in almost all the tasks. Figure 3 shows a bar plot comparing the domain-a veraged success rates of CompPlan and the zero-shot baseline. T able 4 Success rate ( ↑ ) p er task for base policies π g (Zero Shot) and CompPlan . Mean and standard deviation reported o ver 3 seeds. Blue and red denote an increase and decrease w.r.t. zero-shot with gra y indicating no significant difference. Domain Task CRL GC-1S GC-BC GC- TD3 HFBC Z ero Shot CompPlan Zero Shot CompPlan Z ero Shot CompPlan Z ero Shot CompPlan Z ero Shot CompPlan antmaze-medium 1 0.950 0.970 (0.030) 0.400 0.930 (0.070) 0.400 0.870 (0.090) 0.650 0.800 (0.060) 0.900 0.870 (0.030) 2 0.950 0.900 (0.100) 0.800 1.000 (0.000) 0.550 0.830 (0.090) 0.900 0.530 (0.120) 0.900 0.970 (0.030) 3 0.650 0.970 (0.030) 0.100 0.500 (0.120) 0.600 0.900 (0.060) 0.450 0.770 (0.030) 1.000 0.970 (0.030) 4 0.900 1.000 (0.000) 0.650 0.930 (0.030) 0.100 0.770 (0.090) 0.550 0.600 (0.100) 1.000 0.970 (0.030) 5 0.950 1.000 (0.000) 0.850 0.970 (0.030) 0.800 0.870 (0.090) 0.700 0.570 (0.070) 0.900 0.930 (0.030) antmaze-large 1 0.800 0.830 (0.030) 0.100 0.770 (0.070) 0.150 0.570 (0.070) 0.300 0.500 (0.120) 0.800 0.970 (0.030) 2 0.600 0.770 (0.030) 0.150 0.630 (0.120) 0.100 0.730 (0.030) 0.100 0.530 (0.090) 0.600 0.800 (0.000) 3 0.850 0.930 (0.030) 0.800 0.830 (0.090) 0.550 0.900 (0.060) 0.750 0.500 (0.100) 1.000 0.970 (0.030) 4 0.950 1.000 (0.000) 0.000 0.430 (0.030) 0.000 0.670 (0.090) 0.000 0.400 (0.060) 0.900 0.900 (0.060) 5 1.000 0.970 (0.030) 0.000 0.370 (0.120) 0.100 0.770 (0.090) 0.000 0.470 (0.090) 0.600 0.970 (0.030) antmaze-giant 1 0.000 0.070 (0.030) 0.000 0.000 (0.000) 0.000 0.000 (0.000) 0.000 0.000 (0.000) 0.400 0.730 (0.090) 2 0.000 0.670 (0.070) 0.000 0.000 (0.000) 0.000 0.000 (0.000) 0.000 0.030 (0.030) 0.300 0.830 (0.030) 3 0.000 0.100 (0.060) 0.000 0.000 (0.000) 0.000 0.000 (0.000) 0.000 0.000 (0.000) 0.200 0.800 (0.060) 4 0.500 0.230 (0.070) 0.000 0.000 (0.000) 0.000 0.000 (0.000) 0.000 0.000 (0.000) 0.500 0.730 (0.030) 5 0.300 0.400 (0.120) 0.000 0.100 (0.000) 0.000 0.130 (0.030) 0.000 0.030 (0.030) 0.700 0.830 (0.030) cube-1 1 0.200 0.930 (0.030) 0.300 0.470 (0.030) 1.000 1.000 (0.000) 0.500 0.900 (0.060) 0.800 0.970 (0.030) 2 0.100 0.830 (0.090) 0.400 0.470 (0.030) 0.950 1.000 (0.000) 0.850 1.000 (0.000) 0.700 1.000 (0.000) 3 0.500 0.900 (0.000) 0.300 0.900 (0.060) 0.950 1.000 (0.000) 0.600 0.900 (0.000) 0.900 1.000 (0.000) 4 0.250 0.930 (0.030) 0.500 0.770 (0.030) 1.000 1.000 (0.000) 0.550 0.900 (0.060) 0.700 0.970 (0.030) 5 0.350 0.700 (0.120) 0.350 0.700 (0.100) 0.600 0.970 (0.030) 0.400 0.870 (0.070) 0.900 0.900 (0.060) cube-2 1 0.100 0.870 (0.030) 0.200 0.970 (0.030) 0.750 1.000 (0.000) 0.300 0.970 (0.030) 1.000 1.000 (0.000) 2 0.000 0.500 (0.060) 0.200 0.500 (0.170) 0.000 1.000 (0.000) 0.150 0.930 (0.070) 0.900 0.930 (0.030) 3 0.000 0.570 (0.030) 0.050 0.600 (0.100) 0.000 1.000 (0.000) 0.100 0.870 (0.030) 0.900 1.000 (0.000) 4 0.000 0.070 (0.030) 0.000 0.400 (0.170) 0.000 0.870 (0.070) 0.000 0.430 (0.130) 0.300 0.200 (0.060) 5 0.000 0.500 (0.100) 0.050 0.370 (0.070) 0.000 0.970 (0.030) 0.050 0.900 (0.060) 0.700 0.700 (0.100) cube-3 1 0.050 1.000 (0.000) 0.050 0.930 (0.030) 0.450 1.000 (0.000) 0.400 1.000 (0.000) 0.900 1.000 (0.000) 2 0.000 0.970 (0.030) 0.000 1.000 (0.000) 0.000 1.000 (0.000) 0.050 1.000 (0.000) 0.900 1.000 (0.000) 3 0.000 0.930 (0.030) 0.000 0.870 (0.090) 0.000 1.000 (0.000) 0.150 0.930 (0.070) 0.800 0.970 (0.030) 4 0.000 0.270 (0.090) 0.000 0.300 (0.060) 0.000 0.700 (0.060) 0.000 0.700 (0.060) 0.300 0.370 (0.120) 5 0.000 0.470 (0.030) 0.000 0.230 (0.030) 0.000 0.900 (0.000) 0.000 0.530 (0.170) 0.300 0.800 (0.000) cube-4 1 0.000 0.530 (0.070) 0.050 0.970 (0.030) 0.000 1.000 (0.000) 0.000 1.000 (0.000) 0.600 1.000 (0.000) 2 0.000 0.870 (0.090) 0.000 0.900 (0.060) 0.000 1.000 (0.000) 0.000 0.970 (0.030) 0.400 0.830 (0.120) 3 0.000 0.270 (0.120) 0.000 0.800 (0.000) 0.000 1.000 (0.000) 0.000 0.470 (0.030) 0.200 0.770 (0.030) 4 0.000 0.030 (0.030) 0.000 0.100 (0.060) 0.000 0.170 (0.170) 0.000 0.170 (0.120) 0.000 0.200 (0.100) 5 0.000 0.230 (0.130) 0.000 0.230 (0.070) 0.000 0.630 (0.130) 0.000 0.230 (0.030) 0.000 0.570 (0.130) 25 CRL GC-1S GC-B C GC-TD3 HFBC Method 0.0 0.2 0.4 0.6 0.8 1.0 Success antmaze-medium CRL GC-1S GC-B C GC-TD3 HFBC Method antmaze-large CRL GC-1S GC-B C GC-TD3 HFBC Method antmaze-giant Antmaze Results Zero Shot CompPlan CRL GC-1S GC-BC GC- TD3 HFBC Method 0.0 0.2 0.4 0.6 0.8 1.0 Success cube-1 CRL GC-1S GC-BC GC- TD3 HFBC Method cube-2 CRL GC-1S GC-BC GC- TD3 HFBC Method cube-3 CRL GC-1S GC-BC GC- TD3 HFBC Method cube-4 Cube Results Zero Shot CompPlan Figure 3 Success rate ( ↑ ) of base policies π g (Zero Shot) and compositional planning as in Equation ( 6 ) with GHMs ( CompPlan , ours) av eraged o ver tasks. 26 D .2 Compositional Planning / GPI / Action Planning Results CRL GC-1S GC- B C GC- TD3 HFB C 0.00 0.25 0.50 0.75 1.00 Success R ate antmaze-medium 97% 91% 94% 88% 87% 65% 67% 56% 85% 71% 60% 49% 65% 65% 68% 65% 94% 95% 95% 94% CRL GC-1S GC- B C GC- TD3 HFB C antmaze-lar ge 90% 87% 87% 84% 61% 25% 19% 21% 73% 67% 34% 18% 48% 28% 26% 23% 92% 88% 89% 78% CRL GC-1S GC- B C GC- TD3 HFB C 0.00 0.25 0.50 0.75 1.00 Success R ate antmaze-giant 29% 14% 7% 16% 2% 3% 1% 1% 1% 79% 62% 73% 42% antmaze CRL GC-1S GC- B C GC- TD3 HFB C 0.00 0.25 0.50 0.75 1.00 Success R ate cube-1 86% 93% 96% 28% 66% 59% 92% 37% 99% 99% 98% 90% 91% 94% 97% 58% 97% 96% 98% 80% CRL GC-1S GC- B C GC- TD3 HFB C cube-2 50% 49% 69% 2% 57% 45% 69% 10% 97% 94% 37% 15% 82% 72% 68% 12% 77% 70% 84% 76% CRL GC-1S GC- B C GC- TD3 HFB C 0.00 0.25 0.50 0.75 1.00 Success R ate cube-3 73% 83% 61% 1% 67% 38% 59% 1% 92% 90% 28% 9% 83% 61% 63% 12% 83% 39% 59% 64% CRL GC-1S GC- B C GC- TD3 HFB C cube-4 39% 22% 27% 60% 33% 19% 1% 76% 65% 3% 57% 47% 27% 67% 8% 35% 24% cube Zer o Shot A ctionPlan GPI CompPlan Figure 4 Success rate ( ↑ ) of planning vs. zero shot. W e consider action-lev el planning ( A ctionPlan ) with a world mo del; generalized p olicy impro vemen t (GPI) and comp ositional planning ( CompPlan ; ours) with GHMs. 27 CRL GC-1S GC- B C GC- TD3 HFB C -0.2 0.0 +0.2 +0.4 +0.6 P er cent P oint Change in Success antmaze-medium +9% +3% +6% +31% +9% +11% +36% +22% +11% +0% -0% +3% +1% +1% CRL GC-1S GC- B C GC- TD3 HFB C antmaze-lar ge +6% +3% +3% +40% +4% -2% +55% +49% +16% +25% +5% +3% +14% +10% +11% CRL GC-1S GC- B C GC- TD3 HFB C -0.2 0.0 +0.2 +0.4 +0.6 P er cent P oint Change in Success antmaze-giant +13% -2% -9% +2% +3% +1% +1% +1% +37% +20% +31% antmaze CRL GC-1S GC- B C GC- TD3 HFB C -0.3 0.0 +0.3 +0.6 +0.9 P er cent P oint Change in Success cube-1 +58% +65% +68% +29% +22% +55% +9% +9% +8% +33% +36% +39% +17% +16% +18% CRL GC-1S GC- B C GC- TD3 HFB C cube-2 +48% +47% +67% +47% +35% +59% +82% +79% +22% +70% +60% +56% +1% -6% +8% CRL GC-1S GC- B C GC- TD3 HFB C -0.3 0.0 +0.3 +0.6 +0.9 P er cent P oint Change in Success cube-3 +72% +82% +60% +66% +37% +58% +83% +81% +19% +71% +49% +51% +19% -25% -5% CRL GC-1S GC- B C GC- TD3 HFB C cube-4 +39% +22% +27% +59% +32% +18% +76% +65% +3% +57% +47% +27% +43% -16% +11% cube A ctionPlan GPI CompPlan Figure 5 Percen t p oint change ( ↑ ) of planning ov er zero shot. W e consider action-level planning ( A ctionPlan ) with a w orld model; generalized p olicy improv ement (GPI) and comp ositional planning ( CompPlan ; ours) with GHMs. 28 D .3 Ablation on the Planning F requency In this section, w e inv estigate whether the planning cost can b e amortized o ver time. Sp ecifically , we compare the performance of planning at eac h time step with planning every N steps. In the latter case, w e execute the action maximizing the CompPlan ob jectiv e for the first step and then follow the p olicy π z 1 (i.e., the tra jectory is s, a ⋆ , s 1 , π z 1 ( s 1 ) , s 2 , . . . , s N − 1 , π z 1 ( s N − 1 ) , s N ). Figure 6 rep orts the av erage success rate for eac h domain and metho d. On av erage, planning at every step leads to ab out a 20% improv ement compared to planning every 5 steps. This is mostly due to a few cases—most notably , CRL p olicies in Cub e—in whic h planning every 5 steps results in a large p erformance drop or complete unlearning. In the ma jorit y of exp erimen ts, planning every 5 steps do es not substantially degrade ov erall p erformance, while significan tly reducing planning time. Overall, this is a lev er that can b e used to trade off sp eed and p erformance. Finally , T able 5 provides a tabular summary of these results. CRL GC-1S GC-BC GC- TD3 HFBC Method 0.0 0.2 0.4 0.6 0.8 1.0 Success antmaze-medium CRL GC-1S GC-BC GC- TD3 HFBC Method antmaze-large CRL GC-1S GC-BC GC- TD3 HFBC Method antmaze-giant Antmaze Results replan_every 1 5 CRL GC-1S GC-BC GC- TD3 HFBC Method 0.0 0.2 0.4 0.6 0.8 1.0 Success cube-1 CRL GC-1S GC-BC GC- TD3 HFBC Method cube-2 CRL GC-1S GC-BC GC- TD3 HFBC Method cube-3 CRL GC-1S GC-BC GC- TD3 HFBC Method cube-4 Cube Results replan_every 1 5 Figure 6 Success rate ( ↑ ) of CompPlan with different base p olicies when planning at ev ery step of every 5 steps. W e rep ort mean and standard deviation 3 seeds for GHM training. T able 5 Success rate ( ↑ ) of CompPlan with different base p olicies when replanning ev ery 1 or 5 steps. W e rep ort mean and standard deviation ov er 3 seeds. Best method highligh ted in blue; gray indicates no significant difference. Domain CRL GC-1S GC-BC GC- TD3 HFBC 1 Step 5 Steps 1 Step 5 Steps 1 Step 5 Steps 1 Step 5 Steps 1 Step 5 Steps antmaze-medium 0.97 (0.02) 0.94 (0.02) 0.87 (0.05) 0.83 (0.04) 0.85 (0.08) 0.75 (0.01) 0.65 (0.03) 0.59 (0.05) 0.94 (0.01) 0.99 (0.01) antmaze-large 0.90 (0.00) 0.87 (0.02) 0.61 (0.04) 0.64 (0.03) 0.73 (0.02) 0.38 (0.01) 0.48 (0.05) 0.54 (0.01) 0.92 (0.02) 0.93 (0.04) antmaze-giant 0.29 (0.03) 0.28 (0.05) 0.02 (0.00) 0.03 (0.02) 0.03 (0.01) 0.01 (0.01) 0.01 (0.01) 0.01 (0.01) 0.79 (0.04) 0.80 (0.02) cube-1 0.86 (0.02) 0.75 (0.03) 0.66 (0.02) 0.82 (0.03) 0.99 (0.01) 1.00 (0.00) 0.91 (0.01) 0.93 (0.02) 0.97 (0.01) 0.96 (0.02) cube-2 0.50 (0.03) 0.19 (0.03) 0.57 (0.09) 0.75 (0.05) 0.97 (0.01) 0.92 (0.02) 0.82 (0.01) 0.72 (0.00) 0.77 (0.02) 0.79 (0.04) cube-3 0.73 (0.02) 0.33 (0.02) 0.67 (0.02) 0.71 (0.05) 0.92 (0.01) 0.93 (0.02) 0.83 (0.02) 0.83 (0.02) 0.83 (0.03) 0.77 (0.01) cube-4 0.39 (0.04) 0.00 (0.00) 0.60 (0.02) 0.56 (0.03) 0.76 (0.03) 0.71 (0.02) 0.57 (0.03) 0.49 (0.01) 0.67 (0.03) 0.65 (0.02) 29 D .4 Ablation on the Planning Objective As men tioned in the main pap er, we can lev erage the GHM in several differen t planning approaches. In this section, w e compare t wo strategies. The first is the approac h presented in the pap er, in whic h w e optimize b oth the first action and the p olicy sequence ( z 1 , . . . , z n ) : max a 1 ,z 1 ,...,z n Q π z 1 α 1 − → π z 2 ... α n − 1 − − − → π z n γ ( s, a 1 ) (15) while the second do es not in volv e action optimization max z 1 ,...,z n V π z 1 α 1 − → π z 2 ... α n − 1 − − − → π z n γ ( s ) (16) The difference is that, when using ( 15 ) , the first action is deterministic, whereas in ( 16 ) it is sampled according to π z 1 . Figure 7 shows that maximizing o ver the first action yields more consistent p erformance ov erall, with an av erage improv ement of ab out 70%. 4 In contrast, planning ov er the z sequence can fail when the base p olicy is diffuse (i.e., highly sto c hastic). This o ccurs, for example, for CRL p olicies in the Cub e domains and for GC-BC p olicies in An tMaze. Finally , T able 6 provides a tabular summary of these results. CRL GC-1S GC-BC GC-TD3 HFBC Method 0.0 0.2 0.4 0.6 0.8 1.0 Success antmaze-medium CRL GC-1S GC-BC GC-TD3 HFBC Method antmaze-large CRL GC-1S GC-BC GC-TD3 HFBC Method antmaze-giant Antmaze Results planning ( a 1 , z 1 , …, z n ) ( z 1 , …, z n ) CRL GC-1S GC-BC GC-TD3 HFBC Method 0.0 0.2 0.4 0.6 0.8 1.0 Success cube-1 CRL GC-1S GC-BC GC-TD3 HFBC Method cube-2 CRL GC-1S GC-BC GC-TD3 HFBC Method cube-3 CRL GC-1S GC-BC GC-TD3 HFBC Method cube-4 Cube Results planning ( a 1 , z 1 , …, z n ) ( z 1 , …, z n ) Figure 7 Success rate ( ↑ ) of CompPlan with different base p olicies when maximizing ov er ( a 1 , z 1 , . . . z n ) (Eq. 15 ) or ( z 1 , . . . , z n ) (Eq. 16 ). W e rep ort mean and standard deviation 3 seeds for GHM training. T able 6 Success rate ( ↑ ) of CompPlan with (Eq. 15 ) and without (Eq. 16 ) action maximization. W e rep ort mean and standard deviation ov er 3 seeds. Best metho d highlighted in blue; gra y indicates no significant difference. Domain CRL GC-1S GC-BC GC- TD3 HFBC max Q max V max Q max V max Q max V max Q max V max Q max V antmaze-medium 0.97 (0.02) 0.92 (0.03) 0.87 (0.05) 0.89 (0.03) 0.85 (0.08) 0.62 (0.05) 0.65 (0.03) 0.63 (0.10) 0.94 (0.01) 0.96 (0.01) antmaze-large 0.90 (0.00) 0.88 (0.01) 0.61 (0.04) 0.55 (0.03) 0.73 (0.02) 0.24 (0.02) 0.48 (0.05) 0.44 (0.04) 0.92 (0.02) 0.94 (0.01) antmaze-giant 0.29 (0.03) 0.27 (0.06) 0.02 (0.00) 0.01 (0.01) 0.03 (0.01) 0.00 (0.00) 0.01 (0.01) 0.01 (0.01) 0.79 (0.04) 0.83 (0.02) cube-1 0.86 (0.02) 0.73 (0.01) 0.66 (0.02) 0.61 (0.03) 0.99 (0.01) 1.00 (0.00) 0.91 (0.01) 0.95 (0.02) 0.97 (0.01) 0.97 (0.01) cube-2 0.50 (0.03) 0.13 (0.02) 0.57 (0.04) 0.57 (0.04) 0.97 (0.01) 0.92 (0.02) 0.82 (0.01) 0.77 (0.04) 0.77 (0.02) 0.80 (0.01) cube-3 0.73 (0.02) 0.05 (0.01) 0.67 (0.02) 0.69 (0.01) 0.92 (0.01) 0.95 (0.01) 0.83 (0.04) 0.90 (0.02) 0.83 (0.03) 0.79 (0.01) cube-4 0.39 (0.04) 0.00 (0.00) 0.60 (0.02) 0.54 (0.02) 0.76 (0.03) 0.72 (0.01) 0.57 (0.03) 0.42 (0.01) 0.67 (0.03) 0.69 (0.01) 4 This is mostly due to unsuccessful outcomes (i.e., near-zero performance) on certain tasks when Equation 16 . 30 D .5 Ablation on the Proposal Distribution In this section, we study the effect of the sampling distribution on the planning pro cedure. As mentioned in the main pap er, GHMs are trained either with the p olicy condition z or with a learnable token ∅ . In the latter case, the resulting GHM corresp onds to the b ehavioral p olicy . W e compare t wo planning strategies: (1) sampling from the GHM conditioned on the policy asso ciated with the goal ( c onditional pr op osal ); and (2) sampling from the GHM of the b ehavioral p olicy ( unc onditional pr op osal ). Figure 8 sho ws that, in AntMaze, sampling from the unconditional distribution p erforms only marginally w orse than the conditional prop osal, highligh ting the robustness of our planning pro cedure. T able 7 provides a tabular summary of these results. CRL GC-1S GC-BC GC-TD3 HFBC Method 0.0 0.2 0.4 0.6 0.8 1.0 Success antmaze-medium CRL GC-1S GC-BC GC-TD3 HFBC Method antmaze-large CRL GC-1S GC-BC GC-TD3 HFBC Method antmaze-giant Antmaze Results planning Conditional Proposal Unconditional Proposal CRL GC-1S GC- BC GC-TD3 HFBC Method 0.0 0.2 0.4 0.6 0.8 1.0 Success cube-1 CRL GC-1S GC- BC GC-TD3 HFBC Method cube-2 CRL GC-1S GC- BC GC-TD3 HFBC Method cube-3 CRL GC-1S GC- BC GC-TD3 HFBC Method cube-4 Cube Results planning Conditional Proposal Unconditional Proposal Figure 8 Success rate ( ↑ ) of CompPlan with different base policies when using the conditional or unconditional GHMs to prop ose subgoals. W e rep ort mean and standard deviation 3 seeds for GHM training. T able 7 Success rate ( ↑ ) of CompPlan with conditional and unconditional proposal. W e report mean and standard deviation 3 seeds. Best method highligh ted in blue; gray indicates no significant difference. Domain CRL GC-1S GC-BC GC- TD3 HFBC Cond Uncond Cond Uncond Cond Uncond Cond Uncond Cond Uncond antmaze-medium 0.97 (0.02) 0.91 (0.01) 0.87 (0.05) 0.93 (0.02) 0.85 (0.04) 0.85 (0.04) 0.65 (0.03) 0.52 (0.01) 0.94 (0.01) 0.84 (0.02) antmaze-large 0.90 (0.00) 0.85 (0.01) 0.61 (0.04) 0.71 (0.02) 0.73 (0.02) 0.65 (0.05) 0.48 (0.05) 0.31 (0.03) 0.92 (0.02) 0.70 (0.07) antmaze-giant 0.29 (0.03) 0.14 (0.00) 0.02 (0.00) 0.01 (0.01) 0.03 (0.02) 0.03 (0.02) 0.01 (0.01) 0.00 (0.00) 0.79 (0.04) 0.35 (0.07) cube-1 0.75 (0.01) 0.86 (0.02) 0.72 (0.01) 0.66 (0.02) 1.00 (0.00) 0.99 (0.01) 0.97 (0.02) 0.91 (0.01) 0.93 (0.01) 0.97 (0.01) cube-2 0.51 (0.02) 0.50 (0.03) 0.57 (0.09) 0.57 (0.09) 0.77 (0.04) 0.97 (0.01) 0.71 (0.02) 0.82 (0.01) 0.77 (0.02) 0.77 (0.02) cube-3 0.73 (0.02) 0.73 (0.02) 0.67 (0.02) 0.67 (0.02) 0.87 (0.04) 0.92 (0.01) 0.75 (0.02) 0.83 (0.04) 0.45 (0.06) 0.83 (0.03) cube-4 0.36 (0.03) 0.39 (0.04) 0.55 (0.01) 0.60 (0.02) 0.77 (0.04) 0.76 (0.03) 0.39 (0.03) 0.57 (0.03) 0.09 (0.04) 0.67 (0.03) 31 D .6 Ablation on the Consistency Objective This section expands up on § 4.4 , pro viding the complete empirical results for the T emp oral Difference Horizon Consistency ( td-hc ) ob jective across generativ e fidelity ( T able 8 ), qualitative predictions ( Figure 9 ), and do wnstream planning p erformance ( T able 9 ). Generative Fidelity T able 8 sho ws that td-hc systematically improv es the generative accuracy of the td-flow baseline as measured b y the Earth Mov er’s Distance (EMD; Rubner et al. , 2000 ). These gains are esp ecially pronounced in complex domains like antmaze-giant where b o otstrapping errors easily comp ound. Figure 9 visually confirms this effect: while b oth mo dels p erform comparably at shorter horizons ( γ = 0 . 99 ), td-flo w suffers from severe comp ounding errors at longer horizons ( γ = 0 . 998 ), and fails to capture the tail of the distribution. By anchoring its predictions with shorter-horizons, td-hc successfully respects the top ological constrain ts and prop erly captures the tail of the true successor state distribution. Planning Performance Despite the generativ e adv antages of td-hc at extreme horizons, T able 9 reveals that do wnstream planning success rates remain broadly similar b et ween the tw o metho ds. Because our planning pro cedure ev aluates candidate sequences using mo derate effectiv e horizons ( β i ∈ [0 . 98 , 0 . 99] , or roughly 50 − 100 steps), it do es not query the extreme timescales where td-flow breaks do wn. W e exp ect td-hc to enable planning ov er more extreme horizons in the future as b enchmarks ev olve tow ards more complex tasks. T able 8 Accuracy (EMD, ↓ ) of GHMs trained with ( td-hc ) and without ( td-flo w ) our horizon consistency loss ( § 3.2 ). Best metho d is highligh ted in blue. Domain CRL GC-1S GC-BC GC- TD3 HFBC td-flow ( ✗ ) td-hc ( ✓ ) td-flow ( ✗ ) td-hc ( ✓ ) td-flow ( ✗ ) td-hc ( ✓ ) td-flo w ( ✗ ) td-hc ( ✓ ) td-flow ( ✗ ) td-hc ( ✓ ) antmaze-medium 4.41 (0.05) 4.22 (0.06) 4.40 (0.02) 4.22 (0.03) 4.56 (0.05) 4.36 (0.03) 4.68 (0.05) 4.44 (0.05) 3.38 (0.02) 3.22 (0.02) antmaze-large 5.24 (0.07) 4.81 (0.03) 5.12 (0.18) 4.67 (0.04) 5.32 (0.03 ) 4.73 (0.02) 5.33 (0.07) 4.83 (0.04) 3.50 (0.05) 3.18 (0.01) antmaze-giant 6.77 (0.49) 5.74 (0.06) 7.29 (0.69) 5.25 (0.08) 6.46 (0.10) 4.95 (0.12) 6.51 (0.14) 5.24 (0.11) 3.78 (0.03) 3.00 (0.09) cube-1 1.60 (0.02) 1.57 (0.03) 1.43 (0.00) 1.33 (0.03) 1.03 (0.01) 0.98 (0.01) 1.12 (0.01) 1.06 (0.01) 1.82 (0.06) 1.27 (0.01) cube-2 2.36 (0.03) 2.23 (0.02) 1.86 (0.04) 1.71 (0.01) 1.47 (0.00) 1.42 (0.01) 1.89 (0.04) 1.80 (0.03) 2.29 (0.09) 1.54 (0.04) cube-3 2.15 (0.02) 2.10 (0.02) 1.80 (0.04) 1.71 (0.03) 1.89 (0.02) 1.85 (0.02) 1.91 (0.06) 1.84 (0.04) 1.99 (0.09) 1.55 (0.02) cube-4 2.41 (0.03) 2.34 (0.01) 2.13 (0.03) 2.05 (0.03) 2.33 (0.03) 2.22 (0.02) 2.21 (0.02) 2.15 (0.02) 2.05 (0.05) 1.61 (0.03) T able 9 Success rate ( ↑ ) of CompPlan with consistency ( td-hc ) and without consistency ( td-flo w ). Mean and standard deviation ov er 3 seeds. Best metho d highlighted in blue; gra y indicates no significant difference. Domain CRL GC-1S GC-BC GC- TD3 HFBC td-flow ( ✗ ) td-hc ( ✓ ) td-flow ( ✗ ) td-hc ( ✓ ) td-flow ( ✗ ) td-hc ( ✓ ) td-flo w ( ✗ ) td-hc ( ✓ ) td-flow ( ✗ ) td-hc ( ✓ ) antmaze-medium 0.95 (0.01) 0.97 (0.02) 0.85 (0.01) 0.87 (0.05) 0.91 (0.01) 0.85 (0.08) 0.71 (0.03) 0.65 (0.03) 0.95 (0.02) 0.94 (0.01) antmaze-large 0.91 (0.03) 0.90 (0.00) 0.53 (0.09) 0.61 (0.04) 0.79 (0.01 ) 0.73 (0.02) 0.51 (0.03) 0.48 (0.05) 0.91 (0.03) 0.92 (0.02) antmaze-giant 0.31 (0.05) 0.29 (0.03) 0.01 (0.01) 0.02 (0.00) 0.02 (0.00) 0.03 (0.01) 0.01 (0.01) 0.01 (0.01) 0.81 (0.02) 0.79 (0.04) cube-1 0.89 (0.03) 0.86 (0.02) 0.55 (0.02) 0.66 (0.02) 1.00 (0.00) 0.99 (0.01) 0.97 (0.01) 0.91 (0.01) 0.97 (0.01) 0.97 (0.01) cube-2 0.41 (0.02) 0.50 (0.03) 0.57 (0.09) 0.57 (0.09) 0.95 (0.02) 0.97 (0.01) 0.85 (0.02) 0.82 (0.01) 0.84 (0.03) 0.77 (0.02) cube-3 0.72 (0.02) 0.73 (0.02) 0.72 (0.01) 0.67 (0.02) 0.91 (0.02) 0.92 (0.01) 0.83 (0.04) 0.83 (0.04) 0.83 (0.03) 0.83 (0.03) cube-4 0.36 (0.00) 0.39 (0.04) 0.56 (0.02) 0.60 (0.02) 0.75 (0.02) 0.76 (0.03) 0.56 (0.03) 0.57 (0.03) 0.64 (0.04) 0.67 (0.03) 32 γ = 0 . 99 ( ≈ 100 steps ) γ = 0 . 995 ( ≈ 200 steps ) γ = 0 . 998 ( ≈ 500 steps ) td-flo w td-hc gr ound tr uth Figure 9 Qualitative plots of GHM samples on antmaze-giant task 1 at differen t horizons (0 . 99 , 0 . 995 , 0 . 998) from td-flo w and td-hc with the last row depicting the ground truth discoun ted o ccupancy . As can b e seen in the figure, p erformance is comparable at smaller horizons with td-hc doing a muc h b etter job at capturing the true distribution as the horizon increases. 33 E Qualitativ e Geometric Horizon Model Samples t = 0 t = 15 t = 30 t = 45 t = 60 t = 75 t = 90 t = 105 t = 120 t = 135 t = 150 t = 165 t = 180 t = 195 t = 210 t = 225 t = 240 t = 255 t = 270 t = 285 t = 300 t = 315 t = 330 t = 345 Figure 10 Qualitativ e visualization for an episode of cube-4-t ask-5 . The rob ots and cub es shown with transparency indicated GHM samples from the first policy in the GSP . 34 F Experimental Details F .1 Base Policies & Hyperparameters Eac h p olicy class is chosen to highligh t a different asp ect of the pip eline (i.e., GHM learning and skill planning). 1. Goal-Conditioned TD3 (GC- TD3; Pirotta et al. , 2024 ) : A standard goal-conditioned offline RL algorithm where w e employ Flow Q-learning (FQL; P ark et al. , 2025c ), a flo w-based v arian t of TD3-BC ( F ujimoto and Gu , 2021 ) for p olicy extraction. Due to b ootstrapping on its o wn learned p olicy , actions may drift out- of-distribution (OOD) from the dataset, p osing a challenge for off-p olicy predictiv e mo deling ( Levine et al. , 2020 ). In particular, w e use the standard TD3 critic loss ( F ujimoto et al. , 2018 ) to learn a goal-conditioned action-v alue function Q ϕ ( S, A, G ) ( Sc haul et al. , 2015 ): ℓ ( ϕ ) = E ( S,A,S ′ ,G ) A ′ ∼ π G ( ·| S ′ ) h  Q ϕ ( S, A, G ) − I { S ′ = G } − γ Q ¯ ϕ ( S ′ , A ′ , G )  2 i , where ( S, A, S ′ ) are transition sampled uniformly from the dataset, and G is a goal state dra wn from the follo wing mixture distribution: with probability 0 . 2 , set G = S ′ , with probabilit y 0 . 3 sample G uniformly from the dataset and with 0 . 5 set G to a randomly selected future state along the tra jectory starting from ( S, A ) , where the selection time step is drawn from a γ -geometric distribution. F or p olicy training, we use the flow Q-learning pro cedure. W e learn jointly a b ehav ior p olicy π ( · | S ) and goal-conditioned policy π G ( · | S ) parametrized b y flow-matc hing v ector field. The b ehavioral p olicy is trained by a standard conditional flow-matc hing ob jective ( Lipman et al. , 2023 ) on state-action pairs uniformly sampled from the dataset. The goal-conditioned p olicy is mo deled as a one-step flo w map. W e denote b y µ ψ ( S, X 0 ) and µ ω ( S, G, X 0 ) the flo w-maps of π and π G , resp ectiv ely . ℓ ( ψ ) = − E ( S,A,G ) X 0 ∼N (0 ,I )  Q ( S, µ ψ ( S, G, X 0 ) , G ) + λ ∥ µ ψ ( S, G, X 0 ) − µ ω ( S, X 0 ) ∥ 2  , (17) where λ is the distillation co efficient that controls the b eha vior cloning regularization. Here G is a goal state dra wn from the follo wing mixture distribution: with probability 0 . 5 sample G uniformly from the dataset, and with 0 . 5 set G to a randomly selected future state along the tra jectory starting from ( S, A ) . 2. Goal-Conditioned 1-Step RL (GC-1S) : A more conserv ativ e v arian t of GC-TD3 that b o otstraps using the b eha vior p olicy via the dataset’s actions. W e exp ect this to yield easier-to-mo del o ccupancies as w e no longer query the learned v alue function with OOD actions. Policy extraction on the resulting v alue function is also p erformed using FQL ( 17 ) . In particular, we follow the same training as explained for TD3, and we c hange only the critic ob jective: ℓ ( ϕ ) = E ( S,A,S ′ ,A ′ ,G ) h  Q ϕ ( S, A, G ) − I { S ′ = G } − γ Q ¯ ϕ ( S ′ , A ′ , G )  2 i , where A ′ is no w sampled from the dataset rather than from the learned p olicy . 3. Contrastive RL (CRL; Ey senbach et al. , 2022 ) : An alternative v alue-based approach that uses contrastiv e learning to appro ximate the successor measure of the behavior policy ( Eysenbac h et al. , 2022 ). In particular CRL learns a state-action ϕ ( s, a ) and goal enco der ψ ( g ) b y the following Mon te-Carlo InfoNCE ( v an den Oord et al. , 2018 ) con trastive loss: ℓ ( ϕ, ψ ) = E ( S,A,G )  − ϕ ( S, A ) ⊤ ψ ( G )  −   log X ( S,A,G ′ ) exp  ϕ ( S, A ) ⊤ ψ ( G ′ )    (18) where G is γ -distributed sampled future state along the tra jectory starting from ( S, A ) and G ′ is a state distribued uniformly from the dataset. Moreov er, p olicy extraction is p erformed using FQL ( 17 ) with Q-function Q ( S, A, G ) estimated as ϕ ( S, A ) ⊤ ψ ( G ) . 35 4. Goal-Conditioned Behavior Cloning (GC-BC; L ynch et al. , 2019 ; Ghosh et al. , 2021 ) : A purely imitativ e p olicy trained via flow matc hing ( Lipman et al. , 2023 ) to mimic the high-qualit y tra jectories in the dataset to reac h sp ecific goals using hindsight relab eling ( Andrycho wicz et al. , 2017 ). Sp ecially , we parametrize the p olicy b y vector field v ϕ ( t, S, G ) ℓ ( ϕ ) = E t,S,A,G X 0 ∼N (0 ,I )  ∥ v ϕ ( t, S, G ) − ( A − X 0 ) ∥ 2  , (19) where G is γ -distributed sampled future state along the tra jectory starting from ( S, A ) . A k ey consequence of this v alue-free approach is that the p olicy struggles to generalize to distant goals, which can limit the effectiv eness of prop osing goal-directed “wa yp oints” during planning. 5. Hierarchical Flow Behavior Cloning (HFBC; Park et al. , 2025b ) : an imitative approac h that trains tw o p olicies: a high-lev el p olicy is trained to predict subgoals that are h steps a wa y from the curren t state for a fixed lo ok ahead h , and the low-lev el p olicy is trained to predict actions to reac h the giv en subgoal. Sp ecially we parametrize b oth high-level and lo w-level p olicy b y tw o vector fields v ϕ ( t, S, G ) and ν ψ ( t, S, G ) resp ectiv ely . ℓ ( ϕ ) = E t,S n ,S n + h ,G X 0 ∼N (0 ,I )  ∥ v ϕ ( t, S n , G ) − ( S n + h − X 0 ) ∥ 2  , ℓ ( ψ ) = E t,S n ,A n ,S n + h X 0 ∼N (0 ,I )  ∥ v ϕ ( t, S n , S n + h ) − ( A n − X 0 ) ∥ 2  , where S n + h is the h-steps w ay from the current state S n and G is is γ -distributed sampled future state along the tra jectory starting from S n Consequen tly , the high-lev el p olicy can b e used as a prop osal for sub-goal distribution at planning time. T able 10 Base policy h yp erparameters. Parameters in { } denote sw eeps performed o ver the v alues inside brac kets. Method Hyperparameter Antmaze Cube CRL GC- TD3 GC-1S F QL Distillation co efficient { 0 . 1 , 0 . 15 , 0 . 2 , 0 . 3 , 0 . 4 } { 0 . 7 , 0 . 8 , 0 . 9 , 1 , 3 } Discoun t factor { 0 . 995 , 0 . 997 } for giant 0 . 99 otherwise { 0 . 99 , 0 . 995 } for cube-4 0 . 99 otherwise Gradien t steps { 1 M , 3 M } 500 k for cub e- { 1 , 2 } 1 M for cube-3 3 M for cube-4 GC-BC Discoun t factor { 0 . 99 , 0 . 995 } for giant { 0 . 98 , 0 . 99 } for medium { 0 . 98 , 0 . 99 } for large { 0 . 95 , 0 . 96 } for cube-1 { 0 . 96 , 0 . 97 } for cube-2 { 0 . 96 , 0 . 97 , 0 . 98 } for cube-3 { 0 . 96 , 0 . 98 , 0 . 99 } for cube-4 Gradien t steps 125 k 500 k for cube-1 1 M for cube-2 2 M for cube-3 3 M for cube-4 HFBC SHARSA Lo ok ahead [25 , 50] [25 , 50] Discoun t factor { 0 . 99 , 0 . 995 } { 0 . 95 , 0 . 99 , 0 . 995 } Gradien t steps 3 M 1 M for cub e- { 1 , 2 } 3 M for cub e- { 3 , 4 } 36 F .2 Geometric Horizon Model Hyperparameters T able 11 Hyp erparameters for Geometric Horizon Model pre-training. Hyperparameter V alue Flo w Matching ( Lipman et al. , 2023 ) Probabilit y Path Conditional OT ( σ =0 ) Time Sampler U ([0 , 1]) ODE Solv er Euler ODE d t (train) / steps 0 . 1 / 10 ODE d t (ev al) / steps 0 . 05 / 20 Net work (U-Net) ( Ronneb erger et al. , 2015 ) t -P ositional Embedding Dim. 256 t -P ositional Embedding MLP (1024 , 1024) Hidden A ctiv ation mish ( Misra , 2019 ) Blo c ks p er Stage 1 Blo c k Dimensions (1024 , 1024 , 1024) Conditional Enco der Enco der MLP (1024 , 1024 , 1024) Enco der A ctiv ation mish ( Misra , 2019 ) Conditioning Mixing additiv e Optimizer A dam ( Kingma and Ba , 2015 ) Learning Rate 10 − 4 W eight Decay 0 Gradien t Norm Clip — T raining Max Discoun t γ max 0 . 996 T arget Netw ork EMA 5 × 10 − 4 Gradien t Steps 3 M Batc h Size 256 Con text Drop Probability ( z = ∅ ) 0 . 1 Consistency Prop ortion      0 . 25 for antmaze 0 . 15 for cube GCRL Goal Sampling p ( tra jectory goal ) 0 . 5 p ( random goal ) 0 . 5 Geometric T ra jectory Discoun t 0 . 995 37 F .3 Planning Hyperparameters T able 12 CompPlan h yp erparameters for the main results of the pap er. Method Domain Candidates Effectiv e horizons Prop osal Distribution Ev al samples Replan every Discount CRL antmaze-medium 256 [50 , 50 , 100 , 100 , 200] Conditional 256 1 0.999 antmaze-large 256 [50 , 50 , 100 , 100 , 200] Conditional 256 1 0.999 antmaze-giant 256 [50 , 50 , 100 , 100 , 200] Conditional 256 1 0.999 cube-1 1024 [20 , 80] Unconditional 128 1 0.99 cube-2 1024 [20 , 20 , 80] Unconditional 128 1 0.99 cube-3 1024 [20 , 20 , 20 , 80] Unconditional 128 1 0.99 cube-4 1024 [20 , 20 , 20 , 20 , 80] Unconditional 128 1 0.99 GC- TD3 antmaze-medium 256 [50 , 50 , 100 , 100 , 200] Conditional 256 1 0.999 antmaze-large 256 [50 , 50 , 100 , 100 , 200] Conditional 256 1 0.999 antmaze-giant 256 [50 , 50 , 100 , 100 , 200] Conditional 256 1 0.999 cube-1 1024 [20 , 80] Unconditional 128 1 0.99 cube-2 1024 [20 , 20 , 80] Unconditional 128 1 0.99 cube-3 1024 [20 , 20 , 20 , 80] Unconditional 128 1 0.99 cube-4 1024 [20 , 20 , 20 , 20 , 80] Unconditional 128 1 0.99 GC-BC antmaze-medium 256 [50 , 50 , 100 , 100 , 200] Conditional 256 1 0.999 antmaze-large 256 [50 , 50 , 100 , 100 , 200] Conditional 256 1 0.999 antmaze-giant 256 [50 , 50 , 100 , 100 , 200] Conditional 256 1 0.999 cube-1 1024 [20 , 80] Unconditional 128 1 0.99 cube-2 1024 [20 , 20 , 80] Unconditional 128 1 0.99 cube-3 1024 [20 , 20 , 20 , 80] Unconditional 128 1 0.99 cube-4 1024 [20 , 20 , 20 , 20 , 80] Unconditional 128 1 0.99 GC-1S antmaze-medium 256 [50 , 50 , 100 , 100 , 200] Conditional 256 1 0.999 antmaze-large 256 [50 , 50 , 100 , 100 , 200] Conditional 256 1 0.999 antmaze-giant 256 [50 , 50 , 100 , 100 , 200] Conditional 256 1 0.999 cube-1 1024 [20 , 80] Unconditional 128 1 0.99 cube-2 1024 [20 , 20 , 80] Unconditional 128 1 0.99 cube-3 1024 [20 , 20 , 20 , 80] Unconditional 128 1 0.99 cube-4 1024 [20 , 20 , 20 , 20 , 80] Unconditional 128 1 0.99 HFBC antmaze-medium 32 [25] ∗ 24 Conditional 128 1 0.999 antmaze-large 32 [25] ∗ 24 Conditional 128 1 0.999 antmaze-giant 32 [25] ∗ 24 Conditional 128 1 0.999 cube-1 32 [25] ∗ 4 Unconditional 128 1 0.99 cube-2 32 [25] ∗ 5 Unconditional 128 1 0.99 cube-3 32 [25] ∗ 6 Unconditional 128 1 0.99 cube-4 32 [25] ∗ 7 Unconditional 128 1 0.99 38

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment