📝 Original Info
- Title:
- ArXiv ID: 2512.17846
- Date:
- Authors: Unknown
📝 Abstract
We present Planning as Descent (PaD), a framework for offline goal-conditioned reinforcement learning that grounds trajectory synthesis in verification. Instead of learning a policy or explicit planner, PaD learns a goal-conditioned energy function over entire latent trajectories, assigning low energy to feasible, goal-consistent futures. Planning is realized as gradient-based refinement in this energy landscape, using identical computation during training and inference to reduce train-test mismatch common in decoupled modeling pipelines. PaD is trained via self-supervised hindsight goal relabeling, shaping the energy landscape around the planning dynamics. At inference, multiple trajectory candidates are refined under different temporal hypotheses, and low-energy plans balancing feasibility and efficiency are selected. We evaluate PaD on OGBench cube manipulation tasks. When trained on narrow expert demonstrations, PaD achieves state-of-the-art 95% success, strongly outperforming prior methods that peak at 68%. Remarkably, training on noisy, suboptimal data further improves success and plan efficiency, highlighting the benefits of verification-driven planning. Our results suggest learning to evaluate and refine trajectories provides a robust alternative to direct policy learning for offline, reward-free planning.📄 Full Content
We study this problem in the setting of offline goal-conditioned reinforcement learning (GCRL), where the objective is to reach arbitrary target states using only a static dataset of reward-free trajectories. Offline GCRL poses several fundamental difficulties: (i) extracting meaningful structure from unstructured and suboptimal data; (ii) composing disjoint behavioral fragments that may not co-occur within a single trajectory; (iii) propagating sparse goal information over long horizons; and (iv) reasoning about multi-modal futures under stochastic dynamics. Recent benchmarks such as OGBench (Park et al., 2024) highlight the difficulty of these challenges and show that many existing methods struggle to generalize robustly to unseen goals.
A common strategy for addressing offline decision making is to separate modeling and planning. Model-based methods learn forward dynamics and then perform trajectory optimization or model predictive control (MPC) at inference time (Zhou et al., 2024;Hansen et al., 2023;Sobal et al., 2025). While conceptually appealing, this separation often leads to train-test mismatches: powerful optimizers can exploit small inaccuracies in learned dynamics models, producing adversarial or physically implausible trajectories that fail at deployment time (Henaff et al., 2019).
An alternative line of work reframes control as trajectory generation, using sequence models such as Decision Transformers (Chen et al., 2021), masked trajectory models (Wu et al., 2023;Janner et al., 2021;Carroll et al., 2022), or diffusion-based policies (Chi et al., 2023;Janner et al., 2022). These models directly model the distribution of trajectories and can synthesize diverse, multimodal behaviors from offline datasets. However, their sampling-based nature often leads to reproducing undesirable behaviors when trained on noisy or suboptimal data, and they lack explicit mechanisms for enforcing long-horizon dynamical feasibility or goal satisfaction. More broadly, these approaches learn how to generate trajectories, but do not explicitly learn how to evaluate or verify them (West et al., 2023).
In this work, we propose Planning as Descent (PaD), a framework that rethinks offline goal-conditioned control through the lens of generation by verification. Rather than learning a policy, generator, or explicit planner, PaD learns a goal-conditioned energy landscape over entire future trajectories. This energy assigns low values to trajectories that are dynamically plausible and consistent with a desired goal, and high values to incompatible ones. Planning then arises implicitly as gradient descent in this learned energy landscape, iteratively refining candidate trajectories to minimize their energy.
Crucially, PaD explicitly enforces alignment between training and inference by using the same gradient-based refinement procedure in both phases. The energy landscape is shaped during training around the exact descent dynamics used at test time, ensuring that inference corresponds to optimization behavior the model has been trained to support. The forward pass of the model computes trajectory energies (verification), while the backward pass provides structured descent directions that refine trajectories toward feasible, goal-consistent futures (synthesis). This training-as-inference alignment helps mitigate the train-test discrepancies that arise in decoupled modeling and planning pipelines, and contrasts sharply with diffusion and masked trajectory models, which rely on stochastic sampling or reconstruction objectives. In PaD, planning is goal-directed and realized entirely through energy minimization, without autoregressive rollouts, learned noise schedules or value backups.
Moreover, our method is motivated by the most fundamental design bias in deep learning: depth enables composition. Just as deep networks compose pixels into objects and words into sentences, PaD allows primitive transitions to compose into subgoals and plans within the depth of a single-learned energy-based model (EBM). Hierarchical planning structure occurs implicitly through gradient-based refinement in the representation space, potentially eliminating the need for explicitly engineered handcrafted abstractions.
We demonstrate that PaD performs robustly across qualitatively different data regimes, including narrow expert demonstrations and broad, highly suboptimal datasets. On challenging tasks from the OGBench single-cube manipulation suite, PaD achieves state-of-theart performance, exhibits strong robustness to distribution shift, and-perhaps counterintuitively-produces more efficient plans when trained on diverse but highly suboptimal data. These results suggest that learning to verify trajectories, rather than directly generating them, provides a powerful foundation for offline goal-conditioned planning. An implementation of the proposed method will be made publicly available at: https://github.com/ inescopresearch/pad Contributions. Our main contributions are:
• We introduce an energy-based formulation of offline GCRL that unifies trajectory evaluation and synthesis, casting planning as energy minimization in latent trajectory space.
• We propose a self-supervised training scheme based on hindsight goal relabeling that aligns training-time verification with inference-time planning.
• We present a gradient-based planning procedure that iteratively refines noisy latent trajectories into coherent, goal-directed plans.
• We demonstrate state-of-the-art performance on OGBench single-cube tasks and provide an analysis showing that the learned energy landscape supports effective planning and generalization to unseen goals.
We next review relevant previous work in model-based control, sequence modeling, and energy-based approaches, and position our contribution in this landscape.
Offline Goal-Conditioned Reinforcement Learning (GCRL) studies goal-reaching from rewardfree offline datasets, typically using hindsight relabeling (Andrychowicz et al., 2017) to enable unsupervised training. Early goal-conditioned behavioral cloning methods (GCBC) (Lynch et al., 2020;Ghosh et al., 2019) perform well when state coverage is sufficient but degrade sharply under distribution shift or narrow expert demonstrations. Subsequent approaches such as GCIVL, GCIQL, QRL, CRL, and HIQL incorporate value learning, contrastive learning, or explicit hierarchical structure to better propagate sparse goal information and improve compositionality (Kostrikov et al., 2021;Park et al., 2023;Eysenbach et al., 2022;Wang et al., 2023a). However, these methods remain sensitive to dataset quality and often perform well in either narrow or broad noisy regimes, but not both (Park et al., 2024).
A unifying characteristic of these methods is that they learn an explicit policy or value function, with planning either implicit in policy execution or mediated through value-based rollouts. This coupling exposes them to well-known offline RL failure modes, such as extrapolation and compounding action errors at inference time (Levine et al., 2020).
Planning as Descent (PaD) departs from this paradigm by avoiding policy and value learning altogether; instead, it learns a goal-conditioned energy function over entire future trajectories. Planning arises through gradient-based refinement in this energy landscape, providing a verification-driven alternative that enables robust goal-reaching across diverse offline data regimes.
Model-based approaches address offline control by learning a forward dynamics model and performing planning through trajectory optimization or model predictive control (MPC) (Rawlings et al., 2020;Luo et al., 2024). Recent methods (Zhou et al., 2024;Hansen et al., 2023;Sobal et al., 2025) leverage latent dynamics models and planning in representation space to improve scalability and robustness. However, decoupling model learning from planning often leads to a train-test mismatch, as powerful optimizers can exploit small modeling errors and produce physically implausible or brittle plans at inference time (Talvitie, 2017).
Several works attempt to mitigate this issue through uncertainty regularization (Henaff et al., 2019), short-horizon planning, or frequent replanning, but still rely on explicit forward rollouts through learned dynamics. As a result, planning quality remains tightly coupled to model accuracy, particularly in long-horizon and distribution-shifted settings.
Planning as Descent (PaD) avoids explicit dynamics modeling and forward simulation altogether. Instead, PaD learns a goal-conditioned energy function over entire future trajectories and performs planning directly via gradient-based refinement in trajectory space. By unifying trajectory evaluation and synthesis within a single learned energy landscape, PaD sidesteps model exploitation and reduces train-test mismatch inherent to model-based planning pipelines.
An alternative to model-based planning reframes control as a trajectory generation problem, using sequence models trained on offline data. Decision Transformers (Chen et al., 2021) condition autoregressive models on goals or returns to synthesize action sequences (Schmidhuber, 2019), while masked trajectory models reconstruct missing segments of trajectories using bidirectional context (Wu et al., 2023;Janner et al., 2021;Carroll et al., 2022). More recently, diffusion-based policies and planners (Janner et al., 2022;Chi et al., 2023) model the distribution of expert trajectories and have achieved state-of-the-art results through imitation learning on large-scale foundational models (Black et al., 2024;Liu et al., 2024;Bjorck et al., 2025).
Despite their success, these approaches fundamentally learn how to generate trajectories that resemble the training distribution (West et al., 2023). As a result, they are prone to reproducing suboptimal or undesirable behaviors present in the data and often rely on sampling procedures, noise schedules, or autoregressive rollouts that introduce variance and compounding errors over long horizons. Moreover, these methods lack an explicit mechanism for evaluating or verifying whether a synthesized trajectory is dynamically feasible or goal-consistent (Balim et al., 2025).
Planning as Descent (PaD) departs from generative trajectory modeling by learning a goal-conditioned energy function over entire future trajectories. Rather than sampling trajectories, PaD refines latent candidate future plans via gradient descent in the learned energy landscape, explicitly optimizing for feasibility and goal satisfaction through verification rather than likelihood or reconstruction.
Energy-Based Models (EBMs) learn an explicit energy function whose low-energy configurations correspond to compatible system states or actions (LeCun et al., 2006;Du and Mordatch, 2019). In control settings, EBMs have been used to model conditional action distributions, most notably in Implicit Behavior Cloning (IBC), where actions are obtained at inference time by minimizing a learned energy landscape (Liu et al., 2020;Florence et al., 2022;Davies et al., 2025). However, most EBM-based control methods rely on contrastive training objectives that require large numbers of negative samples, often resulting in sharp or poorly conditioned energy landscapes that make inference-time optimization difficult and unstable. This limitation has historically hindered the scalability and practical deployment of EBMs in long-horizon control and planning tasks.
Recent work has revisited EBM training from an optimization-based perspective, introducing objectives that implicitly regularize the energy landscape and avoid explicit negative sampling (Wang et al., 2023b;Gladstone et al., 2025). While these advances significantly improve training stability and scalability, their potential for addressing core challenges world modeling, control and planning-such as long-horizon reasoning and goal-conditioned trajectory synthesis-remains largely unexplored.
In contrast, PaD uses a single goal-conditioned energy function directly as a planner, where gradient-based minimization over entire latent trajectories constitutes the planning procedure itself. Action decoding is decoupled from trajectory planning and can be learned with substantially fewer labeled samples, as it only needs to model local state-to-action mappings rather than long-horizon planning behavior.
We consider a reward-free Markov decision process (MDP) M = (S, A, µ, p), where S is the state space, A is the action space, µ ∈ P(S) is the initial-state distribution, and p(s ′ | s, a) denotes the (possibly stochastic) transition dynamics kernel specifying the probability of transitioning to state s ′ ∈ S after taking action a ∈ A from state s ∈ S.
We study this problem in the offline setting, where the agent has access only to a fixed, reward-free dataset D = {τ (n) } N n=1 of state-action trajectories τ (n) = (s
Tn ) collected by an unknown behavior policy. Importantly, these trajectories are not necessarily optimal, reflecting the heterogeneous, partially exploratory, and potentially multi-modal nature of real-world data.
The objective of offline goal-conditioned RL is to learn a goal-conditioned policy π : S × S → A that enables the agent to reach any target state s g ∈ S from any initial state s 0 ∈ S in as few steps as possible. We define the sparse goal-reaching reward: r g (s) = I[s = s g ] and seek a policy that maximizes the expected discounted return
where γ ∈ (0, 1] is the discount factor.
A key property of this formulation is that the entire state space serves as the goal space. A goal is specified by a full state, not by a subset of features such as object positions. This choice yields a fully unsupervised objective, enabling domain-agnostic training from unlabeled, reward-free data (Park et al., 2024).
Energy-Based Models (EBMs) provide a flexible framework for modeling dependencies between variables by associating a scalar energy-a measure of compatibility-to each configuration of variables (LeCun et al., 2006). Let x ∈ X denote an observed variable and y ∈ Y denote a target variable. An EBM defines an energy function
parameterized by θ, assigning a scalar energy E θ (x, y) to every pair (x, y).
Inference. Given x, inference consists of finding the most compatible configuration of y by minimizing the energy:
Learning. Learning consists of shaping the energy landscape so that compatible pairs (x, y) lie in low-energy regions and incompatible pairs lie in high-energy regions. In practice, this is done by minimizing the energy of positive (observed) pairs while maximizing the energy of negative (corrupted) pairs, typically through contrastive or regularized training procedures that avoid computing the generally intractable partition function (i.e., the normalization constant Z θ (x) = y exp(-E θ (x, y ′ ))dy ′ which is prohibitive to compute in high-dimensional spaces). Following (Wang et al., 2023b), we adopt an optimization-based perspective on EBM training that implicitly regularizes the energy landscape and enables scalable learning.
We introduce Planning as Descent (PaD), a goal-conditioned latent-space planning framework in which future trajectories are synthesized by descending a learned conditional energy landscape. The core idea is to learn an energy function over latent trajectories that simultaneously serves as a verifier of dynamical plausibility and goal satisfaction, and as a planner whose gradient field specifies how candidate trajectories should be refined. Given past observations and a desired goal specification, the model assigns low energy to latent futures that are feasible and goal-consistent, while the gradient of this energy provides the descent direction required to synthesize such trajectories.
Given a raw trajectory: , states are independently encoded into latent representations using f θ . Future latents are corrupted to form an initial trajectory z 0 future , which is iteratively refined by descending the conditional energy E θ and projecting updates back onto the encoder-induced manifold through p θ . At each refinement step, a denoising loss compares the intermediate trajectory to the clean future latents while stop-gradient operations prevent (i) mode collapse and (ii) backpropagation through the refinement dynamics.
PaD consists of three main components trained jointly end-to-end: (i) a state encoder f θ that maps individual observations into a latent space without modeling temporal structure; (ii) a conditional energy function E θ that jointly evaluates and refines entire latent trajectories; and (iii) a projector network p θ that ensures refinement remains confined to the encoder-induced manifold. A defining property of PaD is that the same refinement mechanism is used during both training (Algorithm 1) and inference (Algorithm 2), allowing the energy landscape to be shaped around the planning dynamics.
Given an observation sequence (s 0 , . . . , s T ), each state s t is encoded independently as a latent vector z t = f θ (s t ), where the encoder f θ : S → R d captures only state-wise information and does not impose temporal dependencies. Given past latents z past = (z 0 , . . . , z k ) and a goal specification (s g , λ), where λ ∈ [0, 1] is a normalized time-to-reach variable specifying the desired relative point within the planning horizon at which the goal should be reached, PaD introduces the future latent sequence z future = (z k+1 , . . . , z k+H ), which serves as a free optimization variable during planning. Importantly, λ does not correspond to a physical time or fixed number of environment steps, but instead provides a relative temporal conditioning signal that allows the planner to reason jointly about trajectory feasibility and goal-reaching speed.
We define a conditional energy
which assigns a scalar value to a candidate future trajectory z future , conditioned on the latent past z past , the goal state s g , and the continuous time-to-reach parameter λ ∈ [0, 1]. Low energy indicates that the trajectory is dynamically plausible and consistent with reaching the goal within the temporal budget encoded by λ.
PaD performs planning by iteratively refining the latent trajectory through gradient descent on this energy landscape. Given a current trajectory estimate z (t) future , the raw refinement step is
where η denotes the refinement step size.
To ensure that refinement remains on or near the encoder-induced manifold, the updated trajectory is passed through a shallow learnable projector p θ , producing the final update rule:
future z past , s g , λ .
(1)
The projector matters. We empirically find that the projector p θ plays a critical role in stabilizing refinement. Figure 2 shows that, in the absence of the projector, training becomes unstable, as energy gradients alone may push latent states toward off-manifold regions that do not correspond to valid encoded observations, leading to degenerate latents and degraded planning performance. Conversely, the projector cannot produce meaningful refinements in the absence of the structured descent directions supplied by the energy model. Effective refinement therefore arises from the interaction between the two components: the energy function provides informed, goal-conditioned descent directions, while the projector enforces representational validity by mapping refined trajectories back onto the encoder-induced manifold. We further ablate the projector component in Section 5.5.
A defining property of PaD is that the refinement procedure used for planning is identical during training and inference. This “training-as-inference” principle regularizes the energy landscape such that its gradient flow implements the desired planning behavior. The lightweight 130K-parameter projector substantially stabilizes training and improves convergence with negligible computational overhead.
Hindsight goal relabeling with temporal targets. Given a trajectory of length L, we first sample a scalar r ∈ [0, 1] from the truncated arccos distribution p(r) = 2 π (1 -r 2 ) -1/2 and map it linearly to a past-window length P past ∈ [1, P max ], where P max is the maximum allowed past context. This biases training toward larger past windows while remaining aligned with inference-time conditions.
To specify the temporal target, we draw λ ∼ U(0, 1) and map it linearly to a future index G ∈ [P max , H], where H is the planning horizon. The state at index G becomes the goal s g . Conditioning on λ is essential: without an explicit temporal target, the model would treat all trajectories that eventually reach the goal as equivalent, making denoising from heavily corrupted latents considerably more difficult.
Latent trajectory corruption. Let z clean denote the clean future latent trajectory f θ (s future ). To mimic uncertainty over future predictions, we take a corruption scheme inspired by diffusion models
with the corruption level β uniformly sampled as β ∼ U(0, 1).
Denoising-based training objective. Starting from this noisy initialization, the model performs T refinement steps using Equation 1. At each refinement step, the intermediate trajectory z
future is compared to the clean target z clean using a smooth-L 1 distance. The target is treated as a constant, and stop-gradient is applied to z (t) future between refinement steps to prevent backpropagation through the refinement dynamics themselves, which empirically accelerates training and improves stability.
The overall training loss is therefore
Importantly, this loss is backpropagated through the entire optimization process, which requires second-order derivatives-specifically, gradients of gradients with respect to model parameters arising from the refinement steps. These second-order terms are computed efficiently as Hessian-vector products, which increases training cost about 1.66× compared to standard first-order backpropagation in a feed-forward model, assuming a single refinement step and all other factors held constant (Gladstone et al., 2025).
We summarize the training procedure of the proposed framework in Algorithm 1.
Algorithm
▷ Compute energy 10:
z future ← StopGradient(z future ) ▷ Prevent backprop-through-time 14: end for 15: Update θ using L
At test time, PaD only has access to the sequence of past observations s past = (s 0 , . . . , s L ) and a desired goal state s g . Inference proceeds by sampling and refining multiple candidate future trajectories under different temporal hypotheses as shown in Figure 3.
At test time, PaD is provided with a sequence of past observations s past = (s 0 , . . . , s k ) and a desired goal state s g . Inference proceeds by synthesizing and refining multiple candidate future trajectories under different normalized time-to-reach hypotheses λ, allowing the model to jointly reason about how to reach the goal and how quickly it should be reached.
Given the encoded past z past = f θ (s past ), we initialize a batch of B candidate future trajectories z (0) future ∼ N (0, I) B , each paired with an independently sampled temporal hypothesis λ b ∼ U(0, 1). As in training, λ serves as a relative temporal conditioning signal rather than an absolute number of environment steps, and conditions the planner on the intended fraction of the planning horizon at which the goal should be achieved. Every candidate is refined for T steps using the same update rule as in training (Eq. 1).
After refinement, each trajectory is scored by its final energy
where superscript (T, b) denotes the b-th candidate after T refinement steps. Lower energies correspond to trajectories that are both dynamically plausible and consistent with reaching the goal at the rate specified by λ b .
PaD selects a final plan in two stages. First, it identifies the K candidates with the lowest energies. Second, it samples one trajectory from this top-K set using a categorical distribution with logits proportional to -λ, introducing a mild bias toward plans that reach the goal sooner whenever doing so remains energetically feasible. This selection mechanismsummarized in Algorithm 2-balances feasibility and efficiency within the learned energy landscape and mirrors the temporal conditioning used during training.
Algorithm 2 Inference: Energy Planning with Multiple Time-to-Reach λ Candidates. Inputs: past sequence s past , goal state s g , encoder f θ , planner E θ , projector p θ , refinement steps T , number of samples B, top-k set size K
▷ Compute energies 6:
▷ Compute final energies 10: (E top , I K ) ← TopK(-E, K) ▷ Select K lowest-energy plans 11: i final ∼ Categorical(logits = -λ I K )
▷ Sample index biased by lower Steps-to-reach 12: z plan ← z future [i final ] 13: return z plan More advanced inference schemes (e.g., proposal distributions conditioned on previous refinements) are possible but intentionally omitted to highlight the intrinsic capability of the learned energy landscape. We leave such extensions to future work.
Online Replanning. At inference, PaD is deployed within an iterative replanning loop in which planning and execution alternate. After synthesizing a latent future trajectory using the refinement procedure described above, only the first N predicted transitions are decoded into actions via the inverse dynamics model g ψ and executed open-loop in the environment. The agent then receives new observations, updates the past window, resamples temporal hypotheses, and invokes the same refinement mechanism to obtain a fresh plan. This plan-execute-replan structure enables PaD to continually correct plan inaccuracies and stochastic transitions while remaining computationally lightweight, as all refinement steps are fully parallelized across hypotheses. In Section 5.4, we ablate the choice of the replanning interval N and analyze how it affects task performance, stability, and computational cost. The bottom row shows the distribution of sampled temporal targets and their associated energies, with lower energies indicating more plausible plans for reaching the goal.
Once a latent future trajectory has been synthesized by PaD, it must be translated into executable control inputs. To this end, we employ a separate inverse dynamics model
which maps consecutive latent states to the action responsible for the corresponding transition.
The inverse dynamics model g ψ is trained independently from the planner using supervised learning on action-labeled transitions drawn from the offline dataset. Importantly, gradients from g ψ do not propagate to the encoder or energy model; planning is therefore learned entirely from state-only trajectories, and action labels are not required for shaping the energy landscape or the refinement dynamics.
In our experiments, for simplicity, g ψ is trained using the full set of available actionlabeled transitions. However, this choice is not fundamental to the method. Since inverse dynamics only needs to model local, single-step state transitions rather than long-horizon planning behavior, it can in principle be trained from a substantially smaller labeled subset without affecting the planner itself. We leave a systematic study of the trade-off between action-label availability and execution performance to future work.
Decoupling action decoding from latent planning provides two key advantages. First, it enables PaD to learn goal-conditioned planning entirely from action-free data, which is particularly relevant in settings where only state observations are available. Second, it prevents imperfections in the inverse dynamics model from distorting the learned planning representation, as planning quality is determined solely by the energy-based refinement in latent space.
We evaluate Planning as Descent (PaD) on two state-based manipulation tasks from the OGBench cube suite (Park et al., 2024). OGBench is a challenging multi-goal benchmark spanning diverse manipulation and locomotion tasks under both state and pixel observations. Compared to earlier offline suites such as D4RL (Fu et al., 2020), which has become largely saturated as of 2025 (Park et al., 2025), OGBench provides substantially more difficult evaluation scenarios and offers datasets with distinct state-coverage characteristics (e.g., narrow expert play vs. broad noisy demonstrations). These properties make it a suitable testbed for studying PaD under varying data distributions.
Cube-single task details. The cube-single task involves pick-and-place manipulation of cube blocks, where the goal is to control a robotic arm to move a cube into a target configuration. The robot is controlled at the end-effector level with a 5-dimensional action space, corresponding to displacements in the x, y, and z positions, as well as gripper yaw, and gripper opening. Task success is determined solely by the final cube configuration; the arm pose itself is not considered when evaluating success.
At test time, each task corresponds to one of five movement types illustrated in Figure 4. For each episode, both the initial and target cube poses are randomized within the constraints of the selected task movement type, ensuring diverse goal configurations during evaluation.
task1: horizontal task2: vertical1 task3: vertical2 task4: diagonal1 task5: diagonal2 Evaluation protocol. We follow the official multi-goal evaluation structure of OGBench. Each policy is evaluated on the five test-time tasks, and for each task, we perform 20 randomized goal evaluation episodes, yielding 100 total rollouts per policy. Performance is measured using the binary success rate of Park et al. (2024). OGBench considers a score above 95 as indicative of successful goal achievement, providing a consistent threshold for evaluating agent performance across tasks. To account for training stochasticity, we independently train and evaluate eight policies with distinct random seeds, matching the multi-seed protocol used for state-based OGBench baselines. All results are reported as mean ± standard deviation across the eight runs. Unless otherwise stated, baseline results (GCBC, GCIVL, GCIQL, HIQL, CRL, QRL) are taken directly from the official OGBench benchmark (Park et al., 2024) and correspond to models trained and evaluated under the full benchmark protocol. In contrast, PaD is trained and evaluated under a reduced computational budget (200K training updates vs.
1M, a single evaluation checkpoint vs. three, and 20 episodes per goal vs. 50), making the comparison conservative with respect to PaD. Additional evaluation details are provided in Appendix A.
Implementation details. PaD is implemented using lightweight neural architectures. The state encoder f θ is a two-layer MLP followed by LayerNorm, applied independently to each state without temporal mixing. The energy model E θ consists of three stages: (i) a 2-layer 1D convolutional encoder that processes the concatenated past and future latent trajectories and reduces the temporal resolution by a factor of four; (ii) a decoderonly Transformer with three blocks, each configured with four attention heads and causal attention masks; and (iii) a final linear projection that maps the Transformer outputs to a scalar energy value. Goal conditioning is incorporated by appending two learned tokens-one corresponding to the goal state s g and one to the temporal target λ. The goal token is produced by a shallow three-layer MLP. For the temporal target, λ is first encoded using a fixed sinusoidal embedding with logarithmically spaced frequencies, after which the resulting representation is passed through a three-layer MLP to produce the temporal token. The projector p θ is implemented as a two-layer MLP. The inverse dynamics model g ψ is an MLP with two hidden layers and is trained separately using supervised learning on action-labeled transitions; importantly, gradients from g ψ do not propagate to the encoder, energy model, or projector, ensuring that planning is learned entirely independently of action decoding.
The full PaD model contains almost 6.5M trainable parameters, which is lightweight by contemporary standards. All PaD planning components (i.e., f θ , E θ , and p θ ) are trained end-to-end for 200K gradient updates using the AdamW Optimizer with a batch size of 512 on a single NVIDIA RTX 3090 GPU, resulting in roughly 9 hours of wall-clock training time. The inverse dynamics model g ψ is trained separately under the same dataset but does not influence planner optimization. Unless otherwise specified, we adopt the following default hyperparameters throughout our main experiments: refinement steps T = 2, number of temporal hypotheses B = 768, top-K candidate set size K = 5, and replanning interval N = 1.
We evaluate PaD on the goal-conditioned manipulation task cube-single-play-v0. The dataset contains 1M state transitions collected from open-loop scripted demonstrations with temporally correlated action noise. Although these demonstrations achieve consistently high success, they exhibit extremely narrow state coverage and provide no corrective behavior, resulting in a highly biased training distribution.
This experiment tests whether PaD can (i) extract meaningful planning structure from such narrow expert demonstrations and (ii) remain robust under inference-time distribution shift, where small deviations can push the planner outside the limited expert manifold. 1 show that PaD achieves a remarkable overall success rate of 95±2, substantially outperforming all competing baselines. Across the five test tasks, PaD consistently achieves individual task success rates in the 93-98% range, with the lowest being 93±4 and the highest 98±3. In comparison, the strongest baseline (GCIQL) achieves an Table 1: Success rates on cube-single-play-v0. PaD substantially outperforms all baselines under narrow expert demonstrations by a large margin, achieving high success across tasks and an overall score at the OGBench 95-point success threshold.
Task GCBC GCIVL GCIQL QRL CRL HIQL PaD
overall success rate of 68±6, while GCIVL achieves 53±4, and other baselines like GCBC, QRL, CRL, and HIQL all remain below 20%. The results highlight a consistently large gap between PaD and all other approaches, with standard deviations across seeds remaining low for PaD, indicating stable performance. Notably, these results are obtained despite PaD being trained for only 200K updates, compared to 1M updates used for the reported baselines.
Interpretation. These results demonstrate that PaD can effectively leverage even narrowly distributed, expert-level demonstrations to achieve highly robust and generalizable planning performance. Despite being trained solely on demonstrations with limited state coverage and lacking corrective behaviors, PaD is able to synthesize coherent, goal-directed trajectories at test time and generalize to unseen goals. This suggests that the energybased planning-by-descent framework is able to extract and recombine meaningful behavioral primitives from expert data, overcoming the limitations of direct imitation and static behavioral cloning. The strong and stable performance also indicates that PaD’s iterative refinement and planning mechanism confers robustness to distribution shift, allowing the model to recover from deviations and avoid cascading errors that typically hinder baselines under narrow data regimes.
We next evaluate PaD on the goal-conditioned manipulation task cube-single-noisy-v0. This dataset also contains 1M state transitions, collected using closed-loop Markovian policies perturbed by per-episode Gaussian action noise. In contrast to the expert dataset, these demonstrations provide broad state coverage but exhibit inconsistent and highly suboptimal behavior, resulting in a diverse yet noisy training distribution.
This setting examines whether PaD can (i) leverage wide but noisy coverage to improve generalization and (ii) maintain planning robustness in the presence of inconsistent demonstration behavior.
Table 2: Success rates on cube-single-noisy-v0. PaD achieves consistently high performance across all tasks when trained on highly suboptimal, noisy demonstrations, outperforming most baselines by a large margin and matching the strongest competing method, with both surpassing the 95-point success threshold. Results are reported as mean ± standard deviation over eight seeds.
Results in Table 2 show that PaD achieves an overall success rate of 98±2, reliably surpassing the OGBench success threshold. Across all individual tasks, PaD’s results range from 94±5 to 100±0, with most tasks meeting or exceeding the 95-point mark. In comparison, the best baseline (GCIQL) achieves 99±1 overall, while other baselines fall short of the threshold, reinforcing PaD’s robust and consistent ability to achieve successful outcomes even when trained on highly suboptimal and inconsistent demonstration data. Notably, these results are obtained despite PaD being trained for only 200K updates, compared to 1M updates used for the reported baselines.
Interpretation. These results indicate that PaD is capable of robustly learning from datasets characterized by high diversity and suboptimal, inconsistent behaviors. The planning-as-descent framework leverages the increased state-space coverage present in noisy demonstrations to synthesize effective and generalizable goal-conditioned plans, rather than overfitting to specific behavioral patterns. PaD’s high and stable success rates suggest that its energy-based planning mechanism can extract useful structural information from widely varying suboptimal trajectories, enabling flexible recombination of behavioral fragments to achieve new goals. This robustness to demonstration noise highlights a key advantage of PaD over direct imitation methods, which often struggle when faced with inconsistent or low-quality supervision. PaD’s ability to generalize from such data further underscores the strength of its iterative refinement and verification-based synthesis approach.
Beyond success rates, we evaluate the efficiency of PaD’s planning by measuring the average number of steps required to solve each task, comparing models trained on expert demonstrations (PaD-play) versus those trained on highly suboptimal behavior (PaD-noisy). For each of the five OGBench single-cube manipulation tasks, we record the episode lengths (i.e., the number of environment steps to reach the goal) over successful rollouts and report results in Table 3. Results. Counterintuitively, we find that PaD-noisy consistently solves tasks in fewer steps than the variant trained on expert demonstrations (PaD-play). Across all five tasks, PaD-noisy achieves goal completion with an overall average of 63±6 steps per episode, compared to 78±7 for PaD-play. On each individual task, PaD-noisy achieves lower mean episode lengths, with per-task averages ranging from 53±7 to 69±8, while PaD-play episode lengths range from 66±5 to 84±7. Importantly, these step counts are computed only over successful rollouts. The corresponding task success rates are reported separately: PaD-play success rates are given in Table 1, and PaD-noisy success rates are given in Table 2. Thus, the efficiency comparison in Table 3 is conditioned on successful task completion and should be interpreted jointly with the success-rate results reported earlier.
Episode-length distributions. To complement the mean ± std statistics in Table 3, Figure 5 visualizes the full distribution of episode lengths over successful rollouts for both PaD-play and PaD-noisy, aggregated across all tasks. The distribution for PaD-noisy is clearly left-shifted relative to PaD-play, indicating that the reduction in average step count is not driven by a small number of unusually short episodes. Instead, PaD-noisy consistently produces shorter plans across the majority of successful rollouts, despite being trained on highly suboptimal and inconsistent demonstrations. In contrast, PaD-play exhibits a tighter but right-shifted distribution, closer to the demonstrator behavior, with a substantial fraction of episodes requiring more steps and a heavier tail corresponding to longer successful trajectories. This distributional view confirms that the efficiency gains observed for PaDnoisy reflect a systematic improvement in planning efficiency rather than an artifact of outliers or selective averaging.
Interpretation. This counterintuitive result suggests that PaD is able to generalize more efficiently when trained on data with higher state-space coverage, even if the demonstrations themselves are highly suboptimal and inconsistent. The expert dataset appears to upper-bound the planner: since it only observes nearly optimal but narrowly distributed behaviors, PaD-play is restricted to reproducing these trajectories and struggles to discover more efficient solutions that are not present in the data. In contrast, the diverse transitions present in the noisy dataset allow PaD-noisy to observe a wider range of behaviors and transitions, enabling the planner to synthesize more efficient trajectories at test timeeven outperforming the expert demonstrations themselves in terms of step efficiency. This finding highlights the strength of the planning-by-descent framework, which can leverage suboptimal and diverse demonstrations to achieve superior planning efficiency that goes beyond direct imitation, and underscores the importance of state-space diversity for enabling generalizable and efficient goal-reaching behavior.
5.4 RQ4: What is the Impact of the Replanning Interval N on Planning Robustness and Efficiency?
We analyze how varying the replanning interval N affects both the robustness and efficiency of PaD when trained on narrow expert demonstrations (PaD-play) versus highly suboptimal, diverse demonstrations (PaD-noisy). Figure 6 summarizes the success rate, average steps to solve, and normalized computation time per environment step as N increases from 1 (frequent replanning) to 16 (infrequent replanning).
Both variants exhibit a clear trade-off: increasing N leads to substantial reductions in computational cost, with per-step time dropping from 100 ms at N = 1 to 10 ms at N = 16. However, this efficiency gain comes at the expense of task performance. For PaD-noisy, the success rate remains high for moderate N values, decreasing from 98 ± 2 at N = 1 to 90 ± 2 at N = 4, and only dropping sharply beyond N = 8. The average steps to solve increase gradually, suggesting the model maintains efficient planning across a wider range of N .
In contrast, PaD-play is markedly less robust to infrequent replanning. Its success rate falls rapidly, dropping to 64% at N = 4 and reaching just 10% at N = 16. Similarly, average episode length increases from 83 ± 10 steps at N = 1 to 111 ± 6 at N = 16, with a significant performance collapse at moderate N . These trends indicate that PaD-play is highly sensitive to the replanning interval and requires frequent replanning to maintain reliable behavior.
Comparing the two, PaD-noisy consistently outperforms PaD-play at all values of N , exhibiting both higher success rates and greater resilience to longer open-loop horizons. The results suggest that exposure to diverse, suboptimal behavior during training enables PaD-noisy to recover from off-distribution states more effectively, making it less reliant on frequent replanning. In contrast, the narrow state coverage of PaD-play limits its ability to adapt when replanning is infrequent, leading to rapid degradation in both robustness and efficiency.
Taken together, these findings highlight the importance of training data diversity for planning robustness and suggest that moderate values of N (such as N = 2 or N = 4) offer an effective balance between computational efficiency and reliable goal-reaching performance-particularly when models are trained on broad, suboptimal demonstrations. A key innovation in PaD is the introduction of a lightweight manifold projector network p θ , comprising just 130K parameters, which is applied after each gradient-based refinement step to ensure that latent trajectories remain close to the encoder-induced data manifold. To evaluate the necessity and effectiveness of this component, we train a version of PaD with the projector removed.
Figure 2 shows that removing the projector leads to degraded training dynamics, while Table 4 confirms that it translates into a dramatic drop in final task performance: without the projector, success rates fall from above 95% to just 50%. This large performance gap highlights the projector’s crucial role in mediating between optimization and representation. By preventing energy descent from pushing latent states off-manifold, the projector helps maintain valid and dynamically consistent plans throughout refinement. Notably, this stabilization is achieved with minimal computational overhead, as the projector adds only a small fraction of the total model parameters. These results indicate that the manifold projector is not merely a lightweight architectural addition, but an essential component for robust and effective planning in the PaD framework.
We assess the impact of the Top-K selection parameter-which determines how many of the lowest-energy candidate trajectories are considered during plan selection-on PaD’s performance and efficiency. Across both PaD-noisy and PaD-play, we find that varying K across a broad range (K = 1, 5, 25, 50, 100, with B = 768 candidates) has a negligible effect on success rate, average steps to solve, or computational cost. Performance remains statistically indistinguishable across all tested values of K. This insensitivity indicates that the energy landscape learned by PaD consistently yields robust, low-energy plans, rendering additional selection diversity largely unnecessary. Our findings underscore PaD’s robustness to the choice of K, thereby simplifying hyperparameter tuning and practical deployment.
This work introduces Planning as Descent (PaD), a framework that reframes offline goalconditioned planning as gradient-based refinement in a learned energy landscape. By unifying trajectory evaluation and synthesis and enforcing training-inference alignment, PaD achieves strong and consistent performance across both narrow expert and highly suboptimal data regimes. The results suggest that learning to verify trajectories can provide a stable alternative to policy learning or sampling-based trajectory generation in offline settings where distribution shift and noisy demonstrations are prevalent.
Despite these strengths, several limitations remain. First, our experiments are conducted in a relatively modest computational regime, using lightweight models (6.5M parameters) and short planning horizons (∼ 10 2 steps). While this demonstrates that PaD can be effective under limited resources, it prevents a systematic investigation of how the method scales with increased model capacity, longer horizons, and larger datasets. Understanding these scaling properties is an important direction for future work.
Second, we restrict our studies to state-based inputs and do not address learning directly from high-dimensional visual observations. Extending PaD to pixel-based or multimodal settings would require integrating visual representation learning in jointly learned latent spaces. Applying energy-based trajectory refinement in such settings remains an open and promising challenge that would significantly broaden the applicability of the framework. Third, our empirical evaluation is limited to simulated manipulation tasks from the OG-Bench single-cube suite. While these tasks provide a controlled and challenging benchmark for offline goal-conditioned planning, they do not capture the full complexity of real-world robotic systems, including sensing noise, unmodeled dynamics, and actuation delays. Evaluating PaD on physical robotic platforms is a necessary step to assess the practical viability of energy-based planning-by-descent in real-world settings.
Finally, our experiments do not explore other important control domains such as continuous-control locomotion or navigation. Broadening empirical coverage to these domains would help clarify the generality and limitations of the planning-as-descent paradigm.
Overall, this work represents a step toward verification-driven planning frameworks that operate effectively from offline, reward-free data. We view PaD not as a final solution, but as a foundation for exploring how learned energy landscapes can support scalable, generalpurpose planning and reasoning in increasingly complex environments.
We presented Planning as Descent (PaD), a framework for offline goal-conditioned planning that casts trajectory synthesis as gradient-based refinement in a learned energy landscape. By learning to verify entire future trajectories rather than directly generating them, and by enforcing alignment between training and inference through shared refinement dynamics, PaD provides a robust alternative to policy-and sampling-based approaches in reward-free, offline settings.
Empirically, PaD achieves strong performance on challenging OGBench single-cube manipulation tasks, including state-of-the-art results when trained on narrow expert demonstrations. Moreover, we observe that training on diverse but highly suboptimal data can further improve both success rates and planning efficiency. We interpret this effect as evidence that broad state-space coverage is particularly beneficial for verification-driven planning, enabling the energy landscape to support flexible trajectory refinement beyond the behaviors explicitly demonstrated in the data.
Taken together, these results suggest that learning to evaluate and refine trajectories may offer a scalable and principled foundation for offline goal-conditioned planning. We hope this work motivates further investigation into energy-based formulations as a unifying framework for planning, representation learning, and control.
T = 2 refinement steps. Stop-gradient operations are applied between refinement steps as described in Section 4.
Training details. Models are trained with batch size 512 for 200K gradient updates. Unless otherwise stated, all hyperparameters are fixed across experiments and datasets.
Source code and reproducibility. An implementation of the proposed method will be made publicly available at: https://github.com/inescopresearch/pad Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. Dino-wm: World models on pre-trained visual features enable zero-shot planning. arXiv preprint arXiv:2411.04983, 2024.