Other IPC 2023 Domains

We briefly explain the seven domains used in our second experiment with 10 different domains. The other three domains — Blocksworld, Rovers, and Sokoban — where explained in the text. Table 1 shows the distribution of the main parameters for each domain used in this experiment.

Childsnack

This domain involves making a number of sandwiches, some with gluten‐free ingredients for children with allergies, placing them on trays, and delivering them to tables. Key constraints include production (gluten and non‐gluten items), packaging (quantity of trays), and serving order.

Ferry

Cars are distributed across locations, and a ferry (which can fit one car) must transport them to their target destinations. The domain requires deciding the ferry’s schedule, routing, and when to move each car, all under capacity constraints.

Floortile

A grid of tiles must be painted by robots that can move in four directions via unpainted tiles. Robots can only move over tiles that are not yet painted, so they must paint tiles in a certain order.

Miconic

Elevator domain: there are multiple floors, passengers on origin floors, and destinations; an elevator must move and pick up / drop off passengers, respecting boarding and alighting constraints. Though conceptually simple, it tests planner efficiency regarding sequencing and movement decisions, especially as passenger numbers and floor counts increase.

Satellite

Satellites have instruments (which may need calibration) and different modes of observation; they must orient, turn, switch instruments, and gather images to satisfy a given goal. Only one instrument can be active at a time, and switching or calibrating requires actions.

Spanner

An agent traverses a path where there are $`k`$ spanners place along this path. At the end of the path there exists a gate with $`m \leq k`$ loose nuts. The agent must collect enough spanners to tighten all nuts and cross corridors to be able to cross the gate. Spanners break when used, thus the agent must collect enough spanners to open the gate.

Transport

This is a traditional logistics problem with capacity constraints. Vehicles must deliver packages to certain locations on a graph, subject to vehicle capacities (i.e., number of packages that can be transported at once).

Domain	Param. Dist.	Param. Meaning
Blocksworld	$`n \in [2,10]`$	$`n`$ blocks to be rearranged
Childsnack	$`c \in [4,12]`$	$`c`$ children to be served
Ferry	$`a \in [3,30]`$	$`a`$ cars to be transported
Floortile	$`t \in [2,25]`$	$`t`$ tiles to paint
Miconic	$`p \in [1, 10]`$	$`p`$ passengers to deliver
Rovers	$`r \in [1,4]`$	$`r`$ rovers to drive and collect items
Satellite	$`s \in [1,8]`$	$`s`$ satellites to synchronize
Sokoban	$`b \in [1,4]`$	$`b`$ boxes to push to their goals
Spanner	$`k \in [1,20]`$	$`k`$ spanners to collect
Transport	$`v \in [2,10]`$	$`v`$ vehicles to drive

Task distribution and plan lengths for each domain. Showing only the main parameter for each domain.

Iterative Deployment Improves Planning Skills in LLMs

We study Iterative Deployment, a training mechanism that leads Large Language Models (LLMs) to bootstrap their planning capabilities without requiring external expert demonstrations or additional teacher models. The core intuition is that a model can improve by learning from the simple tasks it has already successfully solved, effectively using its own valid outputs as training data for subsequent generations (provided a reliable curation/verification mechanism is available).

We hypothesize that LLMs improve their planning abilities by simply fine-tuning over their own curated traces. In other words, they can bootstrap by first solving easy tasks themselves, and then being fine-tuned using the traces of these solved tasks. Later generations can then start solving larger tasks, and these new traces can be used so future generations solve even larger ones. Repeating this process many times can then gradually improve their planning skills. For example, if the current generation of a model can solve Sokoban problems with one single box, then it can learn from its own traces to solve problems with two boxes, and then with three boxes, etc. So by exploiting its own current capabilities, the model can improve and solve harder tasks in future generations.

Formally, let $`M_n`$ denote the model at generation $`n`$, parameterized by $`\theta_n`$. We assume access to a dataset of planning tasks without solutions $`\mathcal{D}_{tasks}`$ and a deterministic external validator $`V(x, y)`$ which returns true if trace $`y`$ is a valid solution for task $`x`$, and false otherwise, with an ability to measure solution efficiency. This validator can be seen as a correction mechanism for reasoning tasks, or even simply as a proxy for user preferences. We evaluate model performance on $`\mathcal{D}_{tasks}`$ as a test-set as well. This is because we are interested to find how iterative deployment improves the model’s ability to solve longer tasks with access only to its own previously curated solutions. The iterative deployment process then proceeds as follows:

Deployment and Trace Collection: In each iteration $`n`$, we prompt the current model $`M_n`$ to solve the tasks in $`\mathcal{D}_{tasks}`$. For each task input $`x`$, the model generates a trace $`y`$ according to its policy $`\pi_{\theta_n}(y|x)`$. This trace includes the chain-of-thought and the tentative solution generated by the model. This emulates a standard deployment scenario where the model interacts with users or environments.
Validation: The generated traces are passed to the external validator $`V`$. We filter the outputs to retain only the subset of valid traces, $`\mathcal{D}_{valid}^{(n)} = \{(x, y) |V(x, y) = \text{True}\}`$. Invalid plans, which certainly constitute the majority of outputs in early generations, are discarded.
Curation and Aggregation: To prevent catastrophic forgetting, reduce model collapse, and further improve generalization, we aggregate the valid traces from the current generation with those from all previous generations. The training dataset for the next step is $`\mathcal{T}_{n+1} = \bigcup_{i=0}^{n} \mathcal{D}_{valid}^{(i)}`$.
During this aggregation, we apply a second curation step: a selection mechanism to ensure data quality. If multiple valid traces exist for the same task (e.g., from different generations), we retain only the highest-quality solution. In our experiments, quality is defined by plan efficiency – we select the trace with the shortest plan length, breaking ties by selecting the one with fewer reasoning tokens, but in principle other task-specific metrics can be used.
Supervised Fine-Tuning (SFT): Finally, we produce the next generation $`M_{n+1}`$ by fine-tuning $`M_n`$ on the curated dataset $`\mathcal{T}_{n+1}`$ using the standard supervised learning objective (next token prediction).

The curation and aggregation phase is important here, as the likelihood of a user making an interaction with an LLM publicly available is not uniform, but depends on the nature of the interaction and the users’ intention. For instance, when a user interacts with an LLM to solve a coding task, they are more likely to integrate a response that solves their tasks into their codebase. They are even more likely to use it if the solution is particularly elegant or simple, akin to our selection mechanism. This creates an effective curation mechanism controlled by the user’s revealed preference (which is not necessarily their stated preference). This is a key difference to the assumptions used in the study of model collapse . As LLMs are trained on more and more recursively generated data, their performance can deteriorate and the models eventually collapse. However, curation might delay or prevent this. In this work, we study the effect of this additional assumption compared to the model collapse assumptions, and argue for the importance of understanding its impact on future model generations.

Figure [fig:pipeline] illustrates how one iteration of this process works. When the $`n`$-th generation of an LLM is deployed, it is prompted by their users. The produced traces are filtered by an external validator, and those judged valid traces are used to fine-tune the $`n+1`$-th generation of the model. Note that we are not building a learning curriculum ourselves, but rather the LLM together with the validator are building one: we prompt the model with all tasks from the test set $`\mathcal{D}_{tasks}`$, discard the invalid traces, add the valid ones to the train set, and then fine-tune the next generation.

Reasoning with LLMs.

Chain-of-Thought (CoT) uses “reasoning tokens” to improve the performance of LLMs in many problems. Many approaches build on this idea s. For example, different methods explore multiple reasoning paths to find a valid solution . Others show that using more informative reasoning traces, such as those based on action sequence similarity, can also enhance planning performance . However, these approaches often have limited success in generalizing to out-of-distribution reasoning tasks and longer-horizon planning problems. For instance, show that while CoT can help with planning problems, it does not consistently lead to generalization, particularly for out-of-distribution tasks.

Improving Reasoning Capabilities of LLMs.

There are several methods in the literature that try to improve the reasoning capacities of LLMs . The iterative deployment mechanism we studied here shares conceptual similarities with the Self-Taught Reasoner (STaR) framework , which iteratively improves a model’s reasoning ability by fine-tuning on traces that lead to valid answers. However, while STaR tries to improve model performance, we focus on a very different question: we want to understand whether performance improvement could emerge unintentionally from repeated model deployments. We use planning problems to ground this question in a controlled setting which allows us to study this phenomenon, and prove that repeated deployment can be seen as a form of RL. On the technical level, there are key differences between the two as well. First, STaR has an extra-step for producing rationales for wrong answers, which are often not available from scraped user-curated traces. Second, we use valid traces from older generations as well as the current one emulating the property of web-scraped data containing traces from multiple former model generations, while STaR focuses on traces from the current model generation.

Reinforcement learning has been widely used to improve the reasoning capabilities of large language models . Techniques such as group relative optimization (GRPO) allow us to fine-tune models without supervised data, using only an internal reward function . By correctly modeling the reward functions, we can also influence the RL training so that the models better align with safety conditions or goal specifications.

Iterative deployment is also related to test-time scaling methods. For example, propose a simple test-time scaling technique where they fine-tune a small model on traces generated by a much larger, more capable teacher model—in their paper, the small model is a 32B model, while the teacher model is a model from the Gemini family. In contrast, iterative deployment does not rely on a separate, more powerful model. Instead, the LLM generates its own training data, bootstrapping its performance by solving progressively harder tasks. The model effectively becomes its own teacher, curating its own experience to improve in the next generations.

Model Collapse.

One important connection to our analysis is ‘model collapse’. show that iteratively training models with their own synthetic output eventually collapses (i.e., the model’s distribution shrinks so its tails disappear). The key difference in our analysis is the explicit curation step: only valid traces, as determined by an external validator, are used for fine-tuning. lacks a curation step, and assumes all data is kept between generations. We show that curation can improve reasoning, but it is unknown whether the curation step completely avoids, or simply delays, model collapse. In our context, we have tested running our experiments above up to 10 generations and (despite the smaller improvement past the fifth generation) but we did not observe imminent hints of model collapse within planning domain.

Planning using LLMs.

show that early LLMs do not reliably solve simple planning tasks. show that problems simpler than computing a plan, such as verifying if a given plan is valid, are already challenging for LLMs.

In the context of classical planning, several fine-tuning strategies have been explored. train domain-specific GPT models for each planning domain, achieving strong performance but only for in-distribution problems. report that supervised fine-tuning with optimal plans produced from an off-the-shelf planner did not lead to out-of-distribution generalization. apply RL to end-to-end plan generation, but observed only a small performance improvement. Our work differs from these, by showing that generalization can be achieved without access to an existing planner or to an explicit reward modeling, relying instead on an iterative process with a simple validation filter.

propose an instruction-tuning framework that improves symbolic planning by teaching LLMs explicit logical CoT reasoning with feedback from VAL. While conceptually related to the iterative deployment mechanism, their method relies on carefully handcrafted training data and a multi-phase fine-tuning using structured reasoning feedback from a validator, whereas the mechanism studied here achieves self-improvement through repeated deployment and fine-tuning solely on valid traces, requiring just a binary signal.

Implications to AI Safety

While iterative deployment shares links with RL fine-tuning, it also brings new concerns about AI safety. Unlike standard RL training, where reward functions can be explicitly designed to encode human preferences and safety constraints, iterative deployment of models in the wild relies on implicit signals from post-deployment usage. Curation done indirectly through user interactions with previously deployed models behaves as an opaque reward function which can be difficult to understand and control. In RL for alignment, reward functions are often used to align the model with safety constraints and goal specifications. In iterative model deployment, the implicit reward functions could clash with the explicitly specified rewards in model alignment. This could lead to unexpected problems in future model generations, as the indirect curation used might lead to large gradient steps in the opposite direction to the gradients induced by the safety training, for example. Overall, this raises new alignment challenges.

Another concern is bias during validation. If the external validator (e.g., a tool, or a human curator) has unintended or malicious biases, these biases might accumulate over the generations. Later models might then optimize for harmful properties that diverge from the original goals and also from safety constraints.

A final property to note is model collapse . Data curation might delay model collapse by filtering for valid traces, it is still unknown whether this fully prevents collapse. If collapse occurs, important capabilities can degrade in ways that are difficult to detect until deployment. Note, however, that it is also unclear whether longer training periods using RL techniques, such as PPO or GRPO, lead to model collapse or not.

Formalizing the Connection to Reinforcement Learning

Iterative deployment can be interpreted as RL fine-tuning but with the reward signal left implicit. We prove next that SFT using only valid traces can be seen as a special case of REINFORCE with an implicit binary reward function.

To show this, we start from the following result:

The update directions of the gradients for SFT using only valid traces and REINFORCE with binary rewards are identical.

The proof is included in Appendix 12.7.

In addition to the valid traces generated by the current model, we assume access to a collection of traces from previous generations of the model. We refer to the traces produced by current policy $`\pi_\theta`$ as the on-policy traces, and those produced by a behavior policy $`\pi_\beta`$ (earlier generations or any other external source) as off-policy traces.

SFT on a mixture of on- and off-policy valid traces is equivalent to REINFORCE with binary rewards augmented by importance-weighted contributions from the behavior policy.

The proof for Proposition [thm:importance-weighted] is also included in Appendix 12.7.

In turn, this proves our original claim that SFT using only valid traces can be seen as a special case of REINFORCE .

SFT using only valid traces following the iterative deployment mechanism described in Section 5 is a special case of REINFORCE with implicitly defined binary rewards.

Proof. Follows directly from Proposition [thm:importance-weighted]. ◻

Conclusion

In this work, we showed that iterative deployment of LLMs, where each generation is fine-tuned on curated traces from previous deployments, improves planning capabilities. Our experimental results on classical planning domains show that models trained under this paradigm more than doubled their performance within five generations, with evidence of out-of-distribution generalization to longer-horizon plans.

Iterative deployment can be an effective alternative to RL to improve reasoning capabilities of LLMs. We proved theoretically that our method is equivalent to a special case of REINFORCE with binary rewards. Unlike RL fine-tuning, our approach does not rely on carefully designed reward functions. Instead, it uses external postprocessing tools to validate/curate LLM answers as an implicit signal. However, the absence of an internal reward function brings concern about potential biases and unintended effects caused during training. The use of external validators, that might be outside our control, can be damaging to AI safety.

Overall, our results suggest that iterative deployment is a viable method for self-improving LLMs. In future work, we plan to study how model collapsing can affect iterative deployment, and also to develop theoretical results to connect our method with RL fine-tuning more explicitly.