Dreaming in Code for Curriculum Learning in Open-Ended Worlds
Open-ended learning frames intelligence as emerging from continual interaction with an ever-expanding space of environments. While recent advances have utilized foundation models to programmatically generate diverse environments, these approaches often focus on discovering isolated behaviors rather than orchestrating sustained progression. In complex open-ended worlds, the large combinatorial space of possible challenges makes it difficult for agents to discover sequences of experiences that remain consistently learnable. To address this, we propose Dreaming in Code (DiCode), a framework in which foundation models synthesize executable environment code to scaffold learning toward increasing competence. In DiCode, “dreaming” takes the form of materializing code-level variations of the world. We instantiate DiCode in Craftax, a challenging open-ended benchmark characterized by rich mechanics and long-horizon progression. Empirically, DiCode enables agents to acquire long-horizon skills, achieving a $16%$ improvement in mean return over the strongest baseline and non-zero success on late-game combat tasks where prior methods fail. Our results suggest that code-level environment design provides a practical mechanism for curriculum control, enabling the construction of intermediate environments that bridge competence gaps in open-ended worlds. Project page and source code are available at https://konstantinosmitsides.github.io/dreaming-in-code and https://github.com/konstantinosmitsides/dreaming-in-code.
💡 Research Summary
Dreaming in Code (DiCode) introduces a novel approach to Unsupervised Environment Design (UED) that leverages large pre‑trained foundation models (FMs) to generate executable Python code defining new training environments. Unlike prior UED methods that manipulate low‑dimensional parameters or treat a level as a static seed, DiCode treats a level as a full program that can modify world topology, entity interactions, transition dynamics, and goal specifications. The framework operates in a closed‑loop curriculum: after each training cycle the agent’s performance metrics are stored in an archive, a parent level is selected based on a learnability score (p · (1 − p)) and its status (A/B), and the FM is prompted with a natural‑language description of the parent, the parent’s performance profile, the target environment’s profile, and mutation instructions. The FM first produces a textual description of the desired new level, then a second inference step synthesizes the corresponding Python code. The code is compiled and, if successful, added to the training batch together with the target environment and previously archived levels sampled via Prioritized Level Replay (PLR).
Key technical contributions include: (1) a graph‑based level archive that records executable code, metadata, and success‑rate statistics, enabling diverse evolutionary lineages; (2) a parent‑selection mechanism that favors high‑learnability levels while ensuring that each selected parent has no existing children, thus promoting structural diversity; (3) a reward augmentation scheme where the generated level’s reward equals the target environment’s reward plus a dynamic bonus Bₜ that scales with the agent’s previous average return, encouraging continual skill acquisition; (4) a curriculum feedback loop where the FM’s generation is conditioned on the agent’s current competence, effectively making the FM act as an adaptive teacher that can “remove resources” or “increase enemy damage” to keep the agent in its zone of proximal development.
The authors instantiate DiCode in Craftax, a procedurally generated, open‑ended RL benchmark featuring rich mechanics and long‑horizon progression. Empirical results show that agents trained with DiCode achieve a 16 % improvement in mean return over the strongest baseline and obtain non‑zero success rates on late‑game combat tasks that prior methods (including PLR and random level generation) completely fail. Qualitative analysis reveals emergent “teacher‑like” strategies: the FM automatically adjusts difficulty by modifying resource availability or combat formulas based on the agent’s recent success rates. Ablation studies confirm that removing the curriculum loop eliminates these benefits; the FM alone cannot sustain progress without conditioning on the agent’s performance.
Overall, DiCode demonstrates that code‑level environment synthesis provides a practical and scalable mechanism for curriculum control in open‑ended worlds. By using open‑weight foundation models, the approach remains reproducible and extensible to other domains such as physics‑based simulators (e.g., MuJoCo) or different procedural games. The work bridges the gap between environment generation and curriculum learning, offering a pathway toward continual, long‑term skill acquisition in complex, evolving environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment