COvolve: Adversarial Co-Evolution of Large-Language-Model-Generated Policies and Environments via Two-Player Zero-Sum Game

A central challenge in building continually improving agents is that training environments are typically static or manually constructed. This restricts continual learning and generalization beyond the training distribution. We address this with COvol…

Authors: Alkis Sygkounas, Rishi Hazra, Andreas Persson

COvolve: Adversarial Co-Evolution of Large-Language-Model-Generated Policies and Environments via Two-Player Zero-Sum Game
COvolve: Adv ersarial Co-Evolution of Large-Language-Model-Generated Policies and Environments via T wo-Player Zero-Sum Game Alkis Sygkounas Machine Perception and Interaction Lab, Ör ebro University Sweden alkis.sygkounas@oru.se Rishi Hazra Machine Perception and Interaction Lab, Ör ebro University Sweden rishi.hazra@oru.se Andreas Persson Machine Perception and Interaction Lab, Ör ebro University Sweden andreas.persson@oru.se Pedro Zuidberg Dos Martires Machine Perception and Interaction Lab, Ör ebro University Sweden pedro.zuidberg- dos- martires@oru.se Amy Lout Machine Perception and Interaction Lab, Ör ebro University Sweden amy .lout@oru.se Abstract A central challenge in building continually improving agents is that training environments are typically static or manually con- structed. This restricts continual learning and generalization be- yond the training distribution. W e address this with COvolve , a co-evolutionary framework that lev erages large language models (LLMs) to generate both environments and agent policies, expressed as executable Python code. W e model the interaction between en- vironment and policy designers as a two-player zero-sum game, ensuring adversarial co-ev olution in which environments expose policy weaknesses and policies adapt in response. This process in- duces an automated curriculum in which environments and policies co-evolve to ward increasing complexity . T o guarantee robustness and prevent forgetting as the curriculum progresses, w e compute the mixed-strategy Nash equilibrium (MSNE) of the zero-sum game, thereby yielding a meta-policy . This MSNE meta-policy ensures that the agent does not forget to solve previously seen environ- ments while learning to solve previously unseen ones. Experiments in urban driving, symbolic maze-solving, and ge ometric navigation showcase that COvolve produces progressively more complex en- vironments. Our results demonstrate the potential of LLM-driv en co-evolution to achie ve open-ended learning without predened task distributions or manual intervention. Ke ywords Co-evolution, Unsupervised Environment Design, Mixed-strategy Nash Equilibrium, Large Language Models 1 Introduction Developing agents that continually acquire new skills in dynamic and unpredictable settings remains a core challenge in AI. Most current training pipelines still depend on large amounts of human- curated data, which is costly and often produces agents that gen- eralize poorly beyond their training distribution [ 47 ]. While rein- forcement learning (RL) oers an appealing alternative by allowing agents to learn through extensive interaction in a simulator [ 41 ], it inherits a fundamental limitation: the environments used for train- ing are either xed and/or manually designed. Constructing an environment distribution that captures the diversity and variability of r eal-world conditions is inherently dicult [ 5 ], and RL agents of- ten fail to generalize beyond the narrow distribution they encounter during training [ 15 , 22 , 23 ]. Achieving robustness and transferabil- ity , therefore, requires exposing agents to a diverse and con tinually evolving curriculum of environments that adapt to their capabilities and expand the range of behaviors they must master [6, 23]. 1 ch Unsupervise d environment design (UED) [ 9 , 21 ] addresses these limitations in training environments by automatically generating a curriculum of environments that adapts in diculty based on the agent’s p erformance. By dynamically tailoring environments to expose and challenge the agent’s weaknesses, UED encourages con- tinual learning. Howev er , UED typically generates envir onments via randomization or simple heuristics, which limits the diversity and relevance of the resulting tasks. W e overcome this by intro- ducing COvolve , a co-evolutionary framework that frames UED as a two-player zero-sum game . COvolve leverages LLM-based code generation with rich priors to imaginatively design both environ- ments and policies. In COvolve , an LLM-powered environment designer and policy designer compete adversarially to co-evolve more challenging levels and more capable policies, respectively , as conceptually illustrated in Figure 1. Previous research has explor ed LLM-driven environment gener- ation [ 12 ] and policy design [ 25 ] using code-based outputs, where both environments and p olicies are represented directly as exe- cutable programs. Such programmatic repr esentations provide ad- vantages over neural encodings, including improved generalizabil- ity to unseen scenarios [ 19 , 45 ], greater modularity and reuse of behavioral components [ 11 , 48 ], and stronger v eriability and in- terpretability [ 2 , 46 ]. They also allow Python to express arbitrarily complex environment and policy logic, while LLMs contribute pri- ors that enable the automatic synthesis of diverse tasks without hand-crafted templates [ 12 ]. However , existing approaches typically address either environment generation without promoting r obust 1 Illustrative videos are available at: https://anonymous.4open.science/r/covolve- 6187/. Code will be released upon acceptance. Alkis Sygkounas, Rishi Hazra, Andreas Persson, Pedro Zuidberg Dos Martires, and Amy Loutfi Environment Population ( 𝓛 ) Policy Population ( 𝓟 ) Player 1 Player 2 argmax( 𝓹* ) Policy Designer Environment Designer Mixed Strategy Nash Equilibrium ( 𝓹* ) > > Figure 1: A conceptual ov erview of the pr oposed COvolve , comprise d of an Environment Designer and a Policy Designer that co-evolve by playing a two-player zero-sum game. The Environment Designer generates increasingly challenging environments (as code), while the Policy Designer creates policies (as code) to solve them. A mixed-strategy Nash e quilibrium enables robust, open-ended learning through continual adaptation. agent learning [ 12 ], or policy design without continual adaptation to new challenges [ 26 ]. In contrast, COvolve integrates both asp ects into a closed-loop LLM-driven co-ev olution process that simulta- neously advances environment complexity and policy capability . Concretely , we make the following contributions: (1) Game-theoretic Framework for Robust Policy Design. W e frame the co-evolution as a two-play er zero-sum game between a policy player and an environment player , where the payo is the policy’s success rate in each environment. At each iteration, COvolve main- tains populations of policies and environments, evaluates all pairs to form an empirical payo matrix, and computes the mixe d-strategy Nash equilibrium of this matrix game. The resulting meta-policy distribution solves the max–min objective within the empirical meta-game [ 32 ], improving worst-case performance against the current environment set and guiding the environment player to generate levels that exploit weaknesses of the equilibrium distri- bution [ 24 ]. In contrast, prior approaches [ 50 ] train independent policies per environment, thereby compromising population-level robustness and causing catastrophic forgetting. (2) Empirical Evidence of Emergent Curriculum and Generalization. W e empirically demonstrate that COvolve produces increasingly challenging environments across diverse domains (urban driving, maze-solving, and 2D navigation), with generate d levels exhibit- ing escalating complexity and diversity over time. Crucially , our evaluation shows that computing the MSNE is essential to prevent catastrophic forgetting, unlike approaches such as Eurekaverse [ 26 ], which retain only the latest best policy and netune on new envi- ronments, leading to forgetting. 2 Related W orks Domain randomization (DR) exposes agents to a broad distribution of environments [ 20 , 44 ] but lacks adaptivity and often pr oduces trivial or unsolvable tasks [ 9 ]. Unsupervised environment design (UED) addresses this by automatically generating curricula tailored to agent performance. For instance, minimax adv ersarial training selects environments that minimize the agent’s reward [ 31 , 35 , 39 ]. Howev er , it can produce overly dicult tasks unless constrained [ 9 ]. Regret-based methods like P AIRED [ 9 ] address this by dening re- gret relative to an approximated optimal policy to ensure solvability . While our work uses a minimax adversary , future directions could also incorporate regret-based strategies to avoid generating un- solvable levels. Crucially , our LLM-driven co-evolution introduces data-driven priors that enable the design of more challenging and relevant environments than classical, heuristic-based UED . Recent work uses LLMs to generate and automate environment design [ 12 , 49 ], world model generation [ 8 , 43 ], and reward speci- cation in RL [ 17 , 29 ]. Howev er , most frame works either decouple environment and agent learning or focus only on environment generation, limiting agent robustness. Our framework enables fully closed-loop co-evolution, automatically generating a curriculum that adapts to both the agent and the environment. W e use game- theoretic principles to maintain a div erse policy population via a mixed-strategy Nash equilibrium, yielding a meta-policy that is ro- bust across the evolving set of generated environments and provides a principled population-level objective for continual adaptation. In parallel with environmental design, LLMs have been used to synthesize modular , generalizable, and interpretable code-based policies. Approaches like Code-as-Policies[ 25 ], RL-GPT[ 28 ], and ProgPrompt[ 42 ] leverage LLMs to generate executable plans or combine code with RL controllers, but are typically limited to nar- row task distributions. In contrast, our approach constructs robust, continually adaptive policies that learn within an open-ended, co- evolving curriculum. Our work is also related to the self-play paradigm, in which models play dual roles to create a self-improvement loop. Here, LLMs create copies of themselv es with dierent roles to impro ve without relying on human data. This has b een used in domains like coding (Coder- T ester Agents) [ 27 , 51 ] and reasoning ( Challenger- Solver Agents) [ 3 , 18 ]. The improvement step is directly applied to the LLMs, which can b e inecient for domains where solutions can COvolve: Adversarial Co-Evolution of Large-Language-Model-Generated Policies and Environments via Two-Player Zero-Sum Game be represented by compact policies rather than large , monolithic models. In contrast, COvolve harnesses LLMs to drive the design of specialized agents that are modular , interpretable, and easier to deploy . A more concurrent work is Bachrach et al . [1] where LLMs produce strategies as code for playing against a Nash equilibrium mixture over the curr ent population of strategies. 3 Preliminaries 3.1 Unsupervise d Environment Design (UED) Formally , a UED is dene d over an underspecied partially obser v- able Markov decision process ( UPOMDP) [ 9 ], given by the 8 -tuple M = ( Θ , S , A , O , 𝑇 , 𝑂 , 𝑅, 𝛾 ) , with the last seven elements having the same meaning as in a standard POMDP: S , A , and O are the sets of states, actions, and observations, respectively , 𝑇 and 𝑂 denote the transition and observation functions, 𝑅 is the reward function, and 𝛾 ∈ [ 0 , 1 ] the discount factor . The rst element Θ of a UPOMDP M denotes the space of un- derspecied environment parameters (e.g., number and position of obstacles, size of the grid). Picking a sp ecic 𝜃 ∈ Θ material- izes a concrete POMDP . A UPOMDP can hence be viewed as a set of POMDPs. A concrete set of parameters 𝜃 ∈ Θ is also referred to as a level . The choice of 𝜃 may inuence the reward function 𝑅 : S × A × Θ → R , the transition function 𝑇 : S × A × Θ → Δ ( S ) , and the obser vation function 𝑂 : S × Θ → O , where Δ ( 𝑆 ) is the set of all probability distributions over S . Given a level 𝜃 ∈ Θ , the expected discounted return (i.e. utility) of a policy 𝜋 and a level 𝜃 is denoted as 𝑈 𝜃 ( 𝜋 ) = E 𝜏 ∼ ( 𝜋 ,𝜃 ) [ 𝐺 𝜏 ] , with 𝜏 denoting trajectories sampled under the p olicy 𝜋 at level 𝜃 . 𝐺 𝜏 is the sum of discounted rewards for each trajectory: Í 𝑇 𝑡 = 0 𝛾 𝑡 𝑟 𝑡 , with 𝑟 𝑡 being the reward collected at time step 𝑡 . The optimal policy for level 𝜃 is then given by 𝜋 ★ 𝜃 = arg max 𝜋 𝑈 𝜃 ( 𝜋 ) . The goal of UED is to train a policy that performs well across a broad distribution of envir onments. T o this end, UED is typically framed as a two-player game, with an adversary Λ from which we can sample levels given a policy: 𝜃 ∼ Λ ( 𝜋 ) . The adversary’s goal is to identify levels that challenge the policy 𝜋 by optimizing the utility function 𝑈 𝜃 ( 𝜋 ) that exposes its weaknesses. A simple example is that of a minimax adversary [ 31 , 35 , 39 ], which is a point distribution Λ ( 𝜋 ) = arg min 𝜃 ∈ Θ 𝑈 𝜃 ( 𝜋 ) and pro- poses new le vels to minimize the policy performance. In response, a maximin policy is one that tries to perform well under the most ad- versarial lev el 𝜋 ∗ = arg max 𝜋 ∈ Π min 𝜃 ∈ Θ 𝑈 𝜃 ( 𝜋 ) . Howev er , solving this exactly is computationally intractable. In the following section, we introduce an ecient approximation method. 3.2 Policy Space Response Oracles (PSRO) PSRO [ 24 ] is a general framework for multi-agent learning that addresses the fundamental challenge of non-stationarity in multi- agent environments. In such settings, the optimal policy for one agent depends on the policies of other agents, cr eating a moving target that makes traditional single-agent reinforcement learning approaches ineective. Rather than attempting to learn a single “best" p olicy , PSRO builds and maintains a diverse population of policies over time. This approach pro vides robustness against vari- ous opponent strategies and reduces exploitability in competitive scenarios. In a 2-player setting, the PSRO framework operates through the following four-step iterative process: (1) Each player 𝑖 ∈ { 1 , 2 } maintains a growing set of p olicies P 𝑖 = { 𝜋 𝑖 1 , 𝜋 𝑖 2 , . . . , 𝜋 𝑖 𝑡 } , creating a library of strategies for that player . (2) PSRO constructs a payo matrix M ∈ R | P 1 | × | P 2 | by evaluating all pairwise policy combina- tions, where each entry represents the expected payos for players 1 when player 1 uses policy 𝜋 1 𝑖 and player 2 uses policy 𝜋 2 𝑗 . (3) The framework computes a meta-policy for player 1 that determines how to mix the existing policies in the population. (4) For player 2, a new best response policy 𝜋 2 𝑡 + 1 is trained to maximize perfor- mance against the player 1 meta-policy , and is subsequently added to the policy population P 2 of player 2. This iterative process con- tinues until convergence, resulting in a diverse and robust policy population. 4 Methodology W e adapt PSRO to UED by formulating environment and policy generation as a co-evolutionary process between an Environment Designer and a Policy Designer . The tw o designers iteratively gener- ate, evaluate, and r etain populations of environments and policies. At each iteration, new candidates are produced via structural pro- gram mutations [ 14 , 52 ] of previously generated envir onments and policies, followed by tness-based selection. This interaction is governed by a minimax objective, yielding a two-play er zero-sum game summarized in Algorithm 1, with additional algorithmic de- tails pr ovided in Appendix A. Since both envir onments and policies are represented as executable Python code, Figure 2 illustrates their co-evolution across successiv e iterations. Algorithm 1 COvolve 1: Require: Initial environment 𝜃 0 2: Hyperparameters: T otal iterations 𝑇 , candidate generated per level 𝐾 3: Initialize environment levels L ← { 𝜃 0 } , policy sequence P ← ( ) , Payo Matrix M = [ ] ⊲ Initialize co-evolving populations 4: for 𝑡 = 0 to 𝑇 do 5: # 1. Policy Design (structural mutation + selection) 6: Generate 𝐾 candidates: { ˜ 𝜋 1 , . . . , ˜ 𝜋 𝐾 } = Ψ ( 𝜋 𝑡 − 1 , O 𝜃 𝑡 − 1 , A 𝜃 𝑡 − 1 ) 7: 𝜋 𝑡 = arg max 𝑗 𝑈 𝜃 𝑡 − 1 ( ˜ 𝜋 𝑗 ) 8: Append P ← 𝜋 𝑡 9: # 2. Update Payoff Matrix M (fitness evaluation) 10: for 𝑖 , 𝑗 = 0 to 𝑡 do 11: 𝑚 𝑖 𝑗 ← 𝑈 𝜃 𝑗 ( 𝜋 𝑖 ) 12: end for ⊲ Cross-population tness computation 13: # 3. Recompute MSNE (population-level update) 14: 𝑝 ★ ← SolveNash ( 𝑀 ) ⊲ Eq. 1 ⊲ Mixture over evolv ed policies 15: # 4. Best Response Environment Design (structural mutation + selection) 16: Generate 𝐾 candidates: { ˜ 𝜃 1 , . . . , ˜ 𝜃 𝐾 } = Λ ( 𝜃 𝑡 − 1 , 𝑝 ★ ) 17: 𝜃 𝑡 + 1 = arg min 𝑗 { E 𝜋 𝑖 ∼ 𝑝 ★ [ 𝑈 ˜ 𝜃 𝑗 ( 𝜋 𝑖 ) ] } 18: Add L ← 𝜃 𝑡 + 1 19: end for 20: Return: MSNE Policy distribution 𝑝 ★ , environment levels L Alkis Sygkounas, Rishi Hazra, Andreas Persson, Pedro Zuidberg Dos Martires, and Amy Loutfi from  minigrid.core.world_object  import Goal,Door,Key,Wall from  minigrid.core.grid  import Grid from  minigrid.minigrid_env  import MiniGridEnv from  minigrid.core.mission  import MissionSpace import  numpy  as  np ,  random from  collections  import deque VALID_COLORS = [ "red" , "green" , "blue" , "purple" , "yellow" , "grey" ] #--------------------BFSutils-------------------- def  _passable (env,x,y,keys): cell = env . grid . get(x,y)  if cell is  None : return  True  if  isinstance (cell,Wall): return  False  if  isinstance (cell,Door) and cell . is_locked and (cell . color not  in keys): return  False  return  True def  flood_reachable (env,start,keys):  """4-neighborhoodfloodrespectingwalls/lockeddoors.""" q,vis = deque([start]),{start}  while q: x,y = q . popleft()  for dx,dy in (( 1 , 0 ),( -1 , 0 ),( 0 , 1 ),( 0 , -1 )): nx,ny = x + dx,y + dy  if  0  <= nx < env . width and  0  <= ny < env . height and (nx,ny) not  in vis:  if _passable(env,nx,ny,keys): vis . add((nx,ny));q . append((nx,ny))  return vis def  shortest_path (env,start,goal,keys):  """BFSshortestpath(listof(x,y))orNone.""" q,par = deque([start]),{start: None }  while q: x,y = q . popleft()  if (x,y) == goal: path,cur = [],(x,y)  while cur is  not  None : path . append(cur);cur = par[cur]  return  list ( reversed (path))  for dx,dy in (( 1 , 0 ),( -1 , 0 ),( 0 , 1 ),( 0 , -1 )): nx,ny = x + dx,y + dy  if  0  <= nx < env . width and  0  <= ny < env . height and (nx,ny) not  in par:  if _passable(env,nx,ny,keys): par[(nx,ny)] = (x,y);q . append((nx,ny))  return  None def  bfs_ignore_doors (env,start,goal):  """Scaffoldpathignoringlocks(doorspassable,wallsblock)."""  def  pass_ign (cell):  if cell is  None : return  True  return  not  isinstance (cell,Wall) q,par = deque([start]),{start: None }  while q: x,y = q . popleft()  if (x,y) == goal: path,cur = [],(x,y)  while cur is  not  None :path . append(cur);cur = par[cur]  return  list ( reversed (path))  for dx,dy in (( 1 , 0 ),( -1 , 0 ),( 0 , 1 ),( 0 , -1 )): nx,ny = x + dx,y + dy  if  0  <= nx < env . width and  0  <= ny < env . height and (nx,ny) not  in par:  if pass_ign(env . grid . get(nx,ny)): par[(nx,ny)] = (x,y);q . append((nx,ny))  return  None #--------------------smallhelpers-------------------- def  find_random_empty (env,exclude = None ): ex =  set (exclude or []) choices = [(x,y)  for x in  range ( 1 ,env . width -1 )  for y in  range ( 1 ,env . height -1 )  if env . grid . get(x,y) is  None  and (x,y) not  in ex]  return random . choice(choices) if choices else  None def  manhattan (a,b): return  abs (a[ 0 ] - b[ 0 ]) +  abs (a[ 1 ] - b[ 1 ]) #--------------------Environment-------------------- class  CustomEnv (MiniGridEnv):  """ Relaxedchokepoints+robustkeyplacement. Knobs: -num_doors(int):numberoflockeddoorsonthemainpath -fraction_hard(float):fractionofdoorsastruechokepoints -soft_wing_len(int):softdoorwinglength(0=nowings) -complexity(float):randomobstacledensity[0..1] Fixes: -EachkeyiisplacedontheAGENTSIDEofdoori. -Werequire:path(agent->key_i)andpath(key_i->door_i_side). -Wereservethosecorridors(witha1-cellhalo)solaterobstaclescannotblockthem. """  def  __init__ ( self , size =28 , max_steps = None , complexity =0.55 , num_doors =6 , fraction_hard =0.25 , soft_wing_len =1 ):  self . size = size  self . complexity = complexity  self . num_doors = num_doors  self . fraction_hard = fraction_hard  self . soft_wing_len = soft_wing_len  if max_steps is  None : max_steps =  5  * (size **  2 ) mission_space = MissionSpace( lambda : "Reachthegreengoal." )  import  gymnasium  as  gym  self . observation_space = gym . spaces . Box( low =0.0 ,high =1.0 ,shape = ( self . size *  self . size *  3 ,),dtype = np . float32 )  super () . __init__ (mission_space = mission_space, grid_size = self . size, see_through_walls = True , max_steps = max_steps)  self . failure_feedback =  ""  #--------------------generation--------------------  def  _gen_grid ( self ,width,height):  if  self . num_doors >  len (VALID_COLORS):  self . failure_feedback =  f"Requested { self . num_doors } doors>availablecolors"  return max_tries,solved =  800 , False  for _ in  range (max_tries):  #cleargrid  self . grid = Grid(width,height)  self . grid . wall_rect( 0 , 0 ,width,height)  for x in  range ( 1 ,width -1 ):  for y in  range ( 1 ,height -1 ):  self . grid . set(x,y, None )  #agent/goal agent_pos = find_random_empty( self ) goal_pos = find_random_empty( self ,exclude = [agent_pos])  if  not agent_pos or  not goal_pos or agent_pos == goal_pos:  continue  self . agent_pos = agent_pos  self . agent_dir = random . randint( 0 , 3 )  self . put_obj(Goal(),goal_pos[ 0 ],goal_pos[ 1 ])  #scaffoldpath main_path = bfs_ignore_doors( self ,agent_pos,goal_pos)  if  not main_path or  len (main_path) < ( 2  *  self . num_doors +  5 ):  continue path_set =  set (main_path)    #doorslots&colors   door_idxs =  self . _choose_sequential_door_indices(main_path, self . num_doors)   colors = random . sample(VALID_COLORS,k = self . num_doors)    #pickhardvssoftdoors   n_hard =  max ( 1 , min ( self . num_doors, int ( round ( self . fraction_hard *  self . num_doors))))   hard_mask =  set (random . sample( range ( self . num_doors),k = n_hard))    #placedoors+structures    door_positions = []     for i,idx in  enumerate (door_idxs):    x,y = main_path[idx]    if  self . grid . get(x,y) is  not  None : break    self . put_obj(Door(colors[i],is_locked = True ),x,y)   door_positions . append((x,y))   orient =  self . _local_path_orient(main_path,idx)    if i in hard_mask:    self . _add_barrier_line(x,y,orient)    self . _add_door_jamb(x,y,orient)    else :    self . _add_soft_wings(x,y,orient, self . soft_wing_len)    self . _add_door_jamb(x,y,orient)  else :  #successplacingdoors  #---KEYPLACEMENTwithcorridorprotection--- reserve =  set (path_set) | {agent_pos,goal_pos} |  set (door_positions) protected =  set () key_positions = []  for i,(dx,dy) in  enumerate (door_positions): unlocked =  set (colors[:i]) #colorsbeforedooriareusable  #regionreachablebeforeopeningdoori   region = flood_reachable( self ,agent_pos,unlocked)    #door-sideanchors(neighborsthatareinregion)   side_neighbors = [(nx,ny) for (nx,ny) in  self . _neighbors(dx,dy)    if ( 0  <= nx <  self . width and  0  <= ny <  self . height)    and _passable( self ,nx,ny,unlocked)    and (nx,ny) in region]    if  not side_neighbors:    break  #noapproachside→rejectlayout    #candidatecells=regionminusreserves;biasoff-pathandnotcramped   candidates = [c for c in region    if c not  in reserve    and  self . grid . get( * c) is  None    and  self . _free_neighbors_count( * c,keys = unlocked) >=  2 ]    if  not candidates:    break    #preferoff-path     far = [c for c in candidates if  self . _dist_to_set(c,path_set) >=  2 ]     pool = far if far else candidates     random . shuffle(pool)    #pickacandidatethatvalidatesbothpathsandprotectcorridors   placed =  False    for cx,cy in pool:   p1 = shortest_path( self ,agent_pos,(cx,cy),unlocked)    if  not p1: continue    #pathfromkeytoANYdoor-sideneighbor   p2 =  None    for s in side_neighbors:   p2 = shortest_path( self ,(cx,cy),s,unlocked)    if p2: break    if  not p2: continue    #accept;placekeyandprotectbothcorridors(with1-cellhalo)    self . put_obj(Key(colors[i]),cx,cy)   key_positions . append((cx,cy))    for cell in p1 + p2:   protected . add(cell)    for nb in  self . _neighbors( * cell):    if  1  <= nb[ 0 ] <  self . width -1  and  1  <= nb[ 1 ] <  self . height -1 :   protected . add(nb) reserve |=  set (p1) |  set (p2) |  set (key_positions) placed =  True  break  if  not placed:  break  #failplacingavalid,reachablekey  else :  #scatterobstaclesbutNEVERonprotected/reserve  self . _place_obstacles(reserve | protected,door_positions,key_positions)  #finalsolvabilitycheck(collectingkeysinorder)  if  self . _check_solvable_ordered(agent_pos,goal_pos,colors,door_positions,key_positions): solved =  True  if solved: break  #ifanystepfailed,tryagain  continue  if  not solved:  self . failure_feedback =  "Nosolvablelayoutfound.Lowercomplexityordoors,orreducefraction_hard."  def  gen_obs ( self ):  return  self . grid . encode() . astype(np . float32)  #--------------------internals--------------------  def  _neighbors ( self ,x,y):  return [(x +1 ,y),(x -1 ,y),(x,y +1 ),(x,y -1 )]  def  _free_neighbors_count ( self ,x,y,keys = frozenset ()): cnt =  0  for nx,ny in  self . _neighbors(x,y):  if  0  <= nx <  self . width and  0  <= ny <  self . height and _passable( self ,nx,ny,keys): cnt +=  1  return cnt  def  _dist_to_set ( self ,c,S):  #ManhattandistancetoclosestelementinS(boundedsmallloop) x,y = c best =  1e9  for px,py in S: d =  abs (px - x) +  abs (py - y)  if d < best:best = d  if best ==  0 : break  return best  def  _check_solvable_ordered ( self ,start,goal,colors,door_positions,key_positions):  """Simulatepickingkeysinorder;ateachstageverifyreachabilitytonextkeyandfinallygoal.""" pos = start have =  set ()  for i,color in  enumerate (colors): kpos = key_positions[i] p = shortest_path( self ,pos,kpos,have)  if  not p: return  False have . add(color) #pickkeyi pos = kpos  #Afterpicking,verifywecanreachtheapproachsideofdoori(andthuspassit) dpos = door_positions[i]  #passthroughdoorcellisallowednow p2 = shortest_path( self ,pos,dpos,have)  if  not p2: return  False pos = dpos  #finallytogoal p_last = shortest_path( self ,pos,goal,have)  return p_last is  not  None  def  _choose_sequential_door_indices ( self ,path,k): n =  len (path);start,end =  2 ,n -3  if end - start +  1  < k: step =  max ( 1 ,(end - start +  1 ) // k) idxs = [ min (end,start + i * step) for i in  range (k)]  return  sorted ( set (idxs))[:k] base = [start + (i +1 ) * (end - start) // (k +1 ) for i in  range (k)] out,last = [],start -1  for b in base: j =  max (last +1 ,b + random . randint( -2 , 2 )) j =  min (j,end - (k -  len (out) -  1 )) out . append(j);last = j  return out  def  _local_path_orient ( self ,path,idx): a = path[ max ( 0 ,idx -1 )];c = path[ min ( len (path) -1 ,idx +1 )] dx,dy = c[ 0 ] - a[ 0 ],c[ 1 ] - a[ 1 ]  return  'x'  if  abs (dx) >=  abs (dy) else  'y'  def  _add_barrier_line ( self ,x,y,orient):  if orient ==  'x' :  for yy in  range ( 1 , self . height -1 ):  if yy == y: continue  if  self . grid . get(x,yy) is  None : self . put_obj(Wall(),x,yy)  else :  for xx in  range ( 1 , self . width -1 ):  if xx == x: continue  if  self . grid . get(xx,y) is  None : self . put_obj(Wall(),xx,y)  def  _add_soft_wings ( self ,x,y,orient,wing_len =2 ): wing_len =  max ( 0 , int (wing_len))  if wing_len ==  0 : return  if orient ==  'x' :  for d in  range ( 1 ,wing_len +1 ):  for yy in (y - d,y + d):  if  1  <= yy <  self . height -1  and  self . grid . get(x,yy) is  None :  self . put_obj(Wall(),x,yy)  else :  for d in  range ( 1 ,wing_len +1 ):  for xx in (x - d,x + d):  if  1  <= xx <  self . width -1  and  self . grid . get(xx,y) is  None :  self . put_obj(Wall(),xx,y)  def  _add_door_jamb ( self ,x,y,orient): sides = [(x,y -1 ),(x,y +1 )] if orient ==  'x'  else [(x -1 ,y),(x +1 ,y)]  for sx,sy in sides:  if  1  <= sx <  self . width -1  and  1  <= sy <  self . height -1 :  if  self . grid . get(sx,sy) is  None :  self . put_obj(Wall(),sx,sy)  def  _place_obstacles ( self ,reserved,doors,keys):  """Scatterobstacleswithdensityself.complexity,neveronreservedoradjacenttodoors/keys.""" reserved =  set (reserved) |  set (doors) |  set (keys)  #1-cellhaloarounddoors/keys  for d in doors + keys:  for nb in  self . _neighbors( * d):  if  1  <= nb[ 0 ] <  self . width -1  and  1  <= nb[ 1 ] <  self . height -1 : reserved . add(nb) free = [(x,y)  for x in  range ( 1 , self . width -1 )  for y in  range ( 1 , self . height -1 )  if (x,y) not  in reserved and  self . grid . get(x,y) is  None ] n_obs =  int ( len (free) *  self . complexity) random . shuffle(free) placed =  0  for (x,y) in free:  if placed >= n_obs: break  #avoidsealingnarrow1-widecorridors:keepatleast2passableneighbors  if  self . _free_neighbors_count(x,y,keys = frozenset ()) <=  1 :  continue  self . put_obj(Wall(),x,y);placed +=  1 from  minigrid.core.world_object  import Goal,Door,Key,Wall from  minigrid.core.grid  import Grid from  minigrid.minigrid_env  import MiniGridEnv from  minigrid.core.mission  import MissionSpace import  numpy  as  np ,  random from  collections  import deque VALID_COLORS = [ "red" , "green" , "blue" , "purple" , "yellow" , "grey" ] #--------------------BFSutils-------------------- def  _passable (env,x,y,keys): cell = env . grid . get(x,y)  if cell is  None : return  True  if  isinstance (cell,Wall): return  False  if  isinstance (cell,Door) and cell . is_locked and (cell . color not  in keys): return  False  return  True def  flood_reachable (env,start,keys):  """4-neighborhoodfloodrespectingwalls/lockeddoors.""" q,vis = deque([start]),{start}  while q: x,y = q . popleft()  for dx,dy in (( 1 , 0 ),( -1 , 0 ),( 0 , 1 ),( 0 , -1 )): nx,ny = x + dx,y + dy  if  0  <= nx < env . width and  0  <= ny < env . height and (nx,ny) not  in vis:  if _passable(env,nx,ny,keys): vis . add((nx,ny));q . append((nx,ny))  return vis def  shortest_path (env,start,goal,keys):  """BFSshortestpath(listof(x,y))orNone.""" q,par = deque([start]),{start: None }  while q: x,y = q . popleft()  if (x,y) == goal: path,cur = [],(x,y)  while cur is  not  None : path . append(cur);cur = par[cur]  return  list ( reversed (path))  for dx,dy in (( 1 , 0 ),( -1 , 0 ),( 0 , 1 ),( 0 , -1 )): nx,ny = x + dx,y + dy  if  0  <= nx < env . width and  0  <= ny < env . height and (nx,ny) not  in par:  if _passable(env,nx,ny,keys): par[(nx,ny)] = (x,y);q . append((nx,ny))  return  None def  bfs_ignore_doors (env,start,goal):  """Scaffoldpathignoringlocks(doorspassable,wallsblock)."""  def  pass_ign (cell):  if cell is  None : return  True  return  not  isinstance (cell,Wall) q,par = deque([start]),{start: None }  while q: x,y = q . popleft()  if (x,y) == goal: path,cur = [],(x,y)  while cur is  not  None :path . append(cur);cur = par[cur]  return  list ( reversed (path))  for dx,dy in (( 1 , 0 ),( -1 , 0 ),( 0 , 1 ),( 0 , -1 )): nx,ny = x + dx,y + dy  if  0  <= nx < env . width and  0  <= ny < env . height and (nx,ny) not  in par:  if pass_ign(env . grid . get(nx,ny)): par[(nx,ny)] = (x,y);q . append((nx,ny))  return  None #--------------------smallhelpers-------------------- def  find_random_empty (env,exclude = None ): ex =  set (exclude or []) choices = [(x,y)  for x in  range ( 1 ,env . width -1 )  for y in  range ( 1 ,env . height -1 )  if env . grid . get(x,y) is  None  and (x,y) not  in ex]  return random . choice(choices) if choices else  None def  manhattan (a,b): return  abs (a[ 0 ] - b[ 0 ]) +  abs (a[ 1 ] - b[ 1 ]) #--------------------Environment-------------------- class  CustomEnv (MiniGridEnv):  """ Relaxedchokepoints+robustkeyplacement. Knobs: -num_doors(int):numberoflockeddoorsonthemainpath -fraction_hard(float):fractionofdoorsastruechokepoints -soft_wing_len(int):softdoorwinglength(0=nowings) -complexity(float):randomobstacledensity[0..1] Fixes: -EachkeyiisplacedontheAGENTSIDEofdoori. -Werequire:path(agent->key_i)andpath(key_i->door_i_side). -Wereservethosecorridors(witha1-cellhalo)solaterobstaclescannotblockthem. """  def  __init__ ( self , size =28 , max_steps = None , complexity =0.55 , num_doors =6 , fraction_hard =0.25 , soft_wing_len =1 ):  self . size = size  self . complexity = complexity  self . num_doors = num_doors  self . fraction_hard = fraction_hard  self . soft_wing_len = soft_wing_len  if max_steps is  None : max_steps =  5  * (size **  2 ) mission_space = MissionSpace( lambda : "Reachthegreengoal." )  import  gymnasium  as  gym  self . observation_space = gym . spaces . Box( low =0.0 ,high =1.0 ,shape = ( self . size *  self . size *  3 ,),dtype = np . float32 )  super () . __init__ (mission_space = mission_space, grid_size = self . size, see_through_walls = True , max_steps = max_steps)  self . failure_feedback =  ""  #--------------------generation--------------------  def  _gen_grid ( self ,width,height):  if  self . num_doors >  len (VALID_COLORS):  self . failure_feedback =  f"Requested { self . num_doors } doors>availablecolors"  return max_tries,solved =  800 , False  for _ in  range (max_tries):  #cleargrid  self . grid = Grid(width,height)  self . grid . wall_rect( 0 , 0 ,width,height)  for x in  range ( 1 ,width -1 ):  for y in  range ( 1 ,height -1 ):  self . grid . set(x,y, None )  #agent/goal agent_pos = find_random_empty( self ) goal_pos = find_random_empty( self ,exclude = [agent_pos])  if  not agent_pos or  not goal_pos or agent_pos == goal_pos:  continue  self . agent_pos = agent_pos  self . agent_dir = random . randint( 0 , 3 )  self . put_obj(Goal(),goal_pos[ 0 ],goal_pos[ 1 ])  #scaffoldpath main_path = bfs_ignore_doors( self ,agent_pos,goal_pos)  if  not main_path or  len (main_path) < ( 2  *  self . num_doors +  5 ):  continue path_set =  set (main_path)    #doorslots&colors   door_idxs =  self . _choose_sequential_door_indices(main_path, self . num_doors)   colors = random . sample(VALID_COLORS,k = self . num_doors)    #pickhardvssoftdoors   n_hard =  max ( 1 , min ( self . num_doors, int ( round ( self . fraction_hard *  self . num_doors))))   hard_mask =  set (random . sample( range ( self . num_doors),k = n_hard))    #placedoors+structures    door_positions = []     for i,idx in  enumerate (door_idxs):    x,y = main_path[idx]    if  self . grid . get(x,y) is  not  None : break    self . put_obj(Door(colors[i],is_locked = True ),x,y)   door_positions . append((x,y))   orient =  self . _local_path_orient(main_path,idx)    if i in hard_mask:    self . _add_barrier_line(x,y,orient)    self . _add_door_jamb(x,y,orient)    else :    self . _add_soft_wings(x,y,orient, self . soft_wing_len)    self . _add_door_jamb(x,y,orient)  else :  #successplacingdoors  #---KEYPLACEMENTwithcorridorprotection--- reserve =  set (path_set) | {agent_pos,goal_pos} |  set (door_positions) protected =  set () key_positions = []  for i,(dx,dy) in  enumerate (door_positions): unlocked =  set (colors[:i]) #colorsbeforedooriareusable  #regionreachablebeforeopeningdoori   region = flood_reachable( self ,agent_pos,unlocked)    #door-sideanchors(neighborsthatareinregion)   side_neighbors = [(nx,ny) for (nx,ny) in  self . _neighbors(dx,dy)    if ( 0  <= nx <  self . width and  0  <= ny <  self . height)    and _passable( self ,nx,ny,unlocked)    and (nx,ny) in region]    if  not side_neighbors:    break  #noapproachside→rejectlayout    #candidatecells=regionminusreserves;biasoff-pathandnotcramped   candidates = [c for c in region    if c not  in reserve    and  self . grid . get( * c) is  None    and  self . _free_neighbors_count( * c,keys = unlocked) >=  2 ]    if  not candidates:    break    #preferoff-path     far = [c for c in candidates if  self . _dist_to_set(c,path_set) >=  2 ]     pool = far if far else candidates     random . shuffle(pool)    #pickacandidatethatvalidatesbothpathsandprotectcorridors   placed =  False    for cx,cy in pool:   p1 = shortest_path( self ,agent_pos,(cx,cy),unlocked)    if  not p1: continue    #pathfromkeytoANYdoor-sideneighbor   p2 =  None    for s in side_neighbors:   p2 = shortest_path( self ,(cx,cy),s,unlocked)    if p2: break    if  not p2: continue    #accept;placekeyandprotectbothcorridors(with1-cellhalo)    self . put_obj(Key(colors[i]),cx,cy)   key_positions . append((cx,cy))    for cell in p1 + p2:   protected . add(cell)    for nb in  self . _neighbors( * cell):    if  1  <= nb[ 0 ] <  self . width -1  and  1  <= nb[ 1 ] <  self . height -1 :   protected . add(nb) reserve |=  set (p1) |  set (p2) |  set (key_positions) placed =  True  break  if  not placed:  break  #failplacingavalid,reachablekey  else :  #scatterobstaclesbutNEVERonprotected/reserve  self . _place_obstacles(reserve | protected,door_positions,key_positions)  #finalsolvabilitycheck(collectingkeysinorder)  if  self . _check_solvable_ordered(agent_pos,goal_pos,colors,door_positions,key_positions): solved =  True  if solved: break  #ifanystepfailed,tryagain  continue  if  not solved:  self . failure_feedback =  "Nosolvablelayoutfound.Lowercomplexityordoors,orreducefraction_hard."  def  gen_obs ( self ):  return  self . grid . encode() . astype(np . float32)  #--------------------internals--------------------  def  _neighbors ( self ,x,y):  return [(x +1 ,y),(x -1 ,y),(x,y +1 ),(x,y -1 )]  def  _free_neighbors_count ( self ,x,y,keys = frozenset ()): cnt =  0  for nx,ny in  self . _neighbors(x,y):  if  0  <= nx <  self . width and  0  <= ny <  self . height and _passable( self ,nx,ny,keys): cnt +=  1  return cnt  def  _dist_to_set ( self ,c,S):  #ManhattandistancetoclosestelementinS(boundedsmallloop) x,y = c best =  1e9  for px,py in S: d =  abs (px - x) +  abs (py - y)  if d < best:best = d  if best ==  0 : break  return best  def  _check_solvable_ordered ( self ,start,goal,colors,door_positions,key_positions):  """Simulatepickingkeysinorder;ateachstageverifyreachabilitytonextkeyandfinallygoal.""" pos = start have =  set ()  for i,color in  enumerate (colors): kpos = key_positions[i] p = shortest_path( self ,pos,kpos,have)  if  not p: return  False have . add(color) #pickkeyi pos = kpos  #Afterpicking,verifywecanreachtheapproachsideofdoori(andthuspassit) dpos = door_positions[i]  #passthroughdoorcellisallowednow p2 = shortest_path( self ,pos,dpos,have)  if  not p2: return  False pos = dpos  #finallytogoal p_last = shortest_path( self ,pos,goal,have)  return p_last is  not  None  def  _choose_sequential_door_indices ( self ,path,k): n =  len (path);start,end =  2 ,n -3  if end - start +  1  < k: step =  max ( 1 ,(end - start +  1 ) // k) idxs = [ min (end,start + i * step) for i in  range (k)]  return  sorted ( set (idxs))[:k] base = [start + (i +1 ) * (end - start) // (k +1 ) for i in  range (k)] out,last = [],start -1  for b in base: j =  max (last +1 ,b + random . randint( -2 , 2 )) j =  min (j,end - (k -  len (out) -  1 )) out . append(j);last = j  return out  def  _local_path_orient ( self ,path,idx): a = path[ max ( 0 ,idx -1 )];c = path[ min ( len (path) -1 ,idx +1 )] dx,dy = c[ 0 ] - a[ 0 ],c[ 1 ] - a[ 1 ]  return  'x'  if  abs (dx) >=  abs (dy) else  'y'  def  _add_barrier_line ( self ,x,y,orient):  if orient ==  'x' :  for yy in  range ( 1 , self . height -1 ):  if yy == y: continue  if  self . grid . get(x,yy) is  None : self . put_obj(Wall(),x,yy)  else :  for xx in  range ( 1 , self . width -1 ):  if xx == x: continue  if  self . grid . get(xx,y) is  None : self . put_obj(Wall(),xx,y)  def  _add_soft_wings ( self ,x,y,orient,wing_len =2 ): wing_len =  max ( 0 , int (wing_len))  if wing_len ==  0 : return  if orient ==  'x' :  for d in  range ( 1 ,wing_len +1 ):  for yy in (y - d,y + d):  if  1  <= yy <  self . height -1  and  self . grid . get(x,yy) is  None :  self . put_obj(Wall(),x,yy)  else :  for d in  range ( 1 ,wing_len +1 ):  for xx in (x - d,x + d):  if  1  <= xx <  self . width -1  and  self . grid . get(xx,y) is  None :  self . put_obj(Wall(),xx,y)  def  _add_door_jamb ( self ,x,y,orient): sides = [(x,y -1 ),(x,y +1 )] if orient ==  'x'  else [(x -1 ,y),(x +1 ,y)]  for sx,sy in sides:  if  1  <= sx <  self . width -1  and  1  <= sy <  self . height -1 :  if  self . grid . get(sx,sy) is  None :  self . put_obj(Wall(),sx,sy)  def  _place_obstacles ( self ,reserved,doors,keys):  """Scatterobstacleswithdensityself.complexity,neveronreservedoradjacenttodoors/keys.""" reserved =  set (reserved) |  set (doors) |  set (keys)  #1-cellhaloarounddoors/keys  for d in doors + keys:  for nb in  self . _neighbors( * d):  if  1  <= nb[ 0 ] <  self . width -1  and  1  <= nb[ 1 ] <  self . height -1 : reserved . add(nb) free = [(x,y)  for x in  range ( 1 , self . width -1 )  for y in  range ( 1 , self . height -1 )  if (x,y) not  in reserved and  self . grid . get(x,y) is  None ] n_obs =  int ( len (free) *  self . complexity) random . shuffle(free) placed =  0  for (x,y) in free:  if placed >= n_obs: break  #avoidsealingnarrow1-widecorridors:keepatleast2passableneighbors  if  self . _free_neighbors_count(x,y,keys = frozenset ()) <=  1 :  continue  self . put_obj(Wall(),x,y);placed +=  1 from  minigrid.core.world_object  import Goal,Door,Key,Wall from  minigrid.core.grid  import Grid from  minigrid.core.constants  import COLOR_NAMES from  minigrid.minigrid_env  import MiniGridEnv from  minigrid.core.mission  import MissionSpace import  numpy  as  np import  random from  collections  import deque VALID_COLORS = [ "red" , "green" , "blue" , "purple" , "yellow" , "grey" ] def  bfs_ignore_doors (env,start,goal): queue = deque([start]) parents = {start: None }  while queue: cx,cy = queue . popleft()  if (cx,cy) == goal: path = [] cur = goal  while cur is  not  None : path . append(cur) cur = parents[cur] path . reverse()  return path  for dx,dy in [( 1 , 0 ),( -1 , 0 ),( 0 , 1 ),( 0 , -1 )]: nx,ny = cx + dx,cy + dy  if  0  <= nx < env . width and  0  <= ny < env . height:  if (nx,ny) not  in parents: cell = env . grid . get(nx,ny)  if _bfs_passable_ignore_door(cell): parents[(nx,ny)] = (cx,cy) queue . append((nx,ny))  return  None def  _bfs_passable_ignore_door (cell):  if cell is  None :  return  True  if  isinstance (cell,Wall):  return  False  return  True def  bfs_block_locked (env,start): queue = deque([start]) visited =  set ([start])  while queue: cx,cy = queue . popleft()  for dx,dy in [( 1 , 0 ),( -1 , 0 ),( 0 , 1 ),( 0 , -1 )]: nx,ny = cx + dx,cy + dy  if  0  <= nx < env . width and  0  <= ny < env . height:  if (nx,ny) not  in visited: cell = env . grid . get(nx,ny)  if  isinstance (cell,Wall):  continue  if  isinstance (cell,Door) and cell . is_locked:  continue visited . add((nx,ny)) queue . append((nx,ny))  return visited def  check_solvable (env,start,goal): queue = deque() visited =  set () start_state = (start[ 0 ],start[ 1 ], frozenset ()) queue . append(start_state) visited . add(start_state)  while queue: x,y,have_keys = queue . popleft()  if (x,y) == goal:  return  True cell = env . grid . get(x,y) new_keys = have_keys  if  isinstance (cell,Key) and (cell . color not  in have_keys): new_keys =  frozenset ( set (have_keys) | {cell . color})  for dx,dy in [( 1 , 0 ),( -1 , 0 ),( 0 , 1 ),( 0 , -1 )]: nx,ny = x + dx,y + dy  if  0  <= nx < env . width and  0  <= ny < env . height: nxt = (nx,ny,new_keys)  if nxt not  in visited:  if can_pass(env,nx,ny,new_keys): visited . add(nxt) queue . append(nxt)  return  False def  can_pass (env,x,y,keys): cell = env . grid . get(x,y)  if cell is  None :  return  True  if  isinstance (cell,Wall):  return  False  if  isinstance (cell,Door):  if cell . is_locked and (cell . color not  in keys):  return  False  return  True def  find_random_empty (env,exclude = None ):  if exclude is  None : exclude = [] empties = []  for x in  range ( 1 ,env . width -  1 ):  for y in  range ( 1 ,env . height -  1 ):  if env . grid . get(x,y) is  None  and (x,y) not  in exclude: empties . append((x,y))  if  not empties:  return  None  return random . choice(empties) class  CustomEnv (MiniGridEnv):  def  __init__ ( self ,size =29 ,max_steps = None ,complexity =0.65 ,num_doors =6 ):  self . size = size  self . complexity = complexity  self . num_doors = num_doors  if max_steps is  None : max_steps =  3  * (size **  2 ) mission_space = MissionSpace( lambda : "Gettothegreengoalsquare." )  import  gymnasium  as  gym  self . observation_space = gym . spaces . Box( low =0.0 ,high =1.0 , shape = ( self . size *  self . size *  3 ,), dtype = np . float32 )  super () . __init__ ( mission_space = mission_space, grid_size = self . size, see_through_walls = True , max_steps = max_steps )  self . failure_feedback =  ""  def  _gen_grid ( self ,width,height): max_tries =  1000 solved =  False  self . failure_feedback =  ""  if  self . num_doors >  len (VALID_COLORS):  self . failure_feedback =  f"Requested { self . num_doors } doors>availablecolors"  return  for attempt in  range (max_tries):  self . grid = Grid(width,height)  self . grid . wall_rect( 0 , 0 ,width,height)  for xx in  range ( 1 ,width -  1 ):  for yy in  range ( 1 ,height -  1 ):  self . grid . set(xx,yy, None ) agent_pos = find_random_empty( self ) goal_pos = find_random_empty( self ,exclude = [agent_pos])  if  not agent_pos or  not goal_pos or agent_pos == goal_pos:  continue  self . agent_pos = agent_pos  self . agent_dir = random . randint( 0 , 3 )  self . put_obj(Goal(),goal_pos[ 0 ],goal_pos[ 1 ]) colors = random . sample(VALID_COLORS,k = self . num_doors)  if  not  self . _place_doors_and_keys(agent_pos,goal_pos,colors):  continue  self . _place_strategic_obstacles(agent_pos,goal_pos) path_ign = bfs_ignore_doors( self ,agent_pos,goal_pos)  if  not path_ign:  continue  if  not check_solvable( self ,agent_pos,goal_pos):  self . failure_feedback +=  f"Attempt { attempt  +  1 } :BFSunsolvable."  continue solved =  True  break  if  not solved:  self . failure_feedback += (  f" Nosolvablelayoutfoundafter ❌ { max_tries } tries."  "Possiblytoomanyobstaclesorunluckyplacements." )  def  gen_obs ( self ): encoded =  self . grid . encode() . astype(np . float32)  return encoded  def  _place_doors_and_keys ( self ,agent_pos,goal_pos,door_colors):  for color in door_colors: path_ign = bfs_ignore_doors( self ,agent_pos,goal_pos)  if  not path_ign or  len (path_ign) <  4 :  return  False success =  False  for _ in  range ( 10 ): door_idx = random . randint( 2 , len (path_ign) -  2 ) door_pos = path_ign[door_idx]  if  self . grid . get(door_pos[ 0 ],door_pos[ 1 ]) is  None :  self . put_obj(Door(color,is_locked = True ),door_pos[ 0 ],door_pos[ 1 ]) success =  True  break  if  not success:  return  False alt_path = bfs_ignore_doors( self ,agent_pos,goal_pos)  if alt_path:  for (xx,yy) in alt_path[ 1 : -1 ]:  if (xx,yy) not  in path_ign and  self . grid . get(xx,yy) is  None :  self . put_obj(Wall(),xx,yy) visited_blk = bfs_block_locked( self ,agent_pos) skip = {agent_pos,door_pos,goal_pos} candidates = [c for c in visited_blk  if c not  in skip and  self . grid . get( * c) is  None ]  if  not candidates:  return  False key_spot = random . choice(candidates)  self . put_obj(Key(color),key_spot[ 0 ],key_spot[ 1 ])  return  True  def  _place_strategic_obstacles ( self ,agent_pos,goal_pos): path_ign = bfs_ignore_doors( self ,agent_pos,goal_pos) path_set =  set (path_ign) if path_ign else  set () skip = path_set | {agent_pos}  for xx in  range ( 1 , self . width -  1 ):  for yy in  range ( 1 , self . height -  1 ): obj =  self . grid . get(xx,yy)  if  isinstance (obj,(Goal,Door,Key)): skip . add((xx,yy)) interior = []  for xx in  range ( 1 , self . width -  1 ):  for yy in  range ( 1 , self . height -  1 ):  if (xx,yy) not  in skip and  self . grid . get(xx,yy) is  None : interior . append((xx,yy)) n_obs =  int ( len (interior) *  self . complexity) random . shuffle(interior)  for i,cell in  enumerate (interior):  if i >= n_obs:  break  self . put_obj(Wall(),cell[ 0 ],cell[ 1 ]) from  minigrid.core.world_object  import Goal,Door,Key,Wall from  minigrid.core.grid  import Grid from  minigrid.core.constants  import COLOR_NAMES from  minigrid.minigrid_env  import MiniGridEnv from  minigrid.core.mission  import MissionSpace import  numpy  as  np import  random from  collections  import deque VALID_COLORS = [ "red" , "green" , "blue" , "purple" , "yellow" , "grey" ] def  bfs_ignore_doors (env,start,goal): queue = deque([start]) parents = {start: None }  while queue: cx,cy = queue . popleft()  if (cx,cy) == goal: path = [] cur = goal  while cur is  not  None : path . append(cur) cur = parents[cur] path . reverse()  return path  for dx,dy in [( 1 , 0 ),( -1 , 0 ),( 0 , 1 ),( 0 , -1 )]: nx,ny = cx + dx,cy + dy  if  0  <= nx < env . width and  0  <= ny < env . height:  if (nx,ny) not  in parents: cell = env . grid . get(nx,ny)  if _bfs_passable_ignore_door(cell): parents[(nx,ny)] = (cx,cy) queue . append((nx,ny))  return  None def  _bfs_passable_ignore_door (cell):  if cell is  None :  return  True  if  isinstance (cell,Wall):  return  False  return  True def  bfs_block_locked (env,start): queue = deque([start]) visited =  set ([start])  while queue: cx,cy = queue . popleft()  for dx,dy in [( 1 , 0 ),( -1 , 0 ),( 0 , 1 ),( 0 , -1 )]: nx,ny = cx + dx,cy + dy  if  0  <= nx < env . width and  0  <= ny < env . height:  if (nx,ny) not  in visited: cell = env . grid . get(nx,ny)  if  isinstance (cell,Wall):  continue  if  isinstance (cell,Door) and cell . is_locked:  continue visited . add((nx,ny)) queue . append((nx,ny))  return visited def  check_solvable (env,start,goal): queue = deque() visited =  set () start_state = (start[ 0 ],start[ 1 ], frozenset ()) queue . append(start_state) visited . add(start_state)  while queue: x,y,have_keys = queue . popleft()  if (x,y) == goal:  return  True cell = env . grid . get(x,y) new_keys = have_keys  if  isinstance (cell,Key) and (cell . color not  in have_keys): new_keys =  frozenset ( set (have_keys) | {cell . color})  for dx,dy in [( 1 , 0 ),( -1 , 0 ),( 0 , 1 ),( 0 , -1 )]: nx,ny = x + dx,y + dy  if  0  <= nx < env . width and  0  <= ny < env . height: nxt = (nx,ny,new_keys)  if nxt not  in visited:  if can_pass(env,nx,ny,new_keys): visited . add(nxt) queue . append(nxt)  return  False def  can_pass (env,x,y,keys): cell = env . grid . get(x,y)  if cell is  None :  return  True  if  isinstance (cell,Wall):  return  False  if  isinstance (cell,Door):  if cell . is_locked and (cell . color not  in keys):  return  False  return  True def  find_random_empty (env,exclude = None ):  if exclude is  None : exclude = [] empties = []  for x in  range ( 1 ,env . width -  1 ):  for y in  range ( 1 ,env . height -  1 ):  if env . grid . get(x,y) is  None  and (x,y) not  in exclude: empties . append((x,y))  if  not empties:  return  None  return random . choice(empties) class  CustomEnv (MiniGridEnv):  def  __init__ ( self ,size =29 ,max_steps = None ,complexity =0.65 ,num_doors =6 ):  self . size = size  self . complexity = complexity  self . num_doors = num_doors  if max_steps is  None : max_steps =  3  * (size **  2 ) mission_space = MissionSpace( lambda : "Gettothegreengoalsquare." )  import  gymnasium  as  gym  self . observation_space = gym . spaces . Box( low =0.0 ,high =1.0 , shape = ( self . size *  self . size *  3 ,), dtype = np . float32 )  super () . __init__ ( mission_space = mission_space, grid_size = self . size, see_through_walls = True , max_steps = max_steps )  self . failure_feedback =  ""  def  _gen_grid ( self ,width,height): max_tries =  1000 solved =  False  self . failure_feedback =  ""  if  self . num_doors >  len (VALID_COLORS):  self . failure_feedback =  f"Requested { self . num_doors } doors>availablecolors"  return  for attempt in  range (max_tries):  self . grid = Grid(width,height)  self . grid . wall_rect( 0 , 0 ,width,height)  for xx in  range ( 1 ,width -  1 ):  for yy in  range ( 1 ,height -  1 ):  self . grid . set(xx,yy, None ) agent_pos = find_random_empty( self ) goal_pos = find_random_empty( self ,exclude = [agent_pos])  if  not agent_pos or  not goal_pos or agent_pos == goal_pos:  continue  self . agent_pos = agent_pos  self . agent_dir = random . randint( 0 , 3 )  self . put_obj(Goal(),goal_pos[ 0 ],goal_pos[ 1 ]) colors = random . sample(VALID_COLORS,k = self . num_doors)  if  not  self . _place_doors_and_keys(agent_pos,goal_pos,colors):  continue  self . _place_strategic_obstacles(agent_pos,goal_pos) path_ign = bfs_ignore_doors( self ,agent_pos,goal_pos)  if  not path_ign:  continue  if  not check_solvable( self ,agent_pos,goal_pos):  self . failure_feedback +=  f"Attempt { attempt  +  1 } :BFSunsolvable."  continue solved =  True  break  if  not solved:  self . failure_feedback += (  f" Nosolvablelayoutfoundafter ❌ { max_tries } tries."  "Possiblytoomanyobstaclesorunluckyplacements." )  def  gen_obs ( self ): encoded =  self . grid . encode() . astype(np . float32)  return encoded  def  _place_doors_and_keys ( self ,agent_pos,goal_pos,door_colors):  for color in door_colors: path_ign = bfs_ignore_doors( self ,agent_pos,goal_pos)  if  not path_ign or  len (path_ign) <  4 :  return  False success =  False  for _ in  range ( 10 ): door_idx = random . randint( 2 , len (path_ign) -  2 ) door_pos = path_ign[door_idx]  if  self . grid . get(door_pos[ 0 ],door_pos[ 1 ]) is  None :  self . put_obj(Door(color,is_locked = True ),door_pos[ 0 ],door_pos[ 1 ]) success =  True  break  if  not success:  return  False alt_path = bfs_ignore_doors( self ,agent_pos,goal_pos)  if alt_path:  for (xx,yy) in alt_path[ 1 : -1 ]:  if (xx,yy) not  in path_ign and  self . grid . get(xx,yy) is  None :  self . put_obj(Wall(),xx,yy) visited_blk = bfs_block_locked( self ,agent_pos) skip = {agent_pos,door_pos,goal_pos} candidates = [c for c in visited_blk  if c not  in skip and  self . grid . get( * c) is  None ]  if  not candidates:  return  False key_spot = random . choice(candidates)  self . put_obj(Key(color),key_spot[ 0 ],key_spot[ 1 ])  return  True  def  _place_strategic_obstacles ( self ,agent_pos,goal_pos): path_ign = bfs_ignore_doors( self ,agent_pos,goal_pos) path_set =  set (path_ign) if path_ign else  set () skip = path_set | {agent_pos}  for xx in  range ( 1 , self . width -  1 ):  for yy in  range ( 1 , self . height -  1 ): obj =  self . grid . get(xx,yy)  if  isinstance (obj,(Goal,Door,Key)): skip . add((xx,yy)) interior = []  for xx in  range ( 1 , self . width -  1 ):  for yy in  range ( 1 , self . height -  1 ):  if (xx,yy) not  in skip and  self . grid . get(xx,yy) is  None : interior . append((xx,yy)) n_obs =  int ( len (interior) *  self . complexity) random . shuffle(interior)  for i,cell in  enumerate (interior):  if i >= n_obs:  break  self . put_obj(Wall(),cell[ 0 ],cell[ 1 ]) door_idx = random . randint( 2 , len (path_ign) -  2 door_pos = path_ign[door_idx] if  self . grid . get(door_pos[ 0 ],door_pos[ 1 ]) is  None :  self . put_obj(Door(color,is_locked = True ),door_pos[ 0 ],door_pos[ 1 ]) success =  True  break door_idxs =  self . _choose_sequential_door_indices(main_path, self . num_doors) colors = random . sample(VALID_COLORS,k = self . num_doors) n_hard =  max ( 1 , min ( self . num_doors, int ( round ( self . fraction_hard *  self . num_doors)))) hard_mask =  set (random . sample( range ( self . num_doors),k = n_hard)) door_positions = [] for i,idx in  enumerate (door_idxs): x,y = main_path[idx]  if  self . grid . get(x,y) is  not  None : break  self . put_obj(Door(colors[i],is_locked = True ),x,y) door_positions . append((x,y))  orient =  self . _local_path_orient(main_path,idx)   if i in hard_mask:   self . _add_barrier_line(x,y,orient)   self . _add_door_jamb(x,y,orient)   else :   self . _add_soft_wings(x,y,orient, self . soft_wing_len)   self . _add_door_jamb(x,y,orient) from  heapq  import heappop,heappush from  collections  import deque #====Actions==== TURN_LEFT =  0 TURN_RIGHT =  1 MOVE_FORWARD =  2 PICK_UP =  3 DROP =  4 TOGGLE =  5 #====Objects==== WALL =  2 GOAL =  8 DOOR =  4 KEY =  5 #====Doorstate(typicalMiniGridencoding)==== DOOR_OPEN =  0 DOOR_CLOSED =  1 DOOR_LOCKED =  2 #====Directions==== RIGHT =  0 DOWN =  1 LEFT =  2 UP =  3 DIRECTION_OFFSETS = { RIGHT:( 1 , 0 ), DOWN:( 0 , 1 ), LEFT:( -1 , 0 ), UP:( 0 , -1 ), } #----Runner-persistentstate---- carrying_key_color =  None drop_cooldown =  0 last_drop_front =  None  #avoiddroppingtwiceinsamefrontcell #==========PUBLICENTRYPOINT========== def  policy (obs,agent_pos,agent_dir):  """ obs:NxNx3array(obj,color,state) agent_pos:(x,y) agent_dir:0:RIGHT,1:DOWN,2:LEFT,3:UP returns:actionint """  global carrying_key_color,drop_cooldown  if drop_cooldown >  0 : _dec_drop_cooldown() front = get_facing(agent_pos,agent_dir) goal = find_goal(obs)  #----1)Immediatefrontinteractions----  if in_bounds(front,obs): fobj,fcol,fstate = tile(obs,front)  #Doors:openclosed-unlocked;unlocklockedifwecarrymatchingkey  if fobj == DOOR:  if is_door_closed_unlocked(fstate):  return TOGGLE  if is_door_locked(fstate) and carrying_key_color == fcol:  return TOGGLE  #Keyswithsingle-keycapacity  if fobj == KEY:  if carrying_key_color is  None  and drop_cooldown ==  0 : carrying_key_color = fcol  return PICK_UP  elif carrying_key_color is  not  None  and carrying_key_color != fcol: act = drop_key_somewhere(obs,agent_pos,agent_dir,goal)  if act is  not  None :  return act  return TURN_RIGHT  #----2)Trydirectpathtogoal(open-only)----  if goal: path = a_star_open_only(obs,agent_pos,goal)  if path:  return step_to(path[ 0 ],agent_pos,agent_dir)  #----3)Handleblockingdoors(lockedorclosed)---- blocking = find_blocking_doors(obs,agent_pos,goal) #(x,y,color,state) blocking . sort(key = lambda d:manhattan(agent_pos,d[: 2 ]))  for (dx,dy,dcol,dstate) in blocking:  if is_door_locked(dstate):  #Needkeyofcolordcol  if carrying_key_color not  in ( None ,dcol): act = drop_key_somewhere(obs,agent_pos,agent_dir,goal)  if act is  not  None :  return act  return TURN_RIGHT  if carrying_key_color != dcol: kpos = find_key_of_color(obs,dcol)  if kpos: adj = nearest_adjacent_open_only(obs,kpos,agent_pos)  if adj: path = a_star_open_only(obs,agent_pos,adj)  if path:  return step_to(path[ 0 ],agent_pos,agent_dir)  return TURN_RIGHT  #Havecorrectkey→goadjacenttodoor adj = nearest_adjacent_open_only(obs,(dx,dy),agent_pos)  if adj: path = a_star_open_only(obs,agent_pos,adj)  if path:  return step_to(path[ 0 ],agent_pos,agent_dir)  return TURN_RIGHT  #Closed-unlocked:justapproachandtoggle  if is_door_closed_unlocked(dstate): adj = nearest_adjacent_open_only(obs,(dx,dy),agent_pos)  if adj: path = a_star_open_only(obs,agent_pos,adj)  if path:  return step_to(path[ 0 ],agent_pos,agent_dir)  return TURN_RIGHT  #----4)Mildexploration----  return TURN_RIGHT #==================HELPERS================== def  get_facing (pos,dir_): dx,dy = DIRECTION_OFFSETS[dir_]  return (pos[ 0 ] + dx,pos[ 1 ] + dy) def  in_bounds (pos,obs): n = obs . shape[ 0 ]  return  0  <= pos[ 0 ] < n and  0  <= pos[ 1 ] < n def  tile (obs,pos):  return obs[pos[ 0 ],pos[ 1 ], 0 ],obs[pos[ 0 ],pos[ 1 ], 1 ],obs[pos[ 0 ],pos[ 1 ], 2 ] def  is_door_locked (state): return state == DOOR_LOCKED def  is_door_closed_unlocked (state): return state == DOOR_CLOSED def  is_open_door (state): return state == DOOR_OPEN def  is_passable_open_only (obj,state):  if obj == WALL:  return  False  if obj == DOOR:  return is_open_door(state)  return  True  #GOAL,empty,key,etc.(wedon'tstepintokeys,butallowroutingaround) def  find_goal (obs): n = obs . shape[ 0 ]  for x in  range (n):  for y in  range (n):  if obs[x,y, 0 ] == GOAL:  return (x,y)  return  None def  manhattan (a,b):  return  abs (a[ 0 ] - b[ 0 ]) +  abs (a[ 1 ] - b[ 1 ]) def  step_to (next_pos,agent_pos,agent_dir): dx,dy = next_pos[ 0 ] - agent_pos[ 0 ],next_pos[ 1 ] - agent_pos[ 1 ]  for dir_,(ox,oy) in DIRECTION_OFFSETS . items():  if (dx,dy) == (ox,oy):  if agent_dir == dir_:  return MOVE_FORWARD  elif (agent_dir - dir_) %  4  ==  1 :  return TURN_LEFT  else :  return TURN_RIGHT  return TURN_LEFT #----------A*(opendoorsonly)---------- def  a_star_open_only (obs,start,goal):  if goal is  None :  return  None  def  h (a,b): return  abs (a[ 0 ] - b[ 0 ]) +  abs (a[ 1 ] - b[ 1 ]) open_set = [( 0 ,start)] came_from = {} g = {start: 0 } f = {start:h(start,goal)}  while open_set: _,current = heappop(open_set)  if current == goal:  return _reconstruct_path(came_from,current)  for ox,oy in DIRECTION_OFFSETS . values(): nx,ny = current[ 0 ] + ox,current[ 1 ] + oy nxt = (nx,ny)  if  not in_bounds(nxt,obs):  continue o,c,s = tile(obs,nxt)  if  not is_passable_open_only(o,s):  continue tg = g[current] +  1  if nxt not  in g or tg < g[nxt]: g[nxt] = tg came_from[nxt] = current fn = tg + h(nxt,goal) f[nxt] = fn heappush(open_set,(fn,nxt))  return  None def  _reconstruct_path (came_from,cur): path = []  while cur in came_from: path . append(cur) cur = came_from[cur] path . reverse()  return path #----------Blockingdoors(firstfrontierofclosed/locked)---------- def  find_blocking_doors (obs,start,goal): n = obs . shape[ 0 ] q = deque([start]) seen = {start} blockers = []  while q: x,y = q . popleft()  if goal and (x,y) == goal:  return []  for ox,oy in DIRECTION_OFFSETS . values(): nx,ny = x + ox,y + oy  if  not ( 0  <= nx < n and  0  <= ny < n):  continue o,c,s = obs[nx,ny] nxt = (nx,ny)  if (o == DOOR) and (is_door_locked(s) or is_door_closed_unlocked(s)):  if nxt not  in [b[: 2 ] for b in blockers]: blockers . append((nx,ny,c,s))  continue  if nxt in seen:  continue  if  not is_passable_open_only(o,s):  continue seen . add(nxt) q . append(nxt)  return blockers #----------Adjacentpassabletileneartarget---------- def  nearest_adjacent_open_only (obs,target_pos,from_pos): adjs = []  for ox,oy in DIRECTION_OFFSETS . values(): p = (target_pos[ 0 ] + ox,target_pos[ 1 ] + oy)  if in_bounds(p,obs): o,c,s = tile(obs,p)  if is_passable_open_only(o,s): adjs . append(p)  if  not adjs:  return  None best =  None best_len =  1e9  for a in adjs: path = a_star_open_only(obs,from_pos,a)  if path and  len (path) < best_len: best_len =  len (path) best = a  return best def  find_key_of_color (obs,color): n = obs . shape[ 0 ]  for x in  range (n):  for y in  range (n):  if obs[x,y, 0 ] == KEY and obs[x,y, 1 ] == color:  return (x,y)  return  None #----------Important-cellawareDROPlogic---------- def  drop_key_somewhere (obs,agent_pos,agent_dir,goal):  """ Avoiddroppingonimportantsquares: -GOALtile -tilesadjacenttoanyDOOR -tilesadjacenttoGOAL -cellsoncurrentshortestpathtoGOAL -chokepoints(<=2passableneighbors) """  global carrying_key_color,last_drop_front imp = compute_important_cells(obs,agent_pos,goal)  #1)TrydroppingintoFRONTifsafe&notimportant front = get_facing(agent_pos,agent_dir)  if in_bounds(front,obs) and is_safe_drop_target(obs,front,imp) and front != last_drop_front: carrying_key_color =  None set_drop_cooldown() last_drop_front = front  return DROP  #2)Rotatetofaceasafe&non-importantfrontcell  for turns,next_dir in (( 0 ,agent_dir), ( 1 ,(agent_dir +  1 ) %  4 ), ( 2 ,(agent_dir +  2 ) %  4 ), ( 3 ,(agent_dir +  3 ) %  4 )): f = get_facing(agent_pos,next_dir)  if in_bounds(f,obs) and is_safe_drop_target(obs,f,imp) and f != last_drop_front:  if turns ==  0 : carrying_key_color =  None set_drop_cooldown() last_drop_front = f  return DROP  if (agent_dir - next_dir) %  4  ==  1 :  return TURN_LEFT  else :  return TURN_RIGHT  #3)Gostandonaplatformfromwhichsomefacinghasasafedroptarget platform,face_dir = nearest_drop_platform(obs,agent_pos,imp)  if platform:  if agent_pos != platform: path = a_star_open_only(obs,agent_pos,platform)  if path:  return step_to(path[ 0 ],agent_pos,agent_dir)  #onplatform:orientthendrop  if agent_dir != face_dir:  #one-steprotatetowardface_dir  if (agent_dir - face_dir) %  4  ==  1 :  return TURN_LEFT  else :  return TURN_RIGHT  #nowfrontissafe&non-important front2 = get_facing(agent_pos,agent_dir)  if in_bounds(front2,obs) and is_safe_drop_target(obs,front2,imp): carrying_key_color =  None set_drop_cooldown() last_drop_front = front2  return DROP  #fallback  return TURN_RIGHT def  is_safe_drop_target (obs,pos,important_set):  """Emptyfloor,notimportant,notadoor/key/goal/wall,andnotachokepoint.""" o,c,s = tile(obs,pos)  if o in (WALL,DOOR,KEY,GOAL):  return  False  if pos in important_set:  return  False  if is_chokepoint(obs,pos):  return  False  return  True def  compute_important_cells (obs,agent_pos,goal):  """Marksquaresweshouldavoiddroppingon.""" important =  set () n = obs . shape[ 0 ]  #Goalitself  if goal: important . add(goal)  #Adjacenttogoal  for ox,oy in DIRECTION_OFFSETS . values(): g2 = (goal[ 0 ] + ox,goal[ 1 ] + oy)  if in_bounds(g2,obs): important . add(g2)  #Adjacenttoanydoor  for x in  range (n):  for y in  range (n):  if obs[x,y, 0 ] == DOOR:  for ox,oy in DIRECTION_OFFSETS . values(): d2 = (x + ox,y + oy)  if in_bounds(d2,obs): important . add(d2)  #Cellsonthecurrentshortestpathtogoal(ifreachable)  if goal: path = a_star_open_only(obs,agent_pos,goal)  if path:  for p in path: important . add(p)  return important def  is_chokepoint (obs,pos):  """Atilewith<=2passableneighbors→likelyacorridor/bottleneck.""" cnt =  0  for ox,oy in DIRECTION_OFFSETS . values(): p = (pos[ 0 ] + ox,pos[ 1 ] + oy)  if  not in_bounds(p,obs):  continue o,c,s = tile(obs,p)  if is_passable_open_only(o,s): cnt +=  1  return cnt <=  2 def  nearest_drop_platform (obs,from_pos,important_set):  """ BFSoverstandpositions.Return(platform_pos,facing_dir)suchthat fromplatform_pos,thefrontcell(infacing_dir)isasafedroptarget. """ q = deque([from_pos]) seen = {from_pos}  while q: c = q . popleft()  for face_dir,(ox,oy) in DIRECTION_OFFSETS . items(): f = (c[ 0 ] + ox,c[ 1 ] + oy)  if in_bounds(f,obs) and is_safe_drop_target(obs,f,important_set) and f != last_drop_front:  return c,face_dir  #expand  for ox,oy in DIRECTION_OFFSETS . values(): nx,ny = c[ 0 ] + ox,c[ 1 ] + oy nxt = (nx,ny)  if  not in_bounds(nxt,obs) or nxt in seen:  continue o,c_,s = tile(obs,nxt)  if is_passable_open_only(o,s): seen . add(nxt) q . append(nxt)  return  None , None def  set_drop_cooldown ():  global drop_cooldown drop_cooldown =  2 def  _dec_drop_cooldown ():  global drop_cooldown drop_cooldown =  max ( 0 ,drop_cooldown -  1 ) from  heapq  import heappop,heappush from  collections  import deque #====Actions==== TURN_LEFT =  0 TURN_RIGHT =  1 MOVE_FORWARD =  2 PICK_UP =  3 DROP =  4 TOGGLE =  5 #====Objects==== WALL =  2 GOAL =  8 DOOR =  4 KEY =  5 #====Doorstate(typicalMiniGridencoding)==== DOOR_OPEN =  0 DOOR_CLOSED =  1 DOOR_LOCKED =  2 #====Directions==== RIGHT =  0 DOWN =  1 LEFT =  2 UP =  3 DIRECTION_OFFSETS = { RIGHT:( 1 , 0 ), DOWN:( 0 , 1 ), LEFT:( -1 , 0 ), UP:( 0 , -1 ), } #----Runner-persistentstate---- carrying_key_color =  None drop_cooldown =  0 last_drop_front =  None  #avoiddroppingtwiceinsamefrontcell #==========PUBLICENTRYPOINT========== def  policy (obs,agent_pos,agent_dir):  """ obs:NxNx3array(obj,color,state) agent_pos:(x,y) agent_dir:0:RIGHT,1:DOWN,2:LEFT,3:UP returns:actionint """  global carrying_key_color,drop_cooldown  if drop_cooldown >  0 : _dec_drop_cooldown() front = get_facing(agent_pos,agent_dir) goal = find_goal(obs)  #----1)Immediatefrontinteractions----  if in_bounds(front,obs): fobj,fcol,fstate = tile(obs,front)  #Doors:openclosed-unlocked;unlocklockedifwecarrymatchingkey  if fobj == DOOR:  if is_door_closed_unlocked(fstate):  return TOGGLE  if is_door_locked(fstate) and carrying_key_color == fcol:  return TOGGLE  #Keyswithsingle-keycapacity  if fobj == KEY:  if carrying_key_color is  None  and drop_cooldown ==  0 : carrying_key_color = fcol  return PICK_UP  elif carrying_key_color is  not  None  and carrying_key_color != fcol: act = drop_key_somewhere(obs,agent_pos,agent_dir,goal)  if act is  not  None :  return act  return TURN_RIGHT  #----2)Trydirectpathtogoal(open-only)----  if goal: path = a_star_open_only(obs,agent_pos,goal)  if path:  return step_to(path[ 0 ],agent_pos,agent_dir)  #----3)Handleblockingdoors(lockedorclosed)---- blocking = find_blocking_doors(obs,agent_pos,goal) #(x,y,color,state) blocking . sort(key = lambda d:manhattan(agent_pos,d[: 2 ]))  for (dx,dy,dcol,dstate) in blocking:  if is_door_locked(dstate):  #Needkeyofcolordcol  if carrying_key_color not  in ( None ,dcol): act = drop_key_somewhere(obs,agent_pos,agent_dir,goal)  if act is  not  None :  return act  return TURN_RIGHT  if carrying_key_color != dcol: kpos = find_key_of_color(obs,dcol)  if kpos: adj = nearest_adjacent_open_only(obs,kpos,agent_pos)  if adj: path = a_star_open_only(obs,agent_pos,adj)  if path:  return step_to(path[ 0 ],agent_pos,agent_dir)  return TURN_RIGHT  #Havecorrectkey→goadjacenttodoor adj = nearest_adjacent_open_only(obs,(dx,dy),agent_pos)  if adj: path = a_star_open_only(obs,agent_pos,adj)  if path:  return step_to(path[ 0 ],agent_pos,agent_dir)  return TURN_RIGHT  #Closed-unlocked:justapproachandtoggle  if is_door_closed_unlocked(dstate): adj = nearest_adjacent_open_only(obs,(dx,dy),agent_pos)  if adj: path = a_star_open_only(obs,agent_pos,adj)  if path:  return step_to(path[ 0 ],agent_pos,agent_dir)  return TURN_RIGHT  #----4)Mildexploration----  return TURN_RIGHT #==================HELPERS================== def  get_facing (pos,dir_): dx,dy = DIRECTION_OFFSETS[dir_]  return (pos[ 0 ] + dx,pos[ 1 ] + dy) def  in_bounds (pos,obs): n = obs . shape[ 0 ]  return  0  <= pos[ 0 ] < n and  0  <= pos[ 1 ] < n def  tile (obs,pos):  return obs[pos[ 0 ],pos[ 1 ], 0 ],obs[pos[ 0 ],pos[ 1 ], 1 ],obs[pos[ 0 ],pos[ 1 ], 2 ] def  is_door_locked (state): return state == DOOR_LOCKED def  is_door_closed_unlocked (state): return state == DOOR_CLOSED def  is_open_door (state): return state == DOOR_OPEN def  is_passable_open_only (obj,state):  if obj == WALL:  return  False  if obj == DOOR:  return is_open_door(state)  return  True  #GOAL,empty,key,etc.(wedon'tstepintokeys,butallowroutingaround) def  find_goal (obs): n = obs . shape[ 0 ]  for x in  range (n):  for y in  range (n):  if obs[x,y, 0 ] == GOAL:  return (x,y)  return  None def  manhattan (a,b):  return  abs (a[ 0 ] - b[ 0 ]) +  abs (a[ 1 ] - b[ 1 ]) def  step_to (next_pos,agent_pos,agent_dir): dx,dy = next_pos[ 0 ] - agent_pos[ 0 ],next_pos[ 1 ] - agent_pos[ 1 ]  for dir_,(ox,oy) in DIRECTION_OFFSETS . items():  if (dx,dy) == (ox,oy):  if agent_dir == dir_:  return MOVE_FORWARD  elif (agent_dir - dir_) %  4  ==  1 :  return TURN_LEFT  else :  return TURN_RIGHT  return TURN_LEFT #----------A*(opendoorsonly)---------- def  a_star_open_only (obs,start,goal):  if goal is  None :  return  None  def  h (a,b): return  abs (a[ 0 ] - b[ 0 ]) +  abs (a[ 1 ] - b[ 1 ]) open_set = [( 0 ,start)] came_from = {} g = {start: 0 } f = {start:h(start,goal)}  while open_set: _,current = heappop(open_set)  if current == goal:  return _reconstruct_path(came_from,current)  for ox,oy in DIRECTION_OFFSETS . values(): nx,ny = current[ 0 ] + ox,current[ 1 ] + oy nxt = (nx,ny)  if  not in_bounds(nxt,obs):  continue o,c,s = tile(obs,nxt)  if  not is_passable_open_only(o,s):  continue tg = g[current] +  1  if nxt not  in g or tg < g[nxt]: g[nxt] = tg came_from[nxt] = current fn = tg + h(nxt,goal) f[nxt] = fn heappush(open_set,(fn,nxt))  return  None def  _reconstruct_path (came_from,cur): path = []  while cur in came_from: path . append(cur) cur = came_from[cur] path . reverse()  return path #----------Blockingdoors(firstfrontierofclosed/locked)---------- def  find_blocking_doors (obs,start,goal): n = obs . shape[ 0 ] q = deque([start]) seen = {start} blockers = []  while q: x,y = q . popleft()  if goal and (x,y) == goal:  return []  for ox,oy in DIRECTION_OFFSETS . values(): nx,ny = x + ox,y + oy  if  not ( 0  <= nx < n and  0  <= ny < n):  continue o,c,s = obs[nx,ny] nxt = (nx,ny)  if (o == DOOR) and (is_door_locked(s) or is_door_closed_unlocked(s)):  if nxt not  in [b[: 2 ] for b in blockers]: blockers . append((nx,ny,c,s))  continue  if nxt in seen:  continue  if  not is_passable_open_only(o,s):  continue seen . add(nxt) q . append(nxt)  return blockers #----------Adjacentpassabletileneartarget---------- def  nearest_adjacent_open_only (obs,target_pos,from_pos): adjs = []  for ox,oy in DIRECTION_OFFSETS . values(): p = (target_pos[ 0 ] + ox,target_pos[ 1 ] + oy)  if in_bounds(p,obs): o,c,s = tile(obs,p)  if is_passable_open_only(o,s): adjs . append(p)  if  not adjs:  return  None best =  None best_len =  1e9  for a in adjs: path = a_star_open_only(obs,from_pos,a)  if path and  len (path) < best_len: best_len =  len (path) best = a  return best def  find_key_of_color (obs,color): n = obs . shape[ 0 ]  for x in  range (n):  for y in  range (n):  if obs[x,y, 0 ] == KEY and obs[x,y, 1 ] == color:  return (x,y)  return  None #----------Important-cellawareDROPlogic---------- def  drop_key_somewhere (obs,agent_pos,agent_dir,goal):  """ Avoiddroppingonimportantsquares: -GOALtile -tilesadjacenttoanyDOOR -tilesadjacenttoGOAL -cellsoncurrentshortestpathtoGOAL -chokepoints(<=2passableneighbors) """  global carrying_key_color,last_drop_front imp = compute_important_cells(obs,agent_pos,goal)  #1)TrydroppingintoFRONTifsafe&notimportant front = get_facing(agent_pos,agent_dir)  if in_bounds(front,obs) and is_safe_drop_target(obs,front,imp) and front != last_drop_front: carrying_key_color =  None set_drop_cooldown() last_drop_front = front  return DROP  #2)Rotatetofaceasafe&non-importantfrontcell  for turns,next_dir in (( 0 ,agent_dir), ( 1 ,(agent_dir +  1 ) %  4 ), ( 2 ,(agent_dir +  2 ) %  4 ), ( 3 ,(agent_dir +  3 ) %  4 )): f = get_facing(agent_pos,next_dir)  if in_bounds(f,obs) and is_safe_drop_target(obs,f,imp) and f != last_drop_front:  if turns ==  0 : carrying_key_color =  None set_drop_cooldown() last_drop_front = f  return DROP  if (agent_dir - next_dir) %  4  ==  1 :  return TURN_LEFT  else :  return TURN_RIGHT  #3)Gostandonaplatformfromwhichsomefacinghasasafedroptarget platform,face_dir = nearest_drop_platform(obs,agent_pos,imp)  if platform:  if agent_pos != platform: path = a_star_open_only(obs,agent_pos,platform)  if path:  return step_to(path[ 0 ],agent_pos,agent_dir)  #onplatform:orientthendrop  if agent_dir != face_dir:  #one-steprotatetowardface_dir  if (agent_dir - face_dir) %  4  ==  1 :  return TURN_LEFT  else :  return TURN_RIGHT  #nowfrontissafe&non-important front2 = get_facing(agent_pos,agent_dir)  if in_bounds(front2,obs) and is_safe_drop_target(obs,front2,imp): carrying_key_color =  None set_drop_cooldown() last_drop_front = front2  return DROP  #fallback  return TURN_RIGHT def  is_safe_drop_target (obs,pos,important_set):  """Emptyfloor,notimportant,notadoor/key/goal/wall,andnotachokepoint.""" o,c,s = tile(obs,pos)  if o in (WALL,DOOR,KEY,GOAL):  return  False  if pos in important_set:  return  False  if is_chokepoint(obs,pos):  return  False  return  True def  compute_important_cells (obs,agent_pos,goal):  """Marksquaresweshouldavoiddroppingon.""" important =  set () n = obs . shape[ 0 ]  #Goalitself  if goal: important . add(goal)  #Adjacenttogoal  for ox,oy in DIRECTION_OFFSETS . values(): g2 = (goal[ 0 ] + ox,goal[ 1 ] + oy)  if in_bounds(g2,obs): important . add(g2)  #Adjacenttoanydoor  for x in  range (n):  for y in  range (n):  if obs[x,y, 0 ] == DOOR:  for ox,oy in DIRECTION_OFFSETS . values(): d2 = (x + ox,y + oy)  if in_bounds(d2,obs): important . add(d2)  #Cellsonthecurrentshortestpathtogoal(ifreachable)  if goal: path = a_star_open_only(obs,agent_pos,goal)  if path:  for p in path: important . add(p)  return important def  is_chokepoint (obs,pos):  """Atilewith<=2passableneighbors→likelyacorridor/bottleneck.""" cnt =  0  for ox,oy in DIRECTION_OFFSETS . values(): p = (pos[ 0 ] + ox,pos[ 1 ] + oy)  if  not in_bounds(p,obs):  continue o,c,s = tile(obs,p)  if is_passable_open_only(o,s): cnt +=  1  return cnt <=  2 def  nearest_drop_platform (obs,from_pos,important_set):  """ BFSoverstandpositions.Return(platform_pos,facing_dir)suchthat fromplatform_pos,thefrontcell(infacing_dir)isasafedroptarget. """ q = deque([from_pos]) seen = {from_pos}  while q: c = q . popleft()  for face_dir,(ox,oy) in DIRECTION_OFFSETS . items(): f = (c[ 0 ] + ox,c[ 1 ] + oy)  if in_bounds(f,obs) and is_safe_drop_target(obs,f,important_set) and f != last_drop_front:  return c,face_dir  #expand  for ox,oy in DIRECTION_OFFSETS . values(): nx,ny = c[ 0 ] + ox,c[ 1 ] + oy nxt = (nx,ny)  if  not in_bounds(nxt,obs) or nxt in seen:  continue o,c_,s = tile(obs,nxt)  if is_passable_open_only(o,s): seen . add(nxt) q . append(nxt)  return  None , None def  set_drop_cooldown ():  global drop_cooldown drop_cooldown =  2 def  _dec_drop_cooldown ():  global drop_cooldown drop_cooldown =  max ( 0 ,drop_cooldown -  1 ) from  heapq  import heappop,heappush from  collections  import deque #Actions TURN_LEFT =  0 TURN_RIGHT =  1 MOVE_FORWARD =  2 PICK_UP =  3 DROP =  4 TOGGLE =  5 #Objects WALL =  2 GOAL =  8 DOOR =  4 KEY =  5 #Directions RIGHT =  0 DOWN =  1 LEFT =  2 UP =  3 DIRECTION_OFFSETS = { RIGHT:( 1 , 0 ), DOWN:( 0 , 1 ), LEFT:( -1 , 0 ), UP:( 0 , -1 ), } #Single-keycapacity carrying_key_color =  None def  policy (obs,agent_pos,agent_dir):  global carrying_key_color n = obs . shape[ 0 ] act = maybe_toggle_if_in_front(obs,agent_pos,agent_dir)  if act is  not  None :  return act act = maybe_pickup_if_in_front(obs,agent_pos,agent_dir)  if act is  not  None :  return act goal_pos = find_goal(obs)  if  not goal_pos:  return TURN_LEFT path_goal = a_star(obs,agent_pos,goal_pos,avoid_locked = True )  if path_goal:  return step_to(path_goal[ 0 ],agent_pos,agent_dir) blocking_doors = find_all_blocking_doors(obs,agent_pos,goal_pos)  if  not blocking_doors:  return TURN_LEFT  for (dx,dy,door_col) in blocking_doors:  if  not can_open_door(obs,agent_pos,agent_dir,door_col,exploring_colors = set ()):  continue  #ifweholdadifferentkey=>dropit  if carrying_key_color and carrying_key_color != door_col: drop_act = maybe_drop_wrong_key(obs,agent_pos,agent_dir,goal_pos)  if drop_act:  return drop_act  #ifwedontholddoor_col=>pickitup  if carrying_key_color != door_col: key_loc = find_key(obs,door_col)  if  not key_loc:  continue chain_path = path_with_chain_unlock(obs,agent_pos,agent_dir,key_loc,exploring_colors = set ())  if  not chain_path:  continue  return step_to(chain_path[ 0 ],agent_pos,agent_dir)  #nowweholddoor_col=>pathnextto(dx,dy) adj = find_adjacent_tile(obs,(dx,dy),agent_pos)  if  not adj:  continue chain_path2 = path_with_chain_unlock(obs,agent_pos,agent_dir,adj,exploring_colors = set ())  if chain_path2:  if  len (chain_path2) ==0  or chain_path2[ 0 ] == agent_pos:  return TURN_LEFT  return step_to(chain_path2[ 0 ],agent_pos,agent_dir)  return TURN_LEFT def  maybe_toggle_if_in_front (obs,agent_pos,agent_dir): front = get_facing(agent_pos,agent_dir)  if in_bounds(front,obs . shape[ 0 ]): fobj,fcol,fstate = obs[front[ 0 ],front[ 1 ]]  if fobj == DOOR and fstate ==2  and carrying_key_color == fcol:  return TOGGLE  return  None def  maybe_pickup_if_in_front (obs,agent_pos,agent_dir):  """ IffrontisKEY: -Ifnokey=>pickitup -Ifholddifferentkey=>mustdropoldfirst=>donothing -Ifholdsamecolor=>donothing """  global carrying_key_color f = get_facing(agent_pos,agent_dir)  if in_bounds(f,obs . shape[ 0 ]): fo,fc,fs = obs[f[ 0 ],f[ 1 ]]  if fo == KEY:  if carrying_key_color is  None : carrying_key_color = fc  return PICK_UP  return  None def  maybe_drop_wrong_key (obs,agent_pos,agent_dir,goal_pos):  global carrying_key_color  if carrying_key_color is  None :  return  None front = get_facing(agent_pos,agent_dir)  if in_bounds(front,obs . shape[ 0 ]): fo,fc,fs = obs[front[ 0 ],front[ 1 ]]  if fo in ( 1 , 3 ): #floor carrying_key_color =  None  return DROP non_important = find_non_important_floor(obs,agent_pos,goal_pos)  if non_important: path_ni = a_star(obs,agent_pos,non_important,avoid_locked = False )  if path_ni:  if  len (path_ni) ==0  or path_ni[ 0 ] == agent_pos:  #ifBFSisdegenerate=>trydroppingonourowntileifsafe  if can_drop_on_own_tile(obs,agent_pos): carrying_key_color =  None  return DROP  return TURN_LEFT  return step_to(path_ni[ 0 ],agent_pos,agent_dir) fallback_floor = find_any_floor(obs,agent_pos)  if fallback_floor: path_floor = a_star(obs,agent_pos,fallback_floor,avoid_locked = False )  if path_floor:  if  len (path_floor) ==0  or path_floor[ 0 ] == agent_pos:  if can_drop_on_own_tile(obs,agent_pos): carrying_key_color =  None  return DROP  return TURN_LEFT  return step_to(path_floor[ 0 ],agent_pos,agent_dir)  if can_drop_on_own_tile(obs,agent_pos): carrying_key_color =  None  return DROP  return TURN_LEFT def  can_drop_on_own_tile (obs,agent_pos): x,y = agent_pos o,c,s = obs[x,y]  return (o in ( 1 , 3 )) #---------------------------------------------------------------------- def  find_non_important_floor (obs,agent_pos,goal_pos): visited_path =  set ()  #BFSfromagent_pos=>goal_posignoringlocked n = obs . shape[ 0 ]  from  collections  import deque queue = deque([agent_pos]) visited_path . add(agent_pos) found_goal =  False  while queue: cur = queue . popleft()  if cur == goal_pos: found_goal = True cx,cy = cur  for (dx,dy) in DIRECTION_OFFSETS . values(): nx,ny = cx + dx,cy + dy  if  0<= nx < n and  0<= ny < n:  if (nx,ny) not  in visited_path: oo,cc,ss = obs[nx,ny]  if oo == WALL:  continue  if oo == DOOR and ss ==2 :  continue visited_path . add((nx,ny)) queue . append((nx,ny))  for x2 in  range (n):  for y2 in  range (n):  if (x2,y2) not  in visited_path: oo,cc,ss = obs[x2,y2]  if oo in ( 1 , 3 ): #floor  return (x2,y2)  return  None def  find_goal (obs): n = obs . shape[ 0 ]  for x in  range (n):  for y in  range (n):  if obs[x,y, 0 ] == GOAL:  return (x,y)  return  None def  find_any_floor (obs,agent_pos): n = obs . shape[ 0 ]  for x in  range (n):  for y in  range (n):  if obs[x,y, 0 ] in ( 1 , 3 ):  return (x,y)  return  None def  get_facing (agent_pos,agent_dir): dx,dy = DIRECTION_OFFSETS[agent_dir]  return (agent_pos[ 0 ] + dx,agent_pos[ 1 ] + dy) def  in_bounds (pos,n):  return ( 0<= pos[ 0 ] < n and  0<= pos[ 1 ] < n) def  a_star (obs,start,goal,avoid_locked = True ):  def  heuristic (a,b): return  abs (a[ 0 ] - b[ 0 ]) +  abs (a[ 1 ] - b[ 1 ])  from  heapq  import heappop,heappush n = obs . shape[ 0 ] open_set = [] heappush(open_set,( 0 ,start)) came_from = {} g_score = {start: 0 } f_score = {start:heuristic(start,goal)}  while open_set: _,current = heappop(open_set)  if current == goal: path = []  while current in came_from: path . append(current) current = came_from[current] path . reverse()  return path cx,cy = current  for (dx,dy) in DIRECTION_OFFSETS . values(): nx,ny = cx + dx,cy + dy  if  not ( 0<= nx < n and  0<= ny < n):  continue o = obs[nx,ny, 0 ] s = obs[nx,ny, 2 ]  if o == WALL:  continue  if o == DOOR and s ==2  and avoid_locked:  continue cost = g_score[current] +1  if (nx,ny) not  in g_score or cost < g_score[(nx,ny)]: g_score[(nx,ny)] = cost f_score[(nx,ny)] = cost + heuristic((nx,ny),goal) came_from[(nx,ny)] = current heappush(open_set,(f_score[(nx,ny)],(nx,ny)))  return  None def  step_to (next_pos,agent_pos,agent_dir): dx = next_pos[ 0 ] - agent_pos[ 0 ] dy = next_pos[ 1 ] - agent_pos[ 1 ] desired_dir = None  for d,(ox,oy) in DIRECTION_OFFSETS . items():  if (dx,dy) == (ox,oy): desired_dir = d  break  if desired_dir is  None :  return TURN_LEFT turn_act = turn_toward(agent_dir,desired_dir)  if turn_act is  not  None :  return turn_act  else :  return MOVE_FORWARD def  turn_toward (current_dir,target_dir):  if current_dir == target_dir:  return  None left_turns = (current_dir - target_dir) %4  return TURN_LEFT if left_turns ==1  else TURN_RIGHT def  find_all_blocking_doors (obs,agent_pos,goal_pos):  """ BFSignoringwalls&opendoors.Wenotelockeddoorsbutkeepexploringtoseeifthereisapatharoundthem. Ifwefindthegoal=>nodooristrulyblocking=>return[]. Ifnot=>thelockeddoorswesawareindeedblocking. """ n = obs . shape[ 0 ] visited = set ([agent_pos])  from  collections  import deque q = deque([agent_pos]) blocking = []  while q: cx,cy = q . popleft()  if (cx,cy) == goal_pos:  return []  for (dx,dy) in DIRECTION_OFFSETS . values(): nx,ny = cx + dx,cy + dy  if  0<= nx < n and  0<= ny < n:  if (nx,ny) not  in visited: o = obs[nx,ny, 0 ] s = obs[nx,ny, 2 ]  if o == WALL:  continue  if o == DOOR and s ==2 : color = obs[nx,ny, 1 ]  if (nx,ny,color) not  in blocking: blocking . append((nx,ny,color))  continue visited . add((nx,ny)) q . append((nx,ny))  return blocking def  find_key (obs,color): n = obs . shape[ 0 ]  for x in  range (n):  for y in  range (n):  if obs[x,y, 0 ] == KEY and obs[x,y, 1 ] == color:  return (x,y)  return  None def  can_open_door (obs,agent_pos,agent_dir,needed_color,exploring_colors = None ):  if exploring_colors is  None : exploring_colors = set ()  if needed_color in exploring_colors:  return  False  if carrying_key_color == needed_color:  return  True key_loc = find_key(obs,needed_color)  if  not key_loc:  return  False exploring_colors . add(needed_color) path = path_with_chain_unlock(obs,agent_pos,agent_dir,key_loc,exploring_colors) exploring_colors . remove(needed_color)  return (path is  not  None ) def  path_with_chain_unlock (obs,agent_pos,agent_dir,target,exploring_colors):  """ BFSthattreatslockeddoorcolorXaspassableifcan_open_door=>True """ n = obs . shape[ 0 ] visited = set ([agent_pos]) came_from = {}  from  collections  import deque queue = deque([agent_pos])  def  reconstruct (e): path = []  while e in came_from: path . append(e) e = came_from[e] path . reverse()  return path  while queue: cur = queue . popleft()  if cur == target:  return reconstruct(cur) cx,cy = cur  for (dx,dy) in DIRECTION_OFFSETS . values(): nx,ny = cx + dx,cy + dy  if  not ( 0<= nx < n and  0<= ny < n):  continue  if (nx,ny) in visited:  continue o = obs[nx,ny, 0 ] c = obs[nx,ny, 1 ] s = obs[nx,ny, 2 ]  if o == WALL:  continue  if o == DOOR and s ==2 :  if  not can_open_door(obs,(cx,cy),agent_dir,c,exploring_colors):  continue  elif o == DOOR and s !=0 :  #closed=>nokeyneeded=>pass  pass visited . add((nx,ny)) came_from[(nx,ny)] = cur queue . append((nx,ny))  return  None def  find_adjacent_tile (obs,door_pos,agent_pos): n = obs . shape[ 0 ] (dx,dy) = door_pos candidates = []  for (sx,sy) in DIRECTION_OFFSETS . values(): nx,ny = dx + sx,dy + sy  if  0<= nx < n and  0<= ny < n: o = obs[nx,ny, 0 ] s = obs[nx,ny, 2 ]  if o == WALL:  continue  if o == DOOR and s ==2 :  continue candidates . append((nx,ny))  if  not candidates:  return  None  #pickwhicheverisclosesttoagent candidates . sort(key = lambda c: abs (c[ 0 ] - agent_pos[ 0 ]) +  abs (c[ 1 ] - agent_pos[ 1 ]))  return candidates[ 0 ] from  heapq  import heappop,heappush from  collections  import deque #Actions TURN_LEFT =  0 TURN_RIGHT =  1 MOVE_FORWARD =  2 PICK_UP =  3 DROP =  4 TOGGLE =  5 #Objects WALL =  2 GOAL =  8 DOOR =  4 KEY =  5 #Directions RIGHT =  0 DOWN =  1 LEFT =  2 UP =  3 DIRECTION_OFFSETS = { RIGHT:( 1 , 0 ), DOWN:( 0 , 1 ), LEFT:( -1 , 0 ), UP:( 0 , -1 ), } #Single-keycapacity carrying_key_color =  None def  policy (obs,agent_pos,agent_dir):  global carrying_key_color n = obs . shape[ 0 ] act = maybe_toggle_if_in_front(obs,agent_pos,agent_dir)  if act is  not  None :  return act act = maybe_pickup_if_in_front(obs,agent_pos,agent_dir)  if act is  not  None :  return act goal_pos = find_goal(obs)  if  not goal_pos:  return TURN_LEFT path_goal = a_star(obs,agent_pos,goal_pos,avoid_locked = True )  if path_goal:  return step_to(path_goal[ 0 ],agent_pos,agent_dir) blocking_doors = find_all_blocking_doors(obs,agent_pos,goal_pos)  if  not blocking_doors:  return TURN_LEFT  for (dx,dy,door_col) in blocking_doors:  if  not can_open_door(obs,agent_pos,agent_dir,door_col,exploring_colors = set ()):  continue  #ifweholdadifferentkey=>dropit  if carrying_key_color and carrying_key_color != door_col: drop_act = maybe_drop_wrong_key(obs,agent_pos,agent_dir,goal_pos)  if drop_act:  return drop_act  #ifwedontholddoor_col=>pickitup  if carrying_key_color != door_col: key_loc = find_key(obs,door_col)  if  not key_loc:  continue chain_path = path_with_chain_unlock(obs,agent_pos,agent_dir,key_loc,exploring_colors = set ())  if  not chain_path:  continue  return step_to(chain_path[ 0 ],agent_pos,agent_dir)  #nowweholddoor_col=>pathnextto(dx,dy) adj = find_adjacent_tile(obs,(dx,dy),agent_pos)  if  not adj:  continue chain_path2 = path_with_chain_unlock(obs,agent_pos,agent_dir,adj,exploring_colors = set ())  if chain_path2:  if  len (chain_path2) ==0  or chain_path2[ 0 ] == agent_pos:  return TURN_LEFT  return step_to(chain_path2[ 0 ],agent_pos,agent_dir)  return TURN_LEFT def  maybe_toggle_if_in_front (obs,agent_pos,agent_dir): front = get_facing(agent_pos,agent_dir)  if in_bounds(front,obs . shape[ 0 ]): fobj,fcol,fstate = obs[front[ 0 ],front[ 1 ]]  if fobj == DOOR and fstate ==2  and carrying_key_color == fcol:  return TOGGLE  return  None def  maybe_pickup_if_in_front (obs,agent_pos,agent_dir):  """ IffrontisKEY: -Ifnokey=>pickitup -Ifholddifferentkey=>mustdropoldfirst=>donothing -Ifholdsamecolor=>donothing """  global carrying_key_color f = get_facing(agent_pos,agent_dir)  if in_bounds(f,obs . shape[ 0 ]): fo,fc,fs = obs[f[ 0 ],f[ 1 ]]  if fo == KEY:  if carrying_key_color is  None : carrying_key_color = fc  return PICK_UP  return  None def  maybe_drop_wrong_key (obs,agent_pos,agent_dir,goal_pos):  global carrying_key_color  if carrying_key_color is  None :  return  None front = get_facing(agent_pos,agent_dir)  if in_bounds(front,obs . shape[ 0 ]): fo,fc,fs = obs[front[ 0 ],front[ 1 ]]  if fo in ( 1 , 3 ): #floor carrying_key_color =  None  return DROP non_important = find_non_important_floor(obs,agent_pos,goal_pos)  if non_important: path_ni = a_star(obs,agent_pos,non_important,avoid_locked = False )  if path_ni:  if  len (path_ni) ==0  or path_ni[ 0 ] == agent_pos:  #ifBFSisdegenerate=>trydroppingonourowntileifsafe  if can_drop_on_own_tile(obs,agent_pos): carrying_key_color =  None  return DROP  return TURN_LEFT  return step_to(path_ni[ 0 ],agent_pos,agent_dir) fallback_floor = find_any_floor(obs,agent_pos)  if fallback_floor: path_floor = a_star(obs,agent_pos,fallback_floor,avoid_locked = False )  if path_floor:  if  len (path_floor) ==0  or path_floor[ 0 ] == agent_pos:  if can_drop_on_own_tile(obs,agent_pos): carrying_key_color =  None  return DROP  return TURN_LEFT  return step_to(path_floor[ 0 ],agent_pos,agent_dir)  if can_drop_on_own_tile(obs,agent_pos): carrying_key_color =  None  return DROP  return TURN_LEFT def  can_drop_on_own_tile (obs,agent_pos): x,y = agent_pos o,c,s = obs[x,y]  return (o in ( 1 , 3 )) #---------------------------------------------------------------------- def  find_non_important_floor (obs,agent_pos,goal_pos): visited_path =  set ()  #BFSfromagent_pos=>goal_posignoringlocked n = obs . shape[ 0 ]  from  collections  import deque queue = deque([agent_pos]) visited_path . add(agent_pos) found_goal =  False  while queue: cur = queue . popleft()  if cur == goal_pos: found_goal = True cx,cy = cur  for (dx,dy) in DIRECTION_OFFSETS . values(): nx,ny = cx + dx,cy + dy  if  0<= nx < n and  0<= ny < n:  if (nx,ny) not  in visited_path: oo,cc,ss = obs[nx,ny]  if oo == WALL:  continue  if oo == DOOR and ss ==2 :  continue visited_path . add((nx,ny)) queue . append((nx,ny))  for x2 in  range (n):  for y2 in  range (n):  if (x2,y2) not  in visited_path: oo,cc,ss = obs[x2,y2]  if oo in ( 1 , 3 ): #floor  return (x2,y2)  return  None def  find_goal (obs): n = obs . shape[ 0 ]  for x in  range (n):  for y in  range (n):  if obs[x,y, 0 ] == GOAL:  return (x,y)  return  None def  find_any_floor (obs,agent_pos): n = obs . shape[ 0 ]  for x in  range (n):  for y in  range (n):  if obs[x,y, 0 ] in ( 1 , 3 ):  return (x,y)  return  None def  get_facing (agent_pos,agent_dir): dx,dy = DIRECTION_OFFSETS[agent_dir]  return (agent_pos[ 0 ] + dx,agent_pos[ 1 ] + dy) def  in_bounds (pos,n):  return ( 0<= pos[ 0 ] < n and  0<= pos[ 1 ] < n) def  a_star (obs,start,goal,avoid_locked = True ):  def  heuristic (a,b): return  abs (a[ 0 ] - b[ 0 ]) +  abs (a[ 1 ] - b[ 1 ])  from  heapq  import heappop,heappush n = obs . shape[ 0 ] open_set = [] heappush(open_set,( 0 ,start)) came_from = {} g_score = {start: 0 } f_score = {start:heuristic(start,goal)}  while open_set: _,current = heappop(open_set)  if current == goal: path = []  while current in came_from: path . append(current) current = came_from[current] path . reverse()  return path cx,cy = current  for (dx,dy) in DIRECTION_OFFSETS . values(): nx,ny = cx + dx,cy + dy  if  not ( 0<= nx < n and  0<= ny < n):  continue o = obs[nx,ny, 0 ] s = obs[nx,ny, 2 ]  if o == WALL:  continue  if o == DOOR and s ==2  and avoid_locked:  continue cost = g_score[current] +1  if (nx,ny) not  in g_score or cost < g_score[(nx,ny)]: g_score[(nx,ny)] = cost f_score[(nx,ny)] = cost + heuristic((nx,ny),goal) came_from[(nx,ny)] = current heappush(open_set,(f_score[(nx,ny)],(nx,ny)))  return  None def  step_to (next_pos,agent_pos,agent_dir): dx = next_pos[ 0 ] - agent_pos[ 0 ] dy = next_pos[ 1 ] - agent_pos[ 1 ] desired_dir = None  for d,(ox,oy) in DIRECTION_OFFSETS . items():  if (dx,dy) == (ox,oy): desired_dir = d  break  if desired_dir is  None :  return TURN_LEFT turn_act = turn_toward(agent_dir,desired_dir)  if turn_act is  not  None :  return turn_act  else :  return MOVE_FORWARD def  turn_toward (current_dir,target_dir):  if current_dir == target_dir:  return  None left_turns = (current_dir - target_dir) %4  return TURN_LEFT if left_turns ==1  else TURN_RIGHT def  find_all_blocking_doors (obs,agent_pos,goal_pos):  """ BFSignoringwalls&opendoors.Wenotelockeddoorsbutkeepexploringtoseeifthereisapatharoundthem. Ifwefindthegoal=>nodooristrulyblocking=>return[]. Ifnot=>thelockeddoorswesawareindeedblocking. """ n = obs . shape[ 0 ] visited = set ([agent_pos])  from  collections  import deque q = deque([agent_pos]) blocking = []  while q: cx,cy = q . popleft()  if (cx,cy) == goal_pos:  return []  for (dx,dy) in DIRECTION_OFFSETS . values(): nx,ny = cx + dx,cy + dy  if  0<= nx < n and  0<= ny < n:  if (nx,ny) not  in visited: o = obs[nx,ny, 0 ] s = obs[nx,ny, 2 ]  if o == WALL:  continue  if o == DOOR and s ==2 : color = obs[nx,ny, 1 ]  if (nx,ny,color) not  in blocking: blocking . append((nx,ny,color))  continue visited . add((nx,ny)) q . append((nx,ny))  return blocking def  find_key (obs,color): n = obs . shape[ 0 ]  for x in  range (n):  for y in  range (n):  if obs[x,y, 0 ] == KEY and obs[x,y, 1 ] == color:  return (x,y)  return  None def  can_open_door (obs,agent_pos,agent_dir,needed_color,exploring_colors = None ):  if exploring_colors is  None : exploring_colors = set ()  if needed_color in exploring_colors:  return  False  if carrying_key_color == needed_color:  return  True key_loc = find_key(obs,needed_color)  if  not key_loc:  return  False exploring_colors . add(needed_color) path = path_with_chain_unlock(obs,agent_pos,agent_dir,key_loc,exploring_colors) exploring_colors . remove(needed_color)  return (path is  not  None ) def  path_with_chain_unlock (obs,agent_pos,agent_dir,target,exploring_colors):  """ BFSthattreatslockeddoorcolorXaspassableifcan_open_door=>True """ n = obs . shape[ 0 ] visited = set ([agent_pos]) came_from = {}  from  collections  import deque queue = deque([agent_pos])  def  reconstruct (e): path = []  while e in came_from: path . append(e) e = came_from[e] path . reverse()  return path  while queue: cur = queue . popleft()  if cur == target:  return reconstruct(cur) cx,cy = cur  for (dx,dy) in DIRECTION_OFFSETS . values(): nx,ny = cx + dx,cy + dy  if  not ( 0<= nx < n and  0<= ny < n):  continue  if (nx,ny) in visited:  continue o = obs[nx,ny, 0 ] c = obs[nx,ny, 1 ] s = obs[nx,ny, 2 ]  if o == WALL:  continue  if o == DOOR and s ==2 :  if  not can_open_door(obs,(cx,cy),agent_dir,c,exploring_colors):  continue  elif o == DOOR and s !=0 :  #closed=>nokeyneeded=>pass  pass visited . add((nx,ny)) came_from[(nx,ny)] = cur queue . append((nx,ny))  return  None def  find_adjacent_tile (obs,door_pos,agent_pos): n = obs . shape[ 0 ] (dx,dy) = door_pos candidates = []  for (sx,sy) in DIRECTION_OFFSETS . values(): nx,ny = dx + sx,dy + sy  if  0<= nx < n and  0<= ny < n: o = obs[nx,ny, 0 ] s = obs[nx,ny, 2 ]  if o == WALL:  continue  if o == DOOR and s ==2 :  continue candidates . append((nx,ny))  if  not candidates:  return  None  #pickwhicheverisclosesttoagent candidates . sort(key = lambda c: abs (c[ 0 ] - agent_pos[ 0 ]) +  abs (c[ 1 ] - agent_pos[ 1 ]))  return candidates[ 0 ] def  maybe_drop_wrong_key (obs,agent_pos,agent_dir,goal_pos):  global carrying_key_color  if carrying_key_color is  None :  return  None front = get_facing(agent_pos,agent_dir)  if in_bounds(front,obs . shape[ 0 ]): fo,fc,fs = obs[front[ 0 ],front[ 1 ]]  if fo in ( 1 , 3 ): #floor carrying_key_color =  None  return DROP  ...  if can_drop_on_own_tile(obs,agent_pos): carrying_key_color =  None  return DROP  return TURN_LEFT def  drop_key_somewhere (obs,agent_pos,agent_dir,goal):  global carrying_key_color,last_drop_front imp = compute_important_cells(obs,agent_pos,goal)  #1)TrydroppingintoFRONTifsafe&notimportant  front = get_facing(agent_pos,agent_dir)   if in_bounds(front,obs) and is_safe_drop_target(obs,front,imp)   and front != last_drop_front:  carrying_key_color =  None  set_drop_cooldown()  last_drop_front = front   return DR OP   #2)Rotatetofaceasafe&non-importantfrontcell   for turns,next_dir in (( 0 ,agent_dir),  ( 1 ,(agent_dir +  1 ) %  4 ),  ( 2 ,(agent_dir +  2 ) %  4 ),  ( 3 ,(agent_dir +  3 ) %  4 )):  f = get_facing(agent_pos,next_dir)   . . .   #3)Gostandonaplatformfromwhichsomefacinghasasafedroptarget  platform,face_dir = nearest_drop_platform(obs,agent_pos,imp)   if platform:   if agent_pos != platform:  path = a_star_open_only(obs,agent_pos,platform)   if path:   return step_to(path[ 0 ],agent_pos,agent_dir)   . . .   return TUR N_R IGHT def  a_star (obs,start,goal,avoid_locked = True ):  def  heuristic (a,b): return  abs (a[ 0 ] - b[ 0 ]) +  abs (a[ 1 ] - b[ 1 ])  from  heapq  import heappop,heappush  ...  while open_set:  ...  for (dx,dy) in DIRECTION_OFFSETS . values(): nx,ny = cx + dx,cy + dy  if  not ( 0<= nx < n and  0<= ny < n):  continue o = obs[nx,ny, 0 ] s = obs[nx,ny, 2 ]  if o == WALL:  continue  if o == DOOR and s ==2  and avoid_locked:  continue cost = g_score[current] +1  if (nx,ny) not  in g_score or cost < g_score[(nx,ny)]: g_score[(nx,ny)] = cost f_score[(nx,ny)] = cost + heuristic((nx,ny),goal) came_from[(nx,ny)] = current heappush(open_set,(f_score[(nx,ny)],(nx,ny)))  return  None def  a_star_open_only (obs,start,goal):  if goal is  None :  return  None  def  h (a,b): return  abs (a[ 0 ] - b[ 0 ]) +  abs (a[ 1 ] - b[ 1 ])  ...  while open_set:  ...  for ox,oy in DIRECTION_OFFSETS . values(): nx,ny = current[ 0 ] + ox,current[ 1 ] + oy nxt = (nx,ny)  if  not in_bounds(nxt,obs):  continue o,c,s = tile(obs,nxt)  if  not is_passable_open_only(o,s):  continue tg = g[current] +  1  if nxt not  in g or tg < g[nxt]: g[nxt] = tg came_from[nxt] = current fn = tg + h(nxt,goal) f[nxt] = fn heappush(open_set,(fn,nxt))  return  None visited_blk = bfs_block_locked( self ,agent_pos) skip = {agent_pos,door_pos,goal_pos} candidates = [c for c in visited_blk if c not  in skip and  self . grid . get( * c) is  None ] ... key_spot = random . choice(candidates) self . put_obj(Key(color),key_spot[ 0 ],key_spot[ 1 ]) reserve =  set (path_set) | {agent_pos,goal_pos} |  set (door_positions) protected =  set () key_positions = [] for i,(dx,dy) in  enumerate (door_positions): unlocked =  set (colors[:i]) #colorsbeforedooriareusable region = flood_reachable( self ,agent_pos,unlocked) side_neighbors = [(nx,ny) for (nx,ny) in  self . _neighbors(dx,dy)  if ( 0  <= nx <  self . width and  0  <= ny <  self . height)  and _passable( self ,nx,ny,unlocked)  and (nx,ny) in region]  ... candidates = [c for c in region  if c not  in reserve  and  self . grid . get( * c) is  None  and  self . _free_neighbors_count( * c,keys = unlocked) >=  2 ]  ...  for cx,cy in pool: p1 = shortest_path( self ,agent_pos,(cx,cy),unlocked)  ...  p2 = shortest_path( self ,(cx,cy),s,unlocked)  ...  self . put_obj(Key(colors[i]),cx,cy) key_positions . append((cx,cy))  for cell in p1 + p2: protected . add(cell)  for nb in  self . _neighbors( * cell):  if  1  <= nb[ 0 ] <  self . width -1  and  1  <= nb[ 1 ] <  self . height -1 : protected . add(nb) Figure 2: Example of co evolution in COvolve. Left: successiv e environment implementations, progr essing from ad hoc generation to a structured, parameterize d design with explicit solvability checks and controllable chokepoints. Right: successive policy implementations, progressing from basic navigation to improved handling of keys and doors, together with renements to the A * based planner for more reliable action selection. Highlighted code blocks indicate the changes introduced. 4.1 Policy Designer For the current level, the Policy Designer Ψ synthesizes a policy via an iterative best-response procedure. Given a level 𝜃 , the pol- icy designer generates 𝐾 candidate policy mutations by applying LLM-guided program transformations to the current best policy , conditioning on the level-specic obser vation and action spaces O 𝜃 and A 𝜃 . Each candidate policy ˜ 𝜋 𝑘 is evaluated on 𝜃 , and the highest-performing candidate accor ding to the utility 𝑈 𝜃 is retained. The selected p olicy 𝜋 constitutes an approximate best response for level 𝜃 and is appended to the growing policy sequence P . While each policy is tailored to an individual level, our broader goal is to obtain an approximation of the optimal policy 𝜋 ∗ that p er- forms well across le vels. Since the interaction between the policy designer and the environment designer forms a two-player zero- sum game, a mixed-strategy Nash Equilibrium (MSNE) provides a principled solution, ensuring that the obtained policy is robust under adversarial conditions. Building on PSRO , we achie ve this by maintaining a growing se quence of policies P = { 𝜋 1 , . . . , 𝜋 𝑡 } and previously generated levels L = { 𝜃 1 , . . . , 𝜃 𝑡 } , where each p olicy 𝜋 𝑖 is optimized for its corresponding level 𝜃 𝑖 using the process de- scribed above. Let the payo matrix be M ∈ R 𝑟 × 𝑡 , wher e each entr y 𝑚 𝑖 𝑗 = 𝑈 𝜃 𝑗 ( 𝜋 𝑖 ) denotes the expecte d return of policy 𝜋 𝑖 on level 𝜃 𝑗 . The new MSNE is then computed via a minimax optimization where the policy agent maximizes its worst-case expected payo [34]: 𝑝 ★ = arg max 𝑝 ∈ Δ 𝑟 min 𝑗 ∈ { 1 ,.. .,𝑟 } 𝑟  𝑖 = 1 𝑝 𝑖 𝑚 𝑖 𝑗 where, Δ 𝑟 = ( 𝑝 ∈ R 𝑟      𝑟  𝑖 = 1 𝑝 𝑖 = 1 , 𝑝 𝑖 ≥ 0 ∀ 𝑖 ) . (1) Here, 𝑝 ★ is the mixture weights over policies that maximizes the worst-case e xpecte d return acr oss all levels, dening the MSNE policy distribution 𝑝 ★ , where each policy 𝜋 𝑖 is sampled from 𝜋 MSNE with probability 𝑝 ∗ 𝑖 . At the start of each episode, the agent samples a policy 𝜋 𝑖 ∼ 𝑝 ★ and takes actions according to 𝑎 ∼ 𝜋 in the given level. 4.2 Environment Designer The Environment Designer generates a new level as a best response to the current MSNE policy . Its goal is to minimize the expected return of the mixture policy , thereby re vealing its w eaknesses. This adversarial loop encourages curriculum-like progression, automat- ically increasing the environment’s diculty in response to the agent’s improvement. Using a best-response update, the environment designer uses an LLM-based adversary Λ to generate 𝐾 candidate mutations of the current environment { ˜ 𝜃 1 , . . . , ˜ 𝜃 𝐾 } = Λ ( 𝜃 , 𝑝 ★ ) . The candidate that minimizes performance under 𝜋 MSNE is selected. The sele cted environment is added to the lev el set L , after which a new policy is synthesized in response and appended to the policy sequence P . This procedure repeats across iterations, expanding both the environment and policy sets. COvolve: Adversarial Co-Evolution of Large-Language-Model-Generated Policies and Environments via Two-Player Zero-Sum Game 5 Experiments W e evaluate COvolve across three complementary domains that capture distinct challenges in agent learning. MiniGrid [ 4 ] requires symbolic planning in proce durally generated mazes with sequential dependencies such as keys and doors. PyGame [ 36 ] emphasizes continuous 2D navigation with geometric reasoning and collision constraints, with diculty scaling through denser obstacles and narrower traversable passages. CARLA [ 10 ] provides a high-delity urban driving setting with partial observability , dynamic vehicles, pedestrians, and trac lights. T ogether , these domains encompass symbolic planning, geometric navigation, and realistic multi-agent driving, forming a principled testbed for evaluating curriculum emergence and robustness under co-evolution. 5.1 Environments and T asks All main experiments use GPT -5.2 [ 33 ] as the generative model for both environment and policy synthesis. Prompts are provided in Appendix C. All generated code is dynamically validated and exe- cuted using exec() . For each policy–environment pair , we evaluate the payo 𝑈 𝜃 ( 𝜋 ) ∈ [ 0 , 1 ] , averaged over 100 episodes, to populate the empirical payo matrix M . Complete environment specica- tions (action and observation spaces, task scaling, termination, and feasibility checks) are provided in Appendix B. MiniGrid Symb olic Maze Solving. W e use the MiniGrid-Empty environment as a base for generating symbolic maze-solving tasks. It is a fully-obser vable environment with 𝑛 × 𝑛 × 3 grids, augmented with the agent’s absolute position and orientation, resulting in an 𝑛 × 𝑛 × 3 + 2 obser vation vector , where 𝑛 is the grid width and height. Each cell encodes object type, color identier , and state (e.g., open/closed doors). The agent acts in a 5-action discrete space ( turn, move, pick up, drop, toggle ). Diculty is scaled by enlarging grids, adding walls, and introducing lo cked doors with keys that must be retrieved and used in sequence, enfor cing multi-step planning even in small mazes. W e use handcrafted heuristics to validate if the generated environments ar e feasible ( c.f., § 6 Limitations). Episo des terminate when the agent reaches the goal tile or when the step horizon is reached. A selection of evolved environments is shown in Figure 3, with further implementation details in Appendix B.1. PyGame Geometric 2D Navigation. T o test LLM’s ability to deal with continuous action spaces , we use a custom 2D navigation envi- ronment in which a cir cular agent must reach a rectangular goal zone while avoiding xed rectangular obstacles. States are fully observable, consisting of the agent’s position and a list of all ob- jects (obstacles and goals) with their positions and sizes (gro wing in size for each new level). The agent acts in a continuous space through 2D velocity commands. Diculty increases when obsta- cles are added, the agent–goal distance is increased, and narrow passages are created that may block traversal. This requires agents to identify traversable corridors relative to their size and to plan long detours when direct routes are infeasible. Episodes terminate when the agent overlaps the goal zone or when the step horizon is reached. A selection of evolved environments is shown in Figure 4, with further details in Appendix B.2. CARLA Urban Driving. W e evaluate urban driving in CARLA Town01 , a high-delity simulator with vehicles, pedestrians, and Figure 3: A selection of evolved MiniGrid environments pro- duced by COvolve . Complexity increases from empty grids to larger mazes with dense walls and locke d doors requiring corresponding keys. The agent must reach the green goal tile, often by planning multi-step sequences of key retrieval and door unlocking. Figure 4: A selection of evolved Py Game environments pro- duced by COvolve . Tasks progress from op en arenas to clut- tered maps with dense obstacles and narrow corridors. The agent must reach the rectangular goal zone while navigating collision-free paths through increasingly constrained lay- outs. trac lights. The vehicle follows a prescribed route using con- tinuous steering, throttle, and brake controls. Obser vations are egocentric and partial, consisting of the vehicle ’s kinematics, the nearest trac light, and compact featur es of nearby vehicles and pedestrians. T ask diculty increases with varying trac density and pedestrian activity , introducing adversarial behaviors such as abrupt braking or trac-light violations. Episodes terminate upon route completion (success) or any infraction (collision, red-light violation, or timeout). This setting tests policy robustness under partial observability and multi-agent interactions with stochastic Alkis Sygkounas, Rishi Hazra, Andreas Persson, Pedro Zuidberg Dos Martires, and Amy Loutfi and sometimes adversarial actors. A selection of ev olved environ- ments is shown in Figure 5, with further details in Appendix B.3. Figure 5: Selected CARLA environments produced by COvolve . T asks progress from urban driving on empty roads to crowded streets with increasingly aggressive actor behav- iors. The task for the agent is to drive along the street while following trac rules (such as stopping at red lights), and at the same time, adjust to increasingly unpredictable b ehav- iors of fellow drivers and pedestrians. 5.2 Results W e evaluate whether adversarial co-evolution with equilibrium rea- soning yields policies that b oth adapt to increasingly dicult envi- ronments and retain performance on previously generated ones. At each iteration, we compare three strategies: (i) UED-Greedy , which retains only the latest best policy; (ii) UED-Uniform , which samples uniformly from all p olicies generated up to the current iteration; and (iii) COvolve , which computes a mixed-strategy Nash equilibrium (MSNE) over the policy population. At iteration 𝑘 , all strategies are evaluated on the full environment archive { 𝜃 0 , . . . , 𝜃 𝑘 } . UED-Greedy evaluates only the latest policy when generating new environments, discarding earlier policies. UED-Uniform evalu- ates a uniform mixture ov er all policies generated up to the current iteration, controlling for mixture size without optimizing mixture weights. COvolve instead computes a mixe d-strategy Nash equi- librium over the policy population, selecting mixture weights that maximize the minimum return across the environment ar chive. Because co-evolution produces distinct environment archives across runs, results from dierent random seeds are not directly comparable: averaging would mix performance over non-identical evaluation sets. W e therefore report results from a single repr esen- tative run and provide r esults for a second see d in Appendix D.3. The results, presented in Figur e 6, report three views: (i) UED- Greedy , where the latest policy 𝜋 𝑘 is evaluated on all environments generated up to iteration 𝑘 ; (ii) COvolve , where p olicies are sample d from the MSNE and evaluated on the same environments; and (iii) a direct comparison between UED-Greedy and COvolve at iteration 𝑘 , with performance averaged across all environments. UED-Greedy policies are optimize d for the most recently gen- erated environment and exhibit reduce d performance on earlier environments (Figure 6, left). In contrast, the MSNE mixture main- tains performance across the full environment set as it grows over iterations (Figure 6, center ). The aggr egated comparison shows that, when the equilibrium is non-trivial, MSNE selection yields higher average p erformance than the latest-only UED-Gree dy strategy when evaluated across all environments (Figure 6, right). 5.3 Generalization W e evaluate generalization b eyond co-evolution on unseen stan- dardized benchmark environments that preserve the same under- lying task structure as the evolved environments. For MiniGrid, we consider MiniGrid-MultiRoom-N6-v0 (six rooms), MiniGrid- LockedRoom-v0 , and MiniGrid-DoorKey-16x16-v0 . For CARLA, we evaluate on Town02 , which is not encountered during co-evolution. At iteration 𝑘 , we compare three strategies under identical roll- out settings: UED-Greedy (latest-only policy), UED-Uniform , and the COvolve MSNE policy distribution. For UED-Uniform and COvolve , a policy is sampled at the beginning of each episode according to the corresponding mixture distribution. Detaile d environment spec- ications and their dierences from the evolved tasks are provided in App endix D.1. For each evolutionary seed, we run 100 evaluation episodes and compute the mean return. W e then report the mean and standard deviation across two seeds. W e do not report PyGame generalization results, as no standar dized evaluation benchmarks exist for this domain. Environment UED-Greedy UED-Uniform COvolve MiniGrid-MultiRoom-N6-v0 1 . 00 ± 0 . 00 0 . 86 ± 0 . 06 1 . 00 ± 0 . 00 DoorKey-16x16-v0 (MiniGrid) 1 . 00 ± 0 . 00 0 . 62 ± 0 . 24 1 . 00 ± 0 . 00 LockedRoom-v0 (MiniGrid) 1 . 00 ± 0 . 00 0 . 66 ± 0 . 16 1 . 00 ± 0 . 00 Town02 (CARLA) 0 . 62 ± 0 . 09 0 . 13 ± 0 . 06 0 . 71 ± 0 . 05 T able 1: Generalization to unse en environments. Results are reported as mean ± standard deviation across evolutionary seeds, with 100 evaluation episodes per seed. Importantly , these standardized environments ar e substantially simpler than the environments generated during co-evolution. In contrast, the evolved envir onments typically contain multiple se- quential key–door dependencies, narrow chokepoints, and adver- sarial obstacle placements; the benchmark tasks considered here involve at most a single locked door and a single key , with signi- cantly less constrained geometr y . As a result, strong performance on these unse en benchmarks, as indicated by Table 1, does not indicate overtting of the sp ecic environments, but rather that the learned policies have generalized. This also highlights the role of the environment designer in constructing tasks that are strictly COvolve: Adversarial Co-Evolution of Large-Language-Model-Generated Policies and Environments via Two-Player Zero-Sum Game 0 2 4 6 8 Environment 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Score πi_0 πi_1 πi_2 πi_3 πi_4 πi_5 πi_6 πi_7 πi_8 πi_9 0 1 2 3 4 5 6 7 8 9 Environment 0.5 0.6 0.7 0.8 0.9 1.0 Score COvolve (last) π_argmax (π_9) π_last (π_9) 0 1 2 3 4 5 6 7 8 9 Iteration 0.0 0.2 0.4 0.6 0.8 1.0 Mean score over envs COvolve UED-UNIFORM UED-GREEDY 0 2 4 6 8 Environment 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Score πi_0 πi_1 πi_2 πi_3 πi_4 πi_5 πi_6 πi_7 πi_8 πi_9 0 1 2 3 4 5 6 7 8 9 Environment 0.5 0.6 0.7 0.8 0.9 1.0 Score MSNE (last) π_last (π_9) π_argmax (π_7) 0 1 2 3 4 5 6 7 8 9 Iteration 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Mean score over envs COvolve UED-UNIFORM UED-GREEDY 0 2 4 6 8 Environment 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Score πi_0 πi_1 πi_2 πi_3 πi_4 πi_5 πi_6 πi_7 πi_8 πi_9 0 1 2 3 4 5 6 7 8 9 Environment 0.5 0.6 0.7 0.8 0.9 1.0 Score COvolve (last) π_last (π_9) π_argmax (π_8) 0 1 2 3 4 5 6 7 8 9 Iteration 0.0 0.2 0.4 0.6 0.8 1.0 Mean scor e over envs COvolve UED-UNIFORM UED- GREED Y Figure 6: Performance during environment–policy co-evolution. Left: Success rates of all discovered p olicies evaluated on all environments generated during evolution (policy–environment payo matrix). Center: Comparison between the mixed- strategy Nash equilibrium (MSNE) policy mixture, the b est single policy 𝜋 argmax , and the latest policy 𝜋 𝑘 , evaluated on the environment archive { 𝜃 0 , . . . , 𝜃 𝑘 } . Here, 𝜋 argmax denotes the policy that maximizes mean performance over the entire archive. For MiniGrid, 𝜋 argmax = 𝜋 𝑘 . Right: Mean success over { 𝜃 0 , . . . , 𝜃 𝑘 } for three strategies: UED-Gree dy ( latest policy only), UED-Uniform (uniform mixture over all policies), and COvolve (MSNE mixture). As evolution progresses and the latest p olicy forgets earlier environments, the MSNE mixture assigns probabilities to earlier policies to preser ve worst-case performance over the archive. harder than canonical benchmarks, ee ctively inducing a challeng- ing curriculum during co-evolution. 5.4 Ablation Studies Is curriculum necessary? W e perform a zero-shot ablation in which the LLM generates p olicies directly for the hardest envi- ronment in each domain, without exposure to intermediate envi- ronments. Starting from an initial policy , the LLM applies up to 𝑘 mutation steps and we r etain the b est-performing policy . As sho wn in Figure 7, zero-shot generation consistently fails, demonstrating that progressive curriculum construction is necessary for ee ctive policy synthesis. How do neural network-based RL approaches compare to LLM- generated code-as-policies in solving fully observable environments? T o test whether standard RL can solve our fully obser vable evalua- tion tasks, we trained representative Stable-Baselines3 agents [ 37 ] – PPO [ 40 ] and SAC (continuous control) [ 16 ] for Py Game, and PPO and QR-DQN (discr ete) [ 7 ] for MiniGrid. Despite full observ- ability , performance degrades sharply with task complexity: PPO and QR-DQN achieve near-zero rewards and success rates in the harder environments (SA C shows only limited improvements). See Appendix D.2 for details regarding training settings, cur ves, and success rates. How do weaker language mo dels p erform compared to the latest, state-of-the-art mo dels? W e repeat the co-evolution procedure us- ing GPT -4.1 as a w eaker generative model, keeping prompts and evaluation settings the same. Figur e 8 reports results for a single run, comparing MSNE and UED-Gree dy across domains. Although GPT -4.1 produces simpler environments and weaker policies, MSNE consistently outperforms UED-Gr eedy and remains robust o ver the environment archive . 6 Conclusions W e introduce d COvolve , a framew ork in which large language models generate environments and policies in a closed loop. The interaction between environment design and policy design is for- mulated as a two-player zer o-sum game, and learning is performed over growing populations of environments and policies. Solving Alkis Sygkounas, Rishi Hazra, Andreas Persson, Pedro Zuidberg Dos Martires, and Amy Loutfi Zer o -shot MSNE_0 MSNE_9 0.0 0.2 0.4 0.6 0.8 1.0 Scor e 0.00 0.00 0.91 Zer o -shot MSNE_0 MSNE_9 (a) MiniGrid Zer o -shot MSNE_0 MSNE_9 0.0 0.2 0.4 0.6 0.8 1.0 Scor e 0.00 0.00 0.85 Zer o -shot MSNE_0 MSNE_9 (b) PyGame Zer o -shot MSNE_0 MSNE_9 0.0 0.2 0.4 0.6 0.8 1.0 Scor e 0.00 0.00 0.69 Zer o -shot MSNE_0 MSNE_9 (c) CARLA Figure 7: Result on curriculum learning. Direct training on the hardest envir onment (“Zero-shot”) fails, while co- evolutionary MSNE with curriculum ( right bars ) yields non- trivial performance. for a mixed-strategy Nash e quilibrium yields a meta-policy that optimizes worst-case performance across the empirical set of gen- erated environments and provides a population-level objective for continual adaptation. Experiments in urban driving, symbolic maze solving, and geo- metric navigation show that COvolve produces an emergent cur- riculum with environments that exhibit increasing structural com- plexity . These results demonstrate that a game-theoretic formu- lation of policy and environment generation enables robust co- evolution and automated curriculum construction without pr ede- ned task distributions. Limitations and Future W ork. Unconstrained LLM-generated environments can be infeasible. T o prevent this, the environment designer is restricted to using pr edened helper functions that ensure feasibility in each domain. Environments that violate these 0 1 2 3 4 5 6 7 8 9 Iteration 0.00 0.20 0.40 0.60 0.80 1.00 Scor e -0.00 -0.00 -0.00 +0.05 +0.07 +0.16 +0.19 +0.15 +0.11 +0.23 (a) MiniGrid 0 1 2 3 4 5 6 7 8 9 Iteration 0.00 0.20 0.40 0.60 0.80 1.00 Scor e +0.00 +0.00 +0.00 +0.00 +0.03 +0.09 +0.02 +0.00 +0.06 +0.03 (b) PyGame 0 1 2 3 4 5 6 7 8 9 Iteration 0.00 0.20 0.40 0.60 0.80 1.00 Scor e +0.00 +0.00 +0.00 +0.00 +0.02 +0.05 +0.05 +0.00 +0.12 +0.09 (c) CARLA Figure 8: Ablation with a weaker generative model (GPT - 4.1). Comparison b etween UED-Greedy and COvolve (MSNE) across domains. While overall performance degrades due to weaker generation, MSNE consistently mitigates forgetting and maintains robustness across the environment archive. constraints are never sampled or evaluated. Details of these help er functions are provided in Appendix B. Future work could focus on more principled control of environment diculty , for example, by incorporating minimax regret [ 9 ], to provide formal guarantees on curriculum progression rather than relying on domain-specic heuristics. Additional directions include strengthening diversity checks during environment generation. Acknowledgments This work is supported by Knut and Alice W allenberg Founda- tion via the W allenb erg AI A utonomous Sensors Systems and the W allenb erg Scholars Grant. COvolve: Adversarial Co-Evolution of Large-Language-Model-Generated Policies and Environments via Two-Player Zero-Sum Game References [1] Y oram Bachrach, Edan T oledo, Karen Hambardzumyan, Despoina Magka, Mar- tin Josifoski, Minqi Jiang, Jakob Fo erster , Roberta Raileanu, T atiana Shavrina, Nicola Cancedda, A vraham Ruderman, K atie Millican, Andrei Lupu, and Rishi Hazra. 2025. Combining Code Generating Large Language Models and Self- Play to Iteratively Rene Strategies in Games. In Proceedings of the Thirty- Fourth International Joint Conference on A rticial Intelligence, IJCAI-25 . Interna- tional Joint Conferences on Articial Intelligence Organization, 10999–11003. doi:10.24963/ijcai.2025/1249 Demo Track. [2] Osbert Bastani, Y ewen Pu, and Armando Solar-Lezama. 2018. V eriable reinforce- ment learning via policy extraction. Advances in neural information processing systems 31 (2018). [3] Lili Chen, Mihir Prabhudesai, Katerina Fragkiadaki, Hao Liu, and Deepak Pathak. 2025. Self-Questioning Language Models. arXiv:2508.03682 [cs.LG] https: //arxiv .org/abs/2508.03682 [4] Maxime Chevalier-Boisvert, Bolun Dai, Mark T owers, Rodrigo Perez- Vicente, Lucas Willems, Salem Lahlou, Suman Pal, Pablo Samuel Castro, and Jor dan T err y . 2023. Minigrid & miniworld: Modular & customizable reinforcement learning environments for goal-oriented tasks. Advances in Neural Information Processing Systems 36 (2023), 73383–73394. [5] Je Clune. 2020. AI-GAs: AI-generating algorithms, an alternate paradigm for producing general articial intelligence. arXiv:1905.10985 [cs.AI] https: //arxiv .org/abs/1905.10985 [6] Karl Cobbe, Christopher Hesse, Jacob Hilton, and John Schulman. 2020. Lever- aging procedural generation to benchmark reinforcement learning (ICML’20) . JMLR.org, Article 191, 9 pages. [7] Will Dabney , Mark Rowland, Marc G. Bellemare, and Rémi Munos. 2017. Distributional Reinforcement Learning with Quantile Regression. arXiv:1710.10044 [cs.AI] https://arxiv .org/abs/1710.10044 [8] Nicola Dainese, Matteo Merler , Minttu Alakuijala, and Pekka Marttinen. 2024. Generating Code W orld Models with Large Language Models Guided by Monte Carlo Tr ee Search. In The Thirty-eighth A nnual Conference on Neural Information Processing Systems . https://openreview .net/forum?id=9Sp W vX9ykp [9] Michael Dennis, Natasha Jaques, Eugene Vinitsky, Alexandre Bayen, Stuart Russell, Andrew Critch, and Sergey Levine. 2020. Emergent complexity and zero-shot transfer via unsupervised environment design. In Advances in neural information processing systems . [10] Alexey Dosovitskiy , German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. 2017. CARLA: An open urban driving simulator. In Confer ence on robot learning . PMLR, 1–16. [11] Kevin Ellis, Lionel W ong, Maxwell Nye, Mathias Sable-Meyer , Luc Cary, Lor e Anaya Pozo, Luke Hewitt, Armando Solar-Lezama, and Joshua B Tenenbaum. 2023. DreamCoder: growing generalizable, interpretable knowledge with wake– sleep Bayesian program learning. Philosophical Transactions of the Royal Society A 381, 2251 (2023), 20220050. [12] Maxence Faldor, Jenny Zhang, Antoine Cully , and Je Clune. 2025. OMNI- EPIC: Open-endedness via Models of human Notions of Interestingness with Environments Programmed in Code. In The Thirte enth International Conference on Learning Representations . https://openreview .net/forum?id=Y1XkzMJpPd [13] John Forrest and Ted Ralphs. 2005. CBC: COIN-OR Branch and Cut Solver . https://github.com/coin- or/Cbc. V ersion accessed: 2024. [14] Léo Françoso Dal Piccol Sotto, Paul Kaufmann, Timothy Atkinson, Roman Kalkreuth, and Márcio Porto Basgalupp. 2021. Graph representations in ge- netic programming. Genetic Programming and Evolvable Machines 22, 4 (2021), 607–636. [15] Dibya Ghosh, Jad Rahme, A viral Kumar , Amy Zhang, Ryan P Adams, and Sergey Levine. 2021. Why generalization in rl is dicult: Epistemic pomdps and implicit partial observability . Advances in neural information processing systems 34 (2021), 25502–25515. [16] T uomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. Soft Actor-Critic: O-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor . arXiv:1801.01290 [cs.LG] https://arxiv .org/abs/1801.01290 [17] Rishi Hazra, Alkis Sygkounas, Andreas Persson, Amy Lout, and Pedro Zuid- berg Dos Martires. 2025. REvolve: Reward Evolution with Large Language Models using Human Feedback. In The Thirteenth International Conference on Learning Representations . https://openreview .net/forum?id=cJPUpL8mO w [18] Chengsong Huang, W enhao Y u, Xiaoyang W ang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Y u. 2025. R-Zero: Self-Evolving Reasoning LLM from Zero Data. arXiv:2508.05004 [cs.LG] https://arxiv .org/abs/ 2508.05004 [19] Jeevana Priya Inala, Osbert Bastani, Zenna Tavares, and Armando Solar-Lezama. 2020. Synthesizing programmatic policies that inductively generalize. In 8th International Conference on Learning Representations . [20] Nick Jakobi. 1997. Evolutionary robotics and the radical envelope-of-noise hypothesis. Adaptive behavior 6, 2 (1997), 325–368. [21] Minqi Jiang, Edward Grefenstette, and Tim Rocktäschel. 2021. Prioritized level replay . In Advances in Neural Information Processing Systems (NeurIPS) . [22] Robert Kirk, Amy Zhang, Edward Grefenstette, and Tim Rocktäschel. 2023. A Survey of Zero-shot Generalisation in Deep Reinforcement Learning. J. A rtif. Int. Res. 76 (May 2023), 64 pages. doi:10.1613/jair .1.14174 [23] Ezgi Korkmaz. 2024. A Survey Analyzing Generalization in Deep Reinforcement Learning. arXiv:2401.02349 [cs.LG] https://arxiv .org/abs/2401.02349 [24] Marc Lanctot, Vinicius Zambaldi, A udrunas Gruslys, Angeliki Lazaridou, K arl T uyls, Julien Perolat, David Silver , and Thore Graepel. 2017. A Unie d Game- Theoretic Approach to Multiagent Reinforcement Learning. In Advances in Neural Information Processing Systems , I. Guyon, U. V on Luxburg, S. Bengio, H. W allach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), V ol. 30. Cur- ran Associates, Inc. https://proceedings.neurips.cc/pap er_les/paper/2017/le/ 3323fe11e9595c09af38fe67567a9394- Pap er .pdf [25] Jacky Liang, W enlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter , Pete Florence , and Andy Zeng. 2023. Code as policies: Language model programs for embo died control. In 2023 IEEE International Conference on Rob otics and Automation (ICRA) . IEEE, 9493–9500. [26] William Liang, Sam W ang, Hung-Ju Wang, Osbert Bastani, Dinesh Jayara- man, and Y echeng Jason Ma. 2024. Environment Curriculum Generation via Large Language Models. In 8th A nnual Conference on Robot Learning . https: //openreview .net/forum?id=F0r WEID2gb [27] Zi Lin, Sheng Shen, Jingbo Shang, Jason W eston, and Yixin Nie. 2025. Learn- ing to Solve and V erify: A Self-Play Framework for Code and T est Generation. arXiv:2502.14948 [cs.SE] https://arxiv .org/abs/2502.14948 [28] Shaoteng Liu, Hao qi Yuan, Minda Hu, Y anwei Li, Yukang Chen, Shu Liu, Zongqing Lu, and Jiaya Jia. 2024. RL-GPT: Integrating Reinforcement Learning and Code-as-policy. In The Thirty-eighth A nnual Conference on Neural Information Processing Systems . https://openreview .net/forum?id=LEzx6QRkRH [29] Y echeng Jason Ma, William Liang, Guanzhi W ang, De- An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar . 2024. Eureka: Human-Level Re ward Design via Coding Large Language Models. In The T welfth International Conference on Learning Representations . https://openreview .net/ forum?id=IEduRUO55F [30] Stuart Mitchell, Michael OSullivan, and Iain Dunning. 2011. Pulp: a linear programming toolkit for python. The University of Auckland, Auckland, New Zealand 65 (2011), 25. [31] Jun Morimoto and Kenji Doya. 2005. Robust reinforcement learning. Neural computation 17, 2 (2005), 335–359. [32] John Nash. 1951. Non-Coop erative Games. Annals of Mathematics 54, 2 (1951), 286–295. http://www .jstor .org/stable/1969529 [33] OpenAI. 2025. Introducing GPT -5.2. https://openai.com/index/introducing- gpt- 5- 2/ [34] Martin J Osborne and Ariel Rubinstein. 1994. A course in game theory . MI T press. [35] Lerrel Pinto, James Davidson, and Abhinav Gupta. 2017. Sup ervision via compe- tition: Robot adversaries for learning tasks. In 2017 IEEE International Conference on Robotics and Automation (ICRA) . IEEE, 1601–1608. [36] PyGame Community. 2000–2024. PyGame: Python Game Development. https: //www.p ygame.org/. Accessed: 2025-05-09. [37] Antonin Ran, Ashley Hill, Adam Gleave , Anssi Kanervisto, Maximilian Ernes- tus, and Noah Dormann. 2021. Stable-Baselines3: Reliable Reinforcement Learn- ing Implementations. Journal of Machine Learning Research 22, 268 (2021), 1–8. http://jmlr .org/papers/v22/20- 1364.html [38] Antonin Ran, Ashley Hill, Adam Gleave , Anssi Kanervisto, Maximilian Ernes- tus, and Noah Dormann. 2021. Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research 22, 268 (2021), 1–8. [39] Mikayel Samvelyan, Akbir Khan, Michael D Dennis, Minqi Jiang, Jack Parker- Holder , Jakob Nicolaus Foerster , Roberta Raileanu, and Tim Rocktäschel. 2023. MAESTRO: Open-Ended Environment Design for Multi- Agent Reinforcement Learning. In The Eleventh International Conference on Learning Representations . https://openreview .net/forum?id=sKWlRDzPfd7 [40] John Schulman, Filip W olski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov . 2017. Proximal Policy Optimization Algorithms. arXiv:1707.06347 [cs.LG] https: //arxiv .org/abs/1707.06347 [41] David Silver and Richard S. Sutton. 2025. W elcome to the Era of Experience. In Designing an Intelligence . MI T Press. [42] Ishika Singh, V alts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay , Dieter Fox, Jesse Thomason, and Animesh Garg. 2022. ProgPrompt: Generating Situated Robot T ask P lans using Large Language Models. In W orkshop on Language and Robotics at CoRL 2022 . https://openreview .net/forum?id=3K4- U_5cRw [43] Hao Tang, Darren Ke y , and Ke vin Ellis. 2024. W orldCo der , a Mo del- Based LLM Agent: Building W orld Models by Writing Code and Inter- acting with the Environment. In Advances in Neural Information Pro- cessing Systems , A. Globerson, L. Mackey , D. Belgrave, A. Fan, U. Pa- quet, J. T omczak, and C. Zhang (Eds.), V ol. 37. Curran Associates, Inc., 70148–70212. https://proceedings.neurips.cc/pap er_les/paper/2024/le/ 820c61a0cd419163ccbd2c33b268816e- Paper- Conference.pdf Alkis Sygkounas, Rishi Hazra, Andreas Persson, Pedro Zuidberg Dos Martires, and Amy Loutfi [44] Josh T obin, Rachel Fong, Alex Ray , Jonas Schneider , W ojcie ch Zaremba, and Pieter Abbeel. 2017. Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) . 23–30. [45] Dweep Trivedi, Jesse Zhang, Shao-Hua Sun, and Joseph J Lim. 2021. Learning to synthesize programs as interpretable and generalizable policies. Advances in neural information processing systems 34 (2021), 25146–25163. [46] Abhinav V erma, Vijayaraghavan Murali, Rishabh Singh, Pushmeet Kohli, and Swarat Chaudhuri. 2018. Programmatically interpretable reinforcement learning. In International conference on machine learning . PMLR, 5045–5054. [47] Pablo Villalobos, Anson Ho, Jaime Sevilla, T amay Besiroglu, Lennart Heim, and Marius Hobbhahn. 2024. Will we run out of data? Limits of LLM scaling based on human-generated data. arXiv:2211.04325 [cs.LG] https://ar xiv .org/abs/2211. 04325 [48] Guanzhi W ang, Y uqi Xie, Y unfan Jiang, Ajay Mandlekar , Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar . 2024. V oyager: An Open-Ended Embodied Agent with Large Language Models. Transactions on Machine Learning Research (2024). https://openreview .net/forum?id=ehfRiF0R3a [49] Lirui W ang, Yiyang Ling, Zhecheng Yuan, Mohit Shridhar , Chen Bao, Yuzhe Qin, Bailin W ang, Huazhe Xu, and Xiaolong W ang. 2024. GenSim: Generating Robotic Simulation Tasks via Large Language Models. In The Twelfth Interna- tional Conference on Learning Representations . https://openreview .net/forum? id=OI3RoHo W AN [50] Rui W ang, Joel Lehman, Je Clune, and Kenneth O . Stanley . 2019. Paired Op en- Ended Trailblazer (POET): Endlessly Generating Increasingly Complex and Di- verse Learning Environments and Their Solutions. arXiv:1901.01753 [cs.NE] https://arxiv .org/abs/1901.01753 [51] Yinjie W ang, Ling Y ang, Y e Tian, K e Shen, and Mengdi W ang. 2025. Co-Evolving LLM Coder and Unit T ester via Reinforcement Learning. arXiv:2506.03136 [cs.CL] https://arxiv .org/abs/2506.03136 [52] Y ao W en Y ang, Jian Feng Xu, and Che e Kiong Soh. 2006. An evolutionary programming algorithm for continuous global optimization. European journal of operational research 168, 2 (2006), 354–369. COvolve: Adversarial Co-Evolution of Large-Language-Model-Generated Policies and Environments via Two-Player Zero-Sum Game Appendix The appendix is organized as follows. Section A details the co- evolution algorithm and Nash distribution computation. Section B provides environment implementation specics for the used sim- ulators (i.e., MiniGrid, PyGame, and CARLA). Section C lists the exact prompts used for environment and policy generation. Sec- tion D reports additional e xperimental r esults, including generaliza- tion and reinforcement learning baselines. Se ction E presents the best-performing policies per domain. Finally , Section F illustrates examples of evolv ed environments and their mutation progress. A Algorithmic Details Co-Evolution Lo op. At each generation 𝑡 , the system performs a mutation-based search to synthesize a new environment 𝜃 𝑡 and a corresponding policy 𝜋 𝑡 . Environment mutation generates 𝐾 can- didate environments by perturbing the previous one, 𝜃 𝑡 − 1 . Each candidate is evaluated under the Nash distribution 𝑤 𝑡 − 1 over poli- cies from earlier iterations, and the candidate with the lowest ex- pected return is selecte d. Policy mutation is initialized from the highest-weighted policy 𝜋 best under 𝑤 𝑡 − 1 . The policy LLM gener- ates 𝐾 mutated versions of this base policy , which are evaluated solely on 𝜃 𝑡 , and the best-p erforming policy is selecte d. The pair ( 𝜃 𝑡 , 𝜋 𝑡 ) is added to the archive, and the payo matrix 𝑀 ∈ [ 0 , 1 ] 𝑡 × 𝑡 is updated accordingly . Nash Distribution Computation. T o determine the current policy mixture, we solv e a two-player zero-sum game dened by the em- pirical payo matrix 𝑀 . W e compute the Nash equilibrium policy distribution by solving the dual linear program of a two-player zero- sum game over the empirical payo matrix 𝑀 . The optimization is formulated using PuLP [ 30 ] and solved using the CBC backend solver [ 13 ]. The solution yields a probability distribution over policies that minimizes the worst-case environment return: 𝑤 𝑡 = arg max 𝑤 ∈ Δ 𝑡 min 𝑖  𝑗 𝑤 𝑗 𝑀 𝑖 𝑗 . B Environment Details All environments are being provided, at generation 0, with heuris- tics to ensure solvability for the specic task. For the environment to be generated, at least one solution is required. B.1 MiniGrid Implementation Details The MiniGrid environment represents a 2D grid-world where each cell encodes the presence of objects, walls, keys, doors, and other entities. The environment supports exible congurations of size , object placement, and symbolic dependencies, making it suitable for general planning tasks. Action Space. The agent interacts with the environment using a discrete action space of six primitive actions: • TURN_LEFT (0): Rotate the agent 90 ° counterclockwise. • TURN_RIGHT (1): Rotate the agent 90 ° clockwise. • MOVE_FORWARD (2): Advance one tile forward, if the path is free. • PICK_UP (3): Pick up an obje ct in front, used for collecting keys. • DROP (4): Drop the currently carried object onto the tile in front. • TOGGLE (5): Interact with doors in front of the agent: – Open a close d door (ST A TE = 1). – Unlock a locked do or (ST A TE = 2) if carrying the cor- rect key . Tile Encoding. Each grid tile is encoded as a 3-tuple of integers: (OBJECT_IDX, COLOR_IDX, STATE) This structured representation is provided in a fully observable grid array . The indexing is spatial, with (x, y) referring to grid row and column, respectively . T able 2: MiniGrid OBJECT_IDX Mappings 0 Unseen 1 Empty 2 W all 3 F loor 4 Door 5 Key 6 Ball 7 Box 8 Goal 9 Lava 10 Agent T able 3: Door State Field 0 Open (passable) 1 Closed (toggle to op en) 2 Locke d (requires key to unlock and toggle) Environment Logic. Doors and keys are linked by color indices, with up to six distinct colors available. Lo cked doors block the agent’s path until the corresponding key is acquired. The envi- ronment enforces procedural placement constraints, ensuring at least one feasible path exists through BFS-based solvability che cks. W alls and other obstacles further complicate navigation. The agent maintains a single-key capacity , ne cessitating key management and path re-planning in multi-door congurations. Observations. At each timestep, the agent receives a fully obser v- able grid state represented as a attene d tensor of shap e (grid_size × grid_size × 3) , normalized to [ 0 , 1 ] . Each tile encodes the ob- ject type, color index, and dynamic state (e.g., door status) as dened by the environment’s tile encoding scheme. In addition, the policy receives the agent’s absolute p osition (agent_pos) and current orientation (agent_dir) , enabling precise spatial reasoning and orientation-dependent actions. This structured input enables poli- cies to perform symbolic reasoning without perceptual ambiguity , allowing them to focus solely on decision-making and planning. Alkis Sygkounas, Rishi Hazra, Andreas Persson, Pedro Zuidberg Dos Martires, and Amy Loutfi Figure 9: Example of a generate d MiniGrid environment (cf. Fig. 3). For this environment, the agent (red arrow) must reach the green goal tile by unlocking the intermediate colored doors using the corresponding keys. B.2 PyGame Implementation Details In the PyGame environment, each instance denes a bounded 2D plane in pixel space, with task-sp ecic width and height parameters. The agent is modeled as a circular b ody with a xed physical radius of 15 pixels, while the goal zone is a rectangular target area guaran- teed to fully contain the agent’s circle upon successful completion. Obstacles are axis-aligned rectangles with randomly positioned and size d dimensions. Their placement follows strict feasibility constraints: • Obstacles must not overlap with the goal zone. • Obstacles must not overlap with each other . • New obstacles are placed only if their inated bounding box (expanded by the agent’s radius) does not intersect existing obstacles, ensuring local non-overlap and feasible placement. Action and Observation Spaces. The agent selects a continuous 2D velocity vector [dx, dy] ∈ [ − 1 . 0 , 1 . 0 ] 2 at each timestep. This vector is scaled by an envir onment-dened spee d factor to deter- mine pixel-wise displacement. Collision detection is p erformed for each proposed mo vement; invalid moves that w ould result in obsta- cle penetration or leaving environment bounds are rejected, leaving the agent stationary . Observations are provided as a structured dictionary containing: • agent_pos : The agent’s center coordinates in pixels. • objects : A list describing the goal zone and each obstacle, with entries specifying type , pos , size , and (for the goal zone) purpose . Figure 10: Example of a generated PyGame environment (cf. Fig. 4). In this environment, the agent ( blue circle) must navigate the environment spatially to reach the goal (green rectangle). • step_count : The current timestep within the episode. T ask Parameters. T ask diculty is progressively scaled by modi- fying environment parameters, including: • The number of obstacles, increasing clutter , and requiring more deliberate path planning. • The environment’s width and height, expanding navigation complexity . • The agent’s movement speed, reducing maneuverability . • The minimum agent-goal start distance, forcing longer tra- versal paths. These parameters are dynamically adjuste d by the environment generator to produce increasingly challenging, yet solvable, task instances. Episode T ermination and Reward. An episode terminates when the agent’s circular body is entirely within the goal zone or when the maximum allowed steps are exhausted. Feasibility Guarante es. T o ensure the agent can navigate to the goal, the environment performs a reachability check using a dis- cretized o ccupancy grid that inates obstacle regions by the agent’s radius. This guarantees that all generated tasks are physically fea- sible for the agent to complete. Invalid placements of obstacles or agent start positions are rejected during the generation process. This process ensures that every evaluation involves meaningful, solvable navigation challenges with non-trivial spatial reasoning requirements. B.3 CARLA Implementation Details Simulator and Map. W e use CARLA Town01 in synchronous mode with a xed time step. The route is a recorded closed polyline. The roadway is a two-way single carriageway: one lane per direc- tion, each ≈ 4 m wide (total ≈ 8 m ). Each episode spawns the ego COvolve: Adversarial Co-Evolution of Large-Language-Model-Generated Policies and Environments via Two-Player Zero-Sum Game Figure 11: Example of a generated CARLA environment (cf. Fig. 5). From the ego viewpoint, the car perceives pe des- trians, trac lights, and other vehicles as describe d in Ap- pendix B.3. The red car to the right of the ego vehicle (T esla v3) was intentionally placed (from the LLM) to confuse the policy . at a xed start; non-ego vehicles and pedestrians are randomized. Episodes terminate on collision, red-light violation, timeout, or loop completion. Frenet Ge ometry and Progress. Let the route be a looped p olyline { 𝑃 𝑖 } 𝑁 𝑖 = 1 . For a world point 𝑝 ∈ R 2 , we project onto each segment 𝑡 𝑖 = clip  ( 𝑝 − 𝑃 𝑖 ) ⊤ ( 𝑃 𝑖 + 1 − 𝑃 𝑖 ) ∥ 𝑃 𝑖 + 1 − 𝑃 𝑖 ∥ 2 2 , 0 , 1  , ˆ 𝑝 𝑖 = 𝑃 𝑖 + 𝑡 𝑖 ( 𝑃 𝑖 + 1 − 𝑃 𝑖 ) . Let 𝑘 = arg min 𝑖 ∥ 𝑝 − ˆ 𝑝 𝑖 ∥ 2 , segment length ℓ 𝑘 = ∥ 𝑃 𝑘 + 1 − 𝑃 𝑘 ∥ 2 , and cumulative arclength up to segment 𝑘 be 𝑠 𝑘 . W e dene arclength and lateral oset: 𝑠 ( 𝑝 ) = 𝑠 𝑘 + 𝑡 𝑘 ℓ 𝑘 , ℓ ⊥ ( 𝑝 ) = ( 𝑝 − ˆ 𝑝 𝑘 ) ⊤ n 𝑘 , n 𝑘 = 1 ℓ 𝑘  − 𝑑 𝑘 ,𝑦 𝑑 𝑘 ,𝑥  . Progress from the episode start 𝑠 0 wraps on the loop: Δ 𝑠 = ( 𝑠 ( 𝑝 ) − 𝑠 0 ) mod 𝐿 , with loop length 𝐿 . W e express relative positions/veloci- ties in the ego frame via 𝑅 we ( 𝜓 ) =  cos 𝜓 sin 𝜓 − sin 𝜓 cos 𝜓  . The yaw error is Δ 𝜓 = ( ( 𝜓 − 𝜓 path + 𝜋 ) mod 2 𝜋 ) − 𝜋 . Observation Space. W e expose only the featur es the policy needs to drive on a prescribed path while interacting with trac and pedestrians: Ego kinematics. Sp eed speed_mps , yaw rate yaw_rate_rps , lat- eral error ℓ ⊥ , yaw error Δ 𝜓 . Short histories ( length 4) for { speed, lateral error , yaw error , past steer/throttle/brake } stabilize control. Trac light. Nearest trac light ahead on the route: exists , dist_m , state ∈ { Red , Green , Yellow }. For simplicity , Y ellow is treated as Red. V ehicles. W e ke ep a small, order ed snapshot (top-2) per class with ego-frame gaps and simple surrogates: THW = 𝑔 𝑥 max ( 0 . 5 , 𝑣 ego ) , T TC = ( 𝑔 𝑥 / ( − Δ 𝑣 𝑥 ) , Δ 𝑣 𝑥 < 0 null , otherwise. Lead cars are those with Frenet lateral | ℓ ⊥ | ≤ 2 m (ego lane). Opposite cars fall in − 6 ≤ ℓ ⊥ < − 2 m (oncoming lane). Pedestrians. Within a forward window along the route, we clas- sify: (i) in-lane if | 𝑔 𝑦 | ≤ 2 m ; (ii) approaching if 2 < | 𝑔 𝑦 | ≤ 3 m and moving toward the lane ( Δ 𝑣 𝑦 𝑔 𝑦 < 0 ). For approaching walkers, we estimate time-to-enter the near lane edge, 𝑡 enter = ( 𝑦 ★ − 𝑔 𝑦 ) / Δ 𝑣 𝑦 when | Δ 𝑣 𝑦 | is non-negligible, where 𝑦 ★ = ± 2 m. Action Space. A continuous 3-vector ( steer , throttle , brake ) with steer ∈ [− 1 , 1 ] , throttle ∈ [ 0 , 1 ] , brake ∈ [ 0 , 1 ] . Notes. This design yields a small, interpr etable state while cov- ering path tracking (via ℓ ⊥ , Δ 𝜓 ), car-following and oncoming inter- actions (via THW/T TC and lane bands), signal compliance (trac- light snapshot), and pedestrian crossing risk (in-lane vs. approach- ing with 𝑡 enter ). All constants and implementation details ( e.g., hori- zons, smoothing) are provided in our code release. Alkis Sygkounas, Rishi Hazra, Andreas Persson, Pedro Zuidberg Dos Martires, and Amy Loutfi C Prompts W e provide the exact prompts used for environment and policy generation in each domain. These are instantiated dynamically at each iteration, reecting task-specic parameters and environment congurations. C.1 Environment Generation Prompts Box 1: MiniGrid Environment Prompt GOAL Minimize the scalar "Actual Score" in [0,1] evaluated on the Nash-weighted policy mix: {Weights} {Policies} You will return a SINGLE Python class that replaces the existing: class CustomEnv(MiniGridEnv): REQUIREMENTS (MANDATORY) 1) Class: - Keep "class CustomEnv(MiniGridEnv):" and its public API exactly. - Do not modify the base class or inheritance. 2) - If you add helpers, define them inside the same file . - Do not rely on undefined globals or external dependencies. 3) Fixed knobs in __init__ - Set once ( and only here): self.size = {Size} self.num_doors = min ({NumDoors}, 6) - "_gen_grid()" must use these fixed values directly (no dynamic rescaling). 4) Structured placement - Place perimeter, rooms/corridors, doors, keys, and goal via explicit logic. - Do not place objects blindly or randomly. 5) Solvability check - After placements, call "check_solvable(self, start, goal)" exactly as provided. - Accept the layout only if it returns True. 6) Episode diversity - "_gen_grid()" must generate different layouts across episodes (vary partitions, door indices, key pockets) while using the fixed knobs. 7) Termination - Episode ends when (a) the agent reaches the goal, or (b) max_steps is exceeded. 8) Retry policy - If unsolvable, retry up to 1000 times. - If still unsolvable, set "self.failure_feedback" and return . FEASIBILITY FUNCTIONS (DO NOT MODIFY) - "_verify_mandatory_door_keys(self)" - "bfs_ignore_doors(self, start, goal)" - "bfs_block_locked(self, start, goal)" - "_find_key_spot_block_locked(self, agent_pos, door_pos, unlocked_colors)" Use these as - is . You may add helpers, but never replace or alter them. OUTPUT Return ONLY the updated "CustomEnv" class (no commentary). Box 2: PyGame Environment Prompt GOAL Minimize the scalar "Actual Score" in [0,1] evaluated on the Nash-weighted policy mix: {Weights} {Policies} REQUIREMENTS (MANDATORY) 1) Class/API - Keep "class CustomEnv:" and its public API exactly. - Do not change the class name or inheritance. 2) Implement only these methods: - reset(self) - step(self, action) - draw_objects(self) - task_description(self) - _get_obs(self) - render(self) - Any private helpers defined inside the class (e.g., _handle_quit, _sample_pos). 3) task_description(self) - Must return a plain string describing: - The task objective (agent must reach the goal zone). - The action space (continuous [dx,dy] in [-1.0,1.0]^2). - Key parameters (sizes, margins, speeds). - The full observation dictionary structure. 4) Episode termination - Ends if agent reaches the goal (checked externally by _check_done()). - Ends if max_steps is exceeded. - Do not call _check_done() inside step(). 5) Structured placement - Place agent, goal, and obstacles via explicit rules. - Ensure no overlaps; keep all objects inside bounds. - Guarantee solvability (always at least one valid path). 6) Randomization - Use structured randomness (np.random.randint, np.random.uniform) in reset(). - Every reset() must produce a distinct environment instance. - Randomness must contribute to meaningful diversity. 7) Safety - The goal zone must be large enough: width,height >= 2*agent_radius + margin. 8) Behavior - step(action): interpret action as 2D continuous move. - Call self._handle_quit() at the start to process quit events. - Return (obs, reward, done) where obs=self._get_obs(), reward=0.0, done=self. done. 9) Observations - _get_obs() must return : { "agent_pos" : [x,y], "agent_radius" : r, "objects" : [ { "type" : "zone" , "pos" :[cx,cy], "size" :[w,h], "purpose" : "goal" }, { "type" : "obstacle" , "shape" : "rect" , "pos" :[cx,cy], "size" :[w,h] }, ... ], "bounds" : [W,H], "step_count" : N, "max_steps" : M } 10) Rendering - draw_objects(): use pygame primitives; only draw if self.render_mode==True. - render(): create/update PyGame surface, call draw_objects(), flip buffers if render_mode==True. TASK EVOLUTION - Increase distance between agent and goal. - Add obstacles or tighter passages. - Increase W,H and proportionally increase max_steps. - You are free and motivated to introduce new difficulties as long as the task remains the same: The agent must reach the goal zone. CONSTRAINTS - Do NOT add symbolic puzzles (no keys, doors, colors). - Do NOT use MiniGrid tile logic. - Do NOT add irrelevant randomness. - Do NOT remove or rename required methods. - Do NOT alter the external _check_done(). OUTPUT Return ONLY the updated "CustomEnv" class (no commentary). COvolve: Adversarial Co-Evolution of Large-Language-Model-Generated Policies and Environments via Two-Player Zero-Sum Game Box 3: CARLA Environment Prompt GOAL Minimize the scalar "Actual Score" in [0,1] evaluated on the Nash-weighted policy mix: {Weights} {Policies} Current environment performance: {ActualScore} You will return a SINGLE Python class that replaces the existing: {Actual_Class} MUTATION CONSTRAINT - Apply exactly ONE structural change to the class . - A structural change is defined as : (i) adding or removing a single environment component or behavior, OR (ii) (ii) modifying the logic of a single existing component (within a single method). - All other code must remain semantically identical. REQUIREMENTS (MANDATORY) 1) Class - Keep the class name "CarlaTown01Env" exactly. - Preserve all existing methods; you may add new helpers inside the class . 2) Task identity - Ego vehicle is always vehicle.tesla.model3. - Start and goal follow the same Town01 loop. - Do not spawn oversized vehicles (buses, trucks) or actors that may block solvability. 3) task_description(self) - Must return a plain string describing: - The driving objective (complete the loop without collisions). - Key environment elements (traffic, pedestrians, lights, dynamic behaviors). - The observation dictionary fields provided to the policy. - If new fields are added to obs (via get_obs), they must be explicitly documented in this string. - Do not remove fields; only extend if needed. 4) Observations - get_obs() must remain consistent with the description. - The observation dictionary includes ego state, histories, traffic lights, lead cars, opposite cars, pedestrians. - New factors (e.g., jaywalkers, lane changes) must be added carefully and described in task_description. 5) Solvability and safety - Always ensure at least one feasible driving strategy exists. - Pedestrian side-hit guard must remain intact. - Adjust max_steps proportionally if difficulty increases. - No actor may force unsolvable collisions. 6) Constraints - Do not add global code or side effects outside the class . - Do not remove feasibility checks already in place. - Do not change the class name. OUTPUT Return ONLY the updated CarlaTown01Env class (no commentary). C.2 Policy Generation Prompts Box 4: PyGame Policy Prompt GOAL Maximize the scalar "Actual Score" in [0,1] for the current policy by improving the given function. YOU MUST RETURN A SINGLE Python function that replaces the existing one: def policy(obs): # -> [dx, dy] in [-1.0, 1.0]^2 MUTATION CONSTRAINT - Apply exactly ONE structural change to the class . - A structural change is defined as : (i) adding or removing a single environment component or behavior, OR (ii) (ii) modifying the logic of a single existing component (within a single method). - All other code must remain semantically identical. INPUTS - Actual Score = {ActualScore} - Given Policy = {Policy} - Observation dictionary schema (exact field names and meaning) = {obs_dict} FUNCTION CONTRACT - Keep the exact signature: def policy(obs). - Return a 2D continuous action [dx, dy] with each component in [-1.0, 1.0]. - Do not use randomness, globals , I/O, or external libs beyond numpy. OBSERVATION DICTIONARY - Use only fields provided in obs. The runner supplies {obs_dict}. The typical structure is : - agent_pos: [x, y] current agent center position in pixels - agent_radius: r agent circle radius in pixels - objects: list of dicts describing scene items. Each item: - type : "zone" or "obstacle" - pos: [cx, cy] center position - size: [w, h] rectangle width, height - purpose: optional string, e.g., "goal" for the goal zone - bounds: [W, H] canvas width and height in pixels - step_count: N steps elapsed - max_steps: M step limit before timeout STRICT REQUIREMENTS 1) Goal-directed motion - Move toward the goal zone whenever a safe direction exists. - Prefer progress that reduces distance-to-goal. 2) Obstacle and bounds safety - Do not output actions that would cause collision with any obstacle rectangle. - Do not move outside [0, W] x [0, H]. - Respect agent_radius clearance when deciding direction. 3) Action validity - Clamp or normalize output so each component stays within [-1.0, 1.0]. - Avoid jitter or oscillation near the goal. 4) Robustness - If the direct route is blocked, choose a safe detour around obstacles. - Avoid loops by preferring actions that reduce distance-to-goal over time. EDGE CASES - Goal directly reachable: head straight toward the goal center or safe entry edge. - Narrow passage: align with passage axis and pass through without scraping boundaries. - Stuck against obstacle: choose an alternate heading that increases free-space margin. - Near goal zone edge: reduce overshoot and enter the zone cleanly. QUALITY TARGETS - Short time-to-goal. - Minimal wasted motion and reversals. - Collision-free trajectories across diverse layouts. FORBIDDEN - Changing the function name, arguments, or return type . - Returning values outside [-1.0, 1.0]. - Ignoring obstacles, bounds, or agent_radius in decisions. - Using randomness, global state, file or network I/O, or non-numpy libraries. OUTPUT Return ONLY the improved function "def policy(obs):" No explanations, no comments, no extra text. Alkis Sygkounas, Rishi Hazra, Andreas Persson, Pedro Zuidberg Dos Martires, and Amy Loutfi Box 5: MiniGrid Policy Prompt GOAL You are tasked with improving an existing policy function for navigating MiniGrid environments by applying macro-mutation operator. The policy must analyze the grid, reason about objects, plan an optimal path, and execute actions efficiently. The objective is to reach the goal tile (OBJECT_IDX=8). You are provided with : - Actual Score = {ActualScore}, a scalar in [0,1] that reflects the performance of the given policy. - Policy = {Policy}, the current implementation of the policy function. YOUR TASK Analyze the given policy together with its score and modify it to improve performance. The output must be a new version of the same function with improvements. MUTATION CONSTRAINT - Apply exactly ONE structural change to the class . - A structural change is defined as : (i) adding or removing a single environment component or behavior, OR (ii) (ii) modifying the logic of a single existing component (within a single method). - All other code must remain semantically identical. OUTPUT Return a single Python function: def policy(obs, agent_pos, agent_dir): # -> int in {0,1,2,3,4,5} ENVIRONMENT FORMAT - obs is a 2D NumPy array of shape (grid_size, grid_size, 3). - Each tile is encoded as (OBJECT_IDX, COLOR_IDX, STATE). - Indexing is (x=row, y=column). OBJECT_IDX MAP: 0=Unseen, 1=Empty, 2=Wall, 3=Floor, 4=Door, 5=Key, 6=Ball, 7=Box, 8=Goal, 9=Lava, 10=Agent DOOR STATE: 0=Open (free to pass ), 1=Closed (requires Toggle action=5 when facing), 2=Locked (requires correct key + Toggle=5) ACTIONS: 0=Turn Left, 1=Turn Right, 2=Move Forward, 3=Pick Up, 4=Drop, 5=Toggle STRICT REQUIREMENTS 1) Goal-Oriented Navigation - Always plan and execute a valid path to the Goal (OBJECT_IDX=8). - Avoid unnecessary detours unless a locked door blocks the path. 2) Door Handling - Open doors (STATE=0) act as free space. - Closed doors (STATE=1): face the door, Toggle (5) to open , then Move Forward (2). - Locked doors (STATE=2): only approach after collecting the correct key. Face the door, Toggle (5), then Move Forward (2). 3) Key Handling - Keys are only collected if required to unlock a blocking door. - The agent can hold exactly one key at a time. - If already holding a different key, Drop (4) into the front cell ( if empty) before picking up the new one. - Keys must be picked up with Pick Up (3) when the agent is adjacent and facing the key. - Dropped keys must remain accessible. 4) Safety and Obstacles - Never Move Forward (2) into a Wall (2) or Lava (9). - Treat Unseen tiles (0) as blocked until explored. 5) Orientation - Before any interaction (Move Forward, Pick Up, Drop, Toggle), ensure the agent is facing the correct adjacent cell. - Rotate (0=Left, 1=Right) until aligned, then act. 6) Termination - The episode ends when the agent reaches the Goal or exceeds max_steps. - The policy must minimize wasted actions and maximize efficiency. EDGE CASES - If the agent needs a key but already holds another, drop the held key before pickup. FORBIDDEN - Changing the function name, arguments, or return type . - Returning values outside {0,1,2,3,4,5}. - Using randomness, global state, or external libraries. OUTPUT FORMAT - Return only the improved function ` def policy(...) ` in valid Python. - No explanations, no comments, no extra text. Box 6: CARLA Policy Prompt GOAL Maximize the scalar "Actual Score" in [0,1] by improving the current driving policy. You will return a SINGLE Python class that replaces the existing: {Actual_Policy} MUTATION CONSTRAINT - Apply exactly ONE structural change to the class . - A structural change is defined as : (i) adding or removing a single environment component or behavior, OR (ii) (ii) modifying the logic of a single existing component (within a single method). - All other code must remain semantically identical. INPUTS - Actual Score = {ActualScore} - Previous Policy = {Policy} - Path = np.ndarray (N,2) lane-center polyline STRICT REQUIREMENTS 1) Class/API - Keep the class name "Policy" . - Implement __init__(self) and compute_action(self, obs, path). - Return (steering, throttle, brake) as floats. - steering in [-1,1], throttle in [0,1], brake in [0,1]. - If brake > 0 then throttle must equal 0. 2) Determinism and smoothness - No randomness or learning. - Ensure gradual changes, avoid jerks. 3) Robustness - Handle None or NaN conservatively. - On invalid input , default to safe stop (steer=0, throttle=0, brake>0). - No prints or logging. 4) Sign conventions - lateral_hist4: right-positive meters. - yaw_error_hist4: ego yaw - path yaw, right-positive. - yaw_rate_rps: right-positive radians/s. OBSERVATION FORMAT obs is a dictionary containing: - Ego state: - speed_mps: current speed in m/s - yaw_rate_rps: yaw rate in rad/s (right-positive) - Ego histories (arrays of length 4): - speed_hist4: past speeds - lateral_hist4: lateral errors (m, right-positive) - yaw_error_hist4: yaw errors (rad, right-positive) - steer_cmd_hist4: previous steering commands - throttle_cmd_hist4: previous throttle commands - brake_cmd_hist4: previous brake commands - Traffic light: - exists: boolean - state: int {0=unknown, 1=green, 2=yellow, 3=red} - dist_m: distance to stop line (m) - Lead cars (up to 2, same schema each): - gap_long_m: longitudinal gap (m) - gap_lat_m: lateral gap (m) - rel_long_mps: relative longitudinal speed (m/s) - ttc_s: time-to-collision (s) - thw_s: time headway (s) - Opposite cars (up to 2, same schema as lead cars) - Pedestrians (variable count): - lane: lane index - state: int encoding motion state - gap_long_m: longitudinal gap (m) - gap_lat_m: lateral gap (m) - rel_lat_mps: relative lateral speed (m/s) - t_enter_lane_s: predicted time to enter lane (s) - side: which side of road (left/right) - All dynamic actors truncated to 35 m ahead of ego OBJECTIVES - Lateral: minimize lateral and yaw errors relative to the centerline. - Longitudinal: track target speed up to 6.94 m/s (25 km/h) if unimpeded. - Traffic lights: stop smoothly before stop line on red; never cross on red. - Pedestrians: yield to pedestrians in or entering ego lane. - Lead vehicles: maintain safe following distance; avoid indefinite blocking. - Precedence order: red light stop > pedestrian yielding > lead vehicle following > cruising. - Fail-safe: if uncertain, perform controlled stop. - Comfort: avoid abrupt oscillations; prioritize smooth steering and braking. FORBIDDEN - Changing the class name or method signatures. - Returning values outside steering/throttle/brake ranges. - Simultaneous throttle and brake > 0. - Using randomness, logging, prints, or external dependencies. OUTPUT Return ONLY the new improved class "Policy" . COvolve: Adversarial Co-Evolution of Large-Language-Model-Generated Policies and Environments via Two-Player Zero-Sum Game D Additional Results D .1 Generalization Across Environments W e compare standardized, unseen environments to those produced during co-evolution. The standardized set comprises MiniGrid DoorKey-16x16-v0 and LockedRoom-v0 , and CARLA Town02 (trained on Town01 ). These dier from our e volved environments in three respects: (i) structure (xed layouts and goal semantics rather than co-evolved variants), (ii) scale (grid/world size and path lengths), and (iii) sequential dependencies (e.g., key–door ordering and room unlocking). For CARLA, Town02 diverges from Town01 by road- network density and trac comple xity: it has sharper turns, nar- rower lanes, and more intersections and pedestrian crossings, re- quiring longer detours and tighter maneuvers compared to the more regular T own01 layout. W e evaluate with identical rollout settings. D .2 Reinforcement Learning Results D .2.1 MiniGrid Maze-solving. W e evaluate two repr esentative al- gorithms using Stable-Baselines3 [ 38 ]: PPO, a policy-gradient method, and QRDQN, a value-based method for discrete domains. Reward shaping. For r eward, we use the default MiniGrid re ward function: 𝑅 ( 𝑠 , 𝑎 ) =        1 − 0 . 9 · 𝑡 𝑇 max , if the agent reaches the goal at step 𝑡 , 0 , otherwise , where 𝑇 max is the maximum episode length. Thus, faster completion yields a higher return. Algorithm Env 0 Env 2 Env 6 PPO 12 . 0 ± 1 . 4 % 0 . 0 ± 0 . 0 % 0 . 0 ± 0 . 0 % QRDQN 68 . 5 ± 12 . 0 % 0 . 0 ± 0 . 0 % 0 . 0 ± 0 . 0 % T able 4: Success rates (%, mean ± std over two runs) across MiniGrid environments. D .2.2 PyGame 2D Navigation. W e evaluate two representativ e algorithms using Stable-Baselines3 [38]: PPO , and SAC. Reward shaping. In the Py Game environments, reward is sparse with a per-step penalty: 𝑅 ( 𝑠 , 𝑎 ) =        + 1 , if the agent reaches the goal , − 0 . 01 , otherwise (each step) . This encourages agents to minimize path length while ensuring sparse success feedback. D .3 Additional seed run E Best Performing Policies The nal evolved policies are too extensive to analyze line by line . Instead, we provide high-level summaries of their algorithmic struc- ture and key heuristics. Algorithm Env 0 Env 2 Env 6 PPO 61 . 7 ± 2 . 3 % 6 . 2 ± 0 . 7 % 0 . 0 ± 0 . 0 % SA C 88 . 2 ± 3 . 0 % 22 . 5 ± 5 . 9 % 0 . 0 ± 0 . 0 % T able 5: Success rates (%, mean ± std over two runs) across PyGame environments. E.1 Best Performing MiniGrid Policy The b est p erforming p olicy ( policy_9 , se e Figure 6) is a fully model- based planning agent that formulates MiniGrid navigation as a discrete A * search over agent position, orientation, held key color , and door-open states. The policy operates as follows: (1) it parses the grid to identify the goal, doors, and ke ys, (2) computes a r elaxed r eachability region that ignores door and key semantics to conservatively identify which do ors and key colors can lie on a valid start–goal corridor , (3) performs lexicographic-cost A * planning over a factored state space with explicit door toggling, key pickup , key drop, and mo vement actions, and (4) executes only the rst action of the optimal plan, replanning at every timestep. T o reduce the search space without sacricing correctness, the planner reasons only about useful doors (doors that lie in the relaxed start–goal corridor) and useful key colors (keys that can open such doors). K eys are treated as blocking cells during planning, and ke ys are only dropped when the agent is holding one and the cell ahead is empty , ensuring safe and deterministic key management. The A * heuristic combines Manhattan distance to the goal with a lower bound on the number of turns required to face a direction that reduces this distance, improving guidance while preserving admissibility . T ogether , these components yield a deterministic planning agent that can reliably resolve door–key dependencies, minimize un- necessary interactions, and avoid key-handling loops in complex MiniGrid environments. E.2 PyGame Policy The PyGame agent is implemented as a planning–reactive naviga- tion policy that combines global path planning with local, feasibility- aware motion selection at every timestep. • Global planning : The agent computes a global path to the goal using A * search on a coarse occupancy grid. Obstacles are mildly inated based on the agent radius to ensure collision-free paths. The resulting path is cached and only recomputed when the goal changes or when progr ess stalls. • W ayp oint tracking with visibility lo okahead : The agent follows the planned path using waypoints, advancing when suciently close. If multiple upcoming waypoints are di- rectly visible, the agent skips intermediate points and tar- gets the farthest visible waypoint. • Local motion candidates : At each step, the policy samples a set of candidate motion directions ar ound the desired path direction, including small angular deviations and obstacle- aligned tangents when near walls. Alkis Sygkounas, Rishi Hazra, Andreas Persson, Pedro Zuidberg Dos Martires, and Amy Loutfi (a) DoorKey-16x16-v0 (MiniGrid) (b) LockedRoom-v0 (MiniGrid) (c) MultiRoom-N6-v0 (MiniGrid) (d) Town02 (CARLA) Figure 12: Examples of previously unse en standardized environments used to validate generalization. MiniGrid snapshots ( top ): DoorKey , LockedRoom , and the hardest goal-reaching benchmark ObstructedMaze-Full . CARLA Town02 ( boom ). • Predictive collision checking : Each candidate direction is validate d using one-step forward collision checks that match the environment’s continuous collision model. In- feasible directions are discarded before scoring. • Directional scoring and selection : Feasible candidates are scored based on path-aligned progress, goal alignment, continuity with the previous action, and lo cal obstacle clear- ance. The highest-scoring direction is selected. • Oscillation control : The policy maintains a short-term memory of recent motion directions and penalizes rapid directional sign changes. A turn-rate limiter further con- strains angular changes between successive actions. • Gap-centering behavior : When near obstacles, lateral ray probes estimate free space on either side of the agent, biasing motion toward the center of locally available free space. • Execution : The nal action is a normalized 2D velocity di- rection returned to the environment. If no feasible direction exists, the agent temporarily halts and triggers replanning. This structure allo ws the agent to consistently alternate between global path guidance and locally feasible motion execution in con- tinuous PyGame navigation environments. E.3 Carla Policy The best performing controller ( policy_9 , see Fig. 6) augments a smooth cruise/follow core with a clearance-aware passing routine and stricter intersection handling. Four-stage loop. (1) Signal gating : strict trac-light guard ( Y el- low = Red ), stop-line latch, and p edestrian holds; approach spee d is limited by both stop-line distance and queued-lead gap. (2) Lead classication : distinguishes a right-curb parke d blocker from an in- lane stopped lead using lateral intrusion and r elative speed cues. (3) Clearance-aware pass : if the blocker is parked, oncoming is clear , and distance gates are met, the agent enters a bounded left-oset pass. It maintains a minimum oset and a small, gated opposite-lane incursion, holds the oset while alongside, and only recenters after front-clearance (with a brief hold if the lead vanishes). (4) Smooth tracking : target-speed smo othing with curvature/heading caps COvolve: Adversarial Co-Evolution of Large-Language-Model-Generated Policies and Environments via Two-Player Zero-Sum Game 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Steps 1e6 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 R ewar d R ewar d (MA=50) ±1 std (a) PPO Env 0 0.0 0.2 0.4 0.6 0.8 1.0 Steps 1e6 0.04 0.02 0.00 0.02 0.04 R ewar d R ewar d (MA=50) ±1 std (b) PPO Env 2 0.0 0.2 0.4 0.6 0.8 1.0 Steps 1e6 0.04 0.02 0.00 0.02 0.04 R ewar d R ewar d (MA=50) ±1 std (c) PPO Env 6 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Steps 1e6 0.1 0.2 0.3 0.4 0.5 R ewar d R ewar d (MA=50) ±1 std (d) QRDQN Env 0 0.0 0.2 0.4 0.6 0.8 1.0 Steps 1e6 0.04 0.02 0.00 0.02 0.04 R ewar d R ewar d (MA=50) ±1 std (e) QRDQN Env 2 0.0 0.2 0.4 0.6 0.8 1.0 Steps 1e6 0.04 0.02 0.00 0.02 0.04 R ewar d R ewar d (MA=50) ±1 std (f ) QRDQN Env 6 Figure 13: Training curves of PPO ( top ) and QRDQN ( boom ) across MiniGrid environments. The y-axis represents reward, and the x-axis represents total training steps. 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Steps 1e6 0.2 0.4 0.6 0.8 1.0 R ewar d (a) PPO Env 0 0.0 0.2 0.4 0.6 0.8 1.0 Steps 1e7 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 R ewar d (b) PPO Env 2 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Steps 1e6 0.04 0.02 0.00 0.02 0.04 R ewar d (c) PPO Env 6 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Steps 1e6 0.0 0.2 0.4 0.6 0.8 1.0 R ewar d (d) SAC Env 0 0.0 0.2 0.4 0.6 0.8 1.0 Steps 1e7 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 R ewar d (e) SAC Env 2 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Steps 1e6 0.04 0.02 0.00 0.02 0.04 R ewar d (f ) SAC Env 6 Figure 14: Training curves of PPO (top) and SAC (b ottom) across Py Game environments. The y-axis represents reward, and the x-axis represents total training steps. and a damped lookahead lateral controller; unstick logic pro vides a gentle creep when safe. Key heuristics. • Stop-line priority : combined stop-line/lead-gap caps and a near-line latch prevent creeping over the line on non- green states. Alkis Sygkounas, Rishi Hazra, Andreas Persson, Pedro Zuidberg Dos Martires, and Amy Loutfi 0 2 4 6 8 Environment 0.0 0.2 0.4 0.6 0.8 1.0 Score pi_0 pi_1 pi_2 pi_3 pi_4 pi_5 pi_6 pi_7 pi_8 pi_9 0 1 2 3 4 5 6 7 8 9 Environment 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Score COvolve (last) π_last (π_9) π_argmax (π_5) 0 1 2 3 4 5 6 7 8 9 Iteration 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Mean score over envs COvolve UED-UNIFORM UED-GREEDY 0 2 4 6 8 Environment 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Score πi_0 πi_1 πi_2 πi_3 πi_4 πi_5 πi_6 πi_7 πi_8 πi_9 0 1 2 3 4 5 6 7 8 9 Environment 0.75 0.80 0.85 0.90 0.95 1.00 Score COvolve π_argmax (π_9) π_last (π_9) 0 1 2 3 4 5 6 7 8 9 Iteration 0.0 0.2 0.4 0.6 0.8 1.0 Mean score over envs COvolve UED-UNIFORM UED-GREEDY 0 2 4 6 8 Environment 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Score πi_0 πi_1 πi_2 πi_3 πi_4 πi_5 πi_6 πi_7 πi_8 πi_9 0 1 2 3 4 5 6 7 8 9 Environment 0.5 0.6 0.7 0.8 0.9 1.0 Score COvolve (last) π_last (π_9) π_argmax (π_8) 0 1 2 3 4 5 6 7 8 9 Iteration 0.0 0.2 0.4 0.6 0.8 1.0 Mean score over envs COvolve UED-UNIFORM UED-GREEDY Figure 15: Performance during environment–policy co-evolution for the second seed. Left: Success rates of all discovered policies evaluated on all environments generated during evolution (policy–environment payo matrix). Center: Comparison between the mixe d-strategy Nash equilibrium (MSNE) policy mixture, the best single p olicy ( 𝜋 argmax ), and the latest policy ( 𝜋 𝑘 ), evaluated on all environments { 𝜃 0 , . . . , 𝜃 𝑘 } for Pygame, 𝜋 argmax coincides with 𝜋 𝑘 . Right: Mean success over environments { 𝜃 0 , . . . , 𝜃 𝑘 } for three strategies: UED-Greedy ( latest policy only), UED-Uniform (uniform mixture over all policies so far), and COvolve (MSNE mixture). • Right-edge pass safety : minimum pass oset, centerline guard, and oncoming no-pass gate; tiny opp osite-lane in- cursion is permitted only when clear . • Stability under occlusion : oset-hold on brief lead dropouts avoids snap-back; post-clear recenter includes a short hys- teresis. F Evolved Environments F .1 MiniGrid Environment Mutations In the MiniGrid maze-solving task, the LLM mutates discrete grid- worlds where an agent must navigate to a goal while avoiding obstacles, doors, and keys. Diculty increases through the follow- ing mechanisms: • Grid scaling : larger grids extend path length and increase exploration requirements. • Obstacle density : additional walls create more complex mazes and reduce direct visibility of the goal. • Sequential key–door dependencies : locked doors are introduced along the main corridor , requiring keys to be collected and used in the correct order . • Hard vs. soft chokepoints : some doors are reinforced by barrier walls that force strict bottlenecks, while others include short wings or detours that add complexity without fully blocking the corridor . • Protected corridors : a one-cell halo ensures that critical key–door paths remain open even as random obstacles are added, guaranteeing solvability . This progression transforms initially trivial layouts into struc- tured mazes that demand multi-step reasoning, ordered dependen- cies, and long-horizon planning while ensuring that every environ- ment remains solvable by construction. F .2 PyGame Environment Mutations In the PyGame navigation task, the LLM mutates a continuous 2D arena where a circular agent must reach a rectangular goal zone while avoiding collisions. While early generations adjust simple parameters such as arena size or obstacle counts, later environments evolve into structur ed mazes with corridor-like passages and long detours. Key mutation axes include: COvolve: Adversarial Co-Evolution of Large-Language-Model-Generated Policies and Environments via Two-Player Zero-Sum Game • Corridor formation : long rectangular bars are placed to partition the arena into corridors, forcing agents to identify traversable passages rather than rely on direct routes. • Bottleneck and detour creation : increasing bar thickness and obstacle density narrows passageways and introduces dead ends, requiring agents to plan long, non-greedy paths. • Start–goal separation : minimum distance constraints push the agent to begin far from the goal, ensuring navigation requires multiple turns and obstacle avoidance. • Precision termination : the goal region remains small rel- ative to agent size, demanding careful alignment to trigger success. • Scalable horizons : enlarging arenas and increasing max- imum steps allows environments to grow in complexity without becoming unsolvable. Unlike gridworlds, these continuous Py Game arenas induce navi- gation behaviors closer to geometric planning: agents must balance global pathnding with lo cal collision checks, and later evolved environments present rich mazes with narr ow corridors that mimic real-world navigation challenges. F .3 Carla Environment Mutations In the Carla T own01 driving task, the LLM mutates a xed urban loop with signalized intersections, oncoming trac, and pedestri- ans. Diculty rises from light, compliant ows to dense, hetero- geneous trac with narrow-clearance segments, while remaining solvable by construction. • Trac scaling : vehicle counts increase from light to heavy urban load; speed variance and lane changes introduce realistic ow heterogeneity . • Pedestrian pressure : higher crossing rates and tighter ca- dences create frequent curb-to-lane interactions requiring cautious approach and yielding. • Intersection strictness : virtual “second gates” beyond stop lines mirror light states, penalizing early acceleration and forcing disciplined red/yellow behavior . • Narrow-clearance segments : parked or fr ozen intrusions create lane squeezes that demand bounded lateral osets and precise, short opposite-lane incursions when clear . • Micro-perturbations : perio dic brake-taps on leads and oc- casional temp orary stoppers test following stability without causing deadlocks. • Oncoming dynamics : faster opposite-lane bursts create brief no-pass windows, requiring agents to time passes and maintain centerline guards. • Jam watchdog & solvability : stall detectors inject bounded ow perturbations to unstick trac; obstacle placements and signal logic are constrained to ensure episodes remain completable. • Observation compatibility : added features (e.g., lane- squeeze indicators, extended stop-line states) are e xposed via backward-compatible elds to avoid policy breakage. This progression turns a benign city loop into a dense, signal- rich scenario with tight margins and bursty interactions, pushing policies to coordinate cautious intersection handling, safe passing, and recovery from transient jams.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment