CosmoCore-Evo: Evolutionary Dream-Replay Reinforcement Learning for Adaptive Code Generation

Reading time: 5 minute
...

📝 Original Info

  • Title: CosmoCore-Evo: Evolutionary Dream-Replay Reinforcement Learning for Adaptive Code Generation
  • ArXiv ID: 2512.21351
  • Date: 2025-12-20
  • Authors: Santhosh Kumar Ravindran

📝 Abstract

Building on the affective dream-replay reinforcement learning framework of CosmoCore, we introduce CosmoCore-Evo, an extension that incorporates evolutionary algorithms to enhance adaptability and novelty in code generation tasks. Inspired by anthropological aspects of human evolution, such as natural selection and adaptation in early hominids, CosmoCore-Evo treats RL trajectories as ``genomes'' that undergo mutation and selection during the nocturnal replay phase. This mechanism allows agents to break free from trained patterns, fostering emergent behaviors and improved performance in distribution-shifted environments, such as changing APIs or novel libraries. We augment the Dream Queue with evolutionary operations, including mutation of high-fitness trajectories and enterprise-tuned fitness functions that incorporate efficiency, compliance, and scalability metrics. Evaluated on extended benchmarks including HumanEval variants with shifts, BigCodeBench, and a custom PySpark pipeline simulation, CosmoCore-Evo achieves up to 35% higher novelty in solutions and 25% faster adaptation compared to the original CosmoCore and baselines like PPO and REAMER. Ablations confirm the role of evolutionary components in bridging the sentient gap for LLM agents. Code for replication, including a toy simulation, is provided.

💡 Deep Analysis

Figure 1

📄 Full Content

CosmoCore-Evo: Evolutionary Dream-Replay Reinforcement Learning for Adaptive Code Generation Santhosh Kumar Ravindran Microsoft Corporation, Redmond, WA, USA santhosh.ravindran@microsoft.com December 2025 Abstract Building on the affective dream-replay reinforcement learning framework of CosmoCore, we intro- duce CosmoCore-Evo, an extension that incorporates evolutionary algorithms to enhance adaptability and novelty in code generation tasks. Inspired by anthropological aspects of human evolution, such as natural selection and adaptation in early hominids, CosmoCore-Evo treats RL trajectories as “genomes” that undergo mutation and selection during the nocturnal replay phase. This mechanism allows agents to break free from trained patterns, fostering emergent behaviors and improved performance in distribution- shifted environments, such as changing APIs or novel libraries. We augment the Dream Queue with evolutionary operations, including mutation of high-fitness trajectories and enterprise-tuned fitness func- tions that incorporate efficiency, compliance, and scalability metrics. Evaluated on extended benchmarks including HumanEval variants with shifts, BigCodeBench, and a custom PySpark pipeline simulation, CosmoCore-Evo achieves up to 35% higher novelty in solutions and 25% faster adaptation compared to the original CosmoCore and baselines like PPO and REAMER. Ablations confirm the role of evolu- tionary components in bridging the sentient gap for LLM agents. Code for replication, including a toy simulation, is provided. 1 Introduction Large language models (LLMs) excel at code generation but often remain trapped in patterns from their training data, lacking the adaptability seen in human evolution. From Neanderthals adapting tools to environmental pressures through natural selection, human progress has relied on variation, selection, and inheritance to transcend limitations. In reinforcement learning (RL), similar mechanisms can address the “sentient gap”—the inability of agents to innovate beyond fixed behaviors. In enterprise settings, static LLMs fail to adapt to evolving regulations or API changes, leading to significant rework costs. CosmoCore [Ravindran, 2025] introduced affective tagging with valence and arousal to prioritize error correction via a Dream Queue and Prune Bin, drawing from neuroscience. However, it still relies on historical trajectories, limiting exploration of novel states. CosmoCore-Evo extends this by integrating evolutionary algorithms (EA), treating trajectories as evolvable genomes. During the nocturnal phase, high-fitness items are mutated and selected, promoting diversity and adaptation. This extension is particularly suited for enterprise applications, where code must adapt to dynamic requirements like API updates or compliance rules. We demonstrate gains in novelty and robustness, paving the way for more sentient code agents. 2 Related Work Our work builds upon several key areas in reinforcement learning (RL), including affective RL, prioritized experience replay, evolutionary algorithms, and their applications to code generation. 1 arXiv:2512.21351v1 [cs.SE] 20 Dec 2025 2.1 Affective and Emotion-Inspired RL Affective computing in RL draws from psychological and neuroscientific models to incorporate emotional signals for improved decision-making. Moerland et al. [2018] provide a comprehensive survey of emotion models in RL agents and robots, highlighting how intrinsic motivations like valence (positive/negative affect) and arousal (intensity) can modulate exploration and exploitation. Subsequent works, such as affect-driven RL for procedural content generation [Liapis and Yannakakis, 2022], demonstrate how emotional feedback enhances adaptability in creative tasks. In human-robot interaction, affective signals have been used to guide RL through social cues [Lee et al., 2014]. More recently, surveys on RL from human feedback (RLHF) [Christiano et al., 2017] and emotionally intelligent RL [Moerland et al., 2023] emphasize balancing short- term rewards with long-term well-being, addressing limitations in standard RL setups. CosmoCore [Ravindran, 2025] extends this by integrating affective tagging specifically for code generation, prioritizing error-prone trajectories in a dream-replay mechanism inspired by human sleep consolidation [Du et al., 2021]. 2.2 Prioritized Experience Replay and Buffer Management Prioritized experience replay (PER) [Schaul et al., 2015] addresses sample inefficiency in RL by replaying important transitions more frequently, based on temporal difference (TD) errors. Extensions include dis- tributed PER for scalability [Horgan et al., 2018] and combinations with other prioritization schemes. In code generation contexts, PER has been adapted to focus on syntactic or semantic errors, but often lacks emotional or motivational layers. Our approach augments PER with affective priorities, similar to how Du et al. [2021] use ”lucid dreaming” to refresh stat

📸 Image Gallery

toy_learning_curves.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut