Building on the affective dream-replay reinforcement learning framework of CosmoCore, we introduce CosmoCore-Evo, an extension that incorporates evolutionary algorithms to enhance adaptability and novelty in code generation tasks. Inspired by anthropological aspects of human evolution, such as natural selection and adaptation in early hominids, CosmoCore-Evo treats RL trajectories as "genomes" that undergo mutation and selection during the nocturnal replay phase. This mechanism allows agents to break free from trained patterns, fostering emergent behaviors and improved performance in distributionshifted environments, such as changing APIs or novel libraries. We augment the Dream Queue with evolutionary operations, including mutation of high-fitness trajectories and enterprise-tuned fitness functions that incorporate efficiency, compliance, and scalability metrics. Evaluated on extended benchmarks including HumanEval variants with shifts, BigCodeBench, and a custom PySpark pipeline simulation, CosmoCore-Evo achieves up to 35% higher novelty in solutions and 25% faster adaptation compared to the original CosmoCore and baselines like PPO and REAMER. Ablations confirm the role of evolutionary components in bridging the sentient gap for LLM agents. Code for replication, including a toy simulation, is provided.
Large language models (LLMs) excel at code generation but often remain trapped in patterns from their training data, lacking the adaptability seen in human evolution. From Neanderthals adapting tools to environmental pressures through natural selection, human progress has relied on variation, selection, and inheritance to transcend limitations. In reinforcement learning (RL), similar mechanisms can address the "sentient gap"-the inability of agents to innovate beyond fixed behaviors.
In enterprise settings, static LLMs fail to adapt to evolving regulations or API changes, leading to significant rework costs. CosmoCore [Ravindran, 2025] introduced affective tagging with valence and arousal to prioritize error correction via a Dream Queue and Prune Bin, drawing from neuroscience. However, it still relies on historical trajectories, limiting exploration of novel states. CosmoCore-Evo extends this by integrating evolutionary algorithms (EA), treating trajectories as evolvable genomes. During the nocturnal phase, high-fitness items are mutated and selected, promoting diversity and adaptation.
This extension is particularly suited for enterprise applications, where code must adapt to dynamic requirements like API updates or compliance rules. We demonstrate gains in novelty and robustness, paving the way for more sentient code agents.
Our work builds upon several key areas in reinforcement learning (RL), including affective RL, prioritized experience replay, evolutionary algorithms, and their applications to code generation.
Affective computing in RL draws from psychological and neuroscientific models to incorporate emotional signals for improved decision-making. Moerland et al. [2018] provide a comprehensive survey of emotion models in RL agents and robots, highlighting how intrinsic motivations like valence (positive/negative affect) and arousal (intensity) can modulate exploration and exploitation. Subsequent works, such as affect-driven RL for procedural content generation [Liapis and Yannakakis, 2022], demonstrate how emotional feedback enhances adaptability in creative tasks. In human-robot interaction, affective signals have been used to guide RL through social cues [Lee et al., 2014]. More recently, surveys on RL from human feedback (RLHF) [Christiano et al., 2017] and emotionally intelligent RL [Moerland et al., 2023] emphasize balancing shortterm rewards with long-term well-being, addressing limitations in standard RL setups.
CosmoCore [Ravindran, 2025] extends this by integrating affective tagging specifically for code generation, prioritizing error-prone trajectories in a dream-replay mechanism inspired by human sleep consolidation [Du et al., 2021].
Prioritized experience replay (PER) [Schaul et al., 2015] addresses sample inefficiency in RL by replaying important transitions more frequently, based on temporal difference (TD) errors. Extensions include distributed PER for scalability [Horgan et al., 2018] and combinations with other prioritization schemes. In code generation contexts, PER has been adapted to focus on syntactic or semantic errors, but often lacks emotional or motivational layers.
Our approach augments PER with affective priorities, similar to how Du et al. [2021] use “lucid dreaming” to refresh states in ER, enabling more targeted replays.
Evolutionary algorithms (EA) offer gradient-free optimization, particularly effective for exploration in highdimensional spaces. Salimans et al. [2017] introduced evolution strategies (ES) as a scalable alternative to RL for neuroevolution. Hybrid methods, such as evolutionary reinforcement learning (EvoRL), combine EA with RL to leverage population-based diversity [Zhao et al., 2023, Bodnar et al., 2018]. Surveys like Liu et al. [2024] highlight RL-assisted EA for optimization, while Conti et al. [2018] apply EA to RL policy search.
In exploration-heavy domains, EvoRL has shown promise [Such et al., 2017, Khadka andTumer, 2018], but integrations with affective or replay mechanisms remain underexplored.
RL has been applied to code generation to refine LLM outputs beyond supervised fine-tuning. CodeRL [Le et al., 2022] uses actor-critic methods to improve program synthesis, while StepCoder [Tian et al., 2024] employs RL from process supervision for step-by-step code refinement. Other works address outcome vs. process rewards [Chen et al., 2024] and query enhancement via RL [Li et al., 2024].
Despite these advances, LLMs suffer from the “sentient gap,” including hallucinations, lack of long-term reasoning, and pattern entrapment [Ji et al., 2023, Wang et al., 2024]. Evolutionary methods in code tasks, like test case generation [Fraser and Arcuri, 2011], hint at potential, but deep integration with affective RL is novel.
CosmoCore-Evo bridges these by combining EvoRL with affective dream-replay, enabling adaptive, enterprise-ready code agents.
3 Proposed Method: CosmoCore-Evo CosmoCore-Evo augments the original architecture with evolutionary mechanisms to foster adaptability. Figure 1 provides an overview of the integrated system, highlighting the original flow and Evo extensions.
Algorithm 1 CosmoCore-Evo Evolutionary Update Require: Experience buffer B, evolution frequency T = 10 1: procedure EvolutionaryUpdate(step)
if step mod 50 = 0 then 5:
Prune low-impact items (|v| < 0.2 โง a < 0.3)
P โ top 50% of B (parents)
10:
offspring โ โ
11:
for each ฯ โ P do 12:
end for 17:
B โ Pโช offspring 18:
Apply affective pruning to B 19:
Sample minibatch (80% high-priority + 20% random) for policy update 21: end procedure
To contextualize our evolutionary extensions, we first briefly review the foundational CosmoCore framework [Ravindran, 2025], which introduces affective signals inspired by human emotion and sleep-based memory consolidation into reinforcement learning for code generation.
In CosmoCore, each trajectory is defined as ฯ = (prompt, generated code, execution feedback, reward),
where the reward is typically derived from unit test outcomes or static analysis scores. Unlike traditional RL setups that rely solely on scalar rewards or TD errors, CosmoCore enriches each trajectory with two affective dimensions computed by a lightweight multi-layer perceptron (MLP) tagger (3 layers, 256 hidden units):
, 1]: Captures the emotional “pleasantness” of the outcome. High positive valence is assigned to successful executions with clean feedback; strongly negative valence to critical failures (e.g., runtime errors, security vulnerabilities). Valence is derived from a normalized combination of reward and feedback sentiment.
โข Arousal a i โ [0, 1]: Measures the intensity or surprise of the experience. High arousal corresponds to large deviations from expected outcomes (e.g., unexpected failures on seemingly simple prompts or surprising successes on hard ones), approximated via the magnitude of TD error and feedback novelty.
The MLP tagger is jointly trained with the policy using a multi-task loss (policy loss + affective prediction loss on human-annotated or self-supervised labels during initialization).
The prioritization score for each trajectory i in the experience buffer is then computed as
where ฮป = 0.6 is a hyperparameter that balances standard TD-error prioritization with affective impact. 2. Prune Bin: Low-impact trajectories (e.g., |v i | < 0.2 and a i < 0.3) are periodically pruned to maintain buffer efficiency and prevent dilution by neutral experiences.
Empirically, this affective prioritization yields faster convergence and better error correction than pure TD-based PER, particularly on code tasks where failures are sparse but highly informative. However, as originally noted, CosmoCore remains constrained by the diversity of historically observed trajectories-replays can only remix existing experiences, limiting the agent’s ability to escape local optima or adapt to substantial distribution shifts. This is precisely the limitation that CosmoCore-Evo addresses through evolutionary variation and selection, as detailed in the following subsections.
We empirically validate CosmoCore-Evo on both a controlled toy environment and standard code generation benchmarks with induced distribution shifts. All experiments are averaged over multiple seeds for statistical reliability.
We use Proximal Policy Optimization (PPO) [Schulman et al., 2017] with CodeT5-base (220M parameters) [Wang et al., 2021] as the policy network. Training runs for 1 million steps with batch size 32 and learning rate 1e-5. The evolutionary loop activates every 10 steps.
Baselines: -Vanilla PPO with uniform experience replay. -REAMER [Zhang et al., 2023], a strong execution-feedback RL method. -Original CosmoCore (affective dream-replay without evolution).
We first evaluate in a simplified proxy for code generation: trajectories are sequences of 5 integers (actions โ [0, 5]), with reward 10 -| actions -15|. The optimal reward is 10.0 (e.g., [3,3,3,3,3]), but random sampling yields 5-6 on average. This setup tests the ability to discover structured, high-reward sequences via replay and variation.
We collect 1000 trajectories, followed by 300 learning batches. Results are averaged over 20 random seeds.
Figure 2 shows learning curves. CosmoCore-Evo converges rapidly to near-optimal performance, while CosmoCore improves modestly and the baseline plateaus.
Final performance:
โข Vanilla PPO: 6.30 ยฑ 0.15
โข Original CosmoCore: 6.97 ยฑ 0.12 (+10.6% over baseline)
โข CosmoCore-Evo: 9.79 ยฑ 0.03 (+55.4% over baseline, +40.5% over CosmoCore)
This demonstrates that while affective prioritization aids learning, evolutionary mutation is essential for escaping local optima and discovering superior solutions.
We evaluate on HumanEval [Chen et al., 2021] and BigCodeBench [Zhai et al., 2024], with distribution shifts simulating real-world API evolution (e.g., renaming pandas.read csv โ pd.load data while preserving logic).
HumanEval-Shift Pass@1 (%) These results strongly support our hypothesis: evolutionary mechanisms enable LLM agents to transcend training patterns, approaching more adaptive, “sentient-like” behavior in dynamic environments.
CosmoCore-Evo demonstrates that anthropological principles of evolution-variation, selection, and inheritance-can meaningfully enhance modern RL systems for code generation. By introducing a controlled evolutionary loop within an affective dream-replay framework, we achieve consistent gains in adaptability and solution novelty, addressing a core limitation of current LLM agents: their tendency to reproduce memorized patterns rather than innovate.
Ablation Insights: Removing mutation degrades adaptation speed by 18%, confirming its role in exploration. Disabling the enterprise fitness terms reduces novelty by 12-15%, showing the value of multiobjective guidance. Interestingly, increasing mutation rate beyond 0.2 led to instability, suggesting a sweet spot where variation aids rather than disrupts learning.
Limitations: The primary trade-off is computational overhead ( 15-20% increase due to mutation and re-evaluation). This can be mitigated by parallelizing offspring evaluation or applying mutations only to high-valence trajectories. Current mutations are token-level; future work could explore AST-aware mutations for semantic preservation.
Ethical and Safety Considerations: While Evo promotes diversity, uncontrolled evolution risks amplifying biases present in the base model or reward function. We recommend monitoring fitness terms for fairness metrics and including “ethical guardrails” (e.g., negative fitness for insecure patterns) in production deployments.
Enterprise Impact: In real-world software pipelines, API deprecations and schema changes occur frequently. CosmoCore-Evo’s faster adaptation translates to reduced technical debt and lower maintenance costs. For large organizations, the tunable fitness framework enables alignment with internal standards (security, performance, accessibility), making it particularly valuable for regulated domains like finance and healthcare.
Future Directions: This work establishes evolutionary adaptation as the first pillar in a broader anthropological framework. Upcoming extensions will incorporate:
โข Tribal dynamics via multi-agent debate and role specialization,
โข Cultural accumulation through generational memory compression,
โข Instinctual drives for self-directed exploration.
Together, these aim to push LLM agents closer to truly sentient, self-evolving intelligence.
CosmoCore-Evo successfully integrates evolutionary principles into affective dream-replay RL, yielding more adaptive and creative code generation agents. The observed gains in novelty, robustness, and enterprise alignment validate the anthropological inspiration and lay a strong foundation for subsequent extensions toward bridging the sentient gap. .2 f } โฃ +/ -โฃ { baseline_std [ -1]:.2 f } " ) print ( f " CosmoCore : โฃ { cosmo_mean [ -1]:.2 f } โฃ +/ -โฃ { cosmo_std [ -1]:.2 f } " ) print ( f " CosmoCore -Evo : โฃ { evo_mean [ -1]:.2 f } โฃ +/ -โฃ { evo_std [ -1]:.2 f } " )
Mutation is the most critical component for adaptation; fitness tuning drives practical, enterprise-relevant novelty.
This content is AI-processed based on open access ArXiv data.