ContextEvolve: Multi-Agent Context Compression for Systems Code Optimization

ContextEvolve: Multi-Agent Context Compression for Systems Code Optimization
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models are transforming systems research by automating the discovery of performance-critical algorithms for computer systems. Despite plausible codes generated by LLMs, producing solutions that meet the stringent correctness and performance requirements of systems demands iterative optimization. Test-time reinforcement learning offers high search efficiency but requires parameter updates infeasible under API-only access, while existing training-free evolutionary methods suffer from inefficient context utilization and undirected search. We introduce ContextEvolve, a multi-agent framework that achieves RL-level search efficiency under strict parameter-blind constraints by decomposing optimization context into three orthogonal dimensions: a Summarizer Agent condenses semantic state via code-to-language abstraction, a Navigator Agent distills optimization direction from trajectory analysis, and a Sampler Agent curates experience distribution through prioritized exemplar retrieval. This orchestration forms a functional isomorphism with RL-mapping to state representation, policy gradient, and experience replay-enabling principled optimization in a textual latent space. On the ADRS benchmark, ContextEvolve outperforms state-of-the-art baselines by 33.3% while reducing token consumption by 29.0%. Codes for our work are released at https://anonymous.4open.science/r/ContextEvolve-ACC


💡 Research Summary

ContextEvolve tackles the problem of optimizing systems code with large language models (LLMs) when the model parameters are immutable, as is the case with most commercial API‑only services. Existing test‑time reinforcement learning (RL) approaches achieve high sample efficiency but require weight updates, making them impractical for closed‑source LLMs. Conversely, training‑free evolutionary methods such as AlphaEvolve or CAMEL avoid weight updates but suffer from inefficient context usage: they either flood the prompt with raw history or compress it without preserving the semantic structure needed for directed search.

The core contribution of this paper is a multi‑agent framework that decomposes the optimization context into three orthogonal dimensions—semantic state, optimization direction, and experience distribution—and assigns each to a specialized agent. This decomposition yields a functional isomorphism with the three main components of RL: state representation, policy gradient, and prioritized experience replay, all operating purely in the textual latent space.

  1. Summarizer Agent: Receives the parent abstract (zₚ) and the newly generated child code (c_c) and produces a concise natural‑language summary (z_c). The summary captures both novel modifications and preserved functional blocks, thereby condensing high‑dimensional code differences into a dense textual description that fits within a limited prompt window.

  2. Navigator Agent: Analyzes historical trajectories of code‑score pairs. It weights trajectories by the performance delta Δs and categorizes them into consistent improvement, mixed fluctuation, or consistent decline. By sampling a small number of trajectories from each category and feeding them to the LLM with a “GradientAgent” prompt, it extracts a textual guidance vector (g_t) that serves the same purpose as a policy gradient estimate, steering the search toward promising regions while avoiding repeated futile attempts.

  3. Sampler Agent: Curates a few‑shot exemplar set (E_ctx) from the current population based on relevance, diversity, and proven value, conditioned on the parent abstract (zₚ) and the guidance (g_t). This prioritized sampling mirrors experience replay in RL, ensuring that the generator receives high‑utility references rather than a random dump of past solutions.

The three agents produce (zₚ, g_t, E_ctx), which are composed into a compact context Φ_t. The LLM M_θ is then prompted to generate a new candidate code c_c given Φ_t. The candidate is immediately evaluated by an automated oracle E, yielding a scalar score s_c that is fed back into the Summarizer and stored in a replay buffer. Although the algorithm includes a formal policy‑gradient update step (Δθ ← ∇θ …), in practice this step is a virtual operation; the underlying model weights remain unchanged, preserving the API‑only constraint while still benefiting from RL‑style sample efficiency.

Experiments on the ADRS benchmark—covering domains such as database indexing, network packet processing, and distributed scheduling—show that ContextEvolve outperforms state‑of‑the‑art training‑free baselines by an average of 33.3% in final solution quality. Moreover, the token consumption is reduced by 29.0% compared to the baselines, with each of the three agents contributing roughly a 10‑12% reduction through more informative compression.

The paper’s contributions are threefold: (1) a novel multi‑agent context‑compression framework that works under strict parameter‑blind constraints, (2) a concrete mapping of RL mechanisms to textual operations via specialized agents, and (3) empirical evidence of superior performance and efficiency on a realistic systems‑code optimization suite.

Limitations include the need for careful prompt engineering for each agent, which can be domain‑specific, and the reliance on a single LLM for all three agents, introducing latency due to sequential text‑based communication. Future work may explore multi‑modal memory stores, asynchronous agent communication, and integration with external toolchains to further improve scalability and real‑time responsiveness.


Comments & Academic Discussion

Loading comments...

Leave a Comment