주관적 평가만으로 진화 가능하게 하는 다중에이전트 분해 진화 프레임워크

Reading time: 5 minute
...

📝 Abstract

The integration of Large Language Models (LLMs) with Evolutionary Computation (EC) has unlocked new frontiers in scientific discovery but remains shackled by a fundamental constraint: the reliance on an Oracle–an objective, machine-computable fitness function. This paper breaks this barrier by asking: Can evolution thrive in a purely subjective landscape governed solely by LLM judges? We introduce MADE (Multi-Agent Decomposed Evolution), a framework that tames the inherent noise of subjective evaluation through “Problem Specification.” By decomposing vague instructions into specific, verifiable sub-requirements, MADE transforms high-variance LLM feedback into stable, precise selection pressure. The results are transformative: across complex benchmarks like DevAI and InfoBench, MADE outperforms strong baselines by over 50% in software requirement satisfaction (39.9% to 61.9%) and achieves a 95% perfect pass rate on complex instruction following. This work validates a fundamental paradigm shift: moving from optimizing “computable metrics” to “describable qualities,” thereby unlocking evolutionary optimization for the vast open-ended domains where no ground truth exists.

💡 Analysis

The integration of Large Language Models (LLMs) with Evolutionary Computation (EC) has unlocked new frontiers in scientific discovery but remains shackled by a fundamental constraint: the reliance on an Oracle–an objective, machine-computable fitness function. This paper breaks this barrier by asking: Can evolution thrive in a purely subjective landscape governed solely by LLM judges? We introduce MADE (Multi-Agent Decomposed Evolution), a framework that tames the inherent noise of subjective evaluation through “Problem Specification.” By decomposing vague instructions into specific, verifiable sub-requirements, MADE transforms high-variance LLM feedback into stable, precise selection pressure. The results are transformative: across complex benchmarks like DevAI and InfoBench, MADE outperforms strong baselines by over 50% in software requirement satisfaction (39.9% to 61.9%) and achieves a 95% perfect pass rate on complex instruction following. This work validates a fundamental paradigm shift: moving from optimizing “computable metrics” to “describable qualities,” thereby unlocking evolutionary optimization for the vast open-ended domains where no ground truth exists.

📄 Content

The advent of Large Language Models (LLMs), with their remarkable capabilities in code and natural language generation, has provided a powerful new engine for automated problem-solving. Recently, the paradigm combining the generative power of LLMs with the search capabilities of Evolutionary Computation (EC) has achieved breakthrough progress in scientific and algorithmic discovery. Systems like FunSearch [8] and AlphaEvolve [7], for example, use LLMs as intelligent “mutation operators” to iteratively improve code, successfully discovering superior algorithms and mathematical constructs for problems that have remained unsolved for decades. The tremendous success of these works reveals that LLM-guided evolution is a highly promising engine for automated discovery.

However, a common and essential prerequisite underpins the success of these pioneering works: they all rely on an objective, machine-automatable fitness function. Whether it’s verifying the correctness of an algorithm, measuring its execution speed [6], or checking if a mathematical construct satisfies specific properties, the final “quality” judgment is always provided by a deterministic, programmatic “oracle.”

This dependence on objective evaluation metrics has become a fundamental bottleneck, preventing the broader application of this powerful paradigm. It confines us to a problem space that, while vast, is still limited, excluding the wider range of complex tasks centered on human subjective values.

This leads to a more profound and cutting-edge research question: when an objective, computable fitness function is absent, can we successfully apply the LLM-guided evolutionary paradigm to a domain governed entirely by subjective evaluation?

Using an LLM as the sole “judge” in the evolutionary process introduces a series of seemingly fatal theoretical challenges. The fitness landscape of traditional evolutionary algorithms is static and objective; in contrast, a landscape defined by an LLM is dynamic, subjective, and noisy. This challenge is not entirely new; the field of evolutionary computation has long studied the difficulties of optimizing in uncertain environments with noisy fitness functions [2]. However, the noise from an LLM judge stems from semantic ambiguity and inherent biases, not just statistical randomness, posing a novel challenge. LLM judgments exhibit non-determinism, context sensitivity, and inherent biases [11], which is equivalent to searching for the highest peak on a mountain range that is constantly experiencing “earthquakes” and “drifts.” The evolutionary process could easily fall into chaos or evolve “exploitative” solutions that merely cater to LLM biases rather than being genuinely superior. Therefore, whether a meaningful and convergent evolution is possible under such subjective and unstable selection pressure has been a major open challenge.

To address this challenge head-on, we propose an evolutionary framework whose core idea is to tame the uncertainty of subjective evaluation through “Problem Specification”. Our system, MADE (Multi-Agent Decomposed Evolution), coordinates a group of AI agents through an evolutionary loop. In each generation, a “Creator” agent proposes solutions, which are then evaluated by a dedicated “Judge” agent. The key is that the Judge agent does not directly assess the vague overall task. Instead, it scores based on a set of pre-decomposed, concrete, and verifiable sub-requirements. This principle of decomposition, which has proven highly effective for improving reasoning in generation tasks via methods like Chain-of-Thought prompting [12], is here repurposed to stabilize the evaluation process.

By transforming a high-noise evaluation task into an aggregation of multiple low-noise sub-tasks, we impose effective and controllable subjective selection pressure. The Judge agent completely replaces the traditional fitness function, providing not only fitness scores but also structured semantic feedback to guide subsequent mutation operations, enabling directed evolution toward goals that align with nuanced human values.

To provide strong empirical evidence, we conducted rigorous validation of our proposed framework across four diverse and challenging benchmarks: software development (DevAI), abstract reasoning (BigGen), complex instruction following (InfoBench), and multi-modal chart generation (MatPlotBench).

The experimental results offer decisive proof: on the end-to-end software development task, our method increased the requirement satisfaction rate by over 50% compared to the strongest baseline (from 39.9% to 61.9%); on the complex instruction-following task, it astonishingly boosted the “perfect pass rate” from 72% to 95%. These results demonstrate empirically for the first time that an evolutionary loop driven purely by subjective LLM evaluation is not only feasible but can also stably produce solutions of higher quality and greater complexity than single-shot generation.

Our cont

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut