Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation
Training large language models (LLMs) for non-verifiable tasks, such as creative writing, dialogue, and ethical reasoning, remains challenging due to the absence of ground-truth labels. While LLM-as-Judge approaches offer a scalable alternative to human feedback, they face a fundamental limitation: performance is constrained by the evaluator’s own quality. If the judge cannot recognize good solutions, it cannot provide useful training signals, and evaluation biases (e.g., favoring verbosity over quality) remain unaddressed. This motivates meta-evaluation: the ability to evaluate and improve the evaluator itself. We introduce CoNL, a framework that unifies generation, evaluation, and meta-evaluation through multi-agent self-play. Our key insight: critique quality can be measured by whether it helps others improve their solutions. In CoNL, multiple agents sharing the same policy engage in structured conversations to propose, critique, and revise solutions. Critiques that enable solution improvements earn a diagnostic reward, creating explicit supervision for meta-evaluation and enabling joint optimization of generation and judging capabilities through self-play, without external judges or ground truth. Experiments on five benchmarks show that CoNL achieves consistent improvements over self-rewarding baselines while maintaining stable training.
💡 Research Summary
The paper tackles a fundamental obstacle in training large language models (LLMs) for non‑verifiable tasks such as creative writing, open‑ended dialogue, and ethical reasoning. In these domains, there are no ground‑truth labels, making supervised fine‑tuning (SFT) or conventional reinforcement learning (RL) infeasible. Recent “LLM‑as‑Judge” methods replace costly human preference data with LLM‑generated scalar rewards, but they inherit a critical limitation: the evaluator’s own quality caps performance. If the judge cannot reliably distinguish good from bad solutions, it produces biased rewards (e.g., rewarding verbosity) and the system quickly reaches a performance ceiling.
To overcome this, the authors propose CoNL (Conversation for Non‑verifiable Learning), a multi‑agent self‑play framework that jointly learns generation, evaluation, and meta‑evaluation. The central insight is that the quality of a critique can be measured by whether it enables other agents to improve their solutions. By turning this improvement signal into an explicit “diagnostic reward,” CoNL provides supervision for the evaluator itself, eliminating the need for external judges or ground‑truth data.
Core Protocol
- Agents and Personas – N agents share a single policy πθ but are assigned distinct personas (e.g., “rigorous analyst”, “creative solver”, “skeptical reviewer”) to encourage diverse viewpoints and reduce collusion.
- Four‑Round Interaction
- Round 0 (Initial Proposals): Each agent independently generates an initial solution s_init_i to a query q.
- Round 1 (Ranking & Critique): After observing all initial solutions, each agent produces (a) a set of pairwise preferences ℛ_init_i and (b) textual critiques c_i→k justifying those preferences for the agents they mention.
- Round 2 (Revision): Agents receive all critiques directed at them, incorporate the feedback, and output revised solutions s_rev_i (they may also defend against misguided critiques).
- Round 3 (Final Verdict): Agents rank the revised solutions, yielding ℛ_final_i.
Pairwise rankings are aggregated with a Bradley‑Terry (BT) model, yielding latent quality scores V_k ∈
Comments & Academic Discussion
Loading comments...
Leave a Comment