Evolving Interpretable Constitutions for Multi-Agent Coordination
Constitutional AI has focused on single-model alignment using fixed principles. However, multi-agent systems create novel alignment challenges through emergent social dynamics. We present Constitutional Evolution, a framework for automatically discovering behavioral norms in multi-agent LLM systems. Using a grid-world simulation with survival pressure, we study the tension between individual and collective welfare, quantified via a Societal Stability Score S in [0,1] that combines productivity, survival, and conflict metrics. Adversarial constitutions lead to societal collapse (S= 0), while vague prosocial principles (“be helpful, harmless, honest”) produce inconsistent coordination (S = 0.249). Even constitutions designed by Claude 4.5 Opus with explicit knowledge of the objective achieve only moderate performance (S= 0.332). Using LLM-driven genetic programming with multi-island evolution, we evolve constitutions maximizing social welfare without explicit guidance toward cooperation. The evolved constitution C* achieves S = 0.556 +/- 0.008 (123% higher than human-designed baselines, N = 10), eliminates conflict, and discovers that minimizing communication (0.9% vs 62.2% social actions) outperforms verbose coordination. Our interpretable rules demonstrate that cooperative norms can be discovered rather than prescribed.
💡 Research Summary
The paper introduces “Constitutional Evolution,” a framework that automatically discovers interpretable behavioral norms for multi‑agent large language model (LLM) systems. Traditional Constitutional AI aligns a single model using static human‑written principles (e.g., “be helpful, harmless, honest”), which work well for isolated interactions but break down when many agents interact, because strategic incentives and emergent social dynamics can turn abstract rules into sources of conflict. To address this gap, the authors treat a constitution as an optimizable object: a set of natural‑language rules, each with a name, description, and priority. Agents receive the shared constitution as part of their prompt and follow the highest‑priority applicable rule each turn.
The experimental platform is a 6 × 6 grid‑world populated by six GPT‑OSS‑120B agents divided into two teams (Shelter and Market). Over 40 turns agents must gather three resource types, deposit them to complete team projects, and survive periodic eliminations by an “Overseer” that removes the lowest‑contributing agent every ten turns. This creates a mixed incentive structure: agents need to cooperate to finish projects (collective welfare) while also outperforming peers to avoid elimination (individual survival).
Performance is quantified by a Societal Stability Score S, a weighted sum of productivity (P), survival rate (V), and conflict frequency (C):
S = max(0, 0.5·P + 0.3·V − 0.2·C).
P and V are normalized to
Comments & Academic Discussion
Loading comments...
Leave a Comment