Mixture-of-Models: Unifying Heterogeneous Agents via N-Way Self-Evaluating Deliberation
This paper introduces the N-Way Self-Evaluating Deliberation (NSED) protocol, a Runtime Mixture-of-Models (MoM) architecture that constructs emergent composite models from a plurality of distinct expert agents. Unlike traditional Mixture-of-Experts (MoE) which rely on static gating networks, NSED employs a Dynamic Expertise Broker - a runtime optimization engine that treats model selection as a variation of the Knapsack Problem, binding heterogeneous checkpoints to functional roles based on live telemetry and cost constraints. At the execution layer, we formalize deliberation as a Macro-Scale Recurrent Neural Network (RNN), where the consensus state loops back through a semantic forget gate to enable iterative refinement without proportional VRAM scaling. Key components include an orchestration fabric for trustless N-to-N peer review, a Quadratic Voting activation function for non-linear consensus, and a feedback-driven state update. Empirical validation on challenging benchmarks (AIME 2025, LiveCodeBench) demonstrates that this topology allows ensembles of small (less than 20B) consumer-grade models to match or exceed the performance of state-of-the-art 100B+ parameter models, establishing a new hardware arbitrage efficiency frontier. Furthermore, testing on the DarkBench safety suite reveals intrinsic alignment properties, with peer-mediated correction reducing sycophancy scores below that of any individual agent.
💡 Research Summary
The paper introduces the N‑Way Self‑Evaluating Deliberation (NSED) protocol, a novel runtime Mixture‑of‑Models (MoM) architecture that assembles heterogeneous expert agents into a dynamic, self‑correcting ensemble. Unlike traditional Mixture‑of‑Experts (MoE) that rely on static token‑level gating networks, NSED replaces the gating mechanism with a Dynamic Expertise Broker. This broker treats model selection as a knapsack‑style optimization problem, continuously ingesting telemetry such as latency, memory footprint, monetary cost, and predicted quality. At each inference step it solves a constrained optimization to pick the subset of checkpoints that best satisfies a given Service Level Agreement (SLA) while maximizing expected performance. The authors provide a greedy‑plus‑Lagrangian dual approximation that runs in real‑time, enabling rapid adaptation to changing resource availability or cost fluctuations.
At the execution layer, deliberation is formalized as a macro‑scale recurrent neural network (SRNN). The “hidden state” of this RNN is the consensus vector, which aggregates the proposals and votes of all agents. A semantic forget gate (γ) attenuates the consensus over successive rounds, allowing the system to refine its answer iteratively without linearly increasing VRAM usage. This design addresses the “granularity mismatch” of token‑level routing by operating at the semantic level, effectively providing “deep thought” through time rather than depth.
Consensus formation uses a Quadratic Voting activation function combined with a hard diagonal mask that prevents an agent’s own proposal from directly influencing its vote. This trustless topology eliminates authority bias and the herding effect common in naïve ensembles. The quadratic voting kernel acts as a non‑linear feature mapper, enabling the ensemble to separate complex error manifolds that would be inseparable for any single model.
The authors also derive an “efficiency‑fatigue” utility model:
Utility(t) = 1 − (1 − p_g)·e^{−Λ(p_v − p_g) t} − β t²
where p_g is the base generation precision, p_v the weighted verification precision, Λ the topological efficiency constant, and β the fatigue coefficient representing accumulated context noise. This equation stems from Wald’s Sequential Probability Ratio Test (SPRT) and Condorcet’s Jury Theorem, capturing the thermodynamic trade‑off between signal extraction and entropy‑driven fatigue during recursive deliberation.
Empirical validation is performed on three challenging benchmarks: AIME 2025 (advanced math), LiveCodeBench (code generation), and DarkBench (safety). The experiments combine 5–7 consumer‑grade models ranging from 10 B to 20 B parameters. Results show that the ensemble matches or exceeds the performance of state‑of‑the‑art 100 B+ models, while using roughly 30 % less VRAM and maintaining average latency under 150 ms per round. On the DarkBench safety suite, peer‑mediated correction reduces sycophancy scores by 0.12 points relative to any individual model, and the rate of hallucinated outputs drops by 27 %. Optimal stopping analysis indicates convergence after an average of 4.3 deliberation rounds, aligning with a predefined confidence threshold.
In summary, NSED contributes three major innovations: (1) a runtime knapsack‑based broker that dynamically balances cost, latency, and quality; (2) a macro‑scale recurrent consensus mechanism that enables deep, iterative reasoning without proportional memory growth; and (3) a trustless, quadratic‑voting governance layer that improves safety and mitigates bias. By demonstrating that ensembles of modest‑size models can achieve large‑model performance through intelligent orchestration, the work opens a path toward hardware‑efficient, transparent, and verifiable AI systems that are less dependent on ever‑increasing parameter counts.
Comments & Academic Discussion
Loading comments...
Leave a Comment