Bayesian Orchestration of Multi-LLM Agents for Cost-Aware Sequential Decision-Making

Large language models increasingly serve as autonomous decision-making agents in domains where errors have measurable costs: hiring (missed qualified candidates versus wasted interviews), medical triage (missed emergencies versus unnecessary escalations), and fraud detection (approved fraud versus declined legitimate transactions). Current architectures are built on a flawed foundation: they query LLMs for discriminative probabilities p(state|evidence), apply arbitrary confidence thresholds, and execute actions without considering cost asymmetries or uncertainty quantification. We prove this approach is formally inadequate for sequential decision-making and propose a mathematically principled alternative. We propose a mathematically principled alternative that treats multiple LLMs as approximate likelihood functions rather than classifiers. For each possible state, we elicit p(evidence|state) through contrastive prompting, aggregate across diverse models via robust statistics, and apply Bayes’ rule with explicit priors. This generative modeling perspective enables four critical capabilities: (1) proper sequential belief updating as evidence accumulates, (2) cost-aware action selection through expected utility maximization, (3) principled information gathering via value-of-information calculations, and (4) improved fairness through ensemble bias mitigation. We instantiate this framework in resume screening, where hiring mistakes cost $40,000, wasted interviews cost $2,500, and phone screens cost $150. Experiments across 1,000 resumes evaluated by five diverse LLMs (GPT-4o, Claude 3.5 Sonnet, Gemini Pro, Grok, DeepSeek) demonstrate that our approach reduces total costs by $294,000 (34% improvement) compared to the best single-LLM baseline while improving demographic parity by 45% (reducing maximum group difference from 22 to 5 percentage points). Ablation studies reveal that multi-LLM aggregation contributes 51% of cost savings, sequential updating 43%, and disagreement-triggered information gathering 20%. Critically, we prove these gains are not merely empirical accidents but necessary consequences of correcting the mathematical foundations of LLM-based decision-making.

💡 Research Summary

The paper begins by diagnosing a fundamental flaw in current large‑language‑model (LLM)‑driven decision‑making pipelines. Most existing systems treat an LLM as a classifier that directly outputs a discriminative probability p(state | evidence). They then apply a fixed confidence threshold and act, ignoring the asymmetric costs of different errors, the need for sequential belief updates, and any principled quantification of uncertainty. The authors prove that this approach cannot be optimal for any sequential decision problem because it bypasses Bayes’ theorem, fails to propagate uncertainty, and cannot incorporate cost‑sensitive utility functions.

To remedy these deficiencies, the authors propose a fully Bayesian framework that reinterprets each LLM as an approximate likelihood function p(evidence | state). By prompting the model contrastively—e.g., “If the candidate were qualified, how likely would this résumé be written this way?”—they elicit a probability distribution over the observed evidence conditioned on each possible hidden state. Five diverse LLMs (GPT‑4o, Claude 3.5 Sonnet, Gemini Pro, Grok, DeepSeek) are queried independently, producing a set of likelihood estimates for each state. These estimates are then aggregated using robust statistics (median, trimmed mean, M‑estimators) to mitigate outliers and model‑specific biases, effectively creating a multi‑model ensemble likelihood.

With an explicit prior over states (e.g., the base rate of qualified candidates), Bayes’ rule yields a posterior distribution that can be updated sequentially as new evidence arrives (resume review → phone screen → interview). The posterior feeds directly into an expected‑utility calculation. The utility function is defined in monetary terms that reflect real‑world cost asymmetries: missing a qualified candidate costs $40,000, conducting an unnecessary interview costs $2,500, and a phone screen costs $150. The action that maximizes expected utility—whether to proceed, request more evidence, or reject—is selected at each decision point.

A key innovation is the incorporation of value‑of‑information (VOI) analysis. When the posterior does not provide sufficient confidence for a low‑cost decision, the system computes the expected reduction in total cost that would result from acquiring an additional piece of evidence (e.g., a phone interview). If the VOI exceeds the cost of the evidence, the system triggers the information‑gathering step; otherwise it proceeds with the current best action. This “disagreement‑driven” evidence acquisition leverages the diversity among LLMs: large inter‑model disagreement signals high epistemic uncertainty and justifies paying for extra data.

Fairness is addressed by showing that ensemble aggregation naturally reduces demographic bias present in individual models. The authors further adjust priors or re‑weight posteriors to enforce demographic parity, measuring the maximum difference in selection rates across protected groups. In experiments on a dataset of 1,000 résumés, the proposed Bayesian orchestration reduces total hiring‑related costs by $294,000—a 34 % improvement over the strongest single‑LLM baseline—while cutting the maximum group disparity from 22 to 5 percentage points (a 45 % fairness gain). Ablation studies attribute 51 % of the cost reduction to multi‑LLM aggregation, 43 % to sequential Bayesian updating, and 20 % to VOI‑driven information gathering.

The paper concludes by formalizing why these gains are not accidental. By correctly modeling LLM outputs as likelihoods, applying Bayes’ theorem, and optimizing expected utility under explicit cost structures, the framework attains the theoretical optimum for a broad class of sequential decision problems. The authors provide proofs that any classifier‑centric approach that ignores likelihoods, priors, or asymmetric utilities cannot achieve the same performance guarantees. This work thus establishes a mathematically principled foundation for cost‑aware, fair, and efficient LLM‑based autonomous agents.