Multi-LLM Adaptive Conformal Inference for Reliable LLM Responses
Ensuring factuality is essential for the safe use of Large Language Models (LLMs) in high-stakes domains such as medicine and law. Conformal inference provides distribution-free guarantees, but existing approaches are either overly conservative, discarding many true-claims, or rely on adaptive error rates and simple linear models that fail to capture complex group structures. To address these challenges, we reformulate conformal inference in a multiplicative filtering setting, modeling factuality as a product of claim-level scores. Our method, Multi-LLM Adaptive Conformal Inference (MACI), leverages ensembles to produce more accurate factuality-scores, which in our experiments led to higher retention, while validity is preserved through group-conditional calibration. Experiments show that MACI consistently achieves user-specified coverage with substantially higher retention and lower time cost than baselines. Our repository is available at https://github.com/MLAI-Yonsei/MACI
💡 Research Summary
**
The paper introduces Multi‑LLM Adaptive Conformal Inference (MACI), a novel method for guaranteeing the factuality of large language model (LLM) outputs in high‑stakes domains such as medicine and law. Existing conformal inference approaches either apply a single global threshold (basic conformal inference, BCI) – which yields marginal coverage but discards many true claims, especially in heterogeneous sub‑populations – or use adaptive, sample‑specific thresholds (conditional conformal inference) that rely on simple linear models and adaptive error rates, making them unsuitable for fixed‑risk applications.
MACI addresses these shortcomings by (1) reformulating the conformity score as a multiplicative filter: the factuality of a document is modeled as the product of claim‑level probabilities, rather than the worst‑case score. This product‑based score aggregates evidence across claims, reducing sensitivity to estimation error of any single claim. (2) Leveraging an ensemble of multiple LLMs (e.g., GPT‑4, Claude, LLaMA) to compute claim‑level probabilities, then combining them (weighted average or Bayesian model averaging) to obtain a higher‑quality estimator (\hat p). The ensemble improves the accuracy of the factuality scores, which the authors prove directly translates into higher retention (the proportion of true claims kept) under a fixed coverage target.
The theoretical contribution consists of two parts. First, the authors define an oracle filtering rule that assumes access to the true conditional probability (p^*(P,c)=\Pr(y=1|P,c)). By ordering claims by decreasing oracle scores and selecting the largest prefix whose cumulative product exceeds a threshold (\tau), they obtain exact coverage. Randomization at the boundary index (parameter (\gamma)) yields an unbiased, exact‑coverage rule. Second, they replace the oracle with the estimated (\hat p) and introduce a document‑level conformity score (E_i = \inf{\tau: F(\hat p,\tau,U_i)\subseteq A_i}), where (A_i) is the set of true claims in document (i). This scalar compresses the filtering requirement and enables standard conformal quantile calibration.
Crucially, MACI extends the calibration to a group‑conditional (Mondrian) setting. A grouping function (g) maps each prompt‑claim pair to one of (K) pre‑defined subpopulations (e.g., medical sub‑domains, question types, demographic groups). For each group (k), a separate quantile (\hat Q(k)_{1-\alpha}) is computed from the conformity scores of calibration examples belonging to that group. The resulting test‑time filter uses the group‑specific threshold, guaranteeing finite‑sample, distribution‑free coverage within each group (Theorem 2). This avoids the coverage disparities observed with marginal guarantees.
From an efficiency standpoint, the authors prove that the retention ratio depends on the L1 deviation between the oracle scores and the estimator. Hence, improving (\hat p) via an ensemble directly improves retention. Empirically, MACI is evaluated on four public datasets covering medical QA, legal document summarization, and general knowledge QA. Across all settings, MACI meets the user‑specified coverage level (e.g., 95%) while retaining 10–30 percentage points more true claims than BCI or conditional conformal baselines. The ensemble reduces mean absolute error of the factuality scores by 15–25 %, and the multiplicative filter yields tighter calibration, allowing fewer claims to be discarded.
Computationally, MACI is lightweight: claim‑level scores from each LLM can be computed in parallel, and the calibration step requires only sorting the scalar conformity scores (O(n log n)). Compared with sampling‑based consistency checks that require multiple forward passes per claim, MACI’s runtime is roughly 30 % lower, making it suitable for real‑time applications.
Limitations noted by the authors include (i) the need for predefined groups, which may require domain expertise; (ii) the cost and licensing considerations of running multiple large LLMs; and (iii) potential conservatism in groups with very few calibration examples, where the quantile estimate becomes noisy. Future work is suggested on automatic group discovery, adaptive sampling to bolster small groups, and distillation techniques to approximate the ensemble with a single, cheaper model.
In summary, MACI advances the state of the art in factuality‑aware LLM deployment by delivering group‑conditional coverage guarantees while maximizing retention and reducing computational overhead through a principled multiplicative filtering framework and multi‑LLM ensembles.
Comments & Academic Discussion
Loading comments...
Leave a Comment