Don't Always Pick the Highest-Performing Model: An Information Theoretic View of LLM Ensemble Selection

Don't Always Pick the Highest-Performing Model: An Information Theoretic View of LLM Ensemble Selection
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) are often ensembled together to improve overall reliability and robustness, but in practice models are strongly correlated. This raises a fundamental question: which models should be selected when forming an LLM ensemble? We formulate budgeted ensemble selection as maximizing the mutual information between the true label and predictions of the selected models. Furthermore, to explain why performance can saturate even with many models, we model the correlated errors of the models using Gaussian-copula and show an information-theoretic error floor for the performance of the ensemble. Motivated by these, we propose a simple greedy mutual-information selection algorithm that estimates the required information terms directly from data and iteratively builds an ensemble under a query budget. We test our approach in two question answering datasets and one binary sentiment classification dataset: MEDMCQA, MMLU, and IMDB movie reviews. Across all datasets, we observe that our method consistently outperforms strong baselines under the same query budget.


💡 Research Summary

Large language models (LLMs) are increasingly deployed as decision‑making engines, but their outputs are often unstable. A common remedy is to ensemble several LLMs, yet real‑world constraints (latency, cost) prevent querying every available model for each request. Moreover, models that share training data, architecture, or pre‑training objectives tend to make correlated mistakes, so simply picking the top‑k most accurate models can be sub‑optimal or even harmful.

The authors formalize the budgeted ensemble selection problem: given a pool of m LLMs and a per‑query budget k < m, choose a subset S of size k that minimizes the error of the optimal MAP aggregator. They adopt an information‑theoretic viewpoint, defining the objective as maximizing the mutual information I(Y; X_S) between the true binary label Y and the predictions X_S of the selected models.

First, they show that when model errors are independent, the problem reduces to the classic “pick the k most accurate models” rule. Theorem 4.1 proves that under independence the subset that maximizes mutual information also minimizes the error probability, thereby justifying the Top‑k baseline in that special case.

Real LLM ensembles, however, exhibit strong error correlations. To capture this, the paper introduces a Gaussian‑copula model for the latent error variables. A latent Gaussian vector Z ∼ N(0, Σ) is thresholded to produce binary error indicators E_j. The marginal thresholds encode each model’s individual error rate, while the correlation matrix Σ captures pairwise error dependence. In the equicorrelated setting (all off‑diagonal entries equal to ρ), the model simplifies to a one‑factor representation Z_j = √ρ U + √(1‑ρ) ξ_j, where U is a shared latent factor.

Using this framework, the authors derive a decomposition of the conditional mutual information gain when adding a new model j to an existing subset S (Theorem 4.3): Δ(j | S) = I(Y; X_j) − I(X_j; X_S) + I(E_j; E_S) + Λ_j(S). The first term measures the model’s standalone accuracy, the second penalizes redundancy with already‑selected models, the third captures structured error‑pattern overlap, and the final term corrects for label‑dependence. This mirrors the maximum relevance‑minimum redundancy (mRMR) principle from feature selection, providing a solid theoretical justification for a greedy selection strategy that repeatedly picks the model with the largest Δ(j | S).

The paper also establishes an information‑theoretic error floor for uniformly correlated ensembles (Theorem 4.4). Even as k grows without bound, the error probability converges to a non‑zero limit determined by the shared latent factor ρ; thus, adding more models cannot eliminate uncertainty that originates from common failure modes. This explains the empirically observed saturation of performance when many highly correlated LLMs are ensembled.

Algorithmically, the authors propose a greedy mutual‑information selection procedure. Mutual information terms are estimated directly from a validation set using empirical entropy estimators, and at each iteration the model that yields the greatest Δ(j | S) is added to the ensemble. The MAP decision rule is then applied to the selected subset. The computational cost scales as O(m k), making it practical for realistic model pools.

Experiments on three benchmarks—MEDMCQA and MMLU (multiple‑choice question answering) and IMDB (binary sentiment classification)—demonstrate that the proposed method consistently outperforms strong baselines (Top‑k by accuracy, weighted voting, LLM‑TOPLA, MUSE) under identical query budgets. Notably, in scenarios where a group of GPT‑family models share the same error pattern, the greedy MI selector prefers a more diverse mix (e.g., Gemini, Claude, Llama), achieving higher accuracy despite a lower average individual accuracy. The results validate both the theoretical claims and the practical utility of the approach.

In summary, the paper contributes:

  1. A Gaussian‑copula based model that cleanly separates individual accuracies from inter‑model error correlations.
  2. A proof that Top‑k selection is optimal only under independence, and a mutual‑information based decomposition that quantifies the penalty of redundancy.
  3. An information‑theoretic saturation theorem explaining why ensembles of correlated LLMs eventually stop improving.
  4. A simple, data‑driven greedy algorithm that maximizes mutual information under a query budget.
  5. Empirical evidence across diverse tasks that the method yields consistent gains over existing ensemble selection techniques.

Future directions include extending the framework to multi‑class or multi‑label settings, handling continuous textual outputs via differentiable mutual‑information estimators, and integrating dynamic budget allocation strategies for real‑time LLM services.


Comments & Academic Discussion

Loading comments...

Leave a Comment