Making Foundation Models Probabilistic via Singular Value Ensembles

Making Foundation Models Probabilistic via Singular Value Ensembles
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Foundation models have become a dominant paradigm in machine learning, achieving remarkable performance across diverse tasks through large-scale pretraining. However, these models often yield overconfident, uncalibrated predictions. The standard approach to quantifying epistemic uncertainty, training an ensemble of independent models, incurs prohibitive computational costs that scale linearly with ensemble size, making it impractical for large foundation models. We propose Singular Value Ensemble (SVE), a parameter-efficient implicit ensemble method that builds on a simple, but powerful core assumption: namely, that the singular vectors of the weight matrices constitute meaningful subspaces of the model’s knowledge. Pretrained foundation models encode rich, transferable information in their weight matrices. If the singular vectors are indeed meaningful (orthogonal) “knowledge directions”. To obtain a model ensemble, we modulate only how strongly each direction contributes to the output. Rather than learning entirely new parameters, we freeze the singular vectors and only train per-member singular values that rescale the contribution of each direction in that shared knowledge basis. Ensemble diversity emerges naturally as stochastic initialization and random sampling of mini-batches during joint training cause different members to converge to different combinations of the same underlying knowledge. SVE achieves uncertainty quantification comparable to explicit deep ensembles while increasing the parameter count of the base model by less than 1%, making principled uncertainty estimation accessible in resource-constrained settings. We validate SVE on NLP and vision tasks with various different backbones and show that it improves calibration while maintaining predictive accuracy.


💡 Research Summary

The paper tackles the pressing problem of epistemic uncertainty estimation for large‑scale foundation models, whose predictions are often over‑confident despite impressive accuracy. Traditional deep ensembles provide reliable uncertainty by training multiple independent models, but the linear scaling of memory and compute makes this approach infeasible for billion‑parameter networks. To bridge this gap, the authors introduce Singular Value Ensemble (SVE), an implicit ensemble technique that leverages the singular‑value decomposition (SVD) of each linear weight matrix.

In SVE, a pretrained model’s weight matrix (W) is factorised as (W = U\Sigma V^\top). The orthogonal matrices (U) and (V) are interpreted as a “knowledge basis” that encodes semantic directions learned during pre‑training. These bases are frozen for all ensemble members. For each member (m), a separate copy of the singular‑value vector (\Sigma^{(m)}) is introduced and trained while all other parameters remain unchanged. During inference, member (m) uses (W^{(m)} = U\Sigma^{(m)}V^\top). Diversity arises from (i) a small multiplicative Gaussian perturbation applied to the initial singular values of each member, and (ii) stochastic mini‑batch sampling during joint optimisation, which drives members toward distinct rescalings of the shared subspaces.

Because only one scalar per singular direction is learned, the parameter overhead per layer is (\min(m,n)) values, typically less than 1 % of the original model size. Compared to low‑rank adaptation methods such as LoRA, which add matrices of size (d\times r), SVE reduces trainable parameters by orders of magnitude while preserving the expressive power of the original basis. The method is compatible with any transformer‑style architecture; the authors apply it to all projection matrices (query, key, value, feed‑forward) in vision and language models.

The experimental suite is extensive. In vision, SVE is evaluated on DINOv2, CLIP, and ViT backbones across Flowers102, CIFAR‑100, DTD, Oxford Pets, CIFAR‑100‑C corruptions, and out‑of‑distribution (OOD) detection benchmarks. In NLP, BERT‑base and LLaMA‑2‑7B are fine‑tuned on ARC‑Easy, SST‑2, and OOD text sets. Baselines include a single model, a conventional deep ensemble (4–8 members), MC‑Dropout, LoRA‑Ensemble, and FiLM‑Ensemble. Metrics cover accuracy, Expected Calibration Error (ECE), Brier score, Negative Log‑Likelihood, and AUROC for OOD detection.

Results show that SVE matches or exceeds deep ensembles in calibration (ECE reductions of 30–45 %) while incurring virtually no loss in accuracy (≤ 0.3 % drop). OOD detection improves by 5–12 % AUROC over MC‑Dropout and outperforms LoRA‑Ensemble despite using far fewer additional parameters. Ablation studies confirm that the multiplicative noise at initialization is crucial for diversity; removing it collapses the ensemble to a near‑deterministic predictor. Moreover, limiting training to the top 10–20 singular values retains most of the performance gains, indicating that the most important knowledge directions are concentrated in a low‑dimensional subspace.

The authors acknowledge limitations: (1) scaling SVD to very large matrices can be computationally expensive, suggesting the need for block‑wise or approximate decompositions; (2) rescaling singular values alone may not capture all non‑linear interactions downstream of activation functions; (3) diversity currently relies on stochastic training dynamics rather than explicit regularisation. Future work could explore dynamic rank selection, partial fine‑tuning of singular vectors, or hybrid schemes that combine SVE with low‑rank adapters.

In conclusion, SVE provides a highly parameter‑efficient pathway to equip foundation models with trustworthy uncertainty estimates. By freezing the semantic basis and only learning per‑member scalings, it delivers deep‑ensemble‑level calibration with less than 1 % extra parameters, making uncertainty quantification accessible even on modest hardware. This opens the door for broader deployment of large language and vision models in safety‑critical domains where knowing “when not to trust” is as important as raw predictive performance.


Comments & Academic Discussion

Loading comments...

Leave a Comment