Are foundation models for computer vision good conformal predictors?
Recent advances in self-supervision and contrastive learning have brought the performance of foundation models to unprecedented levels in a variety of tasks. Fueled by this progress, these models are becoming the prevailing approach for a wide array of real-world vision problems, including risk-sensitive and high-stakes applications. However, ensuring safe deployment in these scenarios requires a more comprehensive understanding of their uncertainty modeling capabilities, which has received little attention. In this work, we delve into the behaviour of vision and vision-language foundation models under Conformal Prediction (CP), a statistical framework that provides theoretical guarantees of marginal coverage of the true class. Across extensive experiments including popular vision classification benchmarks, well-known foundation vision models, and three CP methods, our findings reveal that foundation models are well-suited for conformalization procedures, particularly those integrating Vision Transformers. We also show that calibrating the confidence predictions of these models, a popular strategy to improve their uncertainty quantification, actually leads to efficiency degradation of the conformal set on adaptive CP methods. Furthermore, few-shot adaptation of Vision-Language Models (VLMs) to downstream tasks, whose popularity is surging, enhances conformal scores compared to zero-shot predictions. Last, our empirical study exposes APS as particularly promising in the context of vision foundation models, as it does not violate the marginal coverage guarantees across multiple challenging, yet realistic scenarios.
💡 Research Summary
This paper investigates how modern vision and vision‑language foundation models perform under the Conformal Prediction (CP) framework, which offers finite‑sample marginal coverage guarantees for the true class while producing prediction sets rather than single labels. The authors evaluate 17 state‑of‑the‑art foundation models—including self‑supervised Vision Transformers (DINO, DINOv2), contrastive models (CLIP, MetaCLIP, LLaVA, Phi) and ResNet‑based VICReg variants—across three canonical image classification benchmarks (CIFAR‑10, CIFAR‑100, ImageNet) and several distribution‑shifted variants of ImageNet. They pair each model with three widely used CP methods: Least Ambiguous Classifier (LAC), Adaptive Prediction Sets (APS), and Regularized Adaptive Prediction Sets (RAPS).
Key experimental dimensions are: (i) standard in‑distribution (ID) evaluation with a large calibration set, (ii) robustness under domain shift (ImageNet‑C, ImageNet‑A, ImageNet‑V2), (iii) the effect of post‑hoc confidence calibration (e.g., temperature scaling), and (iv) few‑shot adaptation of a Vision‑Language Model (CLIP) to ten fine‑grained downstream tasks via linear probing. The authors measure average set size (efficiency), empirical marginal coverage, class‑conditional coverage gap, and minimum class‑conditional coverage (MCCC).
Findings:
-
Transformer‑based foundation models excel in CP – Vision Transformers consistently yield smaller average prediction sets and higher MCCC than ResNet‑based counterparts. Their softmax distributions are more sharply separated, which benefits non‑conformity scoring.
-
APS provides the most reliable coverage – Across all settings, APS attains coverage closest to the target 1‑α, even under severe distribution shift, albeit with larger set sizes. RAPS, by adding a regularization term, reduces set size by roughly 10‑15 % while preserving coverage above 90 % in most OOD scenarios. LAC produces the smallest sets but suffers dramatic coverage drops on imbalanced or shifted data.
-
Calibration harms CP efficiency – Applying temperature scaling or other post‑hoc calibration improves probability calibration but inflates non‑conformity scores, leading to larger prediction sets (8‑15 % increase). Coverage gaps shrink slightly, indicating a trade‑off: better calibrated probabilities do not translate into more efficient conformal sets.
-
Few‑shot adaptation improves CP metrics – Adapting CLIP to downstream tasks with 1‑5 shot linear probing reduces average set size by 8‑12 % and narrows the coverage gap by 2‑4 % compared with zero‑shot inference on ID data. Gains on OOD data are modest, suggesting that adaptation mainly refines the model’s confidence on familiar domains.
-
Robustness to domain shift – Under synthetic corruptions and style changes, APS maintains the highest coverage retention, while RAPS offers a favorable balance between robustness and efficiency. ConvNet‑based models experience sharp coverage degradation and set‑size explosion as corruption severity rises.
Practical implications: For high‑stakes applications (medical imaging, autonomous driving, security), deploying foundation models together with APS (or RAPS when set size is critical) is advisable to guarantee marginal coverage without sacrificing too much efficiency. Post‑hoc calibration should be used cautiously, as it can undermine CP’s set‑size benefits. Vision Transformers should be preferred as the backbone for conformalized systems due to their superior conditional coverage and resilience to distribution shift.
In summary, the study demonstrates that vision foundation models, especially those built on Vision Transformers, are highly compatible with conformal prediction methods. APS emerges as the most robust CP technique, while RAPS offers a practical efficiency boost. Calibration, while improving raw confidence scores, can degrade conformal set efficiency. These insights provide concrete guidance for safely integrating foundation models into risk‑sensitive, real‑world computer‑vision pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment