Classifier Calibration at Scale: An Empirical Study of Model-Agnostic Post-Hoc Methods

Classifier Calibration at Scale: An Empirical Study of Model-Agnostic Post-Hoc Methods
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We study model-agnostic post-hoc calibration methods intended to improve probabilistic predictions in supervised binary classification on real i.i.d. tabular data, with particular emphasis on conformal and Venn-based approaches that provide distribution-free validity guarantees under exchangeability. We benchmark 21 widely used classifiers, including linear models, SVMs, tree ensembles (CatBoost, XGBoost, LightGBM), and modern tabular neural and foundation models, on binary tasks from the TabArena-v0.1 suite using randomized, stratified five-fold cross-validation with a held-out test fold. Five calibrators; Isotonic regression, Platt scaling, Beta calibration, Venn-Abers predictors, and Pearsonify are trained on a separate calibration split and applied to test predictions. Calibration is evaluated using proper scoring rules (log-loss and Brier score) and diagnostic measures (Spiegelhalter’s Z, ECE, and ECI), alongside discrimination (AUC-ROC) and standard classification metrics. Across tasks and architectures, Venn-Abers predictors achieve the largest average reductions in log-loss, followed closely by Beta calibration, while Platt scaling exhibits weaker and less consistent effects. Beta calibration improves log-loss most frequently across tasks, whereas Venn-Abers displays fewer instances of extreme degradation and slightly more instances of extreme improvement. Importantly, we find that commonly used calibration procedures, most notably Platt scaling and isotonic regression, can systematically degrade proper scoring performance for strong modern tabular models. Overall classification performance is often preserved, but calibration effects vary substantially across datasets and architectures, and no method dominates uniformly. In expectation, all methods except Pearsonify slightly increase accuracy, but the effect is marginal, with the largest expected gain about 0.008%.


💡 Research Summary

This paper presents a large‑scale empirical evaluation of model‑agnostic post‑hoc calibration techniques for binary classification on real i.i.d. tabular data. The authors benchmark 21 widely used classifiers—including linear models, support vector machines, tree ensembles (CatBoost, XGBoost, LightGBM, Gradient Boosting, Random Forest, Explainable Boosting Machine), and modern tabular neural and foundation models (TabTransformer, TabICL, TabPFN, TabM, ModernNCA, REMLP)—across the binary tasks of the TabArena‑v0.1 suite. For each classifier, five post‑hoc calibrators are trained on a held‑out calibration split: isotonic regression, Platt scaling, Beta calibration, Venn‑Abers predictors, and Pearsonify.

Calibration performance is assessed with proper scoring rules (log‑loss and Brier score), diagnostic measures (Expected Calibration Error, Expected Calibration Index, Spiegelhalter’s Z‑statistic), and standard classification metrics (AUC‑ROC, accuracy, precision, recall, F1). The experimental protocol uses randomized, stratified five‑fold cross‑validation, with four folds for training and one fold for out‑of‑sample testing; the calibration set is taken from the training folds.

Key findings:

  1. Venn‑Abers predictors achieve the largest average reduction in log‑loss (‑14.17 %) and are the most consistent in avoiding extreme degradation.
  2. Beta calibration follows closely (‑13.70 % log‑loss reduction) and improves log‑loss most frequently (67.1 % of instances).
  3. Platt scaling shows mixed results, improving log‑loss in only about half of the cases (≈49.8 %) and sometimes worsening it, especially for strong modern models.
  4. Isotonic regression tends to slightly increase log‑loss on average, though it can improve Brier score modestly.
  5. Pearsonify consistently degrades proper scoring performance and does not improve accuracy.

For Brier score, Venn‑Abers leads (‑4.14 % improvement) followed by Beta (‑3.91 %). In terms of discrimination, only Beta calibration yields a marginal AUC‑ROC gain (+0.062 %); all other methods tend to degrade AUC‑ROC, though Venn‑Abers shows fewer extreme drops. Accuracy gains are negligible across the board (maximum expected increase ≈0.008 %).

Computationally, all calibrators add overhead during training because they require a second inference pass on the calibration set. Venn‑Abers and Platt scaling increase inference CPU time by 139.5 % and 47.1 % respectively, while Beta, isotonic, and Pearsonify slightly reduce inference cost. Memory usage roughly doubles during training for all calibrators, but peak RAM during inference is only modestly affected.

The study highlights that widely used calibrators such as Platt scaling and isotonic regression can systematically harm proper‑scoring performance for high‑capacity modern tabular models, contradicting the common practice of applying them by default. In contrast, distribution‑free methods like Venn‑Abers, which provide validity guarantees under exchangeability, and flexible parametric approaches like Beta calibration, offer more reliable improvements without sacrificing discrimination or accuracy.

Practical implications: practitioners should prioritize Beta calibration or Venn‑Abers when calibrating probabilities for modern tabular classifiers, especially when preserving overall classification performance is critical. Simple linear calibrators may still be useful for weaker models but must be applied with caution for state‑of‑the‑art ensembles and neural architectures. The authors also release code and artifacts to facilitate reproducibility and further exploration of calibration in tabular domains.


Comments & Academic Discussion

Loading comments...

Leave a Comment