Distributional Regression with Tabular Foundation Models: Evaluating Probabilistic Predictions via Proper Scoring Rules

Distributional Regression with Tabular Foundation Models: Evaluating Probabilistic Predictions via Proper Scoring Rules
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Tabular foundation models such as TabPFN and TabICL already produce full predictive distributions, yet the benchmarks used to evaluate them (TabArena, TALENT, and others) still rely almost exclusively on point-estimate metrics (RMSE, $R^2$). This mismatch implicitly rewards models that elicit a good conditional mean while ignoring the quality of the predicted distribution. We make two contributions. First, we propose supplementing standard point metrics with proper scoring rules (CRPS, CRLS, and the Interval Score) and provide a head-to-head comparison of realTabPFNv2.5 and TabICLv2 with regards to some proper scoring rules across 20 OpenML regression datasets. Second, we show analytically and empirically that different proper scoring rules induce different model rankings and different inductive biases during training, even though each rule is individually minimized by the true distribution. Fine-tuning realTabPFNv2.5 with scoring rules not seen during pretraining (CRLS, $β=1.8$ energy score) yields consistent improvements on the corresponding metrics, confirming that the training loss shapes the model beyond what propriety alone guarantees. Together, these findings argue for (i) reporting distributional metrics in tabular regression benchmarks and (ii) making the training objective of foundation models adaptable (via fine-tuning or task-token conditioning) to the scoring rule relevant to the downstream decision problem.


💡 Research Summary

The paper addresses a critical mismatch between the capabilities of modern tabular foundation models—such as TabPFN and TabICL—and the way they are currently evaluated. While these models output full predictive distributions (often as discretized histograms), the dominant benchmarks (TabArena, TALENT, etc.) rely almost exclusively on point‑estimate metrics like RMSE and R². This focus rewards models that approximate the conditional mean well but completely ignores the quality of the entire distribution, which is essential for downstream decision‑making that depends on uncertainty, multimodality, or asymmetric risk.

To remedy this, the authors propose augmenting standard regression benchmarks with strictly proper scoring rules (PSRs), namely the Continuous Ranked Probability Score (CRPS), the Continuous Ranked Logarithmic Score (CRLS), a β‑energy score, and the Interval Score (IS). By definition, a PSR is minimized in expectation only by the true predictive distribution, guaranteeing that a model trained to minimize a given PSR will, asymptotically, recover the correct distribution. However, the paper emphasizes that at finite sample sizes the choice of PSR matters because each induces a distinct gradient structure, sample‑efficiency profile, and inductive bias. For example, the logarithmic score’s gradient blows up in low‑density regions, making training unstable, whereas CRPS yields bounded gradients that treat all quantiles equally.

The authors conduct a systematic empirical study on 20 OpenML regression datasets. They compare two state‑of‑the‑art tabular foundation models—realTabPFN v2.5 and TabICL v2—using the original pre‑training loss (CRPS) as a baseline, and then fine‑tune each model with alternative losses: CRLS and a β‑energy score with β = 1.8. Evaluation is performed with all four PSRs as well as traditional point‑estimate metrics. The results reveal three key findings:

  1. Ranking Sensitivity – Model rankings differ markedly across PSRs. A model that tops the CRPS leaderboard may fall near the bottom under the β‑energy score, demonstrating that the choice of evaluation metric can fundamentally alter which model is deemed superior.

  2. Fine‑tuning Benefits – Fine‑tuning with a loss that matches the evaluation PSR yields consistent improvements on that metric (e.g., CRLS‑fine‑tuned TabPFN improves CRLS by ~5‑7%). Moreover, these gains often transfer modestly to other PSRs, indicating that the loss function shapes the learned distribution beyond the theoretical propriety guarantee.

  3. Point‑Estimate Pitfalls – In datasets with multimodal or heavy‑tailed responses, RMSE/R² can be misleading: a model may achieve a low RMSE while placing its mean prediction in a region with no observed data. Proper scoring rules correctly penalize such miscalibration, highlighting the inadequacy of point‑estimate‑only leaderboards.

The paper also discusses practical implications. Different downstream tasks may prioritize different aspects of the predictive distribution (e.g., asymmetric cost, tight prediction intervals). Accordingly, practitioners should select a PSR that reflects their risk structure and, ideally, fine‑tune the foundation model with a matching loss. The authors suggest a “task‑token” conditioning mechanism that would allow a single pre‑trained model to switch its loss function at inference time, avoiding costly retraining for each new decision context.

Finally, the authors call for a paradigm shift in tabular regression benchmarking: (i) include distributional metrics (CRPS, CRLS, β‑energy, IS) alongside traditional point metrics; (ii) make the training objective adaptable, either through fine‑tuning or dynamic loss conditioning, to align with the scoring rule most relevant to the downstream problem. By doing so, the community can fully exploit the probabilistic outputs of tabular foundation models and avoid the misleading conclusions that arise from point‑estimate‑centric evaluation.


Comments & Academic Discussion

Loading comments...

Leave a Comment