Conformal Prediction for Compositional Data
Dirichlet regression models are suitable for compositional data, in which the response variable represents proportions that sum to one. However, there are still no well-established methods for constructing valid prediction sets in this context, especially considering the geometry of the compositional space. In this work, we investigate conformal prediction-based strategies for constructing valid predictive regions in Dirichlet regression models. We evaluate three distinct approaches: a method based on quantile residuals, an approximate construction of highest density regions (HDR), and an adaptation of the approximate HDR using grid-based discretization over the simplex. The performance of the methods was analyzed through simulation studies under different scenarios, varying the model complexity, response dimensionality, and covariate structure. The results indicated that the HDR approximation approach exhibits good robustness in terms of coverage, while the grid discretization proved effective in reducing overcoverage and the area of the prediction region compared to the original method. The quantile method provided larger prediction regions compared to the grid method, while maintaining adequate coverage. The methodologies were also applied to two real datasets: one concerning sleep stages and another on biomass allocation in plants. In both cases, the proposed methods demonstrated practical feasibility and produced coherent interpretations within the compositional space. Finally, we discuss possible extensions of this work
💡 Research Summary
This paper addresses the lack of reliable predictive‑interval methods for compositional data—vectors of positive components that sum to one—by developing conformal prediction (CP) procedures tailored to Dirichlet regression models. Compositional data appear in many scientific fields (ecology, medicine, geology, etc.) and are naturally modeled on the (D‑1)‑simplex Δ⁽ᴰ⁾. While Dirichlet regression provides a flexible parametric framework (mean vector μ and precision φ), existing approaches for constructing prediction regions either rely on asymptotic normality, bootstrap resampling, or ignore the geometry of the simplex, leading to inaccurate coverage or overly conservative intervals.
The authors adopt the split‑conformal prediction (SCP) paradigm, which splits the data into a training set (used to fit the Dirichlet regression) and a calibration set (used to compute non‑conformity scores). SCP guarantees marginal coverage ≥ 1 − α in finite samples under the weak exchangeability assumption, making it attractive for Dirichlet models where heteroscedasticity is intrinsic (variance depends on μ and φ).
Three distinct non‑conformity scores are proposed:
- Quantile‑Residual Score – For each component j, the conditional marginal distribution under Dirichlet regression is Beta(μⱼ·φ, (1‑μⱼ)·φ). Using the fitted parameters, the authors compute the standard‑normal quantile residual r_qij = Φ⁻¹(F_Beta(y_ij; μ̂_ij, φ̂_i)). The overall score is the maximum absolute residual across components, s(x,y)=max_j|r_qj|. This yields component‑wise intervals I_j =
Comments & Academic Discussion
Loading comments...
Leave a Comment