Aggregate Models, Not Explanations: Improving Feature Importance Estimation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Feature-importance methods show promise in transforming machine learning models from predictive engines into tools for scientific discovery. However, due to data sampling and algorithmic stochasticity, expressive models can be unstable, leading to inaccurate variable importance estimates and undermining their utility in critical biomedical applications. Although ensembling offers a solution, deciding whether to explain a single ensemble model or aggregate individual model explanations is difficult due to the nonlinearity of importance measures and remains largely understudied. Our theoretical analysis, developed under assumptions accommodating complex state-of-the-art ML models, reveals that this choice is primarily driven by the model’s excess risk. In contrast to prior literature, we show that ensembling at the model level provides more accurate variable-importance estimates, particularly for expressive models, by reducing this leading error term. We validate these findings on classical benchmarks and a large-scale proteomic study from the UK Biobank.

💡 Research Summary

The paper addresses a critical obstacle in using machine‑learning models for scientific discovery: the instability of feature‑importance estimates caused by data sampling variability and algorithmic stochasticity, especially in highly expressive models such as deep neural networks and random forests. While ensembling is a well‑known remedy for predictive performance, it is unclear whether one should (i) compute importance scores for each constituent model and then average them (the “sub‑models” approach) or (ii) first aggregate the predictions into a single ensemble predictor and then compute importance on that predictor (the “ensemble” approach). Because most importance measures (LOCO, Conditional Feature Importance, SHAP, etc.) are defined as differences of risks that involve non‑linear loss functions, the two strategies are not mathematically equivalent.

The authors develop a theoretical framework that decomposes the total estimation error of risk‑based importance measures into (A) test‑set variance and (E) excess risk. Excess risk is further split into approximation, estimation, and optimization components (E_app, E_est, E_opt). Prior work assumed that excess risk is asymptotically negligible, focusing only on term A; this assumption fails for modern over‑parameterized models where convergence rates are slow and optimization error dominates. By introducing two mild assumptions—loss consistency (the learned model’s loss converges to the Bayes loss) and finite variance of the Bayes loss—the authors prove that the risk‑estimation error satisfies

R_n(f̂) – R(f*) = E + O_p(n⁻¹ᐟ²).

Thus, when E is large, the dominant source of error is the bias introduced by the model itself, not sampling noise.

The key insight is that the ensemble approach directly reduces the excess risk of the predictor because averaging predictions mitigates both estimation and optimization errors. Consequently, the risk difference used by importance measures becomes more accurate, leading to a substantial bias reduction. In contrast, the sub‑models approach only reduces variance by averaging importance scores but cannot eliminate the bias inherent in each individual model.

Empirical validation is performed on classical benchmark datasets and a large‑scale proteomic cohort from the UK Biobank (tens of thousands of samples, thousands of features). The authors train expressive models (deep nets, random forests, gradient‑boosted trees) with multiple random initializations and bootstrap samples, forming ensembles of 5‑20 members. They evaluate LOCO, Conditional Feature Importance, and SHAP under both aggregation strategies, measuring (i) mean‑squared error of risk differences, (ii) stability of feature rankings (Kendall τ), and (iii) concordance with known biological markers.

Across all settings, the ensemble‑based importance consistently outperforms the sub‑models average. Bias reductions of 15‑35 % are observed, and ranking stability improves markedly. Notably, for deep networks where optimization error is large, ensembling of fewer than ten models cuts this error by more than 40 %, bringing the importance estimates close to the Bayes‑optimal values. In the UK Biobank analysis, ensemble‑based SHAP correctly prioritizes established cardiovascular risk proteins, whereas the sub‑models approach frequently inflates noisy features.

The paper also critiques earlier theoretical work that required Donsker‑class assumptions or convergence rates of O(n⁻¹ᐟ⁴), which are unrealistic for modern high‑dimensional learners. By relying on weaker, more realistic assumptions, the authors’ analysis applies broadly to contemporary ML pipelines.

In conclusion, the study demonstrates that for risk‑based feature‑importance methods, aggregating predictions at the model level (model‑level ensembling) is fundamentally superior to aggregating individual importance scores. This strategy reduces the dominant excess‑risk bias, yielding more reliable and biologically meaningful explanations, and should become the default practice when interpreting expressive machine‑learning models in biomedical research.

Aggregate Models, Not Explanations: Improving Feature Importance Estimation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment