CLE-SH: Comprehensive Literal Explanation package for SHapley values by statistical validity

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recently, SHapley Additive exPlanations (SHAP) has been widely utilized in various research domains. This is particularly evident in application fields, where SHAP analysis serves as a crucial tool for identifying biomarkers and assisting in result validation. However, despite its frequent usage, SHAP is often not applied in a manner that maximizes its potential contributions. A review of recent papers employing SHAP reveals that many studies subjectively select a limited number of features as ‘important’ and analyze SHAP values by approximately observing plots without assessing statistical significance. Such superficial application may hinder meaningful contributions to the applied fields. To address this, we propose a library package designed to simplify the interpretation of SHAP values. By simply inputting the original data and SHAP values, our library provides: 1) the number of important features to analyze, 2) the pattern of each feature via univariate analysis, and 3) the interaction between features. All information is extracted based on its statistical significance and presented in simple, comprehensible sentences, enabling users of all levels to understand the interpretations. We hope this library fosters a comprehensive understanding of statistically valid SHAP results.

💡 Research Summary

**
The paper addresses three pervasive shortcomings in the current use of SHapley Additive exPlanations (SHAP) within biomedical research: (1) the lack of statistical validation for SHAP‑derived feature importance, (2) the difficulty for clinicians and other non‑technical stakeholders to interpret SHAP visualisations, and (3) the under‑utilisation of SHAP’s capacity to reveal non‑linear relationships and feature interactions. To remedy these issues, the authors introduce CLE‑SH (Comprehensive Literal Explanation package for SHAP values), an open‑source Python library that automatically performs statistically rigorous analyses of SHAP values and generates human‑readable textual reports.

Motivation and Background
A rapid literature survey of 38 PubMed articles (April 1‑7 2024) that combined machine learning with SHAP revealed that 84.8 % of papers presented only SHAP summary plots, while 81.8 % omitted SHAP dependence plots entirely. Feature selection was often arbitrary, with authors reporting a fixed number of top features (e.g., 10, 15, 20) without any statistical justification. This practice limits reproducibility and hampers trust among domain experts who are not versed in machine‑learning nuances.

Related Work
Existing efforts to bring statistical rigour to SHAP include BorutaSHAP, Shapicant, PowerSHAP, and SHAP‑XR‑T, which compute p‑values or employ conditional randomisation tests. Other studies have examined SHAP stability, uncertainty bounds, or used large language models (LLMs) to generate textual explanations of SHAP plots. However, none of these works provide an end‑to‑end pipeline that (i) determines the number of important features, (ii) classifies feature types, (iii) conducts appropriate univariate statistical tests, (iv) extracts significant interactions, and (v) outputs a concise natural‑language report.

CLE‑SH Workflow
The library follows a four‑stage pipeline:

Automatic Determination of Important Features – SHAP absolute values are examined against a user‑defined significance threshold (default p < 0.05). Features whose SHAP contributions are statistically distinguishable from zero are retained, eliminating the need for arbitrary “top‑k” selections.
Feature‑Type Classification – Each retained feature is automatically labelled as binary, discrete, or continuous. This classification drives the choice of statistical tests in subsequent steps.
Univariate Statistical Analysis –
- Sign Tests: One‑sample t‑test (parametric) or Wilcoxon signed‑rank test (non‑parametric) assess whether SHAP values for a given feature are predominantly positive or negative.
- Two‑Group Comparisons: For binary features, two‑sample t‑test or Mann‑Whitney U test evaluate differences in SHAP distributions (e.g., male vs. female).
- Multi‑Group Comparisons: One‑way ANOVA or Kruskal‑Wallis test, followed by post‑hoc Tukey HSD or Dunn’s test, detect differences across more than two categories.
- Regression on Continuous Features: SHAP values are regressed against the raw feature values using a suite of candidate functions (linear, quadratic, exponential, etc.). The function with a significant coefficient (p < 0.05) and the lowest RMSE is selected, revealing whether the relationship is linear, monotonic, or more complex.
Interaction (Dependence) Analysis – CLE‑SH generates SHAP dependence plots for all pairwise feature combinations. For each plot, the library repeats the appropriate univariate tests on the grouped SHAP values, retaining only those interactions that achieve statistical significance.

Report Generation
All statistically significant findings are translated into plain‑language sentences, e.g., “Feature X shows a positive contribution to the model prediction for values greater than 5 (p = 0.003).” This textual summary is bundled with concise visualisations, producing a report that is accessible to clinicians, biologists, and other non‑technical stakeholders.

Experimental Validation
The authors evaluated CLE‑SH on five heterogeneous biomedical tabular datasets (including breast cancer, heart failure, and inflammatory bowel disease) using XGBoost models and 5‑fold cross‑validation. Each dataset achieved high predictive performance (average AUC 0.72–0.99). CLE‑SH successfully identified statistically significant features and interactions across all datasets. Notable examples include:

In the heart‑failure cohort, the continuous feature “heart rate” displayed a linear positive SHAP trend (p = 0.001).
The binary feature “diabetes status” showed a significant SHAP difference between groups (Mann‑Whitney U, p = 0.02).
A non‑linear interaction between “age” and “blood pressure” was detected (Kruskal‑Wallis, p = 0.004).

Strengths

Statistical Rigor – By embedding a comprehensive suite of parametric and non‑parametric tests, CLE‑SH ensures that reported SHAP insights are reproducible and defensible.
Automation and Reproducibility – The entire pipeline—from feature selection to report generation—is fully automated, reducing analyst bias and saving time.
Accessibility – Natural‑language summaries democratise SHAP interpretation for domain experts lacking machine‑learning expertise.

Limitations

The library is currently demonstrated only with XGBoost; its behaviour with other model families (e.g., neural networks, random forests) remains to be validated.
SHAP values themselves can be unstable across different training seeds; while CLE‑SH aggregates SHAP across cross‑validation folds, it does not explicitly model this variability.
Multiple‑testing correction (e.g., FDR, Bonferroni) is not integrated, which may inflate false‑positive rates when many features or interactions are examined.

Future Directions
The authors propose extending CLE‑SH to support a broader range of models, incorporating formal multiple‑testing corrections, and coupling the statistical engine with LLM‑based narrative generation for richer, context‑aware explanations.

Conclusion
CLE‑SH represents a novel, end‑to‑end solution that brings statistical validation, automated feature‑type handling, univariate and interaction analyses, and human‑readable reporting to SHAP interpretation. By addressing the key shortcomings of current SHAP practice, it has the potential to become a standard tool for trustworthy AI explanation in biomedical research and beyond.

CLE-SH: Comprehensive Literal Explanation package for SHapley values by statistical validity

💡 Research Summary

Comments & Academic Discussion

Leave a Comment