Detection of Multiple Influential Observations on Model Selection
Outlying observations are frequently encountered across a wide spectrum of scientific domains, posing notable challenges to the generalizability of statistical models and the reproducibility of downstream analysis. They are identified through influential diagnostics, which aim to capture observations that unduly bias model estimation. To date, methods for identifying observations that influence the selection of a stochastically chosen submodel have been underdeveloped, especially in the high-dimensional setting where the number of predictors $p$ exceeds the sample size $n$. Recently we proposed an improved diagnostic measure to handle this setting. However, its distributional properties and approximations have not yet been explored. To address this shortcoming, we revisit the notion of exchangeability to determine the exact asymptotic distribution of our assessment measure. This foundation enables the introduction of theoretically supported parametric and nonparametric approaches for distributional approximation and derivation of thresholds for outlier identification. The resulting framework is further extended to logistic regression models and evaluated by comprehensive simulation studies comparing the performance of various detection methods. Finally, the framework is applied to data from a task-based fMRI study of thermal pain, with the goal of identifying outliers that distort the formulation of the statistical model using functional brain activity to predict physical pain ratings. Both linear and logistic models are used to demonstrate the benefits of detection and compare the performance of different detection procedures. In particular, we identify two influential observations that were not detected in prior studies
💡 Research Summary
The paper addresses the problem of detecting observations that exert undue influence on the stochastic selection of sub‑models in high‑dimensional regression and logistic regression settings, where the number of predictors p exceeds the sample size n. Building on earlier work that introduced the high‑dimensional influence measure (HIM), multiple influential points (MIP) procedure, DF(LASSO) and its generalization GDF, the authors first revisit the exchangeability property of the diagnostic statistics. They show that each GDF (or DF(LASSO)) value can be written as a sum of p exchangeable Bernoulli variables, ξij, and invoke de Finetti’s representation theorem to prove that, as p → ∞, the distribution of these sums converges to a finite mixture of binomial distributions. This result (Theorem 1) provides a rigorous asymptotic distributional foundation that was previously missing; the earlier reliance on a central‑limit‑theorem approximation required min(n,p) → ∞ and did not capture the discrete nature of the statistics.
Armed with this theoretical insight, the authors develop two families of distributional approximations for the common distribution Fτ of the GDF values. The parametric approach fits six candidate families—Conway‑Maxwell‑Binomial (CMB), Conway‑Maxwell‑Poisson (CMP), beta‑binomial (BB), generalized Poisson (GP), mixtures of binomials (MB) and mixtures of Poissons (MP). Theorem 2 establishes that, despite the lack of independence among τi, maximum‑likelihood estimation remains consistent because exchangeability guarantees that the empirical distribution converges to Fτ, and the MLE minimizes the Kullback–Leibler divergence between the true distribution and the chosen parametric family. The authors also discuss the practical need for a “mid‑quantile” rather than a conventional quantile, because τi are integer‑valued; the mid‑quantile smooths the discrete jumps and yields asymptotically normal estimators.
The non‑parametric route relies on bootstrap schemes designed for exchangeable data. Three bootstrap variants are proposed: simple random resampling, block‑wise exchangeable resampling, and a deletion‑based scheme that mimics the random‑group‑deletion strategy of MIP. By repeatedly resampling the τi values and computing their mid‑quantiles, an empirical threshold is obtained without imposing any parametric form. The authors provide guidance on when to prefer each method: parametric mixtures are recommended when over‑dispersion or mixture structure is suspected, while the bootstrap is reserved for very large n or when the parametric assumptions are doubtful.
Methodologically, the new approximations are incorporated into the ClusMIP algorithm, which originally combined clustering with the MIP detection strategy. The revised algorithm now supports both linear and logistic regression models, and the authors release an R package (ClusMIP) that lets users select among the six parametric families or the three bootstrap options, and to specify the desired significance level via the mid‑quantile.
A comprehensive simulation study evaluates the performance across twelve scenarios varying p/n ratios, signal‑to‑noise levels, sparsity, and the proportion of influential observations. Results show that the CMP and beta‑binomial families most accurately capture the true distribution when over‑dispersion is present, achieving higher true‑positive rates and lower false‑positive rates than the original HIM/MIP methods. The bootstrap approaches perform comparably when n is large, but at a higher computational cost.
The methodology is applied to a task‑based fMRI dataset collected to predict physical pain ratings from brain activation patterns (≈ 3000 voxels, 84 subjects). Both linear and logistic models are fitted, and the enhanced ClusMIP procedure identifies two additional influential subjects that were missed by earlier analyses. Removing these subjects improves model fit (lower RMSE for linear regression) and predictive discrimination (higher AUC for logistic regression), and stabilizes variable selection across resamples.
In conclusion, the paper delivers a solid probabilistic foundation for high‑dimensional model‑selection influence diagnostics, proposes practical parametric and non‑parametric thresholding techniques, and demonstrates their superiority through simulations and a real neuroimaging application. The work broadens the toolkit for researchers dealing with outlier‑driven heterogeneity in high‑dimensional predictive modeling, and suggests future extensions to non‑Gaussian errors, multi‑response settings, and online outlier detection.
Comments & Academic Discussion
Loading comments...
Leave a Comment