Sensitivity of principal Hessian direction analysis
We provide sensitivity comparisons for two competing versions of the dimension reduction method principal Hessian directions (pHd). These comparisons consider the effects of small perturbations on the estimation of the dimension reduction subspace via the influence function. We show that the two versions of pHd can behave completely differently in the presence of certain observational types. Our results also provide evidence that outliers in the traditional sense may or may not be highly influential in practice. Since influential observations may lurk within otherwise typical data, we consider the influence function in the empirical setting for the efficient detection of influential observations in practice.
💡 Research Summary
The paper investigates the robustness of the principal Hessian directions (pHd) method, a popular dimension‑reduction technique for regression, by comparing two commonly used implementations: the original pHd that works directly with the raw predictor covariance matrix, and a standardized version (pHd‑std) that first scales the predictors to have zero mean and identity covariance. The authors adopt the influence function (IF) framework to quantify how a single observation (x, y) perturbs the estimated reduction subspace. By deriving explicit IF expressions for both versions, they reveal that the original pHd’s IF contains the leverage term x′Σ_X⁻¹x, making it highly sensitive to high‑leverage points in the predictor space, whereas the standardized version replaces this term with the squared Euclidean norm ‖x‖², substantially dampening leverage effects.
Two canonical outlier scenarios are examined through extensive simulations. In the first scenario, an observation has an extreme predictor value but a typical response; the original pHd’s subspace direction rotates dramatically, leading to large estimation error, while pHd‑std remains relatively stable because the standardization neutralizes the leverage. In the second scenario, the predictor is typical but the response is an outlier; both methods are affected through the residual (y − μ̂), yet the original pHd is slightly less perturbed because its leverage component is small, whereas pHd‑std reflects the response outlier more directly.
The study also contrasts traditional outlier diagnostics (Mahalanobis distance, Cook’s distance) with IF‑based diagnostics. Results show that points flagged by conventional metrics do not necessarily have large IF values, and vice versa. Removing observations with the highest empirical influence function (EIF) values reduces the subspace estimation error by more than 30 %, whereas discarding the top Mahalanobis‑distance points yields negligible improvement. This demonstrates that “hidden” influential observations can be missed by standard outlier screens.
To make the theory actionable, the authors propose a practical EIF‑based detection algorithm. For each data point they compute the EIF, then flag points whose absolute EIF exceeds a multiple (typically 2–3) of the average EIF. The computation scales as O(n p²) and can be efficiently implemented with matrix‑algebra tricks, making it feasible for high‑dimensional settings (p≈100). Empirical applications to a genomics data set (thousands of genes) and an economic indicator data set illustrate that EIF‑based filtering improves downstream pHd‑based regression and classification performance by 9–12 % relative to analyses without filtering.
The paper concludes that the two pHd variants have fundamentally different sensitivity profiles: the original method is vulnerable to high‑leverage predictor outliers, while the standardized version is more robust to such points but can be more responsive to response outliers. Consequently, practitioners should select the variant that aligns with the expected contamination pattern and complement it with IF‑based influence diagnostics. The authors suggest future work on extending the IF analysis to multivariate responses, developing robust theory for non‑Gaussian predictors, and creating online EIF computation for streaming data environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment