Inference for Spiked Eigenstructure under Generalized Covariance and Correlation Models
In high-dimensional principal component analysis, important inferential targets include both leading spikes and the associated principal eigenspaces. Such problems arise naturally in high-dimensional factor models, where leading principal directions are interpreted as dominant loading directions and spike magnitudes reflect the strength of the corresponding common factors. We study inference based on the sample covariance matrix $\bS$ and the sample correlation matrix $\widehat{\bR}$ under generalized spiked models with arbitrary bulk spectrum. We establish almost sure limits and central limit theorems for spiked sample eigenvalues, and derive asymptotic distributions for functionals of sample spiked eigenspaces. Building on this theory, we develop procedures for one-sample inference for benchmark principal directions and for two-sample comparison of leading spike strengths across populations. Even in the covariance setting, our results substantially extend the existing literature by allowing a non-identity bulk structure. A real-data analysis on stock returns further illustrates the practical relevance of the proposed procedures, showing that covariance-based and correlation-based PCA can lead to markedly different conclusions.
💡 Research Summary
This paper develops a unified asymptotic theory for the spiked eigenstructure of both the sample covariance matrix S and the sample correlation matrix (\widehat{\mathbf R}) in high‑dimensional settings where the number of variables p and the sample size n grow proportionally. Unlike most existing work that assumes a “identity bulk” (i.e., the non‑spiked eigenvalues are all equal to one), the authors allow an arbitrary, non‑degenerate bulk spectral distribution. The population model is written as (Y = \Gamma X) (covariance case) or (Y = G X) (correlation case), where (X) has i.i.d. entries with zero mean, unit variance, and finite fourth moment. The population covariance (or correlation) matrix is decomposed into a low‑rank spiked part and a bulk part.
Four technical assumptions are imposed: (A1) finite fourth moment, (A2) weak convergence of the empirical bulk spectral distribution to a deterministic limit (H_R) (or (H_\Sigma)), (A3) a fixed aspect ratio (y = p/n), and (A4) a separability condition that each spike (\alpha_k) lies outside the support of the bulk and satisfies (\phi’y(\alpha_k) > 0). Under these conditions the authors first prove almost‑sure limits for the sample spiked eigenvalues: each sample eigenvalue associated with a population spike (\alpha_k) converges to (\phi{y,n}(\alpha_k)), where (\phi) is a deterministic transform defined via the Stieltjes transform of the bulk law.
The second major contribution is a set of central limit theorems (CLTs) for the fluctuations of the spiked eigenvalues. For a spike of multiplicity (m_k) the vector of centered, (\sqrt{n})-scaled eigenvalues converges in distribution to the eigenvalues of an (m_k\times m_k) Gaussian random matrix (G_k). When all spikes are simple, the joint limit is a K‑dimensional zero‑mean Gaussian vector with an explicit covariance matrix that depends on the spike strengths, the bulk distribution, the fourth‑moment parameter (\nu_4), and the left/right singular vectors of the population matrix. Analogous results hold for the covariance case by replacing (G,R,H_R) with (\Gamma,\Sigma,H_\Sigma).
Beyond eigenvalues, the paper focuses on projection‑type functionals that are invariant to the sign ambiguity of eigenvectors. For a given unit vector (P), the statistic (T_{R,k}(P)=P^\top \widehat{\Pi}k P) (or (T{\Sigma,k}(P))) measures the contribution of the k‑th spiked eigenspace to the direction (P). Using the eigenvalue CLTs, the authors derive the asymptotic distribution of these projection statistics, which is again Gaussian with a variance that incorporates the same model ingredients.
The theoretical results are then turned into concrete inferential procedures. First, a one‑sample test for a benchmark principal direction is constructed by comparing (T_{R,k}(P)) (or (T_{\Sigma,k}(P))) to its Gaussian limit. Second, a two‑sample test for equality of leading spike strengths across two populations is proposed. The test statistic is based on the difference of the normalized sample eigenvalues from the two groups; its asymptotic variance follows from the joint CLT for the two sets of eigenvalues. Both procedures are fully data‑driven because the limiting variance can be consistently estimated from the observed data.
A particularly insightful part of the work is the analysis of the effect of normalization (i.e., moving from covariance‑based PCA to correlation‑based PCA). The authors show that the first‑order limits (the deterministic spike locations) are unchanged by normalization, but the second‑order fluctuations differ substantially. The normalization introduces additional terms involving the fourth‑moment (\nu_4) and the bulk spectral shape, leading to different asymptotic variances for eigenvalues and projection statistics. Consequently, correlation‑based PCA is not merely a preprocessing step; it fundamentally alters the inferential properties of the principal components.
Simulation studies confirm the theoretical predictions across a range of bulk spectra (including Marčenko–Pastur and mixtures) and spike strengths (weak, moderate, strong). The empirical distributions of the normalized eigenvalues and projection statistics match the Gaussian limits even for moderate sample sizes.
The methodology is illustrated on a real‑world dataset of daily returns for 500 S&P 500 stocks over 1,000 trading days. Covariance‑based PCA identifies a few dominant components driven by large‑cap stocks, whereas correlation‑based PCA highlights sector‑level patterns after removing scale differences. The two analyses lead to markedly different interpretations of the underlying risk factors, underscoring the practical relevance of the theoretical findings.
In summary, the paper makes four major contributions: (1) a unified random‑matrix framework for spiked eigenstructure under arbitrary bulk spectra, (2) almost‑sure limits and CLTs for both eigenvalues and projection‑type eigenspace functionals, (3) feasible one‑sample and two‑sample inference procedures that work for both covariance and correlation PCA, and (4) a clear exposition of how normalization changes second‑order behavior, with concrete implications for high‑dimensional data analysis. The results substantially broaden the scope of high‑dimensional PCA inference and open avenues for future work on heavy‑tailed data, dynamic spiked models, and integration with high‑dimensional regression.
Comments & Academic Discussion
Loading comments...
Leave a Comment