Discriminant Analysis with Adaptively Pooled Covariance
Linear and Quadratic Discriminant analysis (LDA/QDA) are common tools for classification problems. For these methods we assume observations are normally distributed within group. We estimate a mean and covariance matrix for each group and classify using Bayes theorem. With LDA, we estimate a single, pooled covariance matrix, while for QDA we estimate a separate covariance matrix for each group. Rarely do we believe in a homogeneous covariance structure between groups, but often there is insufficient data to separately estimate covariance matrices. We propose L1- PDA, a regularized model which adaptively pools elements of the precision matrices. Adaptively pooling these matrices decreases the variance of our estimates (as in LDA), without overly biasing them. In this paper, we propose and discuss this method, give an efficient algorithm to fit it for moderate sized problems, and show its efficacy on real and simulated datasets.
💡 Research Summary
The paper introduces a novel regularized discriminant analysis method called L1‑Pooled Discriminant Analysis (L1‑PDA), which bridges the gap between Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA). Under the standard two‑class Gaussian assumption, LDA pools the covariance matrices of the two classes, yielding low‑variance estimates but potentially large bias when the true covariances differ. QDA estimates a separate covariance for each class, reducing bias but suffering from high variance when sample sizes are limited.
L1‑PDA is built on the observation that, for many variable pairs, the corresponding entries of the precision matrices (the inverses of the covariances) are nearly equal across classes. Formally, let Δ = (Σ₁⁻¹ – Σ₂⁻¹)/2. The method imposes an ℓ₁‑norm penalty on Δ, encouraging element‑wise sparsity in the difference of the two precision matrices. The resulting convex optimization problem can be written as
min_{Σ₁,Σ₂} –ℓ₁(μ₁,Σ₁) – ℓ₂(μ₂,Σ₂) + λ‖Σ₁⁻¹ – Σ₂⁻¹‖₁
subject to Σ₁, Σ₂ ≽ 0,
where ℓ_k denotes the Gaussian log‑likelihood for class k and μ_k are the sample means (which can be fixed without affecting the covariance solution). When λ = 0 the solution coincides with QDA; when λ is sufficiently large (λ ≥ λ_max = n₁n₂‖S₁ – S₂‖_∞/(n₁+n₂)) the solution collapses to the LDA estimate. Thus, by varying λ, one obtains a continuous path from fully quadratic to fully linear discriminant rules.
The authors derive the Karush‑Kuhn‑Tucker (KKT) conditions, revealing two key properties: (1) the pooled covariance average S_pool = (n₁S₁ + n₂S₂)/(n₁+n₂) remains unchanged for any λ, and (2) the sparsity of Δ is directly controlled by λ through a sub‑gradient term. These conditions also provide closed‑form expressions for the extreme solutions (λ = 0 and λ ≥ λ_max).
To solve the problem efficiently, the paper adopts the Alternating Direction Method of Multipliers (ADMM). Introducing auxiliary variables A = Σ₁⁻¹, B = Σ₂⁻¹, C = A – B, and a dual matrix Γ, each ADMM iteration consists of: (i) updating A and B via eigen‑decomposition and a simple scalar shrinkage formula, (ii) updating C by element‑wise soft‑thresholding with parameter λ/ρ, and (iii) updating the dual variable Γ. The dominant computational cost is O(p³) per iteration due to the eigen‑decompositions, which is practical for problems with a few hundred features.
A particularly insightful contribution is the reinterpretation of the discriminant model as a forward (predictive) logistic regression with interaction terms. By applying Bayes’ theorem, the posterior log‑odds can be expressed as
logit P(y=1|x) = β₀ + βᵀx + ½ xᵀΓx,
where β₀ = log(π₁/π₂) + ½(μ₂ᵀΣ₂⁻¹μ₂ – μ₁ᵀΣ₁⁻¹μ₁), β = Σ₁⁻¹μ₁ – Σ₂⁻¹μ₂, and Γ = Σ₂⁻¹ – Σ₁⁻¹. In LDA, Γ = 0, yielding a linear decision boundary; in QDA, Γ is dense, giving a fully quadratic boundary. L1‑PDA forces many off‑diagonal entries of Γ to zero, thereby selecting a sparse set of pairwise interactions while retaining all main effects. This connects the method to a growing literature on sparse interaction estimation in high‑dimensional logistic models.
The paper compares L1‑PDA with two related approaches: (a) Regularized Discriminant Analysis (RDA) by Friedman, which blends LDA and QDA covariances via a convex combination of eigenvalues and is basis‑invariant; and (b) Sparse LDA methods, which either assume diagonal covariances or impose sparsity on Σ⁻¹(μ₁ – μ₂) to obtain a linear rule that uses only a subset of variables. L1‑PDA differs by targeting sparsity in the interaction matrix Γ rather than in the variables themselves, making it more appropriate when the goal is to simplify the decision surface rather than to perform variable selection. Empirical studies on simulated and real data sets demonstrate that, for moderate dimensions (p < n₁ + n₂), L1‑PDA achieves higher classification accuracy and more interpretable interaction structures than both RDA and Sparse LDA.
The authors also discuss the well‑posedness of the problem. If either class covariance estimate is singular, the QDA solution is undefined; however, as long as the pooled covariance S_pool is full rank, the L1‑PDA solution exists for any λ > 0, providing robustness in low‑sample regimes.
In summary, L1‑PDA offers a principled, convex framework for adaptively pooling precision matrices, delivering a tunable bias‑variance trade‑off, a clear interpretation in terms of sparse interaction selection, and an efficient ADMM implementation suitable for moderate‑scale discriminant analysis tasks.
Comments & Academic Discussion
Loading comments...
Leave a Comment