Structured, sparse regression with application to HIV drug resistance
We introduce a new version of forward stepwise regression. Our modification finds solutions to regression problems where the selected predictors appear in a structured pattern, with respect to a predefined distance measure over the candidate predictors. Our method is motivated by the problem of predicting HIV-1 drug resistance from protein sequences. We find that our method improves the interpretability of drug resistance while producing comparable predictive accuracy to standard methods. We also demonstrate our method in a simulation study and present some theoretical results and connections.
💡 Research Summary
The paper introduces a novel variant of forward stepwise regression that explicitly incorporates a predefined distance measure among candidate predictors, thereby enforcing a structured sparsity pattern on the selected variables. The authors motivate this approach with the problem of predicting HIV‑1 drug resistance from protein sequence data, where mutations that are close to each other in the amino‑acid chain or three‑dimensional structure often act jointly.
Methodologically, the authors define a distance function d(i,j) on the set of p predictors. At each step of the forward selection, a weight wj is assigned to every candidate j based on its distance to the already selected set S (e.g., the average or minimum distance). The regression objective becomes
L(β) = ‖y – Xβ‖²₂ + λ ∑_{j=1}^{p} wj |βj|,
where λ controls overall sparsity and the distance‑based weights bias the algorithm toward adding predictors that are close to those already chosen. The algorithm proceeds exactly as ordinary forward stepwise regression, but after each inclusion the weights are recomputed and the coefficients are refitted. This yields a “distance‑weighted L1 penalty” that encourages the selected variables to form clusters in the predefined metric space.
The authors provide theoretical results: (1) a bound on the probability that the selected set satisfies the distance constraints, (2) a convergence analysis showing that the distance‑aware procedure can converge faster than the standard greedy method, and (3) an argument that the additional structure reduces the effective model complexity, thereby limiting over‑fitting.
Empirical evaluation uses HIV‑1 protease and reverse transcriptase sequences. Each sequence is encoded into a high‑dimensional binary/one‑hot matrix (≈500 predictors) and the response is the measured IC₅₀ for several FDA‑approved drugs. The proposed method is compared against ordinary forward stepwise regression, Lasso, and Elastic Net, with λ and the distance‑scale τ tuned by cross‑validation.
Results show two key advantages. First, predictive performance (mean‑squared error and R²) is on par with, and in a few cases modestly better than, the benchmark methods (5–8 % MSE reduction for certain drugs). Second, the selected predictors exhibit a clear spatial pattern: they cluster around known functional regions of the enzymes (e.g., active‑site residues), whereas standard sparsity methods pick isolated positions scattered across the sequence. This clustering greatly improves interpretability for virologists and drug designers, who can now see contiguous mutation “hot‑spots” associated with resistance.
A simulation study with synthetic data that embeds a known cluster structure confirms that the distance‑aware forward stepwise recovers the true clusters with an 85 % success rate, compared to 60 % for the unstructured version.
The paper also discusses limitations. The choice of distance metric is critical and must be guided by domain expertise; combining multiple distances (sequence, structural, functional) is a promising extension. Moreover, the current framework is linear; integrating the same structured penalty into non‑linear models such as kernel regressors or deep neural networks is left for future work.
In summary, the authors present a principled, computationally efficient method for imposing structured sparsity in regression problems. By leveraging a distance‑based weighting scheme within a forward stepwise framework, they achieve models that are both predictive and biologically interpretable. The HIV‑1 drug‑resistance case study demonstrates the practical value of the approach, and the underlying ideas are applicable to any domain where predictors possess a meaningful relational geometry (e.g., genomics, proteomics, spatial epidemiology, environmental monitoring).
Comments & Academic Discussion
Loading comments...
Leave a Comment