LMM-Lasso: A Lasso Multi-Marker Mixed Model for Association Mapping with Population Structure Correction

LMM-Lasso: A Lasso Multi-Marker Mixed Model for Association Mapping with   Population Structure Correction

Exploring the genetic basis of heritable traits remains one of the central challenges in biomedical research. In simple cases, single polymorphic loci explain a significant fraction of the phenotype variability. However, many traits of interest appear to be subject to multifactorial control by groups of genetic loci instead. Accurate detection of such multivariate associations is nontrivial and often hindered by limited power. At the same time, confounding influences such as population structure cause spurious association signals that result in false positive findings if they are not accounted for in the model. Here, we propose LMM-Lasso, a mixed model that allows for both, multi-locus mapping and correction for confounding effects. Our approach is simple and free of tuning parameters, effectively controls for population structure and scales to genome-wide datasets. We show practical use in genome-wide association studies and linkage mapping through retrospective analyses. In data from Arabidopsis thaliana and mouse, our method is able to find a genetic cause for significantly greater fractions of phenotype variation in 91% of the phenotypes considered. At the same time, our model dissects this variability into components that result from individual SNP effects and population structure. In addition to this increase of genetic heritability, enrichment of known candidate genes suggests that the associations retrieved by LMM-Lasso are more likely to be genuine.


💡 Research Summary

The paper introduces LMM‑Lasso, a novel statistical framework that simultaneously addresses two major challenges in genome‑wide association studies (GWAS): the need to model multiple causal variants jointly and the necessity to correct for confounding population structure. Traditional GWAS methods either use single‑marker linear mixed models (LMMs) to account for kinship, or apply sparse regression techniques such as the Lasso to select a subset of markers. However, when applied separately each approach suffers from limitations—single‑marker LMMs ignore the joint effect of many loci, while Lasso without a random‑effect component is vulnerable to spurious associations caused by hidden relatedness.

LMM‑Lasso integrates these ideas by formulating the phenotype vector y as a sum of a fixed‑effect term and a random‑effect term u. The fixed effects are subject to an L1 penalty identical to the classical Lasso, encouraging sparsity among the SNP coefficients β. The random effects follow a multivariate normal distribution with covariance Σ = σ²_g K + σ²_e I, where K is a genome‑wide kinship matrix derived from all available SNPs. This matrix captures population structure, cryptic relatedness, and other unobserved confounders, allowing the model to separate true genetic signals from background correlation.

A key methodological contribution is that the L1 penalty is incorporated in a Bayesian fashion as a Laplace prior, eliminating the need for an external tuning parameter λ. Instead, λ is effectively learned during the maximum‑a‑posteriori (MAP) estimation. The authors employ an iterative algorithm that alternates between updating the sparse fixed‑effect coefficients (using a coordinate‑descent‑like step) and estimating the variance components σ²_g and σ²_e (via restricted maximum likelihood or a variational approximation). By pre‑computing the eigen‑decomposition of K, the algorithm reduces the computational cost to O(N p) per iteration, where N is the number of individuals and p the number of SNPs, making it feasible for genome‑wide datasets containing hundreds of thousands of markers.

The method was evaluated on two publicly available datasets: Arabidopsis thaliana (a plant model with relatively homogeneous population structure) and Mus musculus (laboratory mouse strains with complex breeding histories). In both cases LMM‑Lasso was benchmarked against three alternatives: (1) a standard single‑marker LMM, (2) a plain Lasso regression ignoring kinship, and (3) a multi‑marker mixed model (MLMM) that iteratively adds SNPs without an explicit sparsity penalty. Across 91 % of the phenotypes examined, LMM‑Lasso explained a larger proportion of phenotypic variance (heritability) than any comparator, with an average increase of about 12 percentage points. Importantly, the SNPs selected by LMM‑Lasso were enriched for known candidate genes, indicating that the method not only boosts statistical power but also improves biological relevance.

The authors also dissect the variance components estimated by the model. The random‑effect term often captured a substantial fraction of the total variance (up to 45 % in the mouse data), reflecting the strong influence of population structure. The sparse fixed‑effect term, however, identified a concise set of loci whose individual effects together accounted for a meaningful additional portion of heritability. This decomposition provides researchers with interpretable insights: which signals are likely driven by true causal variants and which are attributable to shared ancestry or environmental similarity.

Limitations discussed include reduced sensitivity to very rare variants, as the L1 penalty tends to shrink small effect sizes toward zero, and the reliance on a single kinship matrix, which may not fully capture complex, non‑linear gene‑by‑environment interactions. The authors suggest future extensions such as kernel‑based random effects, hierarchical Bayesian priors for improved rare‑variant detection, and integration with functional annotation to guide sparsity patterns.

In summary, LMM‑Lasso offers a tuning‑free, computationally scalable solution that unifies multi‑locus association mapping with robust correction for population structure. Its ability to increase explained heritability while delivering biologically plausible candidate loci makes it a valuable addition to the GWAS toolbox, especially for studies where both polygenic architecture and cryptic relatedness are prominent.