Solving Sequences of Generalized Least-Squares Problems on Multi-threaded Architectures
Generalized linear mixed-effects models in the context of genome-wide association studies (GWAS) represent a formidable computational challenge: the solution of millions of correlated generalized least-squares problems, and the processing of terabytes of data. We present high performance in-core and out-of-core shared-memory algorithms for GWAS: By taking advantage of domain-specific knowledge, exploiting multi-core parallelism, and handling data efficiently, our algorithms attain unequalled performance. When compared to GenABEL, one of the most widely used libraries for GWAS, on a 12-core processor we obtain 50-fold speedups. As a consequence, our routines enable genome studies of unprecedented size.
💡 Research Summary
The paper tackles one of the most demanding computational tasks in modern genetics: solving millions of correlated generalized least‑squares (GLS) problems that arise in genome‑wide association studies (GWAS). Each GLS shares the same covariance matrix Σ and the same design matrix X, while only the response vector and the fixed‑effect coefficients differ for each genetic marker. By exploiting this structural commonality, the authors devise a two‑stage algorithm. In the first stage they compute a single Cholesky factorisation of Σ (Σ = LLᵀ) and pre‑multiply both X and each response vector y by L⁻¹, thereby reducing every GLS to a simple ordinary least‑squares problem that can be solved with a small matrix inversion. This pre‑processing is performed once and amortised over all markers, eliminating the dominant cost of repeated factorisations.
The second stage is a highly parallel implementation that maps the remaining dense linear‑algebra operations onto multi‑core shared‑memory hardware. The authors organise the data into cache‑friendly blocks and use BLAS‑3 level matrix‑matrix multiplications as the computational kernel. OpenMP directives schedule blocks dynamically across the available cores, ensuring load balance and high utilisation. Careful selection of block size based on the hierarchy of L1, L2 and L3 caches maximises data reuse and minimises memory‑bandwidth bottlenecks.
To handle data sets that exceed main‑memory capacity, the paper introduces an out‑of‑core variant. The input files are partitioned into fixed‑size chunks that are streamed from disk using asynchronous I/O threads. While one chunk is being processed by the compute threads, the next chunk is pre‑fetched, effectively overlapping computation with I/O. The algorithm discards intermediate results as soon as they are no longer needed, keeping the in‑memory footprint low. This streaming approach reduces disk‑access latency and enables the solution of terabyte‑scale GWAS problems on a single workstation.
Experimental evaluation was carried out on a 12‑core Intel Xeon E5‑2670 v3 system equipped with 64 GB of DDR4 RAM. Using a realistic GWAS data set (≈10 000 individuals and 5 million SNPs), the in‑core implementation achieved an average speed‑up of 45× and a peak of 52× compared with GenABEL, a widely used R‑based GWAS library. Memory consumption dropped to roughly 30 % of the GenABEL baseline. The out‑of‑core version, despite the additional I/O overhead, still delivered more than a 40× acceleration while processing data sets that would not fit into RAM.
The authors claim three primary contributions: (1) a mathematically grounded pre‑processing scheme that leverages the shared covariance structure of GLS problems; (2) a multi‑core, BLAS‑3‑centric implementation with dynamic scheduling that extracts near‑linear scalability on modern CPUs; and (3) an efficient out‑of‑core streaming framework that extends these gains to data sets of arbitrary size. They argue that the same ideas can be transferred to other domains involving massive linear‑regression‑type workloads, such as large‑scale Bayesian inference or high‑dimensional econometric modelling. Future work outlined includes extending the approach to GPU accelerators and distributed‑memory clusters, but the current results already demonstrate that a well‑engineered shared‑memory solution can make GWAS of unprecedented scale feasible on commodity hardware.
Comments & Academic Discussion
Loading comments...
Leave a Comment