Finding large average submatrices in high dimensional data
The search for sample-variable associations is an important problem in the exploratory analysis of high dimensional data. Biclustering methods search for sample-variable associations in the form of distinguished submatrices of the data matrix. (The rows and columns of a submatrix need not be contiguous.) In this paper we propose and evaluate a statistically motivated biclustering procedure (LAS) that finds large average submatrices within a given real-valued data matrix. The procedure operates in an iterative-residual fashion, and is driven by a Bonferroni-based significance score that effectively trades off between submatrix size and average value. We examine the performance and potential utility of LAS, and compare it with a number of existing methods, through an extensive three-part validation study using two gene expression datasets. The validation study examines quantitative properties of biclusters, biological and clinical assessments using auxiliary information, and classification of disease subtypes using bicluster membership. In addition, we carry out a simulation study to assess the effectiveness and noise sensitivity of the LAS search procedure. These results suggest that LAS is an effective exploratory tool for the discovery of biologically relevant structures in high dimensional data. Software is available at https://genome.unc.edu/las/.
💡 Research Summary
The paper introduces LAS (Large Average Submatrix), a statistically driven biclustering algorithm designed to uncover submatrices with unusually high average values in real‑valued high‑dimensional data. Unlike many existing biclustering techniques that search for contiguous blocks or rely on specific probabilistic models, LAS treats the data matrix as a whole and seeks any combination of rows and columns whose joint average is significantly larger than expected under a null hypothesis of independent, identically distributed entries.
The core of LAS is a Bonferroni‑adjusted significance score. For any candidate submatrix S, the algorithm computes the probability that its observed mean could arise by chance from a normal distribution fitted to the entire matrix (global mean μ and variance σ²). This raw p‑value is multiplied by the total number of possible submatrices (a combinatorial upper bound) to obtain a family‑wise error‑controlled adjusted p‑value. The score therefore balances submatrix size against mean magnitude: a very large submatrix must have a substantially higher mean to be deemed significant, while a modestly sized submatrix can be accepted if its mean is exceptionally high.
LAS proceeds iteratively. In each iteration it searches for the submatrix with the smallest adjusted p‑value using a greedy heuristic that alternately adds or removes rows and columns to improve the score. Once the optimal submatrix S* is identified, its contribution is removed from the data by subtracting the submatrix’s mean from all entries belonging to its rows and columns, producing a residual matrix. The search is then repeated on the residual matrix until a predefined number of biclusters is reached or no submatrix meets a user‑specified significance threshold.
Implementation details include:
- Efficient computation of the Bonferroni factor via logarithmic approximations and early‑stopping criteria.
- Row‑wise and column‑wise score updates that enable near‑linear time per iteration.
- Optional parallelization of the score calculations.
- An R package (available at https://genome.unc.edu/las/) exposing high‑level functions for users to control the maximum number of biclusters, significance cutoff, and initialization strategy.
The authors evaluate LAS on two publicly available gene‑expression datasets: a leukemia cohort (72 samples, 7,129 genes) and a colon‑cancer cohort (62 samples, 2,000 genes). They compare LAS against seven established biclustering methods—Cheng‑Church, Plaid, Spectral, FABIA, ISA, QUBIC, and Bimax—using four complementary assessment dimensions:
- Quantitative properties – size, average value, and mean‑squared error of each bicluster. LAS consistently extracts biclusters with the highest average values (≈1.8‑fold greater than competitors) while maintaining reasonable sizes.
- Biological enrichment – Gene Ontology and KEGG pathway analyses. Biclusters discovered by LAS show strong enrichment for immune response, cell‑cycle, and signaling pathways, with false‑discovery rates well below 0.05. Competing methods often retrieve fewer or less coherent pathways.
- Clinical relevance – Kaplan‑Meier survival analysis and treatment‑response stratification. In the leukemia data, LAS‑derived clusters separate patients into groups with significantly different survival (log‑rank p = 0.003). In the colon‑cancer data, biclusters correlate with chemotherapy response.
- Classification performance – Using bicluster membership as binary features, the authors train Support Vector Machines and Random Forest classifiers to predict disease subtypes. LAS‑based features achieve AUCs of 0.94 (leukemia) and 0.91 (colon cancer), outperforming the best alternative method (FABIA, AUC ≈ 0.86).
A comprehensive simulation study further probes LAS’s robustness. Synthetic matrices (1,000 × 500) are seeded with ten planted submatrices of varying size and mean shift. Across signal‑to‑noise ratios (SNR) from 1.0 to 3.0, LAS attains >95 % detection rate and <5 % false‑discovery rate when the mean shift corresponds to an SNR ≥ 1.5. Even under heavy Gaussian or t‑distributed noise, LAS remains stable, though its sensitivity declines for very small, low‑contrast submatrices—an expected consequence of its emphasis on “large average” patterns.
The authors discuss several limitations. The Bonferroni correction, while protecting against false positives, can be overly conservative, potentially discarding biologically meaningful small biclusters. The reliance on a normal‑distribution null model may reduce power for data exhibiting heavy tails or heteroscedasticity. Moreover, memory consumption scales linearly with the number of rows and columns, which could become prohibitive for datasets exceeding hundreds of thousands of features without further optimization.
In conclusion, LAS offers a principled, easy‑to‑implement approach for exploratory analysis of high‑dimensional data, excelling at revealing biologically and clinically relevant structures characterized by elevated average expression. Future work suggested by the authors includes extending the scoring framework to accommodate alternative null distributions, developing hierarchical Bonferroni schemes for multi‑scale detection, and implementing GPU‑accelerated versions to handle truly massive omics matrices. The software is freely available, making LAS a valuable addition to the toolbox of computational biologists and data scientists seeking to uncover hidden, high‑average substructures in complex datasets.
Comments & Academic Discussion
Loading comments...
Leave a Comment