MCA: Multiresolution Correlation Analysis, a graphical tool for subpopulation identification in single-cell gene expression data

Background: Biological data often originate from samples containing mixtures of subpopulations, corresponding e.g. to distinct cellular phenotypes. However, identification of distinct subpopulations may be difficult if biological measurements yield distributions that are not easily separable. Results: We present Multiresolution Correlation Analysis (MCA), a method for visually identifying subpopulations based on the local pairwise correlation between covariates, without needing to define an a priori interaction scale. We demonstrate that MCA facilitates the identification of differentially regulated subpopulations in simulated data from a small gene regulatory network, followed by application to previously published single-cell qPCR data from mouse embryonic stem cells. We show that MCA recovers previously identified subpopulations, provides additional insight into the underlying correlation structure, reveals potentially spurious compartmentalizations, and provides insight into novel subpopulations. Conclusions: MCA is a useful method for the identification of subpopulations in low-dimensional expression data, as emerging from qPCR or FACS measurements. With MCA it is possible to investigate the robustness of covariate correlations with respect subpopulations, graphically identify outliers, and identify factors contributing to differential regulation between pairs of covariates. MCA thus provides a framework for investigation of expression correlations for genes of interests and biological hypothesis generation.

💡 Research Summary

The paper introduces Multiresolution Correlation Analysis (MCA), a novel visual‑analytic framework designed to uncover subpopulations in low‑dimensional single‑cell gene expression data without requiring a predefined interaction scale. Traditional clustering methods (e.g., k‑means, hierarchical clustering) rely on distance or density and often fail when cell states overlap or when the biologically relevant signal is encoded in changes of pairwise gene correlations rather than absolute expression levels. MCA addresses this gap by systematically evaluating local Pearson correlations across a continuum of window sizes (resolutions) defined on the two‑dimensional space of any selected gene pair.

Methodology
For a chosen gene pair (X, Y), the algorithm sweeps a grid of window centers μ = (μ_X, μ_Y) and, for each center, constructs a rectangular sub‑sample S(μ, α) that includes all cells whose X and Y values lie within a fraction α of the total range around μ. The parameter α (0 < α ≤ 1) controls the resolution: small α yields highly localized windows, large α approximates a global view. If the sub‑sample contains at least N_min cells, MCA computes the Pearson correlation r(μ, α) and its statistical significance (p‑value). By mapping r to color intensity and the number of points to opacity, a “correlation landscape” is generated where each point on the μ‑grid represents the correlation at a specific resolution. Users can explore this landscape interactively, identifying regions where the sign or magnitude of the correlation abruptly changes—these transition zones are hypothesized to correspond to boundaries between distinct cellular subpopulations.

Simulation Study
The authors first validated MCA on synthetic data derived from a small gene‑regulatory network comprising three genes and two regulatory modules (one activating, one repressing). They sampled 500 cells per module, deliberately allowing the two modules to overlap in expression space. Traditional PCA and k‑means could not separate the modules, but MCA revealed distinct zones where the correlation between, for example, Nanog and Oct4 switched from strongly positive (module 1) to strongly negative (module 2). This demonstrated that MCA can detect subpopulation structure encoded purely in correlation patterns, even when marginal distributions are indistinguishable.

Application to Real Data
The method was then applied to a published single‑cell qPCR dataset of mouse embryonic stem cells (mESCs) measuring 48 genes across 96 cells. Prior analyses had identified two major subpopulations—“fluorescent” and “plasma”—based on expression thresholds of a few marker genes. MCA reproduced these groups in the correlation landscape and, importantly, uncovered an additional intermediate region where the correlation between Sox2 and Rex1 was moderate, suggesting a possible transitional state not captured by earlier clustering. Moreover, MCA highlighted outlier cells that exhibited extreme local correlations but were supported by very few data points, allowing researchers to flag potentially spurious compartments for further validation.

Advantages

Scale‑free exploration – By varying α, MCA simultaneously probes global and local correlation structures without committing to a single window size.
Visual intuition – The correlation landscape provides an immediate, interpretable map where changes in color or opacity signal shifts in regulatory relationships.
Robustness to density – Incorporating the number of points per window as an opacity channel helps distinguish genuine correlation changes from noise in sparsely populated regions.

Limitations and Future Work
The primary limitation is computational: the number of windows grows quadratically with the resolution of the μ‑grid and linearly with the number of α values, leading to high runtime for larger datasets or higher‑dimensional gene sets. The authors suggest parallel GPU implementation and adaptive window selection (e.g., density‑aware α) to mitigate this. Additionally, while the current implementation uses Pearson correlation, extending MCA to partial correlations, Spearman rank correlations, or mutual information could capture non‑linear regulatory interactions. Finally, applying MCA after an initial dimensionality reduction (e.g., selecting the most variable genes) may make the approach tractable for modern single‑cell RNA‑seq data.

Conclusion
MCA offers a powerful, hypothesis‑generating tool for dissecting subpopulation structure in low‑dimensional single‑cell measurements such as qPCR or flow cytometry. By focusing on how pairwise gene correlations vary across resolutions, it reveals biologically meaningful compartments that are invisible to conventional clustering, flags outliers, and provides a quantitative framework for assessing the robustness of inferred regulatory relationships. The authors anticipate that integrating MCA with emerging high‑throughput single‑cell technologies will enable systematic, multiscale interrogation of cellular heterogeneity and the underlying gene‑regulatory networks.