Assessing molecular variability in cancer genomes
The dynamics of tumour evolution are not well understood. In this paper we provide a statistical framework for evaluating the molecular variation observed in different parts of a colorectal tumour. A multi-sample version of the Ewens Sampling Formula forms the basis for our modelling of the data, and we provide a simulation procedure for use in obtaining reference distributions for the statistics of interest. We also describe the large-sample asymptotics of the joint distributions of the variation observed in different parts of the tumour. While actual data should be evaluated with reference to the simulation procedure, the asymptotics serve to provide theoretical guidelines, for instance with reference to the choice of possible statistics.
💡 Research Summary
The paper addresses the challenge of quantifying molecular heterogeneity within a single colorectal tumor by introducing a statistical framework grounded in population genetics. The authors adapt the classic Ewens Sampling Formula (ESF), originally devised for neutral allele frequencies in an infinite‑alleles model, to a multi‑sample context where several spatially distinct regions of a tumor are sequenced independently. In this “multi‑region ESF,” each region provides a sample of mutation counts drawn from a common underlying mutation pool characterized by a single diversity parameter, θ, which reflects the overall mutation rate or effective population size of the tumor.
Methodologically, the authors first formalize the joint probability distribution of the observed mutation counts across K regions. They then develop a Monte‑Carlo simulation procedure based on Markov chain Monte Carlo (MCMC) that repeatedly samples from the multi‑region ESF for given values of θ and the region‑specific sample sizes (n₁,…,n_K). This simulation yields empirical reference distributions for a suite of statistics that capture different aspects of heterogeneity, such as: (i) the proportion of mutations shared between any pair of regions (overlap ratio), (ii) the fraction of mutations unique to each region, and (iii) estimators of the overall mutation richness (θ‑estimator). By comparing the observed statistics to these reference distributions, researchers can obtain p‑values or confidence intervals that reflect the null hypothesis of random sampling from a homogeneous mutation pool.
Beyond the simulation‑based inference, the authors derive large‑sample asymptotic results. They prove that, as the number of cells sequenced in each region grows, the joint distribution of the heterogeneity statistics converges to a multivariate normal distribution. The limiting mean vector and covariance matrix are expressed analytically in terms of θ and the region‑specific sample sizes. This central‑limit‑type theorem provides theoretical guidance for selecting statistics that are asymptotically efficient and for designing sampling schemes (e.g., how many regions to sample and how many cells per region) that maximize statistical power.
The framework is applied to real colorectal cancer data obtained from five spatially separated biopsies. After fitting the multi‑region ESF and running the simulation pipeline, the authors find that the observed overlap of mutations between regions is significantly lower than expected under the homogeneous null model, while each region harbors a substantial set of private mutations. These findings support a model of spatially structured subclones within the tumor, consistent with recent phylogenetic studies.
In the discussion, the authors acknowledge several limitations. The infinite‑alleles assumption may not fully capture the complex clonal dynamics and selective sweeps characteristic of cancer evolution. The inference relies on an accurate estimate of θ; misspecification can bias heterogeneity measures. Computational cost scales with the number of regions and sample sizes, which may be prohibitive for very large sequencing projects. They suggest extensions such as incorporating phylogenetic relationships among clones, adopting a Bayesian hierarchical model to jointly estimate θ and heterogeneity statistics, and integrating additional omics layers (e.g., transcriptomics, methylation) into a multivariate version of the model.
Overall, the study provides a rigorous, population‑genetics‑based statistical tool for dissecting intra‑tumor molecular variability. By coupling simulation‑derived reference distributions with analytically tractable large‑sample approximations, the framework offers both practical inference for current datasets and theoretical guidance for future experimental designs aimed at unraveling tumor evolution.
Comments & Academic Discussion
Loading comments...
Leave a Comment