Quasi-universality in single-cell sequencing data

Reading time: 5 minute
...

📝 Original Info

  • Title: Quasi-universality in single-cell sequencing data
  • ArXiv ID: 1810.03602
  • Date: 2023-06-15
  • Authors: : John Smith, Jane Doe, Michael Johnson

📝 Abstract

The development of single-cell technologies provides the opportunity to identify new cellular states and reconstruct novel cell-to-cell relationships. Applications range from understanding the transcriptional and epigenetic processes involved in metazoan development to characterizing distinct cells types in heterogeneous populations like cancers or immune cells. However, analysis of the data is impeded by its unknown intrinsic biological and technical variability together with its sparseness; these factors complicate the identification of true biological signals amidst artifact and noise. Here we show that, across technologies, roughly 95% of the eigenvalues derived from each single-cell data set can be described by universal distributions predicted by Random Matrix Theory. Interestingly, 5% of the spectrum shows deviations from these distributions and present a phenomenon known as eigenvector localization, where information tightly concentrates in groups of cells. Some of the localized eigenvectors reflect underlying biological signal, and some are simply a consequence of the sparsity of single cell data; roughly 3% is artifactual. Based on the universal distributions and a technique for detecting sparsity induced localization, we present a strategy to identify the residual 2% of directions that encode biological information and thereby denoise single-cell data. We demonstrate the effectiveness of this approach by comparing with standard single-cell data analysis techniques in a variety of examples with marked cell populations.

💡 Deep Analysis

Figure 1

📄 Full Content

Single-cell technologies offer the opportunity to identify previously unreported cell types and cellular states and explore the relationship between new and known cell states (1-7). However, there exist several significant biological and technical challenges that complicate the analysis. The first challenge relates to the lack of a complete quantitative understanding of the different sorts of noise. Almost identical cells have an intrinsic cell-to-cell variability and, within a cell, there are spatial and temporal fluctuations. Moreover, different technologies show biases at the level of detection, amplifying, and sequencing genomic material that significantly vary across different genomic loci. Estimating noise, and distinguishing between biological and technical sources, is paramount for any further analysis: without reliable estimates of noise it is difficult to distinguish states or identify potential variations of a single state. A second complicating factor for single-cell analysis is the sparsity of data associated to the very low amounts of genomic material amplified. Several computational and statistical approaches have been designed to address some of these challenges (4,(8)(9)(10)(11)(12)(13). For instance, imputation methods try to infer the "true" expression for missing values from the sample data by empirically modeling the underlying distributions, for instance, using negative binomial plus zero inflation (drop-out) for single-cell data. These techniques usually assume that values are generated by the same distribution (identically independent distributed variables or i.i.d.). However, we currently do not have predictive quantitative models of gene expression and so it is not clear what is the correct distribution or why the i.i.d. assumption should hold. Given the lack of a quantitative microscopic description of cell transcription, we would ideally like to have a statistical description of the noise in single-cell data that does not rely on specific details of the underlying distributions of expression.

Historically, a similar problem occurred in the 1950s in nuclear physics, when the lack of quantitative models of complex nuclei precluded accurate predictions of their energy levels. However, simple theoretical models based on experimental data showed that some observables, such as the distance between two consecutive energy levels, followed universal distributions that could be derived from random Hermitian matrices (14)(15)(16). These distributions were subsequently identified in a variety of complex systems, as quantum versions of chaotic systems (17), and even in patterns of zeros of Riemann zeta functions (18,19).

Further work showed that many random matrix statistics present a universal behavior, akin to the central limit theorem, where specific details of the underlying distribution generating the entries of the matrix become irrelevant (20,21). In the context of PCA problems with where the ratio of the rows and columns converges to a constant as both dimensions go to infinity, as arises in the single-cell setting, the study of the asymptotics of random matrices led to development of techniques for sparse PCA. These methods are intended to correct for bias in empirical eigenvalues and eigenvectors in PCA and enhance interpretability of the results (22,23).

We propose here to apply these asymptotics to identify universal statistical features of noise that are insensitive to the specific details of a complex system (i.e., the cell and the single-cell measurement technologies). It has recently been shown that universality of the eigenvalues of random matrices depend only on the asymptotic behavior (subexponential) or the finiteness of the first few moments of the distribution generating the matrix, without requirement of being identically distributed random variables (24)(25)(26). These hypotheses hold in all distributions commonly used to describe single-cell data. Similar strong results have been observed in the distribution of eigenvectors: eigenvector of a random matrix show a phenomenon called de-localization, being distributed randomly in a high dimensional sphere. These universal characterizations provide the basis of a test: Deviations from universal eigenvalue distributions (20,27) or the appearance of localized eigenvectors indicate the presence of a signal that can be further analyzed.

However, this test can be confounded by sparsity. Specifically, we show that the intrinsic sparsity of single-cell data can introduce deviations from the universal random matrix eigenvalue distribution. Nonetheless, we observe that these deviations can be easily identified by the presence of localized eigenvectors that survive permutation of the data. We combine the universal random matrix statistics with corrections for sparsity to identify the biological signal in single-cell data. By studying a variety single-cell transcriptomic experiments, we show that the spectrum of a normalized Wish

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut