scDataset: Scalable Data Loading for Deep Learning on Large-Scale Single-Cell Omics

scDataset: Scalable Data Loading for Deep Learning on Large-Scale Single-Cell Omics
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Training deep learning models on single-cell datasets with hundreds of millions of cells requires loading data from disk, as these datasets exceed available memory. While random sampling provides the data diversity needed for effective training, it is prohibitively slow due to the random access pattern overhead, whereas sequential streaming achieves high throughput but introduces biases that degrade model performance. We present scDataset, a PyTorch data loader that enables efficient training from on-disk data with seamless integration across diverse storage formats. Our approach combines block sampling and batched fetching to achieve quasi-random sampling that balances I/O efficiency with minibatch diversity. On Tahoe-100M, a dataset of 100 million cells, scDataset achieves more than two orders of magnitude speedup compared to true random sampling while working directly with AnnData files. We provide theoretical bounds on minibatch diversity and empirically show that scDataset matches the performance of true random sampling across multiple classification tasks.


💡 Research Summary

Training deep learning models on single‑cell omics datasets that contain tens or hundreds of millions of cells poses a severe I/O bottleneck: the data cannot fit into RAM and must be read from disk. Traditional random sampling, which draws each cell independently, requires a separate random disk access per sample and therefore scales poorly; on the 100‑million‑cell Tahoe‑100M dataset it yields only ~20 cells per second, translating to more than 58 days for a single epoch. Loading the entire dataset into memory eliminates I/O but expands the sparse AnnData representation (~300 GB) to a dense matrix exceeding 1 TB, which is infeasible on typical workstations. Sequential streaming, on the other hand, offers high throughput because disks are optimized for contiguous reads, but it introduces strong sampling bias: cells are stored by experimental plates (≈7 M cells per plate), so a sequential pass feeds the model long stretches of highly correlated samples, leading to catastrophic forgetting and degraded generalization.

scDataset addresses these conflicting requirements by introducing a quasi‑random sampling scheme that combines block sampling with batched fetching. The dataset is partitioned into contiguous blocks of size b. To assemble a minibatch of size m, scDataset randomly selects ⌈m/b⌉ blocks and reads the entire block contiguously, reducing the number of random disk seeks from m to ⌈m/b⌉. This alone trades some intra‑batch diversity for I/O efficiency. To recover diversity, a fetch factor f is introduced: instead of fetching exactly m cells per iteration, scDataset retrieves m × f cells in a single I/O operation, stores them in an in‑memory buffer, shuffles them, and then splits the buffer into f minibatches. Consequently each minibatch draws from many more blocks than the block‑sampling‑only baseline, dramatically increasing the expected entropy of the sampled labels.

The authors provide a theoretical analysis that derives explicit bounds on the expected minibatch entropy as a function of block size b and fetch factor f. The bound shows that when the product b·f is sufficiently large, the entropy approaches that of true random sampling, guaranteeing that stochastic gradient descent retains its convergence properties. Empirically, the paper uses m = 64, b = 16, f = 10 as a default configuration, which means only four random disk accesses per minibatch and a single read of 640 cells (≈10 MB) per fetch.

Performance benchmarks on Tahoe‑100M demonstrate that scDataset achieves more than two orders of magnitude higher throughput than pure random sampling (≈2 500 cells / s vs. ≈20 cells / s). Memory consumption remains modest because only the fetch buffer is resident in RAM. Compared with existing loaders—AnnLoader, scDataLoader, AnnBatch, HuggingFace Datasets, BioNeMo‑SCDL, and TileDB‑SOMA—scDataset requires no data conversion, works directly on standard .h5ad files, supports multiprocessing, and integrates seamlessly with the scverse ecosystem.

Downstream evaluation on three classification tasks (cell‑type annotation, drug‑response prediction, and multi‑condition labeling) shows that models trained with scDataset achieve virtually identical accuracy to those trained with true random sampling (differences <0.1 %). However, total training time is reduced by a factor of 30–40 because the data pipeline no longer stalls the GPU. The paper also highlights that scDataset’s modular callback architecture allows it to wrap arbitrary backends (NumPy arrays, Zarr, HuggingFace Datasets, TileDB‑SOMA, etc.), making it a drop‑in replacement for any PyTorch DataLoader.

In summary, scDataset introduces a principled, high‑performance data loading strategy for atlas‑scale single‑cell omics. By replacing costly per‑sample random I/O with block‑level random access and batched prefetching, it preserves the stochasticity required for effective deep learning while exploiting the sequential read strengths of modern storage hardware. The theoretical guarantees, extensive empirical validation, and broad compatibility position scDataset as a practical solution for researchers who wish to train deep models on massive single‑cell datasets without investing in specialized hardware or costly data conversion pipelines. Future work may explore adaptive block sizing, dynamic fetch factors, or integration with cloud‑native storage systems to further scale beyond the current hundred‑million‑cell regime.


Comments & Academic Discussion

Loading comments...

Leave a Comment