Scalable Batch Correction for Cell Painting via Batch-Dependent Kernels and Adaptive Sampling
Cell Painting is a microscopy-based, high-content imaging assay that produces rich morphological profiles of cells and can support drug discovery by quantifying cellular responses to chemical perturbations. At scale, however, Cell Painting data is strongly affected by batch effects arising from differences in laboratories, instruments, and protocols, which can obscure biological signal. We present BALANS (Batch Alignment via Local Affinities and Subsampling), a scalable batch-correction method that aligns samples across batches by constructing a smoothed affinity matrix from pairwise distances. Given $n$ data points, BALANS builds a sparse affinity matrix $A \in \mathbb{R}^{n \times n}$ using two ideas. (i) For points $i$ and $j$, it sets a local scale using the distance from $i$ to its $k$-th nearest neighbor within the batch of $j$, then computes $A_{ij}$ via a Gaussian kernel calibrated by these batch-aware local scales. (ii) Rather than forming all $n^2$ entries, BALANS uses an adaptive sampling procedure that prioritizes rows with low cumulative neighbor coverage and retains only the strongest affinities per row, yielding a sparse but informative approximation of $A$. We prove that this sampling strategy is order-optimal in sample complexity and provides an approximation guarantee, and we show that BALANS runs in nearly linear time in $n$. Experiments on diverse real-world Cell Painting datasets and controlled large-scale synthetic benchmarks demonstrate that BALANS scales to large collections while improving runtime over native implementations of widely used batch-correction methods, without sacrificing correction quality.
💡 Research Summary
Cell Painting is a high‑content microscopy assay that generates rich, single‑cell morphological profiles for drug discovery. However, large‑scale experiments inevitably suffer from batch effects caused by differences in laboratories, instruments, protocols, and other technical factors, which can mask the true biological signal of interest. Existing batch‑correction methods—such as Harmony, Scanorama, or BBKNN—either rely on global transformations that ignore batch‑specific density variations or construct batch‑aware graphs without providing corrected feature vectors, and they become computationally prohibitive when the number of cells reaches hundreds of thousands or millions.
The authors introduce BALANS (Batch Alignment via Local Affinities and Subsampling), a novel, scalable batch‑correction framework designed specifically for Cell Painting data. BALANS consists of two key innovations. First, it defines a batch‑dependent local scale for each pair of points (i, j). For a point i and a target point j belonging to batch bⱼ, the algorithm computes σᵢⱼ as the distance from i to its k‑th nearest neighbor within batch bⱼ. This scale is then used in a Gaussian kernel:
Aᵢⱼ = exp(−‖xᵢ − xⱼ‖² / σᵢⱼ²).
By adapting the kernel bandwidth to the density of the target batch, BALANS amplifies cross‑batch similarities that are biologically meaningful while suppressing spurious affinities that arise from densely clustered or noisy batches.
Second, BALANS avoids constructing the full n × n affinity matrix (which would require O(n²) distance computations). Instead, it performs adaptive landmark sampling of rows. At each iteration, a point i is selected with probability proportional to the inverse of its cumulative affinity to already sampled points. This “coverage‑based” strategy preferentially selects points whose neighborhoods are under‑represented, guaranteeing that each biological cluster contributes roughly O(log K) rows with high probability (where K is the number of true clusters). For each sampled row, the full set of affinities to all n points is computed, then an elbow‑detection heuristic retains only the strongest entries, producing a sparse row. Collecting |S| ≈ O(K log K) such rows yields a matrix A_S of size |S| × n.
To reconstruct a full approximation of the affinity matrix, BALANS applies a low‑rank Nystrom‑type approximation (Williams & Seeger, 2000) using A_S as landmarks. The authors prove that under a block‑diagonal plus noise model—where the true affinity matrix consists of K rank‑one blocks (one per biological cluster) plus independent exponential noise—the reconstruction error ‖Â − A‖_op is bounded with high probability, and the sampling scheme is order‑optimal in terms of sample complexity.
Algorithmically, BALANS runs in near‑linear time: k‑NN searches for local scales can be performed with efficient spatial indexes (e.g., ball trees), adaptive sampling and row sparsification are O(|S| log |S|), and the final low‑rank multiplication is O(n K). Crucially, the full dense affinity matrix is never materialized, keeping memory usage linear in n.
Empirical evaluation spans real Cell Painting datasets from the JUMP‑CP consortium (four datasets, 13 sources, three microscopes) and the BBBC benchmark, using both handcrafted CellProfiler features and deep‑learning‑based DeepProfiler embeddings. Across eight experimental scenarios, BALANS matches or exceeds the performance of state‑of‑the‑art methods on several metrics: (i) batch mixing (Adjusted Rand Index, silhouette scores), (ii) alignment of clusters with compound labels, (iii) average correlation structure, and (iv) downstream tasks such as phenotype classification. In the largest setting (≈5 million cells), BALANS improves average evaluation scores by more than 30 % relative to the uncorrected baseline and outperforms competing methods while completing in under one hour on a single commodity server. Runtime comparisons show that the pure‑Python implementation of BALANS is faster than optimized C++ implementations of Harmony and Scanorama, and its memory footprint remains modest because only |S| rows are stored.
The paper also contrasts BALANS with BBKNN, noting that BBKNN builds a batch‑aware KNN graph but does not produce corrected feature vectors; consequently, BBKNN can only be evaluated with graph‑based metrics. BALANS, by constructing an explicit smoothing operator from A_S, directly transforms the original data matrix X into a batch‑corrected representation, enabling immediate use in downstream analyses.
In summary, BALANS offers a theoretically grounded, computationally efficient solution for batch correction in high‑content imaging. By integrating batch‑specific local scaling with adaptive landmark sampling and low‑rank matrix approximation, it achieves near‑linear scalability, high correction quality, and practical usability for datasets ranging from tens of thousands to millions of single‑cell profiles. This makes it a valuable tool for large‑scale drug‑screening pipelines and other biomedical imaging studies where batch effects have previously limited interpretability.
Comments & Academic Discussion
Loading comments...
Leave a Comment