FlashSketch: Sketch-Kernel Co-Design for Fast Sparse Sketching on GPUs

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Sparse sketches such as the sparse Johnson-Lindenstrauss transform are a core primitive in randomized numerical linear algebra because they leverage random sparsity to reduce the arithmetic cost of sketching, while still offering strong approximation guarantees. Their random sparsity, however, is at odds with efficient implementations on modern GPUs, since it leads to irregular memory access patterns that degrade memory bandwidth utilization. Motivated by this tension, we pursue a sketch-kernel co-design approach: we design a new family of sparse sketches, BlockPerm-SJLT, whose sparsity structure is chosen to enable FlashSketch, a corresponding optimized CUDA kernel that implements these sketches efficiently. The design of BlockPerm-SJLT introduces a tunable parameter that explicitly trades off the tension between GPU-efficiency and sketching robustness. We provide theoretical guarantees for BlockPerm-SJLT under the oblivious subspace embedding (OSE) framework, and also analyze the effect of the tunable parameter on sketching quality. We empirically evaluate FlashSketch on standard RandNLA benchmarks, as well as an end-to-end ML data attribution pipeline called GraSS. FlashSketch pushes the Pareto frontier of sketching quality versus speed, across a range of regimes and tasks, and achieves a global geomean speedup of roughly 1.7x over the prior state-of-the-art GPU sketches.

💡 Research Summary

The paper tackles a fundamental performance bottleneck that arises when applying sparse Johnson‑Lindenstrauss transforms (SJLT) on modern GPUs. While SJLT and related constructions such as OSNAP dramatically reduce arithmetic cost by enforcing a small, fixed number of non‑zeros per column, their completely random sparsity pattern leads to highly irregular memory accesses. On GPUs, where bandwidth‑limited memory traffic dominates execution time, this irregularity forces the use of global atomic operations and prevents effective reuse of fast shared memory, resulting in poor utilization of the device’s capabilities.

To resolve this tension, the authors adopt a sketch‑kernel co‑design methodology. They first define a new family of sparse sketches, BlockPerm‑SJLT, whose sparsity is structured at the block level. The d‑dimensional input space and the k‑dimensional sketch space are each partitioned into M contiguous blocks of size Bc = d/M and Br = k/M respectively. At the block level, the sketch matrix is built as a union of κ edge‑disjoint permutations {π₁,…,π_κ} of the block indices. Each output block g is connected only to κ distinct input blocks N(g) = {π_ℓ(g)}; the edge‑disjoint property guarantees that the resulting bipartite block graph is κ‑regular on both sides. Inside every non‑zero block (g, h) with h ∈ N(g), an independent sparse JL matrix Φ_{g,h} with exactly s non‑zeros per column (Rademacher entries scaled by 1/√s) is sampled. The full sketch S is therefore a block‑sparse matrix where each non‑zero block retains the fine‑grained mixing of a classic SJLT while the block‑level wiring enforces regularity.

The second contribution is the FlashSketch CUDA kernel, which exploits the block‑regular structure. The kernel streams tiles of the input matrix A into shared memory, performs per‑block accumulation of the sketch output using shared‑memory atomics, and writes each output tile back to global memory exactly once. Because each output block is assigned to a unique thread block (thanks to the edge‑disjoint permutations), there are no write conflicts in global memory, eliminating the costly global atomics that dominate prior GPU SJLT implementations. The design also includes on‑the‑fly generation of the Rademacher signs, avoiding the need to materialize the sparse sketch matrix.

From a theoretical standpoint, the authors analyze BlockPerm‑SJLT within the oblivious subspace embedding (OSE) framework. They introduce a “neighborhood‑coherence” quantity that captures how well the κ permutations mix input blocks. Under a simplified model where the permutations are independent random derangements, they prove that increasing κ reduces the block coherence μ_blk of any subspace, thereby strengthening the OSE guarantee. Conversely, larger κ also enlarges the amount of data each thread block must handle, slightly increasing memory traffic. Thus κ serves as an explicit tunable parameter that trades sketch quality for GPU efficiency.

Empirically, FlashSketch is evaluated on an NVIDIA RTX 4090 across a suite of RandNLA tasks (least‑squares, low‑rank approximation, regression) and an end‑to‑end machine‑learning pipeline for data attribution called GraSS. Baselines include dense Gaussian sketches (cuBLAS), generic sparse SJLT via cuSPARSE, and a recent CountSketch implementation. Across all experiments FlashSketch achieves a geometric‑mean speedup of roughly 1.7× over the best prior GPU sketch, with the most pronounced gains at modest sketch dimensions (k = 64–512). Accuracy, measured by the error in Gram‑matrix approximation, remains on par with or slightly better than the baselines, confirming that the OSE guarantees hold in practice.

In summary, the paper makes four key contributions: (1) a hardware‑aware block‑permuted sparse JL transform that preserves mixing while enabling regular memory access patterns; (2) a specialized CUDA kernel that leverages shared‑memory atomics and eliminates global atomics; (3) a theoretical analysis linking the number of block permutations κ to both subspace coherence and OSE quality; and (4) a thorough experimental validation showing that the co‑designed sketch and kernel push the quality‑speed Pareto frontier for GPU‑based sketching. The block‑permutation concept is generic enough to be adapted to other random projections (e.g., subsampled Hadamard) and to emerging accelerator architectures, opening a promising direction for future research.

FlashSketch: Sketch-Kernel Co-Design for Fast Sparse Sketching on GPUs

💡 Research Summary

Comments & Academic Discussion

Leave a Comment