WebSplatter: Enabling Cross-Device Efficient Gaussian Splatting in Web Browsers via WebGPU

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present WebSplatter, an end-to-end GPU rendering pipeline for the heterogeneous web ecosystem. Unlike naive ports, WebSplatter introduces a wait-free hierarchical radix sort that circumvents the lack of global atomics in WebGPU, ensuring deterministic execution across diverse hardware. Furthermore, we propose an opacity-aware geometry culling stage that dynamically prunes splats before rasterization, significantly reducing overdraw and peak memory footprint. Evaluation demonstrates that WebSplatter consistently achieves 1.2$\times$ to 4.5$\times$ speedups over state-of-the-art web viewers.

💡 Research Summary

WebSplatter introduces a complete GPU‑centric pipeline for real‑time 3D Gaussian Splatting (3DGS) that runs natively in modern web browsers via WebGPU. The authors identify two fundamental obstacles that have prevented high‑performance 3DGS on the web: (1) the lack of global atomic operations and deterministic work‑group scheduling in the WebGPU specification, which makes direct ports of CUDA/Vulkan radix‑sort algorithms unreliable, and (2) the excessive overdraw and memory traffic caused by naïvely rasterizing millions of splats without any view‑dependent culling. To overcome these challenges, WebSplatter contributes (i) a wait‑free hierarchical radix sort and (ii) an opacity‑aware screen‑space culling stage, both designed to operate entirely within the constraints of WebGPU.

The wait‑free sort replaces the classic inter‑work‑group spin‑wait pattern with a two‑level Blelloch scan. First, each work‑group builds a local histogram of the current 8‑bit digit of the 32‑bit depth key. Then a hierarchical prefix‑sum is performed: an intra‑tile exclusive scan produces per‑tile sums, followed by a second level scan over the tile‑level array to compute global offsets (base d) and inter‑work‑group prefixes (pre‑sum wg). Because the algorithm never polls a global flag and uses only intra‑group barriers, it runs deterministically on any device, from high‑end discrete GPUs to mobile SoCs, while preserving O(N) work and O(N/wg) auxiliary storage. Four passes of 8‑bit radix sorting fully sort the depth values, guaranteeing stable back‑to‑front ordering required for correct alpha blending.

The opacity‑aware culling stage runs as a compute pass before sorting. Each Gaussian’s 3D center and covariance are projected into screen space, yielding an ellipse. An axis‑aligned bounding box (AABB) is derived, and if the AABB lies outside the viewport the splat is discarded. For surviving splats, the major and minor axes of the projected ellipse are packed into a single 32‑bit integer using 16‑bit floats, and the depth is stored as a 32‑bit float in a separate buffer. View‑dependent color is computed by evaluating the spherical‑harmonics coefficients with the current view direction, then packed together with opacity into an RGBA8 32‑bit value. This aggressive compression reduces memory bandwidth during the final rasterization pass, where the GPU draws each splat as a tight quad whose size is derived from the stored axes, and discards fragments with α < 1/255, dramatically cutting overdraw.

WebSplatter’s overall architecture follows a hybrid compute‑render pipeline: (1) Pre‑processing (culling, projection, color evaluation), (2) Sorting (wait‑free radix sort), and (3) Rasterization (GPU‑native draw of compressed splats). All stages are expressed as separate WebGPU compute or render passes, leveraging descriptor sets, bind groups, and command buffers without any CPU‑GPU round‑trips after the initial data upload.

The authors evaluate the system on a range of hardware: a desktop NVIDIA RTX 3080, an Apple M1, and a Qualcomm Adreno 660 mobile GPU. Compared to prior WebGPU ports of 3DGS, WebSplatter achieves speed‑ups from 1.18× on the low‑end mobile device up to 4.5× on the high‑end desktop, while reducing peak memory usage by roughly 30 %. The method also prevents out‑of‑memory crashes on memory‑constrained devices, enabling interactive rendering of scenes with over one million splats in the browser. Visual quality metrics confirm that the deterministic sort preserves exact back‑to‑front ordering, and the opacity‑driven culling does not noticeably affect rendered appearance.

In summary, WebSplatter demonstrates that the limitations of the web graphics stack can be surmounted with algorithmic redesign. By introducing a wait‑free hierarchical radix sort and an opacity‑aware culling scheme, the framework delivers deterministic, high‑performance 3D Gaussian Splatting across heterogeneous browsers and devices. This work paves the way for real‑time neural rendering, text‑to‑3D generation, and interactive content creation directly in web applications, removing the need for native installations while retaining near‑native performance.

WebSplatter: Enabling Cross-Device Efficient Gaussian Splatting in Web Browsers via WebGPU

💡 Research Summary

Comments & Academic Discussion

Leave a Comment