High-Performance Pseudo-Random Number Generation on Graphics Processing Units
This work considers the deployment of pseudo-random number generators (PRNGs) on graphics processing units (GPUs), developing an approach based on the xorgens generator to rapidly produce pseudo-random numbers of high statistical quality. The chosen algorithm has configurable state size and period, making it ideal for tuning to the GPU architecture. We present a comparison of both speed and statistical quality with other common parallel, GPU-based PRNGs, demonstrating favourable performance of the xorgens-based approach.
💡 Research Summary
The paper addresses a critical bottleneck in GPU‑accelerated Monte Carlo simulations: the generation of high‑quality pseudo‑random numbers. While GPUs excel at parallel arithmetic, many existing GPU‑oriented PRNGs either consume excessive memory (e.g., MTGP, a GPU‑adapted Mersenne Twister) or provide insufficient statistical robustness (e.g., CURAND’s default XOR‑WOW). The authors propose a GPU‑friendly implementation of the xorgens family, originally introduced by Brent, which combines an xorshift core with a Weyl sequence to break the linearity inherent in GF(2) shift‑register generators.
Key contributions include:
-
Configurable State and Period – xorgens allows the user to select any power‑of‑two word size up to 4096, yielding periods of the form 2ⁿ‑1. This flexibility lets developers match the generator’s memory footprint to the limited per‑thread or per‑block shared memory available on modern GPUs.
-
Parallelism Analysis – By examining the recurrence x_i = x_{i‑r}(I+L^a)(I+R^b) + x_{i‑s}(I+L^c)(I+R^d), the authors show that the maximum number of concurrently computable terms is min(s, r‑s). Choosing s ≈ r/2 (subject to GCD(r,s)=1) maximizes inherent parallelism, allowing up to ~64 independent updates per block for the configuration (r=128, s=65).
-
CUDA Implementation Strategy – Each CUDA block receives its own copy of the generator’s state stored in fast shared memory. Threads within the block use compile‑time constants for the parameters {r,s,a,b,c,d}, enabling aggressive compiler optimizations and minimal register pressure. The block‑level state is advanced in a circular buffer fashion, guaranteeing that each thread works on a distinct point in the period, thus producing statistically independent sub‑streams without costly synchronization.
-
Empirical Evaluation – Benchmarks were performed on an NVIDIA GeForce GTX 480 and a dual‑GPU GTX 295 using CUDA 3.2. The xorgens‑GP implementation achieved 1.5–2× higher throughput (random numbers per second) than CURAND’s XOR‑WOW and 1.2–1.4× higher than MTGP, while consuming less memory than MTGP and only modestly more than CURAND.
-
Statistical Validation – The generator was subjected to the full TestU01 suite (SmallCrush, Crush, BigCrush). It passed all tests, including those that expose linear dependencies (e.g., LinearComp, MatrixRank). The inclusion of the Weyl term (w_k = w_{k‑1} + ω mod 2^w) provides a non‑linear operation over GF(2), eliminating the systematic failures observed in pure xorshift generators.
The authors conclude that xorgens‑GP offers a balanced solution for GPU‑based Monte Carlo workloads: it delivers high statistical quality, tunable memory usage, and excellent parallel scalability. The design is portable across GPU generations, and the paper suggests future work on multi‑GPU seed management, dynamic parameter selection, and SIMD‑level optimizations to further boost performance. Overall, the study provides a clear blueprint for integrating robust, high‑throughput random number generation into GPU‑centric scientific computing pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment