Wallace has proposed a new class of pseudo-random generators for normal variates. These generators do not require a stream of uniform pseudo-random numbers, except for initialisation. The inner loops are essentially matrix-vector multiplications and are very suitable for implementation on vector processors or vector/parallel processors such as the Fujitsu VPP300. In this report we outline Wallace's idea, consider some variations on it, and describe a vectorised implementation RANN4 which is more than three times faster than its best competitors (the Polar and Box-Muller methods) on the Fujitsu VP2200 and VPP300.
Deep Dive into A fast vectorised implementation of Wallaces normal random number generator.
Wallace has proposed a new class of pseudo-random generators for normal variates. These generators do not require a stream of uniform pseudo-random numbers, except for initialisation. The inner loops are essentially matrix-vector multiplications and are very suitable for implementation on vector processors or vector/parallel processors such as the Fujitsu VPP300. In this report we outline Wallace’s idea, consider some variations on it, and describe a vectorised implementation RANN4 which is more than three times faster than its best competitors (the Polar and Box-Muller methods) on the Fujitsu VP2200 and VPP300.
arXiv:1004.3114v1 [cs.DS] 19 Apr 2010
A FAST VECTORISED IMPLEMENTATION OF
WALLACE’S NORMAL RANDOM NUMBER GENERATOR
RICHARD P. BRENT
Abstract. Wallace has proposed a new class of pseudo-random gen-
erators for normal variates. These generators do not require a stream
of uniform pseudo-random numbers, except for initialisation. The inner
loops are essentially matrix-vector multiplications and are very suitable
for implementation on vector processors or vector/parallel processors
such as the Fujitsu VPP300. In this report we outline Wallace’s idea,
consider some variations on it, and describe a vectorised implementa-
tion RANN4 which is more than three times faster than its best competi-
tors (the Polar and Box-Muller methods) on the Fujitsu VP2200 and
VPP300.
1. Introduction
Several recent papers [3, 5, 18, 19] have considered the generation of
uniformly distributed pseudo-random numbers on vector and parallel com-
puters. In many applications, random numbers from specified non-uniform
distributions are required. A common requirement is for the normal dis-
tribution, which is what we consider here. In principle it is sufficient to
consider methods for generating normally distributed numbers with mean 0
and variance 1, since translation and scaling can easily be performed to give
numbers with mean µ and variance σ2 (usually referred to as numbers with
the N(µ, σ2) distribution).
The most efficient methods for generating normally distributed random
numbers on sequential machines [2, 4, 9, 10, 11, 12, 14, 20] involve the use of
different approximations on different intervals, and/or the use of “rejection”
methods, so they do not vectorise well. Simple, “old-fashioned” methods
may be preferable on vector processors. In [6] we described two such meth-
ods, the Box-Muller [16] and Polar methods [12]. The Polar method was
implemented as RANN3 and was the fastest vectorised method for normally
distributed numbers known at the time [17, 19], although much slower than
the best uniform random number generators. For example, on the Fujitsu
VP2200/10 a normal random number using RANN3 requires an average of
21.9 cycles, but a good generalised Fibonacci uniform random number gen-
erator requires only 2.21 cycles. (A cycle on the VP2200/10 is 3.2 nsec.
Date: 14 April 1997.
1991 Mathematics Subject Classification. Primary 65C10, Secondary 54C70, 60G15,
65Y10, 68U20.
Key words and phrases. Gaussian random numbers, maximum entropy, normal distri-
bution, normal random numbers, pseudo-random numbers, random number generators,
random numbers, simulation, vector processors, Wallace’s method.
Copyright c⃝1997, R. P. Brent
rpb170tr typeset using AMS-LATEX.
1
2
R. P. BRENT
Since four floating-point operations can be performed per cycle, the theo-
retical peak performance of the VP2200/10 is 1250 Mflop. The cycle time
of the VPP300 is 7 nsec but the pipelines are wider, so the theoretical peak
performance is 2285 Mflop.)
Recently Wallace [21] proposed a new class of pseudo-random generators
for normal variates. These generators do not require a stream of uniform
pseudo-random numbers (except for initialisation) or the evaluation of ele-
mentary functions such as log, sqrt, sin or cos (needed by the Box-Muller
and Polar methods). The crucial observation is that, if x is an n-vector of
normally distributed random numbers, and A is an n×n orthogonal matrix,
then y = Ax is another n-vector of normally distributed numbers. Thus,
given a pool of nN normally distributed numbers, we can generate another
pool of nN normally distributed numbers by performing N matrix-vector
multiplications. The inner loops are very suitable for implementation on
vector processors such as the VP2200 or vector/parallel processors such as
the VPP300. The vector lengths are proportional to N, and the number of
arithmetic operations per normally distributed number is proportional to n.
Typically we choose n to be small, say 2 ≤n ≤4, and N to be large.
Wallace implemented variants of his new method on a scalar RISC work-
station, and found that its speed was comparable to that of a fast uniform
generator. The same performance relative to a fast uniform generator is
achievable on a vector processor, although some care has to be taken with
the implementation (see §7).
In §2 we describe Wallace’s new methods in more detail. Some statis-
tical questions are considered in §§3–6.
Aspects of implementation on a
vector processor are discussed in §7, and details of an implementation on
the VP2200 and VPP300 are given in §8. Some conclusions are drawn in §9.
2. Wallace’s Normal Generators
The idea of Wallace’s new generators is to keep a pool of nN normally
distributed pseudo-random variates. As numbers in the pool are used, new
normally distributed variates are generated by forming appropriate combi-
nations of the numbers which have been used. On a vector processor N can
be large and the whole pool can be regenerated with only a small number
of vector operations1.
As just outlined,
…(Full text truncated)…
This content is AI-processed based on ArXiv data.