fast splittable pseudorandom number generatorsgee.cs.oswego.edu/dl/papers/oopsla14.pdf · fast...

Fast Splittable Pseudorandom Number GeneratorsGuy L. Steele Jr.

Oracle [email protected]

Doug LeaSUNY Oswego

[email protected]

Christine H. FloodRed Hat Inc

[email protected]

AbstractWe describe a new algorithm SPLITMIX for an object-oriented and splittable pseudorandom number generator(PRNG) that is quite fast: 9 64-bit arithmetic/logical opera-tions per 64 bits generated. A conventional linear PRNG ob-ject provides a generate method that returns one pseudoran-dom value and updates the state of the PRNG, but a splittablePRNG object also has a second operation, split, that replacesthe original PRNG object with two (seemingly) independentPRNG objects, by creating and returning a new such objectand updating the state of the original object. Splittable PRNGobjects make it easy to organize the use of pseudorandomnumbers in multithreaded programs structured using fork-join parallelism. No locking or synchronization is required(other than the usual memory fence immediately after ob-ject creation). Because the generate method has no loops orconditionals, it is suitable for SIMD or GPU implementation.

We derive SPLITMIX from the DOTMIX algorithm ofLeiserson, Schardl, and Sukha by making a series of pro-gram transformations and engineering improvements. Theend result is an object-oriented version of the purely func-tional API used in the Haskell library for over a decade, butSPLITMIX is faster and produces pseudorandom sequencesof higher quality; it is also far superior in quality and speedto java.util.Random, and has been included in Java JDK8as the class java.util.SplittableRandom.

We have tested the pseudorandom sequences producedby SPLITMIX using two standard statistical test suites(DieHarder and TestU01) and they appear to be adequatefor “everyday” use, such as in Monte Carlo algorithms andrandomized data structures where speed is important.Categories and Subject Descriptors G.3 [Mathematicsof Computing]: Random number generation; D.1.3 [Soft-ware]: Programming techniques—Concurrent programmingGeneral Terms Algorithms, PerformanceKeywords collections, determinism, Java, multithreading,nondeterminism, object-oriented, parallel computing, pedi-gree, pseudorandom, random number generator, recursivesplitting, Scala, spliterator, splittable data structures, streamsPermission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] ’14, October 20–24, 2014, Portland, OR, USA.Copyright c© 2014 ACM 978-1-4503-2585-1/14/10. . . $15.00.http://dx.doi.org/10.1145/2660193.2660195

1. Introduction and BackgroundMany programming language environments provide a pseu-dorandom number generator (PRNG), a deterministic algo-rithm to generate a sequence of values that is likely to passcertain statistical tests for “randomness.” Such an algorithmis typically described as a finite-state machine defined bya transition function τ , an output function µ, and an ini-tial state (or seed) s0; the generated sequence is zj wheresj = τ(sj−1) and zj = µ(sj) for all j ≥ 1. (An alternateformulation is sj = τ(sj−1) and zj = µ(sj−1) for all j ≥ 1,which may be more efficient when implemented on a super-scalar processor because it allows τ(sj−1) and µ(sj−1) tobe computed in parallel.) Because the state is finite, the se-quence of generated states, and therefore also the sequenceof generated values, will necessarily repeat, falling into a cy-cle (possibly after some initial subsequence that is not re-peated, but in practice PRNG algorithms are designed so asnot to “waste state,” so that in fact any given initial state s0will recur). The length L of the cycle is called the period ofthe PRNG, and si = sj iff i ≡ j mod L.

Characteristics of PRNG algorithms and their implemen-tations that are of practical interest include period, speed,quality (ability to pass statistical tests), size of code, size ofdata (some algorithms require large tables or arrays), easeof jumping forward (skipping over intermediate states), par-tionability or splittability (for use by multiple threads operat-ing in parallel), reproducibility (the ability to run a programtwice from a specified starting state and get exactly the sameresults, even if parallel computation is involved), and unpre-dictability (how difficult it is to predict the next output givenpreceding outputs). These characteristics trade off, and sodifferent applications typically require different algorithms.

A linear congruential generator (LCG) [13, §3.2.1] hasas its state a nonnegative integer less than some modulus M(which is typically either a prime number or a power of 2);the transition function is τ(s) = (a · s + c) modM andthe output function is just the identity function µ(s) = s.The Unix library function rand48 uses this algorithm witha = 25214903917 = 0x5DEECE66Dull, c = 11 = 0xB,and M = 248. Ever since the original JavaTM LanguageSpecification [10] in 1995, the class java.util.Random

has been specified to use this same algorithm (see Figure 1).This code is simple, concise, and adequate for its originallyintended purpose: to supply pseudorandomly generated val-ues to be used relatively infrequently by a smallish numberof concurrent threads running within a set-top box or web

453

public class Random {

protected long seed;

public Random() {

this(System.currentTimeMillis()); }

public Random(long seed) { setSeed(seed); }

synchronized public void setSeed(long seed) {

this.seed = (seed ^ 0x5DEECE66DL)

& ((1L << 48) - 1); }

synchronized protected int next(int bits) {

seed = (seed * 0x5DEECE66DL + 0xBL)

& ((1L << 48) - 1);

return (int)(seed >>> (48 - bits)); }

public int nextInt() { return next(32); }

public int nextLong() {

return ((long)next(32) << 32) + next(32); }

public double nextDouble() {

return (((long)next(26) << 27) + next(27))

/ (double)(1L << 53); }

}

Figure 1. The essence of java.util.Random

browser. It has a number of drawbacks, however, that makeit inappropriate for use in “serious” applications: (1) Its shortperiod (consider that 248 nanoseconds is less than 80 hours).(2) While the rand48 algorithm was considered to be of rel-atively high quality when introduced in 1982 [28], more re-cent tests [17] have uncovered significant flaws in its statis-tical behavior. (3) While it is thread-safe, thanks to its useof synchronization locks, it is not thread-efficient: if manythreads share a single instance of Random, then contentionfor the lock may become a performance bottleneck. (4) Onthe other hand, if many instances of class Random are cre-ated, one for each thread, there is no guarantee that the val-ues collectively generated by those many instances will be asstatistically pseudorandom as if they had been generated bya single instance of class Random. (If 256 threads each havea Random object with its initial seed chosen at random, andthen each thread generates 232 values, it is more likely thannot, thanks to the Birthday Paradox, that two of the gener-ated sequences will have some overlap.)

Another kind of finite-state machine used for this purposeis the linear feedback shift register (LFSR), in which thestate is a vector of bits contained in a shift register; the tran-sition function shifts the register by one position, shifting ina new bit value that is computed as a fixed linear function ofthe current vector (typically the exclusive OR of specific bitsof the vector), and the output function provides the bit thatwas shifted out. This arrangement provides a stream of bits,which may then be chunked into groups of, say, 32 or 64consecutive bits to produce a sequence of integers. In soft-ware, it may be more convenient for code to execute multiplesteps of the LFSR algorithm in one shot to produce a multibitoutput all at once. Marsaglia’s XORWOW algorithm [22] is ahybrid that uses both an Xorshift method (which is provablyequivalent to an LFSR [4]) and an LCG with a = 1:

unsigned long xorwow() {

static unsigned long

x=123456789, y=362436069, z=521288629,

w=88675123, v=5783321, d=6615241;

unsigned long t=(x^(x>>2));

x=y; y=z; z=w; w=v; v=(v^(v<<4))^(t^(t<<1));

return (d+=362437)+v; }

The period of XORWOW is pretty good (232(2160 − 1) =2192 − 232), and generation of one 32-bit value requires just9 arithmetic operations: 3 shifts, 4 XORs, and 2 adds. (Thereare also a number of assignment operations, which could beoptimized away in the context of a suitably unrolled loop.)Unfortunately, while XORWOW is also fairly fast, very recenttesting [29] has also revealed statistical flaws.

There are other PRNG algorithms with much longer pe-riod that produce sequences of much higher quality, but theyare slower. These include the Mersenne Twister [23], whichis related to LFSR algorithms in relying on bit-shifting andbitwise operations, and MRG32k3a [9], which is related toLCG algorithms in relying on computation of linear formu-lae modulo prime numbers. There are also many PRNG algo-rithms, such as those in java.security.secureRandom,that produce pseudorandom sequences of superb quality,suitable for use in cryptographic and security applications,but their algorithms are substantially slower.

The algorithms we have described so far are sequential(single-threaded), but there is a need for algorithms that areeffective in a parallel (SIMD or multi-threaded) computingenvironment. One approach is to partition the cycle of anotherwise sequential PRNG algorithm. This is simple if thereis a cheap way to “jump forward” n steps from any givenstate s by using some computational shortcut to computeτn(s). (For example, if τ(s) = (a ·s) modM then τn(s) =(an modM) · s modM , and it may be possible to compute(or precompute) an modM cheaply.) This approach workswell when the number of threads N to be used is knownat the start of a computation: if a PRNG has period L, thena global initial state s0 is chosen and then the PRNG forthread i is given initial state τ ib

LN c(s0). The MRG32k3a

algorithm does have a good shortcut for jumping ahead (itrequires precomputation of two 3×3 matrices), and there is asoftware package RngStream that uses such jumping aheadfor partitioning its cycle into streams and substreams [19].

On the other hand, some applications are organized insuch a way that new threads may be created dynamically,where any thread may choose to spawn one or more newthreads at any time. To avoid synchronization overhead, itis desirable for an existing thread to be able to initialize thePRNG state for a new thread without having to communicatewith any other thread and without having to update anyshared global data structure; ideally it should need to updateonly the state of its own PRNG instance. In effect, any threadcan turn one PRNG instance into two; a PRNG algorithm thatcan accommodate this requirement is said to be splittable.

454

One can imagine constructing a splittable PRNG from apartitionable PNRG as follows: when a thread is created, itis given a tuple (s, l) to remember—this defines a portionof the PRNG cycle of length l starting from state s—andthen uses s as the initial state for its own PRNG. (For theinitial thread, l = L, the period of the PRNG.) When anexisting thread t having tuple (s, l) spawns a new thread u,then it cuts its allotted portion in half and gives the otherhalf to u: thread t computes h =

⌊l2

⌋, replaces its own

tuple (s, l) with (s, h), and then gives new thread u the tuple(τh(s), l− h). This idea could be applied to the RngStreampackage: while its API allows only sequential generation ofits multiple streams, it could easily be extended to allow aset of multiple streams to be split in half instead (this wouldrequire the precomputation of a table of 192 pairs of 3 × 3matrices, one for each power of 2 smaller than its period).

This works well if the tree of tasks induced by taskspawning is balanced (more precisely, if it does not get toodeep). But some applications may, depending on the data be-ing processed, produce task trees that have long left spinesor are otherwise very deep. A strategy that repeatedly splitsa discrete resource of size L in half will run out of steamafter log2 L splits; for MRG32k3a, that’s only 192 splits.

Another idea is to take a small risk: instead of guarantee-ing that different threads use different parts of the cycle, onecan settle for making it highly likely that different threads usedifferent parts of the cycle, by choosing the initial state forthe PRNG of a new thread “randomly” [6]; the pitfall is that ifcycle portions used by two threads do accidentally overlap,then their sequences become highly correlated.

The Haskell standard library System.Random providesan API and implementation for a splittable, pure PRNG:

data StdGen = StdGen Int32 Int32

stdNext :: StdGen -> (Int, StdGen)

stdSplit :: StdGen -> (StdGen, StdGen)

The generator state StdGen is simply a pair of integers; thefunction stdNext takes a state and returns a pair of a gener-ated value and a new state—in other words, given state s itreturns (µ(s), τ(s))—while the function stdSplit takes astate and returns a pair of states that may thenceforward beused independently. The API is beautiful; the implementa-tion turns out to have a severe flaw (see Section 7).

We are interested in splittable random number generatorsthat are easy to use in multithreaded settings where the shapeof the task tree is unpredictable, are quite fast, and haveaggregate statistical properties of sufficiently good qualityfor “everyday use,” by which we mean not only randomizeddata structures but also Monte Carlo algorithms for machinelearning and scientific simulation. We want to integrate thesealgorithms into the streams framework of Java JDK8, whichsupports the processing of collections of data in a functionalmap/reduce style. For example, we would like the expression

prng.doubles(n).parallel().map(x->x*x).sum()

to compute the sum of the squares of n pseudorandomlychosen double values in the range [0, 1), and to do so ef-ficiently, using parallel computational resources if possible.We care about reproducibility but not unpredictability.

We present two parallel PRNG algorithms. The first is ahighly optimized version of the pedigree-based DOTMIXalgorithm of Leiserson, Schardl, and Sukha [20] that webelieve is interesting and useful in its own right and mayserve as an alternate jumping-off point for further research.We present an early version of this algorithm in Scala, andfurther refined versions of it in Java. The second algorithm,SPLITMIX, was inspired by the first, but makes no use ofpedigrees, dot products, or tables of coefficients.

A specific 64-bit instantiation of this novel SPLITMIX de-sign (a principal contribution of this paper) is presented inSection 3, but it is far from obvious that such a simple ap-proach to generating values and splitting objects is a goodone. Therefore we first explain in Section 2 how we derivedthis design by a stepwise refinement of the DOTMIX algo-rithm that included one or two leaps of faith requiring after-the-fact analysis and testing; the details of this process arethe other principal contribution of this paper. This includesan improved use of avalanche statistics to select good mix-ing functions, and an understanding of why the avalanchestatistics cannot (or should not) be improved past a partic-ular point; we discuss this in Section 4. We present qualitymeasurements of SPLITMIX in Section 5 and speed mea-surements in Section 6, discuss related work in Section 7,and offer conclusions and future work in Section 8.

2. Derivation of the SPLITMIX AlgorithmWe were inspired by the 2012 PPoPP paper of Leiserson,Schardl, and Sukha [20] (hereafter “LSS”). We summarizetheir idea here and discuss other related work in Section 7.

Assume a fork-join framework for parallel tasks, in whicheach task performs a sequence of actions. There are threekinds of actions: generate a pseudorandom value, spawn(that is, fork) a child task, and sync with (that is, join)a previously spawned child task. Except for the causalityconstraints implied by forking and joining, tasks executefreely in parallel. Initially there is a single root task.

Each task has some local state, namely an integer counterand a pointer to its parent task (the task that spawned it); theroot task has a null parent pointer. The algorithm describedby LSS takes certain steps to maintain this state at each gen-erate, spawn, and sync action; this three-way division of la-bor was chosen so as to minimize overhead for tasks thatdo not generate random numbers. We present a simplifiedreformulation of their idea here that requires state mainte-nance only during generate and spawn actions; therefore inthe remainder of this paper we need not discuss the detailedbehavior of sync actions any further, and we speak generallyof only two kinds of action, generate and spawn.

If we identify each task other than the root with theaction that spawned it, we have a tree whose internal nodes

455

Figure 2. Tree of tasks and generated numbers

are tasks; all generate actions are leaves of this tree. (If aspawned task happens to perform no actions, it will also bea leaf.) Figure 2 shows such a tree. Task A is the root task; itperforms four actions generate, spawn, spawn, generate (inthat order). Task B is the first task spawned by task A, andit performs three generate actions. Task C is the second taskspawned by task A, and it performs a spawn and a generate.Task D is the task spawned by task C, and it performs threegenerate actions. In the figure, the edges connecting eachtask to its actions are numbered sequentially starting from 1.

LSS observe that every action in such a tree has a uniquepedigree, the sequence of edge labels on the path within thetree from the root task down to the action. As examples,the second value “74” generated by task B has pedigree〈2, 2〉, the first value “4D” generated by task D has pedigree〈3, 1, 1〉, and the pedigree of the last value “81” generatedby task A has pedigree 〈4〉. Their idea is that every generateaction can produce a value by hashing its pedigree. Theydescribe more than one way of doing this, but we focus ontheir “DOTMIX” technique, which hashes a pedigree in twosteps: (1) compute the dot product of the pedigree (regardedas a vector) with a vector of random coefficients, then (2)apply a bijective “mixing function” µ to the result. We alsoadopt their suggestion of adding a seed value σ to the dotproduct before applying the mixing function.

The calculation of the dot product and the addition of σto the result must be done modulo a prime p. For a com-puter with a word width of w bits, LSS choose p to be thelargest prime less than m = 2w. Assume that every elementjk of a pedigree satisfies 1 ≤ jk < p and every randomcoefficient γi satisfies 1 ≤ γk < p. The sum of σ and thedot product (modulo p) lies in the range [0, p) and so can betreated as aw-bit binary value (because p < 2w), from whichthe mixing function µ can produce a w-bit result. A generateaction having pedigree 〈j1, j2, . . . , jD〉 therefore computesµ((σ +

∑Dk=1 γkjk

)mod p

). LSS present a proof that if

the γi are chosen uniformly at random from the range [0, p)then the probability that two distinct pedigrees will producethe same dot-product value is 1/p. A specific version pre-sented by LSS uses a fixed table of 64-bit coefficients γi(chosen by some random or pseudorandom process) anduses p = 264 − 59 (we shall call this prime number “Fred”).

LSS then note that while the dot-product formula effec-tively hashes pedigrees into the range [0, p) with a smallprobability of collision, nevertheless two similar pedigreesmay produce “similar” hash values, whereas we would pre-fer them to be statistically “dissimilar”; the mixing functionis intended to address this point. LSS use a mixing func-tion based on the RC6 block cipher [27], namely µ(z) =f(f(f(f(z)))) where f(z) = φ((2z2 + z) mod m) andφ(x) is the function that swaps the two halves of a w-bitword x (for even w); if w = 64, then a standard “rotate left32” instruction does the job. Thus f is easily calculated onmost computers using four operations (one of them an inte-ger multiply), and so calculating µ requires 16 instructions.

LSS describe an implementation of the DOTMIX algo-rithm as part of MIT Cilk 5.4.6, a programming languageand runtime that support fork-join parallelism, and presentand measure a sample Cilk program that uses pseudorandomsequences generated by this algorithm. They also describe atechnique for using “scoped” pedigrees so as to allow a sub-computation that uses pseudorandom values to be executedmore than once in such a way that the multiple runs generateexactly the same random values, even though correspondingactions in the subcomputations would of course have distinctpedigrees globally. Such a facility allows for controlled andrepeatable testing of the code for such subcomputations.

Computing a dot product involves following the chainof parent pointers from the current task to the root task(or to the limit of the pedigree scope); the counter valuesfound along the way are the pedigree. LSS remark that dotproducts can be computed incrementally and report that theytried “memoizing” portions of the dot-product calculationsto allow generate operations to be performed in time O(1)rathe than O(d) (where d is the length of the pedigree), butdid not see any speed benefit in practice [20, §8].

We derive a series of algorithms from DOTMIX by step-wise refinement. Our final SPLITMIX algorithm is discussedin detail in Section 2.4; before that, Sections 2.1, 2.2, and2.3 present three intermediate points in the refinement pro-cess (in part because these algorithms are plausible alternatepoints of departure for future work).

2.1 An Implementation in ScalaWe set out to implement the DOTMIX algorithm within theframework of the Scala programming language [24], bothto obtain the stylistic benefits of the DOTMIX approach andto see whether we could identify any useful improvements.(Our earliest prototype was actually done within the runtimesystem for Fortress [1], for which language the LSS designseemed ideal.) Our idea was to extend the Scala App class(which conveniently provides access to certain programmingfacilities in a way that Java does not), so that a main pro-gram need only extend our new library class ParallelAppin order to have available (1) a parallel construct thatwould evaluate two expressions in parallel, and (2) opera-tions such as nextLong() and nextDouble() for obtaining

456

import ParallelRNG.ParallelApp

object ScalaPiCalc extends ParallelApp {

def tries(n: Int): Int = {

var count: Int = 0

for (k <- 0 until n) {

val x = nextDouble()

val y = nextDouble()

if (x*x + y*y < 1.0) count += 1

}

count

}

val iters = 1000000000

var xtime: Long = 0

for (j <- 0 to 9) {

val basetime = System.nanoTime()

val result = (tries(iters) * 4.0) / iters

xtime += (System.nanoTime() - basetime)

println("Result: " + result) }

println("Total time: " + xtime)

}

Figure 3. Monte Carlo approximation of π in Scala

def tries(n: Int): Int = {

if (n < 1000000) {

var count: Int = 0

for (k <- 0 until n) {

val x = nextDouble()

val y = nextDouble()

if (x*x + y*y < 1.0) count += 1

}

count

} else {

val h = n >>> 1

val (p, q) = parallel(tries(h),

tries(n-h))

p+q

}

}

Figure 4. Parallel-sequential approximation of π in Scala

pseudorandomly generated values. Figure 3 shows a typi-cal simple application: Monte Carlo calculation of approx-imate values for π by sequentially choosing random pointsuniformly within a unit square and counting how many fallwithin a quarter-circle of unit radius. The code also containstiming instrumentation and reporting. Figure 4 shows an al-ternate definition for the tries method of Figure 3 that re-cursively splits each request into two parallel tasks until thenumber of requested iterations is less than 1000000. CallingnextDouble() generates a pseudorandom Double value,chosen uniformly from the range [0.0, 1.0), using implicitlymaintained local task state; all that is needed to execute twoexpressions p and q in parallel is to say parallel(p, q).

The code for our library is shown in Figures 5–10. Wemade ten changes to the DOTMIX algorithm:

object Rand {

val signbit = 0x8000000000000000L

def unsignedGE(x: Long, y: Long) =

((x ^ signbit) >= (y ^ signbit))

def update(dotp: Long, gamma: Long) = {

val result = dotp + gamma

if (unsignedGE(result, dotp)) result

else (if (unsignedGE(result, 13L)) result

else result + gamma) - 13L }

def mix64(x: Long): Long = {

var z = x

z = (x ^ (x >>> 33)) * 0xff51afd7ed558ccdL

z = (z ^ (z >>> 33)) * 0xc4ceb9fe1a85ec53L

z ^ (z >>> 33) }

val longs = Array( / Random 64-bit values

0x14CBB762934FA47FL, 0xA7224BC18A26FD10L,

0xF2281E2DBA6606F3L, 0xBD24B73A95FB84D9L,

... // 1024 entries in all

0x91673D71CFCAA9A1L, 0xBBD6771B9BA86215L )

}

Figure 5. Utility routines in object Rand

trait AnyTask {

val depth: Int

val gamma: Long

var dotProd: Long

def newChild[T](x: =>T): Task[T]

def next64Bits(): Long }

Figure 6. Scala task trait containing PRNG state

[1] Every task maintains a count of its actions (LSS call thisthe rank); we increment this counter at the start of everyaction (LSS do this incrementation at other points in thecomputation, which causes sync operations to get involved).[2] Instead of adding the seed σ after calculating the dotproduct, we use σ to initialize the accumulator for the dotproduct calculation; this enables inclusion of σ in commonsubexpressions. From now on, when we speak of the “dotproduct” we will mean a value that already incorporates σ.[3] Rather than memoizing dot products in a cache, we applycommon-subexpression elimination and strength reduction.Each task retains in its local state the most recent dot productσ+∑Dk=1 γkjk it computed; to compute the next dot product

after incrementing jD, the task need only add γD to theprevious dot product value. This improves speed greatly.[4] When a task spawns a child task, it first computes a newdot product exactly as if it were going to perform a generateaction. It then sets the child’s counter to 0 and its dot productequal to its own new dot product value. A routine inductiveproof shows that when the child then proceeds to calculate anew dot product of its own, the correct value is produced.[5] Each task uses exactly one of the random coefficicents,namely γD where D is the depth of the task in the tree.Instead of repeatedly fetching γD from the table, we madeeach task cache γD as part of its local state when it is created.

457

[6] The consequence of previous changes is that no actionactually uses the local counters! Therefore the counters canbe eliminated. Rather then maintaining the pedigree explic-itly, the algorithm maintains dot products directly.[7] At this point, the only use of the parent pointer is tocalculate the depth D of the task. We eliminate the parentpointer and introduce a local variable in each task to trackits depth. When a task spawns a child task, it sets the child’sdepth to 1 more than its own depth.[8] After all these changes, measurements showed that themixing function was now taking a substantial fraction of thetotal time for a generate action. We searched for cheapermixing functions. LSS reported that µ(z) = f(f(f(f(z))))is an adequate mixing function but µ(z) = f(f(z)) is not(as determined by using the DieHarder [5] test suite to testthe resulting pseudorandom sequences). Curiously, LSS didnot report also trying µ(z) = f(f(f(z))), so we tested itand found it adequate, using 12 instructions instead of 16.Then we searched the literature for other bit-mixing func-tions and found the MurmurHash3 64-bit finalizer [2], iden-tified by Austin Appleby using a simulated-annealing algo-rithm designed to find mixing functions with good avalanchestatistics. (We discuss this idea in detail in Section 4.) TheMurmurHash3 finalizer needs only 8 arithmetic/logical op-erations and seems to be adequate as a mixing function inthe DOTMIX algorithm. The 14 variants identified by DavidStafford [33] also seem to work well for this purpose.[9] We noticed an aesthetic flaw in DOTMIX: Because arith-metic for the dot product is performed modulo p and p < m,there are some 64-bit values that are never generated, namelyµ(z) for p ≤ z < m. For m = 264 and p = Fred, 59 of thepossible 264 64-bit values cannot occur. This is a relativelytiny nonuniformity, unlikely to be picked up by statisticaltests, yet it bothered us. Then we realized that there weretwo principal reasons for choosing p < m: (i) if p > m thenarithmetic calculations mod p, especially multiplication, aremuch more difficult to implement using w-bit arithmetic;and (ii) if p > m then not all arithmetic results can be repre-sented in w bits. But we can overcome both of these reasons:(i) Thanks to the strength reduction optimization, we nolonger perform any multiplications mod p, and addition modp is not so difficult even if p > m; and (ii) we don’t reallycare about generating all p possible values—we just want dotproduct values likely to be different—so if we ever calculatea dot product that is not representable, we simply discard itand try again. If we calculate p dot products successively,necessarily generating all p values in the range [0, p), thenp−m values (those not less thanm) will be discarded andmvalues will be accepted, and so we will have generated everyvalue in the range [0,m) exactly once. Therefore our Scalaimplementation performs its dot-product arithmetic moduloGeorge = 264 + 13.1 Moreover, if we ensure that no γi issmaller than p − m (which is almost surely true anyway if1 At least one of us likes to call two primes “magical twins” if there is noother prime number between them but there is a power of 2 between them.

the γ values are chosen randomly) then it is never necessaryto “try again” more than once, so we need only a conditional,not a loop, in the code that computes new dot products. Webelieve that this use of an unrepresentable prime modulus ina PRNG is a theoretical novelty.[10] We found a simpler way to provide the “scoped pedi-gree” functionality described by LSS: a construct to spawna new task such that the user, rather than the spawning task,provides the initial value for the dot product. The Scala con-struct withNewRNG(seed) { body } executes the body ina new task whose dot product has been initialized to seedand then immediately does a join with that new task. Unlikethe LSS scoped pedigree approach, withNewRNG does notallow any given task to access more than one generator at atime, but an advantage is that there is no need for user codeto explicitly manipulate data structures describing the scope.We believe that this technique is also novel.

In Figure 5, the method unsignedGE compares Long

values as if they were bit representations of unsigned 64-bitvalues (an extremely clever compiler would compile this to asingle unsigned-compare machine instruction). The methodupdate uses arithmetic modulo George to add gamma todotp once, or twice if necessary, to produce a new valuerepresentable in 64 bits. The method mix64 implements theMurmurHash3 64-bit finalizer function [2]. The array longsis a table (shown here only in part) of 1024 “truly random”values obtained from HotBits [35].

Figure 6 declares a Scala trait AnyTask, which abstractlydefines the local state of a task (fields depth, gamma, anddotProd) and methods for calculating the next dot product,spawning a child task, and generating 64 pseudorandom bits.

Figures 7 and 8 contain necessary plumbing to extendScala’s existing task-pool framework so that every workerthread will keep track of which task it is currently running.The last two methods of the Par object in Figure 8 definethe parallel and withNewRNG constructs. Each is definedusing parametric types, so that the result type of parallelis a tuple of the types of the two argument expressions, andthe result type of withNewRNG is the type of its body.

Figure 9 declares a type-parametric Scala class Task[T](a task that computes a result of type T) that extends theexisting Scala class RecursiveTask[T] as well as thetrait AnyTask declared in Figure 8. The method compute

is invoked by the Scala task-pool infrastructure when thetask is executed by a worker thread; it computes the body

of the task after using Par.setCurrentTask to save thecurrent task in the local state of the thread. The methodrunUsingForkJoinThread is called by the Par.parallelmethod declared in Figure 8 to cause this task to be exe-cuted by some worker thread for the task pool. The methodnext64Bits is used to compute 64 new pseudorandombits; it simply computes the next dot product and givesthe result to method Rand.mix64 of Figure 5. The methodnextDotProd computes a new dot product from its localdotProd and gamma state variables by using the method

458

import scala.concurrent.forkjoin._

import scala.concurrent.forkjoin.ForkJoinPool._

class TaskForkJoinWorkerThread(group: ForkJoinPool)

extends ForkJoinWorkerThread(group: ForkJoinPool) { var task: AnyTask = null }

class TaskForkJoinPool(parallelism: Int, factory: TaskForkJoinWorkerThreadFactory)

extends ForkJoinPool(parallelism: Int, factory: ForkJoinWorkerThreadFactory) {}

class TaskForkJoinWorkerThreadFactory

extends ForkJoinWorkerThreadFactory {

def newThread(group: ForkJoinPool) = new TaskForkJoinWorkerThread(group) }

Figure 7. Scala threads and thread pools extended to hold tasks that contain PRNG state

object Par {

def nprocs =

Runtime.getRuntime().availableProcessors()

val taskPool: TaskForkJoinPool =

new TaskForkJoinPool(nprocs,

new TaskForkJoinWorkerThreadFactory())

def getCurrentTask(): AnyTask = {

Thread.currentThread() match {

case cur: TaskForkJoinWorkerThread =>

cur.task

case _ => throw new ClassCastException } }

def setCurrentTask(t: AnyTask) = {



cur.task = t

case _ => } }

def parallel[T,U](f: =>T, g: =>U): (T,U) = {

val t: Task[T] =



cur.task.newChild(f)

case _ => new Task(f, 0, 0L, 0L) }

t.runUsingForkJoinThread()

val v: U = g

(t.join(), v) }

def withNewRNG[T](seed: Long)(f: =>T): T = {

val newThread =

new Task[T](f, 1, Rand.longs(1), seed)

newThread.runUsingForkJoinThread()

newThread.join() }

}

Figure 8. Scala PRNG task utility routines

Rand.update of Figure 5; it stores the result back indotProd and then returns it. The method newChild makesa new child task, using a depth one greater than its own, thegamma value from the Rand.longs table corresponding tothat new depth, and a newly computed dot product. Problem:what to do if the depth exceeds the length of the table? Inthis implementation, we just let it wrap around modulo thelength of the table, which in principle disrupts the proof thatcollisions will be unlikely, but in practice may be acceptable,especially if the table is fairly large (1024 should be plentyfor applications that do relatively balanced task splitting).

import scala.concurrent.forkjoin.RecursiveTask

class Task[T](body: =>T,

val depth: Int,

val gamma: Long,

var dotProd: Long)

extends RecursiveTask[T] with AnyTask {

def compute(): T = {

Par.setCurrentTask(this)

body }

def runUsingForkJoinThread() = {

if (Thread.currentThread

.isInstanceOf[ForkJoinWorkerThread])

this.fork()

else Par.taskPool.execute(this) }

override def next64Bits(): Long =

Rand.mix64(nextDotProd())

override def nextDotProd(): Long = {

dotProd = Rand.update(dotProd, gamma)

dotProd }

override def newChild[T](x: =>T): Task[T] = {

val d = (depth + 1) % Rand.longs.length

new Task(x, d, Rand.longs(d), nextDotProd())

}

}

Figure 9. New task type for maintaining PRNG state

Figure 10 declares a Scala class ParallelApp that ex-tends the existing Scala class App. Instances of App simplyexecute their bodies; ParallelApp arranges to make a fewextra facilities available during such execution. The methodsparallel and withNewRNG simply forward their argumentsto the corresponding methods of object Par in Figure 8. Themethod nextLong() uses Par.getCurrentTask() to re-trieve the current task from the current worker thread andcalls its next64Bits method; the methods nextInt()andnextDouble() make use of nextLong(). Method main isthe interface that causes the body of an instance of App to berun; the override in class ParallelApp causes a new ran-dom number generator task to be established first. MethodinitialSeed chooses the initial seed for this root task; itattempts to use environmental information so as to make itlikely that distinct processes will compute distinct values forthis initialization value. (The details of this rather difficultprocess do not concern us here.)

459

class ParallelApp extends App {

def parallel[T,U](x: => T, y: => U): (T,U) =

Par.parallel(x, y)

def withNewRNG[T](seed: Long)(f: => T) =

Par.withNewRNG(seed)(f)

def nextLong(): Long =

Par.getCurrentTask()

.asInstanceOf[AnyTask].next64Bits()

def nextInt(): Int = nextLong().toInt

val DOUBLE_ULP: Double = (1.0 / (1L << 53))

def nextDouble(): Double =

(nextLong() >>> 11) * DOUBLE_ULP

override def main(args: Array[String]) =

withNewRNG(initialSeed())(super.main(args))

def initialSeed(): Long = ...

}

Figure 10. Class for declaring parallel Scala applications

2.2 A 64-bit Implementation in JavaNext we constructed a version of the algorithm suitable forthe JavaTM programming language environment [11]. Thecode for this class, Splittable64, is shown in Figures 11and 12, We made six changes relative to the Scala version:[1] A drawback of tying the PRNG state to tasks is that pro-grams that use parallel tasks must pay the overhead of main-taining the PRNG state even in parts of the program that arenot generating pseudorandom values. Also, we hoped to de-velop an algorithm that might supersede the existing java.

util.Random facility. With the elimination of parent point-ers and explicit pedigrees, there is no real need to keep thetask structure involved if we are willing to pass around ex-plicit references to PRNG objects, as is already done in Javawith java.util.Random. Therefore our Java implementa-tion is object-oriented rather than task-oriented.[2] We then realized that if the spawn operation became amethod called split, PRNG objects could act as collectionsin the JDK8 library framework, and a PRNG object could eas-ily produce a spliterator [25] capable of delivering a stream(of pseudorandom values) that could provide methods suchas map and reduce, and interact with streams produced byother sorts of collections such as lists and hashmaps.[3] We eliminated the use of unsigned compare operationsby adopting a change of representation: we regard the dotproduct (but not the random coefficient) as being offset in itsnominal value by 0x8000000000000000. This allows us toreplace calls to the unsigned comparison unsignedGE withuses of the Java signed comparison operator >=. (Java JDK8includes a library of unsigned comparison operations, but wedo not yet know whether, or how soon, compilers will inlinethese as single unsigned-compare machine instructions.)[4] A drawback of our Scala implementation is that the tableof random coefficients has fixed size. We decided to usepseudorandom coefficients, and to generate them on the flyusing a copy of the DOTMIX algorithm that always usespedigrees of length 1, and therefore needs only one “random

coefficient” γγ . This gets rid of the big table. (We must stilltake care to ensure that such generated coefficients are notsmaller than 13, so that the update method that performsarithmetic mod George will not require a loop.)[5] At this point, speed measurements showed that theupdate method was a large fraction of the time for a gen-erate action. We further restricted the range of pseudoran-domly generated coefficients so that they are substantiallysmaller than 264: instead of using arithmetic mod George,this calculation is performed modulo Percy = 256 − 5 toproduce a 56-bit value that is then mixed by a separate mix-ing method mix56. Adding 13 to this result produces a co-efficient that is still much smaller than 264 but not smallerthan 13. The point of using a small coefficient is that the ad-dition in the first line of method update will overflow onlyrarely and therefore the conditional expression nearly alwaystakes the first choice, avoiding further arithmetic. Moreover,it makes that conditional test highly predictable, further in-creasing speed on processors that do branch prediction.[6] Like java.util.Random, Splittable64 provides twopublic constructors. One takes a seed argument, and theother takes no argument and provides a pseudorandomlygenerated seed value. For the purpose of generating suchdefault seeds, a third copy of the algorithm is used, againusing arithmetic mod George.

Figure 11 declares a Java class Splittable64. It uses anAtomicLong value called defaultGen for generating de-fault seeds for the parameterless constructor. The update

and mix64 methods are identical in behavior to the Scalaversions in Figure 5, except that the Java version of updateassumes the offset representation for argument s and there-fore uses signed comparisons. We have given nextDotProd

the new name nextRaw64, but its purpose is the same: toproduce 64 “raw” bits for input to the mixing function. The(private) two-argument constructor takes the usual seed ar-gument and also a seed for generating γ coefficients; the con-structor adds γγ to this second seed (modulo Percy), appliesmix56 to the result, then adds 13 to produce the γ coefficientfor the new PRNG object. It also saves the updated γ seed inits nextSplit field for use by its split method. MethodnextDefaultSeed similarly updates the defaultGen seed(modulo George), using an atomic compareAndSet opera-tion, then applies mix64 and returns that result.

Figure 12 shows the public methods of Splittable64.The constructors are straightforward; each calls the pri-vate constructor with appropriate arguments. The split

method calls nextRaw64 to produce a new dot product,then creates a new SplittableRandom object with thatvalue and the saved value in nextSplit. The methodsnextLong, nextInt, and nextDouble are similar to theirScala counterparts in Figure 10. Finally, the method longs

returns a stream of pseudorandom values of the specifiedlength; it does this by calling the Java JDK8 library methodStreamSupport.longStream and giving it a spliterator(an instance of class RandomLongs) as its first argument.

460

public class Splittable64 {

private static final long

GAMMA_PRIME = (1L << 56) - 5, // "Percy"

GAMMA_GAMMA = 0x00281E2DBA6606F3L,

DEFAULT_SEED_GAMMA = 0xBD24B73A95FB84D9L;

private static final double DOUBLE_ULP =

1.0 / (1L << 53);

private static final AtomicLong defaultGen =

new AtomicLong(initialSeed());

private long seed;

private final long gamma, nextSplit;

private static long update(long s, long g) {

// Add g to s modulo George.

long p = s + g;

return (p >= s) ? p

: (p >= 0x800000000000000DL) ? p - 13L

: (p - 13L) + g; }

private static long mix64(long z) {

z = (z ^ (z >>> 33)) * 0xff51afd7ed558ccdL;

z = (z ^ (z >>> 33)) * 0xc4ceb9fe1a85ec53L;

return z ^ (z >>> 33); }


z = ((z ^ (z >>> 33)) * 0xff51afd7ed558ccdL)

& 0x00FFFFFFFFFFFFFFL;

z = ((z ^ (z >>> 33)) * 0xc4ceb9fe1a85ec53L)

& 0x00FFFFFFFFFFFFFFL;

return z ^ (z >>> 33); }

private long nextRaw64() {

return (seed = update(seed, gamma)); }

private Splittable64(long seed, long s) {

// We require 0 <= s < Percy

this.seed = seed;

s += GAMMA_GAMMA;

if (s >= GAMMA_PRIME) s -= GAMMA_PRIME;

this.gamma = mix56(s) + 13;

this.nextSplit = s; }

private static long nextDefaultSeed() {

long p, q;

do { p = defaultGen.get();

q = update(p, DEFAULT_SEED_GAMMA);

} while (!defaultGen.compareAndSet(p, q));

return mix64(q); }

// Public methods go here

}

Figure 11. Class Splittable64 and its private methods

The method longs consists of fairly standard “boiler-plate” code for creating a new spliterator-based stream [25]of pseudorandomly chosen long values that can poten-tially be processed in parallel; similarly the method ints

produces a stream of pseudorandomly chosen int values,and doubles produces a stream of pseudorandomly chosendouble values. As a simple example, if prng refers to aninstance of SplittableRandom, then the expression

prng.doubles(n).map(x->x*x).sum()

public Splittable64(long seed) {

this(seed, 0); }

public Splittable64() {

this(nextDefaultSeed(), GAMMA_GAMMA); }

public Splittable64 split() {

return new Splittable64(nextRaw64(),

nextSplit); }

public long nextLong() {

return mix64(nextRaw64()); }

public int nextInt() {

return (int)nextLong(); }


return (nextLong() >>> 11) * DOUBLE_ULP; }

public LongStream longs(long size) {

if (size < 0L)

throw new IllegalArgumentException();

return StreamSupport.longStream

(new RandomLongs(this, 0L, size),

false); }

// The next two methods are just like "longs"

public IntStream ints(long size) { ... }

public DoubleStream doubles(...) { ... }

Figure 12. Public methods of class Splittable64

will compute the sum of the squares of n pseudorandomlychosen double values in the range [0, 1), and the expression

prng.doubles(n).parallel().map(x->x*x).sum()

will do so using parallel threads if available. The instanceof RandomLongs created by the longs method can supply(to a given consumer of long values) a sequence of pseudo-randomly chosen long values.

For completeness, we exhibit in Figure 13 a slightly sim-plified version of the code for the spliterator RandomLongs;its job is to apply a given consumer of long values to asequence of pseudorandomly chosen long values (methodstryAdvance and forEachRemaining). The RandomLongsspliterator, like any other spliterator, must also be prepared to(try to) split itself into two such streams (method trySplit)that can then be processed in parallel; when it does so, it alsosplits the underlying instance of SplittableRandom so thateach substream will have its own PRNG rather than trying toshare the same one. As a result, if the two streams are thenprocessed in parallel, they do not contend for the use of asingle PRNG object; each has its own, and they may be usedindependently. This eliminates the need for any locking orother synchronization in the parallel execution of stream ex-pressions that involve SplittableRandom objects. This isa hallmark of the spliterator paradigm: for effective paral-lel processing, objects are designed not to be shared amongtasks, but split across tasks. (The actual details of parallelexecution—thread management and so on—are handled bythe Java library code that implements streams and need notconcern us here.)

461

package java.util;

import java.util.function.LongConsumer;

import java.util.stream.StreamSupport;

import java.util.stream.LongStream;

static final class RandomLongs

implements Spliterator.OfLong {

final SplittableRandom prng;

long index; final long fence;

RandomLongs(SplittableRandom prng,

long index, long fence) {

this.prng = prng; this.index = index;

this.fence = fence; }

public RandomLongs trySplit() {

long i = index, m = (i + fence) >>> 1;

return (m <= i) ? null :

new RandomLongs(prng.split(), i,

index = m); }

public long estimateSize() {

return (fence - index); }

public int characteristics() {

return (Spliterator.SIZED |

Spliterator.SUBSIZED |

Spliterator.NONNULL |

Spliterator.IMMUTABLE); }

public boolean tryAdvance(LongConsumer c) {

if (c == null)

throw new NullPointerException();

long i = index, f = fence;

if (i < f) { c.accept(prng.nextLong());

index = i + 1; return true; }

return false; }

public void

forEachRemaining(LongConsumer c) {

if (c == null)

throw new NullPointerException();

long i = index, f = fence;

if (i < f) { index = f;

do { c.accept(prng.nextLong());

} while (++i < f); } }

}

Figure 13. Spliterator for generating pseudorandom 64-bitlong values (slightly simplified from actual JDK8 code)

We also remark on a subtle aspect of the semantic con-tract for the split method: We do not guarantee that the setof pseudorandom values produced by a pair of PRNG objects,one split from the other, will be exactly the same set of val-ues that would have been produced by a single unsplit PRNGobject; we promise only that the two sets will have similardesirable statistical properties (thanks to the LSS proof aboutthe probability of dot-product collisions, backed up by em-pirical testing). On the other hand, we have this guaranteeof reproducibility: if two PRNG objects are created with thesame initial state, and then the same series of method callsis performed on those two PRNG objects, and an identical

series of method calls is performed on every pair of corre-sponding PRNG objects recursively produced by calls to thesplit method, then exactly the same pseudorandom valueswill be produced.

2.3 A 128-bit Implementation in JavaThe Splittable64 algorithm has at least two remainingdrawbacks. First, the period for any single PRNG object isequal to the number of distinct values it can produce (264),and therefore adjacent values are never, ever the same—indeed, if you generate up to 264 values successively, therewill be no duplicates—so the behavior is not quite the sameas “truly random.” (On the other hand, many applications useonly a portion of each generated 64-bit value—in particular,the generation of a 64-bit floating-point value uses only 53bits—and so this drawback is often not serious in practice.)Second, the fact that we reduced the range of random coeffi-cients from [13, 264) to [13, 256+13) (in order to gain somespeed) undesirably increases the probability of dot-productcollisions. One way out is to increase the dot-product sizefrom 64 bits to 128 bits; then we can afford to decrease theγ coefficient size from 128 bits to, say, 120 bits and stillhave plenty of headroom on the probability of dot-productcollisions. The idea is straightforward, but the details are abit messy because Java does not have a primitive 128-bit in-teger type. This, of course, provides ample opportunity foradditional engineering cleverness.

The 128-bit dot-product arithmetic is performed moduloArthur = 2128 + 51. The main new idea here is to choose128-bit γ values that have 114 random bits represented in avery specific way: as two 64-bit words, with the high-orderword having a value in the range [0, 254) and the low-orderword having a value in the range [51, 260 + 51).

To generate 114 random bits, we use a generator of 57-bitresults by using arithmetic modulo Ginny = 257 − 13 and amixing function mix57, then use two successive such valuesto construct one γ value.

Figure 14 declares a Java class Splittable128, simi-lar in spirit and overall organization to Splittable64. Thevalue GAMMA_PRIME is now Ginny rather than Percy. Whilewe could have chosen to use 128-bit arithmetic for generat-ing default seeds, there really is not much point since theirgeneration is serialized by the atomic synchronization andtherefore it is highly unlikely that more than 264 or even257 will be needed, so in order to share code we just usearithmetic modulo Ginny for that as well. Therefore arith-metic modulo George is not used at all. The field seed isreplaced by fields seedHi and seedLo; similarly the fieldgamma is replaced by fields gammaHi and gammaLo. Methodmix64 is as in Figure 11. The method nextRaw64 performsarithmetic modulo Arthur. The point of our choice of γ val-ues is that the sum exceeds Arthur only very rarely (aboutone time in a thousand), and therefore method seedFixup

(which handles this overflow case) is very rarely called. Fur-thermore, the fact that at least 3, and very probably 4, of

462

import java.util.concurrent.atomic.AtomicLong;

public class Splittable128 {


GAMMA_PRIME = (1L << 57) - 13, // "Ginny"

GAMMA_GAMMA = 0x00281E2DBA6606F3L,

DEFAULT_SEED_GAMMA = 0x0124B73A95FB84D9L;

private static final double DOUBLE_ULP =

1.0 / (1L << 53);



private long seedHi, seedLo;

private final long gammaHi, gammaLo;

private final long nextSplit;

private static long mix64(long z) { ... }



z &= 0x01FFFFFFFFFFFFFFL;

z = (z ^ (z >>> 33)) * 0xc4ceb9fe1a85ec53L;

z &= 0x01FFFFFFFFFFFFFFL;

return z ^ (z >>> 33); }

private long nextRaw64() {

// Add gamma to seed modulo Arthur.

long s = seedLo, h = seedHi, gh = gammaHi;

seedLo = s + gammaLo;

if (seedLo < s) ++gh; // Rare

seedHi = h + gh;

if (seedHi < h) seedFixup(); // Very rare

return (seedHi ^ seedLo); }

private void seedFixup() {

if (seedLo >= (0x8000000000000000L + 51)) {

seedLo -= 51;

} else if (seedHi != 0x8000000000000000L) {

--seedHi; seedLo -= 51;

} else {

long s = seedLo - 51L;

seedLo = s + gammaLo;

if (seedLo < s) ++seedHi;

seedHi += (gammaHi - 1); } }

private Splittable128(

long seedHi, long seedLo, long s) {

// We require 0 <= s < Ginny

this.seedHi = seedHi; this.seedLo = seedLo;

s += GAMMA_GAMMA;


long b = mix57(s);

this.gammaHi = b >>> 3;

long extraBits = (b << 61) >>> 4;

s += GAMMA_GAMMA;


this.gammaLo = (extraBits | mix57(s)) + 51;

nextSplit = s; }

// Other methods go here

}

Figure 14. Class Splittable128 and its private methods

private static long nextDefaultSeed() {

long p, q;

do { p = defaultGen.get();

q = p + GAMMA_GAMMA;

if (q >= GAMMA_PRIME) q -= GAMMA_PRIME;

} while (!defaultGen.compareAndSet(p, q));

return mix57(q); }

public Splittable128(long seedLo) {

this(0, seedLo, 0); }

public Splittable128(long hi, long lo) {

this(hi, lo, 0); }

public Splittable128() {

this(nextDefaultSeed(), 0, GAMMA_GAMMA); }

public Splittable128 split() {

nextRaw64();

return new Splittable128(

seedHi, seedLo, nextSplit); }

// Methods nextLong, nextInt, nextDouble,

// longs, ints, and doubles are the same

// as in the 64-bit Java implementation.

Figure 15. Other methods of class Splittable128

the high-order bits of gammaLo are 0-bits means that theincrementation of gh in method nextRaw64 (reflecting acarry from the low half to the high half) is rare (about onetime in 16), which makes the incrementation rare and theconditional governing it highly predictable. Note that themethod nextRaw64 updates a 128-bit seed, but then re-turns a 64-bit value by returning the bitwise XOR of thetwo halves of the seed. The rationale for this is that be-cause mix64 has good first-order avalanche statistics, mix64of this XOR will also have good first-order avalanche statis-tics (though not second-order). The two-argument construc-tor Splittable128 is analogous to the two-argument con-structor Splittable64 in Figure 11 but performs calcula-tions modulo Ginny to generate two 57-bit values and thenuses them to construct one γ value as described above.

Figure 15 shows method nextDefaultSeed and thepublic methods of Splittable128. The constructors arestraightforward; each calls the private constructor with ap-propriate arguments. Note that a two-argument public con-structor is provided to allow the user to specify a 128-bit seedif desired. The split method calls nextRaw64 to producea new dot product (discarding the 64 bits it returns), thencreates a new SplittableRandom object with the newlyupdated 128-bit seed value in seedHi and seedLo and thesaved nextSplit value (we think this implementation semi-elegant in its semi-simplicity).

We explored the alternative of using 127-bit arithmetic(modulo Molly = 2127 + 29), with a dot-product represen-tation that does not use the high-order bit of the low-orderword, and γ values with 117 random bits (54 in the high-order word and 63 in the low-order word). Carry propagationfrom the low-order half to the high-order half can be per-formed without branch instructions by adding the two low-

463

order 63-bit halves, then shifting their 64-bit sum rightward63 positions to produce the necessary carry bit. In our exper-iments this failed to produce a speed improvement.

Of course, if this algorithm were to be coded directly inassembly language, one might code a 128-bit integer addi-tion as a 64-bit integer add instruction followed by a 64-bit integer add-with-carry instruction, avoiding the need forthese tricks. Nevertheless, method nextRaw64 is an inter-esting example of a situation where code can be made sub-stantially and usefully faster not by rewriting the code but byadjusting the data that is supplied.

2.4 The SPLITMIX AlgorithmOne doubt remained about Splittable128: while the prob-ability of a dot-product collision is small, the consequencesare extremely undesirable: if two PRNG objects happen toproduce the same dot-product value and also have the sameγ coefficient, then from that point they will produce dupli-cate streams of values—and if new PRNG objects are thenproduced by balanced recursive splitting to some uniformdepth, all the PRNG objects will have the same γ coefficient.

But what if every PRNG object could have a different γcoefficient? It would be as if the length of every pedigreewere equal to the total number of PRNG objects, and eachobject corresponds to one element of the pedigree (and in-crements only that element). Even if two PRNG objects hap-pened to produce the same dot-product value (and thereforeemit the same pseudorandom value), the next values theygenerated would be different, and so the problem of dupli-cate subsequences would be avoided. This situation can beaccomplished with the algorithms we have already presentedby doing a long, linear chain of splits rather than a binary(or n-ary) tree of splits. Unfortunately, that can take a longtime, and may not match the structure of the user code. In aparallel computation, it’s difficult to ensure that all γ valuesare different without using a shared data structure that mayrequire synchronized access—just what we hope to avoid.

But then we realized that perhaps it’s good enough fordifferent PRNG objects to have distinct γ values with highprobability, and this can be achieved if they are chosenpseudorandomly. Furthermore, our intuition was that it’sokay in practice for there to be occasional (that is, rare)collisions of γ values, because two PRNG object with thesame γ value are likely to have distinct seed values, andin fact seed values that are distant from each other withinthe periodic cycle of seed values generated by that γ value.(We must, of course, take the Birthday Paradox into accountwhen calculating how rare is rare enough.)

Now, the principal reason for doing arithmetic moduloa prime number was to support the string LSS proof thatdot-product collisions are rare. If we are no longer worriedabout dot-product collisions (thanks to the high probabilitythat γ values are different), then maybe arithmetic modulo apower of 2 suffices (a weaker version of the LSS proof stillapplies, but with a bound on the probability of dot-product

collision that is about two orders of magnitude larger than forthe prime-modulus case). Using arithmetic modulo a powerof 2 makes it unnecessary to restrict γ values to a range suchas [13, 264), because the speed tricks for doing arithmeticmodulo a prime number are no longer relevant.

All these considerations led us to the SPLITMIX strategy:to create a new PRNG, simply use an existing PRNG instanceto generate a new seed and a new γ value pseudorandomly.In practice, we tweak this idea slightly. In the next sectionwe present the specific 64-bit implementation of this strategythat has been incorporated into Java JDK8.

3. The Class SplittableRandomFigure 16 declares a Java class SplittableRandom thatuses 64-bit arithmetic to generate a pseudorandom sequenceof 64-bit values (from which 32-bit int or 64-bit doublevalues may then be derived). It has two 64-bit fields seed

and gamma; for any given instance of SplittableRandom,the seed value is mutable and the gamma value is unchang-ing. To ensure that every instance has period 264, it is neces-sary to require that the gamma value always be odd; thus eachinstance actually has 127 bits of state, not 128. The privateconstructor is trivial, taking two arguments and using themto initialize the seed and gamma fields.

The method nextSeed simply adds gamma into seed andreturns the result. (How it could be any simpler?)

The methods mix64, mix32, mix64variant13, andmixGamma “mix” (that is, “scramble and blend”) the bitsof a 64-bit argument to produce a result. Each of the firstthree computes a bijective function on 64-bit values; mix32furthermore discards 32 bits to produce a 32-bit result. Themethod mix32 is a simply a version of mix64 optimizedfor the case where only the high-order 32 bits of the re-sult will be used (there is no need to use an XOR to up-date the low-order bits, only to discard them). The methodmix64variant13 is David Stafford’s Mix13 variant of theMurmurHash3 finalizer [33].

The value DOUBLE_ULP is the positive difference between1.0 and the smallest double value larger than 1.0; it isused for deriving a double value from a 64-bit long value.

A predefined gamma value is needed for initializing “root”instances of SplittableRandom (that is, instances not pro-duced by splitting an already existing instance). We chosethe odd integer closest to 264/φ, where φ = (1 +

√5)/2 is

the golden ratio, and call it GOLDEN_GAMMA.A globally shared, atomically accessed seed value is also

needed for creating root instances; we call it defaultGen.It is initialized by procedure initialSeed.

Figure 17 contains declarations of public methods forclass SplittableRandom. The two constructors are simple:if a seed is provided, it is used along with GOLDEN_GAMMA,but if a seed is not provided, then the shared defaultGen

variable is updated once but used to provide two distinctvalues that are mixed by methods mix64 and mixGamma toproduce seed and gamma values for the new instance.

464

package java.util;

import java.util.concurrent.atomic.AtomicLong;

import java.util.stream.LongStream;

import java.util.stream.IntStream;

import java.util.stream.DoubleStream;

public final class SplittableRandom {

private long seed;

private final long gamma; // An odd integer

private SplittableRandom(long seed,

long gamma) {

// Note that "gamma" should always be odd.

this.seed = seed; this.gamma = gamma; }

private long nextSeed() {

return (seed += gamma); }



z = (z ^ (z >>> 33)) * 0xc4ceb9fe1a85ec53L;

return z ^ (z >>> 33); }

private static int mix32(long z) {


z = (z ^ (z >>> 33)) * 0xc4ceb9fe1a85ec53L;

return (int)(z >>> 32); }

private static long mix64variant13(long z) {

z = (z ^ (z >>> 30)) * 0xbf58476d1ce4e5b9L;

z = (z ^ (z >>> 27)) * 0x94d049bb133111ebL;

return z ^ (z >>> 31); }

private static long mixGamma(long z) {

z = mix64variant13(z) | 1L;

int n = Long.bitCount(z ^ (z >>> 1));

if (n >= 24) z ^= 0xaaaaaaaaaaaaaaaaL;

return z; } // This result is always odd.

private static final double

DOUBLE_ULP = 1.0 / (1L << 53);


GOLDEN_GAMMA = 0x9e3779b97f4a7c15L; // odd



// Public methods go here

}

Figure 16. Private methods of class SplittableRandom

The pseudorandom generation of values is then straight-forward. To generate a new 64-bit long value, simply com-pute the next seed and feed it to mix64; to generate a new 32-bit int value, feed the next seed to mix32. To generate a newdouble value, generate a 64-bit long value, discard 11 bits,and multiply the remaining 53 bits by DOUBLE_ULP, thusproducing a value chosen uniformly from the range [0, 1).

The split method is equally straightforward; in effect,it creates a new instance of SplittableRandom by simplychoosing its seed and gamma values “randomly”; to be spe-cific, it generates two new seed values and mixes them usingmethods mix64 and mixGamma.

Not shown in Figure 17 but implemented in Java JDK8are versions of methods longs, ints, and doubles that

public SplittableRandom(long seed) { // Okay

this(seed, GOLDEN_GAMMA); }

public SplittableRandom() { // Preferred

long s = defaultGen.getAndAdd(

2 * GOLDEN_GAMMA);

this.seed = mix64(s);

this.gamma = mixGamma(s + GOLDEN_GAMMA); }

public long nextLong() {

return mix64(nextSeed()); }

public int nextInt() {

return mix32(nextSeed()); }


return (nextLong() >>> 11) * DOUBLE_ULP; }

public SplittableRandom split() {

return new SplittableRandom(

mix64(nextSeed()),

mixGamma(nextSeed())); }

public LongStream longs(long size) { ... }

public IntStream ints(long size) { ... }

public DoubleStream doubles(...) { ... }

// Other convenience methods as well

Figure 17. Public methods of class SplittableRandomtake no argument and produce an indefinitely long streamrather than a stream of a prespecified length, as well as avariety of convenience methods that allow the user to spec-ify ranges from which pseudorandom values should be uni-formly chosen. For example, prng.nextInt(0, 6) returnsa value chosen uniformly from the set {0, 1, 2, 3, 4, 5}, andprng.ints(10000, 0, 6) returns a stream of 10000 valueschosen uniformly and independently from that same set.

While we have chosen not to provide a method for “jump-ing ahead” [16] in a sequence, to do so would be easy:

public void jump(long n) {

seed += (gamma * n); }

and for negative n this also jumps backward.The mixing function mix64 is easily inverted:

private static long unmix64(long z) {

z = (z ^ (z >>> 33)) * 0x9cb4b2f8129337dbL;

z = (z ^ (z >>> 33)) * 0x4f74430c22a54005L;

return z ^ (z >>> 33); }

because0xff51afd7ed558ccdL * 0x4f74430c22a54005L == 1

and0xc4ceb9fe1a85ec53L * 0x9cb4b2f8129337dbL == 1

(using long arithmetic, that is, modulo 264); such modularinverses are easy to compute. Note also that the transfor-mation z = z ^ (z >>> 33) is self-inverse. Therefore,for any 64-bit value q, unmix64(mix64(q)) == q andmix64(unmix64(q)) == q. If p and q are values producedby consecutive calls to the nextLong method of a specificinstance of SplittableRandom (with no intervening callsto other methods such as split), then the current value of

465

seed for that instance must be unmix64(q), and the gammavalue for that instance must be unmix64(q)-unmix64(p).From these values, all future behavior of the instance can bepredicted. Therefore SplittableRandom and other mem-bers of the SPLITMIX family of PRNG algorithms should notbe used for applications (such as security and cryptography)where unpredictability is an important characteristic.

The method mixGamma takes an input, mixes its bits us-ing mix64variant13, and then forces the low bit to be 1. Atfirst we thought that would suffice, but then we tested PRNGobjects with “sparse” γ values whose representations haveeither very few 1-bits or very few 0-bits, and found that suchcases produce pseudorandom sequences that DieHarder re-gards as “weak” just a little more often than usual. A littlethought shows that γ values of the form 000 . . . 0011 . . . 111and 111 . . . 1100 . . . 000 could also perform poorly; our in-formal explanation is as follows. Any γ value would befine if the mixing function were perfect; but for a less-than-perfect mixing function, the more bits change from one seedto the next, the more effective is avalanching in producingapparently random sequences—so we can “help” the mix-ing function by ensuring that the nextSeed method changesmany bits in the seed. Long runs of 0-bits or of 1-bits inthe γ value do not cause bits of the seed to flip; an approxi-mate proxy for how many bits of the seed will flip might bethe number of bit pairs of the form 01 or 10 in the candi-date γ value z. Therefore we require that the number of suchpairs, as computed by Long.bitCount(z ^ (z >>> 1)),exceed 24; if it does not, then the candidate z is replaced bythe XOR of z and 0xaaaaaaaaaaaaaaaaL, a constant cho-sen so that (a) the low bit of z remains 1, and (b) every bitpair of the form 00 or 11 becomes either 01 or 10, and like-wise every bit pair of the form 01 or 10 becomes either 00 or11, so the new value necessarily has more than 24 bit pairswhose bits differ. Testing shows that this trick appears to beeffective. The threshold value 24 was chosen somewhat ar-bitrarily as being large enough to ensure good bit-flippingin successive seeds but small enough that the correction isrequired relatively rarely, for

(∑23k=0

(63k

))/263 ≈ 2.15%

of all candidates. (Our first implementation simply rejectedsuch candidates using a do–while loop; we changed it tothe XOR method purely to avoid using a loop. While the loopwould perform a second iteration only rarely, we prefer loop-less algorithms for parallel implementation on SIMD or GPUarchtectures. For the same reason we prefer to avoid condi-tionals, but this case can be handled well using a conditionalmove instruction.) At most four distinct 64-bit inputs to themethod mixGamma will be mapped to the same final γ value.

If the calls to methods nextSeed and mix64 are inlined,then the body of the method nextLong is straight-line codeconsisting of a read of gamma, a read and a write of seed, and9 64-bit arithmetic/logical operations (2 multiplies, 1 add,3 shifts, and 3 XOR operations). Likewise, aside from theallocation of a new SplittableRandom object, the workperformed by method split is straight-line code if the if

statement in method mixGamma is compiled using a condi-tional move instruction. Suppose, then, that one has a SIMDarchitecture with n-way parallelism (n might be 4 for IntelSSE, 32 for a modern GPU, or 65536 for an old-fashionedConnection Machine CM-2 [34]). A special SIMD version ofthe split method can start with one PRNG object, put itsseed and gamma values into the first elements of two arraysof length n, and then use dlog2 ne doubling steps to performthe split computation on each seed/gamma pair already inthe arrays, thus doubling the number of such pairs in the ar-rays. This initialization can be done just once and the result-ing seed/gamma pairs used many times. After that, a SIMDversion of method nextLong can process all n seed/gammapairs in parallel using width-n SIMD instructions on the twoarrays. (Alternatively, one can give each thread a copy of thesame PRNG state, then have thread k jump forward 2k statesand then perform a split operation to obtain its own PRNG;this works because the jump operation is also amenable toSIMD implementation.)

4. Some Remarks about Avalanche StatisticsThe idea behind the avalanche effect is that a good bit-mixing function will have the property that changing a sin-gle bit of the input will likely cause many (about half) ofthe output bits to change. Feistel [8] first used the term“avalanche,” and Webster and Tavares [36] formulated theStrict Avalanche Criterion: whenever a single input bit isflipped, each of the output bits should flip with probability(approximately) 1

2 , averaged over all possible inputs. Mea-surement of this criterion can be approximated by testingonly a fraction of all possible inputs.

Appleby used this criterion as the “energy function” in asimulated annealing process to explore a specific space ofpotential 8-operation bit-mixing functions (both 32-bit and64-bit versions). The results are known as the MurmurHash3finalizers [2]. Stafford remarked that a non-random set ofinputs might make a better training set, did his own exten-sive simulated-annealing experiments, and reported a set offourteen 64-bit variants that have better avalanche statisticsthan the MurmurHash3 64-bit finalizer. Neither Appleby norStafford described the precise energy function used to drivethe simulated annealing process. However, Stafford reportedboth maximum error and mean error from the avalanche ma-trix for each of his variants, where “error” is presumably de-viation from the “perfect” probability of 1

2 .We have conducted our own searches for good 64-bit bit-

mixing functions. We tested each candidate µ using a setof 64-bit input values chosen at random (and using sev-eral different PRNG algorithms for this purpose, includingSPLITMIX and java.util.Random). From this set of Ninput values, the 64 × 64 avalanche matrix A was con-structed as a histogram: for each input value v and for every0 ≤ i < 64 and 0 ≤ j < 64, aij is incremented if and onlyif (µ(v)⊕µ(v⊕2i))∧2j is nonzero, where⊕ is bitwise XORand ∧ is bitwise AND on binary integers. The absolute value

466

long[][] A = new long[64][64];

for (long n = 0; n < N; n++) {

long v = rng.nextLong(), w = mix64(v);

for (int i = 0; i < 64; i++) {

long x = w ^ m.mix(v ^ (1 << i));

for (int j = 0; j < 64; j++) {

if (((x >>> j) & 1) != 0) A[i][j] += 1;

} } }

double sumsq = 0.0;

for (int i = 0; i < 64; i++) {

for (int j = 0; j < 64; j++) {

double v = a[j][k] - N / 2.0;

sumsq += v*v; } }

double result = sumsq / ((N / 4.0) * 4096.0);

Figure 18. Calculating sum-of-squares avalanche statistic

of the difference of each histogram entry and N2 , normalized

by dividing by N , represents avalanche “error” (deviationfrom the desired 50% probability of flipping).

A compromise for minimizing both maximum and meanis to minimize the root-mean-square (equivalently, the sum-of-squares). For our simulated-annealing searches, we let astate be a tuple of multipliers and shift distances; we let theneighbors of a state be a copy with either (a) just one bit ofone multiplier flipped, or (b) just one shift distance increasedor decreased by 1; and we used the sum-of-squares of theavalanche matrix entries as the energy function. This provedeffective up to a certain point, beyond which additional com-putational effort resulted in no further improvement—wewere unable to drive the energy any closer to zero.

A little thought reveals why. The avalanche histogram isa set of 64×64 = 4096 variables, each of which is, in effect,the sum of N values selected from the set {0, 1}. If the mix-ing function is “good” (that is, behaving as if each output bitwere “perfectly random”) then each variable would be thesum of n “coin flips”—in other words, a random variablechosen from a binomial distribution. For large N , this bi-nomial distribution can be approximated as a normal (Gaus-sian) distribution with mean N

2 and variance N4 . The sum-of-

squares of those 4096 variables is therefore (approximately)a chi-squared distribution with 4096 degrees of freedom.The Java code in Figure 18 performs this calculation, nor-malizing with respect to both N and the number of matrixentries 4096 so that the result has expected mean 1. The en-ergies of the mixing functions that we bottomed out on wereright around that expected value for the mean every time;we also measured this avalanche statistic for MurmurHash3(with N = 106) and got 1.0502 on one run, and had similarresults for the Stafford variants. So there is every reason tobelieve that if we had found a mixing function with a muchlower energy than that, it would be worse, in that its behav-ior would differ in a statistically noticeable way from trulyrandom behavior. We conclude that the MurmurHash3 andStafford mixing functions are quite good by this measure,and one can do perhaps only a tiny bit better.

5. Measurements of Quality

We used the DieHarder test suite version 3.31.1 [5] to testthe algorithm used in class SplittableRandom. For thispurpose, we recoded the algorithms into C so that they couldbe linked directly with DieHarder, rather than using the fileI/O interface (which is considerably slower). DieHarder uses32-bit inputs, so the 64-bit outputs from our algorithms arebroken into two halves, supplying first the low-order half,then the high-order half, to the single input stream of 32-bitvalues. We used only the switches -a (selecting all tests) and-g n (to select the generator algorithm). DieHarder reportsany p-value outside the range [10−6, 1− 10−6] as indicatingthat the tested generator has “ failed.” DieHarder also flagsall p-values outside the range [5× 10−3, 1− 5× 10−3];such values, if they are not clear failures, are reported as“weak.” For each generator variant we ran DieHarder 7 timesand we report the total number of weak and failed val-ues over all 7 runs, omitting results for tests opso, oqso,dna, and sums, each of which the DieHarder software itselfdescribes as “Suspect” or “Do Not Use”; thus each run ofDieHarder performs 110 distinct tests, so 7 runs collectivelyperform 770 tests). (LSS used Dieharder version 2.28.1 [31],so the number—and quality—of tests shown here is slightlydifferent from what they report.)

We also used the BigCrush test of TestU01 [18, 32], againrecoding into C for direct linking. TestU01 expects to test64-bit double values, so we use 53 out of every 64 bits gen-erated by our algorithm to generate a double value as forour method nextDouble. TestU01 regards any p-value out-side the range [10−10, 1− 10−10] as indicating “clear fail-ure” [18] of the tested generator. The TestU01 software flagsall p-values outside the range [10−3, 1− 10−3]; such values,if they are not clear failures, are regarded as “suspect.” Foreach generator variant we ran BigCrush 3 times and we re-port the total number of suspect p-values and clear failuresover all 3 runs; each run of BigCrush performs 160 distincttests, so 3 runs collectively perform 480 tests. With 480 testsat two-sided p = 0.001, we expect even a perfect generatorto exhibit 0.96 “suspect” values on average.

We tested sequential use of the generate operation for asingle PRNG object using 16 different γ values, 8 of whichhave few 01 or 10 bit pairs and 8 of which have many.For comparison, we also used DieHarder to test 9 of the“standard” algorithms supplied with DieHarder, includingthe venerable Mersenne Twister [23]. Summary results areshown in Table 1. These results show “poor” γ values havingslightly more “weak” results than others, but not consistentlyso; the average is only slightly higher. We believe that moreinvestigation is required. The average number of “weak”scores over all 16 gamma values tested is 15.3, which is closeto the number of “weak” scores for such excellent algorithmsas R_mersenne_twister and AES_OFB.

To test the behavior of split PRNG objects, we used threestrategies: (a) Form a balanced binary tree of splits to create

467

DieHarder TestU01weak failed suspect failure

γ value (of 770) (of 480)0x0000000000000001L 19 0 0 00x0000000000000003L 18 0 2 00x0000000000000005L 12 0 1 0

“poo

r” 0x0000000000000009L 19 0 0 00x0000010000000001L 14 0 0 00xffffffffffffffffL 12 1 0 00x0000000000ffffffL 15 0 2 00xffffff0000000001L 15 0 2 00x0000000000555555L 11 0 1 00x1111111111110001L 17 0 1 0

“oka

y”

0x7777777777770001L 15 1 1 00x7f7f7f7f33333333L 12 0 3 00x5555550000000001L 16 0 2 00xc45a11730cc8ffe3L 16 0 2 0

rand

om

0x2b13b77d0b289bbdL 15 0 1 00x40ead42ca1cd0131L 19 0 0 0

AES_OFB 18 0 —R_knuth_taocp 13 0 —

R_marsaglia_multic. 13 28 —R_mersenne_twister 15 0 —

R_super_duper 16 19 —R_wichmann_hill 19 0 —Threefish_OFB 24 0 —

kiss 17 0 —superkiss 13 14 —

Table 1. Test results for γ values used sequentially

2k PRNG objects, then round-robin interleave their streams;(b) start with one object, use it to generate one number, thenreplace it with a split-off object; (c) start with one object,split off a second object, then use the first object to generateone number, then replace the first object with the second.The results are shown in Table 2.

For comparison, we did three runs of TestU01 BigCrushon java.util.Random; 19 tests produced clear failure onall three runs. These included 9 Birthday Spacings tests,8 ClosePairs tests, a WeightDistrib test, and a CouponCol-lector test. This confirms L’Ecuyer’s observation that java.util.Random tends to fail Birthday Spacings tests [17].

In early experiments, we tested algorithms ParallelApp,Splittable64, and Splittable128 (Sections 2.1–2.3)using DieHarder, with results similar to those in Table 1.We have not yet tried TestU01 on these algorithms.

6. Measurements of SpeedTo evaluate the performance consequences of introducingclass SplittableRandom for Java JDK8, we compared itagainst two existing alternatives in the Java core library.The class java.util.Random uses the UNIX rand48 algo-rithm and is thread-safe, serializing generation from a com-mon shared generator when accessed by multiple threads.Class java.util.concurrent.ThreadLocalRandom, in-

DieHarder TestU01weak failed suspect failure

number of PRNGs (of 770) (of 480)2 17 0 0 04 9 0 2 0

binary 8 20 0 0 0tree 16 14 0 0 0

of 32 13 0 1 0splits 64 19 0 3 0

128 18 0 1 0256 15 0 1 0

chained generate then split 17 0 1 0splits split then generate 20 1 1 0

Table 2. Test results for split PRNG objects used in parallel

troduced in JDK7, also uses rand48, but lazily creates a newgenerator per thread using Java’s ThreadLocal facilities.

Comparisons are shown for two representative computersystems, with recent 64-bit processors, both with 32 effec-tive cores spread over multiple sockets (four Intel i7 X7560chips vs. two AMD 6278 chips; the Intel system is hyper-threaded to allow 64 hardware threads, but experiments useda maximum of 32 threads). Both run Linux 2.6.39 kernels.Measurements of ThreadLocalRandom (“TLR7”) usedJDK7; measurements of java.util.Random (“JUR8”) andSplittableRandom (“SR8”) used OpenJDK JDK8 earlyaccess build 103. All entries represent the median of fiveruns using loops with 226 iterations, following a warmupperiod of 15 shorter runs.

Table 3 compares sequential performance for a simpleloop that sums nextLong, nextDouble, or nextInt results.Throughput for SplittableRandom is faster (by factorsranging from 1.22 to 8.29) than either alternative, except forbeing 15%–20% slower than ThreadLocalRandom whengenerating 32-bit values (nextInt), the only measured casefor which the use of cheaper (32-bit) arithmetic instructionsin the rand48 algorithm leads to a performance advantage.

Figure 19 compares parallel performance using the JavaJDK8 Stream library to compute sums; for example, tocompute the sum of n values of type long from a pseudo-random number generator prng, we used this expression:

prng.longs(n).parallel().sum()

The Stream library subdivides expressions like this intomultithreaded subcomputations using the ForkJoin frame-work. In the case of SplittableRandom, each subdi-vision is associated with a split-off generator, while forjava.util.Random, which has no split method, allthreads use the same generator. Such a stream expressioncan also be computed sequentially by omitting the callto the parallel method: prng.longs(n).sum(). UsingSplittableRandom produces performance that scales ap-proximately linearly with the number of threads with a slopeof roughly 5

8 (approximately 17× on the AMD machine and22× on the Intel machine when using 32 threads).

468

Millions of generated values per second Speedup of SR8 over the others

AMD 6278 Intel X7560 AMD 6378 Intel X7560JUR8 TLR7 SR8 JUR8 TLR7 SR8 JUR8 TLR7 JUR8 TLR7

Long 29.8 143.8 247.1 33.2 175.9 274.1 ×8.29 ×1.71 ×8.25 ×1.55Double 28.8 149.2 189.5 33.1 176.4 215.7 ×6.57 ×1.27 ×6.51 ×1.22Int 57.5 302.7 248.4 65.3 288.2 250.2 ×4.32 ×0.82 ×3.83 ×0.86

Table 3. Sequential throughput (using a sequential loop that sums generated pseudorandom values)

intlongdouble

milli

ons

of a

dditi

ons

per s

econ

d

0

2000

4000

number of threads0 4 8 12 16 20 24 28 32

SplittableRandom on AMD 6278

intlongdouble

milli

ons

of a

dditi

ons

per s

econ

d

0

2000

4000


SplittableRandom on Intel i7

intlongdouble

milli

ons

of a

dditi

ons

per s

econ

d

0

50


java.util.Random on AMD 6278

intlongdouble

milli

ons

of a

dditi

ons

per s

econ

d

0

50


java.util.Random on Intel i7

Figure 19. Parallel stream throughput (1 to 32 threads, millions of generated values per second, median of three runs)

On the other hand, when using java.util.Random,throughput drops precipitously (by a factor of 4 or more)when going from 1 thread to 2 threads, and adding morethreads beyond that only makes things worse. Because asingle synchronized generator is being shared among allthreads, we would not expect any speedup on the generationof pseudorandom numbers; we might have hoped that otherparts of the application might enjoy some parallel speedup.But the measurements tell a different story; while we havenot done a detailed performance analysis, it appears thatcontention among threads for the shared generator (possiblycompounded by poor memory system behavior as ownershipof the generator state is transferred among threads) swampsany gains from parallel execution. The obvious way to mit-igate synchronization overhead is for each thread to haveits own PRNG state rather than trying to share state, whilestill maintaining good statistical quality—which java.util.Random was never designed to do.

Figure 20 shows measurements of parallel scaling of aMonte Carlo algorithm using SplittableRandom.

We also measured the performance difference between apurely sequential version of our benchmarks and the parallelversion using just one thread. Performance is roughly thesame: for the calculation of π, the sequential version was 1%slower on AMD 6278 and 1% faster on Intel X7560 whenusing SplittableRandom. We observed a reversed effectwhen using java.util.Random: the sequential version was2% faster on AMD 6278 and 2% slower on Intel X7560.For stream summation, the two versions sometimes differedin execution time by as much as 16%, but we also observedsimilar (unexplained) run-to-run variation when comparingruns of each version separately.

LSS do not report much about absolute performance ofDOTMIX, but they do remark that, for one benchmark, usingDOTMIX was about 2.3 times as slow as using MersenneTwister [20, §6]. Sean Luke reports that his optimized Javaimplementation of Mersenne Twister [21] is about 1

3 fasterthan using java.util.Random, and our Table 3 indicatesthat using SplittableRandom is about 4 to 8 times asfast as using java.util.Random. These data imply that

469

Intel i7AMD 6278

spee

dup

rela

tive

to th

e da

ta p

oint

for 1

thre

ad

0

4

8

12

16

20

24

28

32


Monte Carlo calculation of π using SplittableRandom

Figure 20. Parallel scaling for calculation of πSplittableRandom is roughly 6 to 12 times as fast asDOTMIX. Given the need for DOTMIX to chase pointersamong stack frames and to perform multiplications moduloa prime number, this estimate seems plausible.

7. Other Related WorkL’Ecuyer [15, Figure 3] describes a way to create a PRNG ofgood quality and very long period by combining two or moremultiplicative linear congruential generators, and remarksthat the period of the combined generator can be “split”(partitioned) into separate sections easily because each ofthe underlying generators can be so partitioned.

Park and Miller [26], surveying the situation in 1988,demonstrated that many PRNG algorithms then in use wereof terrible quality, and recommended a specific prime-modulus multiplicative linear-congruential (“Lehmer”) gen-erator zn+1 = 16807zn mod (231 − 1), and challengedfuture PRNG designers to do at least as well. They notedthat the multiplier 16807, while not necessarily the absolutebest choice, was quite good and furthermore small enoughto permit certain implementation tricks to improve speed.

Burton and Page [6] describe partitioning the period of asequential PRNG in three ways: alternation (elsewhere called“leapfrogging” or “lagging”), halving (“jumping ahead” bya fixed amount, typically to cut the stream in half), andjumping to a random point in the cycle. In effect, our splitmethod takes this third approach after randomly choosingamong a large number of possible cycles.

In the mid-1990s, Augustsson [3] implemented L’Ecuyer’salgorithm in purely functional form as part of the Haskellstandard library System.Random; the code now in that li-brary, dated 2001, contains a kernel (Figure 21) with twofunctions stdNext and stdSplit. The implementation ofstdNext is a faithful rendition of L’Ecuyer’s algorithm [15,Figure 3], but the stdSplit method does not split the pe-riod in the manner suggested by L’Ecuyer; instead, it uses

stdNext :: StdGen -> (Int, StdGen)

stdNext (StdGen s1 s2) =

(fromIntegral z’, StdGen s1’’ s2’’) where

z’ = if z < 1 then z + 2147483562 else z

z = s1’’ - s2’’

k = s1 ‘quot‘ 53668

s1’ = 40014 * (s1 - k * 53668) - k * 12211

s1’’ = if s1’ < 0 then s1’ + 2147483563 else s1’

k’ = s2 ‘quot‘ 52774

s2’ = 40692 * (s2 - k’ * 52774) - k’ * 3791

s2’’ = if s2’ < 0 then s2’ + 2147483399 else s2’

stdSplit :: StdGen -> (StdGen, StdGen)

stdSplit std@(StdGen s1 s2) = (left, right) where

-- no statistical foundation for this!

left = StdGen new_s1 t2

right = StdGen t1 new_s2

new_s1 | s1 == 2147483562 = 1

| otherwise = s1 + 1

new_s2 | s2 == 1 = 2147483398

| otherwise = s2 - 1

StdGen t1 t2 = snd (next std)

Figure 21. Haskell library System.Random kernel (2001)

an ad hoc method that, by its own admission, has “no statis-tical foundation,” but is not that different in structure fromSPLITMIX except that it fails to try to compute “random”values for initializing the new StdGen objects.

Hellekalek [12] pointed out the difficulty of finding good-quality PRNG algorithms for parallel use, and that splittingvia “leapfrogging” can produce unexpected surprises.

L’Ecuyer [17] describes severe failures of the algorithmused by java.util.Random (as well as the PRNG facilitiesof Visual Basic and Excel as of 2001 and the “Lehmer16807” generator recommended by Park and Miller [26])to pass a “birthday spacings” test. He also comments, “Inthe Java class java.util.Random, RNG streams can bedeclared and constructed dynamically, without limit on theirnumber. However, no precaution seems to have been takenregarding the independence of these streams.”

L’Ecuyer et al. [19] describe an object-oriented C++PRNG package RngStream that supports repeatedly splittingits very long period (approximately 2191) into streams andsubstreams. (We discuss this package at length in Section 1.)

Salmon et al. [30] describe a parallel PRNG algorithmwith period 2128 that furthermore relies on a key that allowsselection of one of 264 independent streams. They remarkthat their paper “revisits the underutilized design approachof using a trivial f and building complexity into the outputfunction”; their f corresponds to our nextSeed method andtheir output function to our bit-mixing function, so in that re-spect we follow in their footsteps. They suggest that f mightbe as simple as f(x) = x+ 1, and so refer to their methodsas “counter-based”; they rely on powerful (perhaps crypto-graphically strong) output functions. Our approach differs inthat we are willing to choose f more carefully in order toreduce the cost of the output (bit-mixing) function. Salmonet al. also make use of the golden ratio and Weyl sequences

470

to generate “round keys” for AES encryption (though, curi-ously, they use 64 bits of the golden ratio for the high half ofthe initial round key and 64 bits of

√3− 1 for the low half).

Claessen and Pałka [7] remark on an application thatexposes a severe defect of the stdSplit function in theHaskell standard library System.Random, then describe asuperior implementation of the same purely functional APIthat is similar in spirit to LSS: it generates random values byencoding the path in the split tree as a sequence of numbers,and then applies a cryptographically strong hash function.Their path encoding and hash function are designed to al-low successive pseudorandom values to be computed incre-mentally in constant time, independent of the path length.We have not yet made comparative measurements, but it ap-pears that their approach produces pseudorandom streamsof higher quality, while our approach is likely much faster.

8. Conclusions and Future WorkInspired by Leiserson, Schardl, and Sukha, we have traceda winding development path that led us to the SplitMix

family of algorithms for parallel generation of sets and se-quences of pseudorandom values. They should not be usedfor cryptographic or security applications, because they aretoo predictable (the mixing functions are easily inverted,and two successive outputs suffice to reconstruct the internalstate), but because they pass standard statistical test suitessuch as DieHarder and TestU01, they appear to be suitablefor “everyday” use such as in Monte Carlo algorithms. Oneversion seems especially suitable for use as a replacementfor java.util.Random, because it produces sequences ofhigher quality, is faster in sequential use, is easily paral-lelized for use in JDK8 stream expressions, and is amenableto efficient implementation on SIMD and GPU architectures.

Each SplittableRandom object produces a stream ofpseudorandom values of period 264; this period is relativelyshort, but if values were generated sequentially at a rate ofone per nanosecond, it would take over five centuries to cyclethrough the period once. In practice, one would generatesuch a large quantity of pseudorandom values in parallel,using splitting, which gives each thread (or core) a differentγ value and therefore a different permutation of the 264

values. Since almost 263 distinct γ values may be used, thestate space of a SplittableRandom object has almost 2127

states. (Salmon et al. [30] make this same point, and declare:“A PRNG with a period of 219937 is not substantially superiorto a family of 264 PRNGs, each with a period of 2130.”) Ifthis is insufficient for certain applications, a version usingarithmetic of higher precision (128 bits or more) may suffice.However, a specific attraction of SplittableRandom is thatit has the relatively small “likelihood of accidental overlap”of a PRNG with 2127 states but the computational speed of a64-bit algorithm. Code size and data size are quite small.

In the future we would like to subject these algorithmsto even more rigorous testing. It may also be that evenmore effective mixing functions will be discovered. Finally,

we hope that theoreticians will be able to illuminate thestrengths and weaknesses of this family of algorithms, andto that end we offer the following observations. In an al-ternate version of the code for SplittableRandom (Figure16), we used mix64 rather than mix64variant13 in methodmixGamma. In the end we chose not to share code and usemix64variant13, not because of any hard evidence or solidtheory, but simply because of an intuition (or engineering su-perstition?) that, all else being equal, it is best to avoid op-portunities for accidental correlation. On the other hand, wechose to use GOLDEN_GAMMA in two places, and one wouldthink the same intuition might lead us to use a differentvalue (perhaps the odd integer closest to 264/δS , where δS =1 +√2 is the silver ratio?). And even our choice of the odd

integer closest to 264/φ was based only on the intuition thatit might be a good idea to keep γ values “well spread out”and the fact that prefixes of the Weyl sequence generated by1/φ are known to be “well spread out” [14, exercise 6.4-9].Finally, guided by intuition and informal argument, we madea somewhat arbitrary decision to suppress certain γ valuesin the method mixGamma—and yet we are haunted by thememory of the Enigma cipher machines, which were care-fully designed so that “bad” encryptions could never occurby guaranteeing that no letter would ever map to itself, andso the word “LONDON” would never be encrypted as theword “LONDON”, yet that avoidance of “bad” encryptionswas the Achilles’ heel that allowed the codebreaking ma-chinery (the “cryptologic bombes” designed by Alan Tur-ing and Gordon Welchman) to be so effective. By analogy,we worry that our effort to avoid “bad” random sequencesthat we believe would occur only rarely has in fact just in-troduced an unnecessary and undesirable bias. True random-ness requires not that “miracles” or “disasters” be absent, butthat they be present in due proportion—and human intuitionis notoriously poor at judging this proportion. In the future,all of the choices outlined in this paragraph should be ques-tioned and tested against better theory than we have so farmustered. It would be a delightful outcome if, in the end,the best way to split off a new PRNG is indeed simply to“pick one at random.”

AcknowledgmentsWe thank Claire Alvis for implementing our first prototypefor Fortress, Martin Buchholz for suggesting that we testsparse γ values. and Brian Goetz, Paul Sandoz, Peter Levart,Kasper Neilsen, Mike Duigou, and Aleksey Shipilev fordiscussion of suitability of these algorithms for Java JDK8.

References[1] Eric Allen, David Chase, Joe Hallett, Victor Luchangco, Jan-

Willem Maessen, Sukyoung Ryu, Guy L. Steele Jr., and SamTobin-Hochstadt. The Fortress Language Specification Ver-sion 1.0, March 2008.

[2] Austin Appleby. Murmurhash3, April 3, 2011. Project wikientry. http://code.google.com/p/smhasher/wiki/

MurmurHash3 Accessed Sept. 10, 2013.

471

[3] Lennart Augustsson. Personal communication, Aug. 30, 2013.

[4] Richard P. Brent. Note on Marsaglia’s Xorshift random num-ber generators. Journal of Statistical Software, 11(4):1–5,Aug. 2004.

[5] Robert G. Brown, Dirk Eddelbuettel, and David Bauer.Dieharder: A random number test suite, version 3.31.1,2003–2006. http://www.phy.duke.edu/~rgb/General/dieharder.php Accessed Sept. 10, 2013.

[6] F. Warren Burton and Rex L. Page. Distributed random num-ber generation. J. Functional Programming, 2(2):203–212,April 1992.

[7] Koen Claessen and Michał Pałka. Splittable pseudorandomnumber generators using cryptographic hashing. In Proc.ACM SIGPLAN Haskell Symp., pages 47–58, 2013.

[8] Horst Feistel. Cryptography and computer privacy. ScientificAmerican, 228(5):15–23, May 1973.

[9] Gregory W. Fischer, Ziv Carmon, Dan Ariely, Gal Zauberman,and Pierre L’Ecuyer. Good parameters and implementationsfor combined multiple recursive random number generators.Operations Research, 47(1):159–164, Jan. 1999.

[10] James Gosling, Bill Joy, and Guy Steele. The Java LanguageSpecification (first edition). Addison-Wesley, 1996.

[11] James Gosling, Bill Joy, Guy Steele, Gilad Bracha, and AlexBuckley. The Java Language Specification, Java SE 7 Edition.Addison-Wesley, Upper Saddle River, New Jersey, 2013.

[12] P. Hellekalek. Don’t trust parallel Monte Carlo! In Proc.Twelfth Workshop on Parallel and Distributed Simulation(PADS 98), pages 82–89. IEEE, 1998.

[13] Donald E. Knuth. Seminumerical Algorithms (third edition),volume 2 of The Art of Computer Programming. Addison-Wesley, Reading, Massachusetts, third edition, 1998.

[14] Donald E. Knuth. Sorting and Searching (second edition),volume 3 of The Art of Computer Programming. Addison-Wesley, Reading, Massachusetts, second edition, 1998.

[15] Pierre L’Ecuyer. Efficient and portable combined randomnumber generators. Comm. ACM, 31(6):742–751, June 1988.

[16] Pierre L’Ecuyer. Uniform random number generators. InProc. 30th Winter Simulation Conf., WSC ’98, pages 97–104.IEEE Computer Society Press, 1998.

[17] Pierre L’Ecuyer. Software for uniform random number gen-eration: Distinguishing the good and the bad. In Proc. 33ndWinter Simulation Conf., WSC ’01, pages 95–105, Washing-ton, DC, USA, 2001. IEEE Computer Society.

[18] Pierre L’Ecuyer and Richard Simard. TestU01: A C library forempirical testing of random number generators. ACM Trans.Math. Softw., 33(4), August 2007.

[19] Pierre L’Ecuyer, Richard Simard, E. Jack Chen, and W. DavidKelton. An object-oriented random-number package withmany long streams and substreams. Operations Research,50(6):1073–1075, Nov. 2002.

[20] Charles E. Leiserson, Tao B. Schardl, and Jim Sukha. De-terministic parallel random-number generation for dynamic-multithreading platforms. In Proc. 17th ACM SIGPLAN Symp.Principles and Practice of Parallel Programming, pages 193–204, New York, 2012. ACM.

[21] Sean Luke. Documentation for the Mersenne Twister inJava, October 2004. http://www.cs.gmu.edu/~sean/

research/mersenne Accessed March 11, 2014.

[22] George Marsaglia. Xorshift RNGs. Journal of StatisticalSoftware, 8(14):1–6, Jul. 2003.

[23] Makoto Matsumoto and Takuji Nishimura. Mersenne Twister:A 623-dimensionally equidistributed uniform pseudo-randomnumber generator. ACM Trans. Model. Comput. Simul.,8(1):3–30, January 1998.

[24] Martin Odersky. The Scala Language Specification, Version2.7. EPFL Lausanne, Switzerland, 2009.

[25] Oracle. Interface spliterator<t>. Documentationfor JavaTM Platform Standard Ed. 8 DRAFT ea-b106.http://download.java.net/jdk8/docs/api/java/

util/Spliterator.html Accessed Sept. 13, 2013.

[26] Stephen K. Park and Keith W. Miller. Random number gener-ators: Good ones are hard to find. Comm. ACM, 31(10):1192–1201, October 1988.

[27] Ronald L. Rivest, M. J. B. Robshaw, R. Sidney, and Y. L. Yin.The RC6 block cipher, version 1.1, August 20, 1998. Unpub-lished paper. http://people.csail.mit.edu/rivest/

pubs/RRSY98.pdf.

[28] C. S. Roberts. Implementing and testing new versions of agood, 48-bit, pseudo-random number generator. Bell SystemTechnical Journal, 61(8):2053–2063, Oct. 1982.

[29] Mutsuo Saito and Makoto Matsumoto. A deviation of CU-RAND: Standard pseudorandom number generator in CUDAfor GPGPU, Feb. 2012. Slides presented at the Tenth Intl.Conf. Monte Carlo and Quasi-Monte Carlo Methods in Scien-tific Computing. http://www.mcqmc2012.unsw.edu.au/

slides/MCQMC2012_Matsumoto.pdf.

[30] John K. Salmon, Mark A. Moraes, Ron O. Dror, and David E.Shaw. Parallel random numbers: As easy as 1, 2, 3. InProc. 2011 International Conference for High PerformanceComputing, Networking, Storage and Analysis, pages 16:1–16:12. ACM, 2011.

[31] Tao B. Schardl. Personal communication, Aug. 13, 2012.

[32] Richard Simard. TestU01 version 1.2.3, August 2009. Websiteat http://www.iro.umontreal.ca/~simardr/testu01/tu01.html.

[33] David Stafford. Better bit mixing: Improving on Murmur-Hash3’s 64-bit finalizer, September 28, 2011. Blog“Twiddling the Bits.” http://zimbry.blogspot.com/

2011/09/better-bit-mixing-improving-on.html

Accessed Sept. 10, 2013.

[34] Thinking Machines Corporation. CM-2 Technical Summary.Technical Report HA87-4, Cambridge, Massachusetts, April1987. https://archive.org/details/06Kahle001885.

[35] John Walker. HotBits: Genuine random num-bers, generated by radioactive decay. Website athttp://www.fourmilab.ch/hotbits/.

[36] A. F. Webster and Stafford E. Tavares. On the design ofS-boxes. In Advances in Cryptology (CRYPTO ’85), August18-22, 1985, Proceedings, volume 218 of Lecture Notes inComputer Science, pages 523–534. Springer, 1985.

472

fast splittable pseudorandom number generatorsgee.cs.oswego.edu/dl/papers/oopsla14.pdf · fast...

Documents