a parallel monte carlo transport algorithm using a pseudo-random tree to guarantee reproducibility

Parallel Computing 4 (1987) 281-290 281 North-Holland

A parallel Monte Carlo transport algorithm using a pseudo-random tree to guarantee reproducibility

Paul FREDERICKSON and Robert HIROMOTO Computing and Communications Division, Los Alamos National Laboratory, Los Alamos, NM 87545, U.S.A.

John LARSON Cray Research Inc., Chippewa Falls, WI 54?29, U.S.A.

Received June 1986

Abstract. We present a parallel Monte Carlo photon transport algorithm that insures the reproducibility of results. The important feature of thi~ parallel implementation is the introduction of a pair of pseudo-random number generators. This pair of generators is structured in such a manner as to insure minimal correlation between the two sequences of pseudo-random numbers produced. We term this structure as a 'pseudo-random tree'. Using this structure, we are able to reproduce results exactly in a asynchronous parallel processing environment. The algorithm tracks the .history of photons as they interact with two carbon cylinders joined end to end. The algorithm was implemented on both a Denelcor HEP and a CRAY X-MP/48. We describe the algorithm and the pseudo-random tree structure and present speedup results of our implementation.

Keywords. Parallel Monte Carlo photon transport algorithm, pseudo-random number generator, parallel implementation, Denelcor HEP, CRAY X.MP/48, speedup results.

1. Introduction

The processing in parallel of a single computation task has the potential of introducing side effects that may add a significant level of programming complexity to the verification of program correctness. In the case where each of several asynchronous instruction execution streams follow strictly deterministic execution flow and act upon independent partitions of a problem's data structures, no side effects occur, and the sequential results may be easily obtained exactly. However two side effects may occur when all parallel instruction streams access and update a single memory location, or when they asynchronously use, or are controlled by, data values that are defined recursively. The first effect, accumulation into a global counter or sum, is a result of the different roundoff errors that o~cur when the elements of a sum are accumulated in various orders. The only way to obtain the exact sum produced by the sequential algorithm is to compute sequentially. In order to compute the sums in parallel, we tolerate these differences, and call the parallel sum exact (which is true for integer sums because there is no roundoff error). The second effect, nondeterministic execution flow, is a result of the dependence of the flow of an instruction stream on the time when the instructions are executed. A broad class of algorithms having this unpredictable behavior are those that use the Monte Carlo method [7]. This method resorts to the use of statistical techniques, coupled with a resource of 'random' numbers, to induce a model of a stochastic system. Central to the

0167-8191/87/$3.50 © 1987, Elsevier Science Publishers B.V. (North-Holland)

282 P. Frederick, son et al. / A parallel Monte Carlo transport algorithm

Monte Carlo method is the use of a single, long, statistically valid sequence of pseudo-random numbers [5]. Based on the values of these numbers within the sequence, various decisions branches are followed. Any recordering of these random numbers from one problem execution to the next (~suming that the problem is itself unchanged) will necessarily give different, though statistically valid, results. It is evident then ~at processing algorithms based on the Monte Carlo method in a parallel, asynchronous environment adds a greater level of complexity to the burden of proving program correctness.

To resolve the induced nondeterministic behavior of a parallel Monte Carlo algorithm, we took a simple but typical Monte Carlo photon (particle) transport simulation program, developed it as a parallel algorithm, and implemented a rather simple tracking scheme based upon a pseudo-random tree [3] to insure reproducibility. Using a minimum of two random number generators to structure a pseudo-random tree for our problem configuration, we were ~ble to demonstrate program correctness, independent of the processing order and the number of process~ executing in parallel. We further show that the overhead in using the pseudo-random tree does not degrade parallel performance. Finally we describe an extremely interesting error in the photon transport program that may not have been detected in a fully parallel execution without the aid of the pseudo-random tree as described in this paper.

2. Deserip6on of the Monte Carlo algorithm

Our problem is a gamma ray (photon) transport simulation using the Monte Carlo method. The prosram tracks the interaction of gamma rays passing through two carbon cylinders joined end to end. The problem model is linear in that the flux of particles does not affect the material cross sections. During the gamma-ray's lifetime, various physical processes (illustrated in Figs. 1-3) are selected on the basis of a random number and may affect the progress of the particles (photons): these processes include Compton scattering (the collision of a photon with a free electron), pair production (the absorption of a photon in the presence of a nuclear or electron field producing an electron-positron pair with a total energy equal to that of the photon), and photoelectric absorption (the absorption of a photon by a bound electron, which then is ejected as a free electron from its orbit). A much more detailed account of each of these physical processes may be found in [1].

Using the Monte Carlo method, we generated a sequence of pseudo-random numbers for sampling the photon energy, the distance to next collision within the carbon geometry, the isotropic scattering of the photon under Compton scattering, the probability of photon, absorption by the photoelectric effect, and numerous other processes. Statistical important samp!mg tecb..~ques are employed by assigning weights to the occurrence of various physical processes. Particle splitting and Russian roulette are two such sampling techniques [1] used. Particle splitting increases the particle sample size where a particle of weight w may be

2

"~! ~ • 72 ~ 71 Fig, 1. Compton scattering where the in- coming photon with energy Yt scatters off

- a free electron e- and emerges with en- er&y Y2.

P. Frederickson et al. / A parallel Monte Carlo translfort algorithm 283

J J 72 f I / f I 71 / e- e- ) ~ = Followed by

e + e ÷

72 t

Coulomb Field of

Nucleus

Fig. 2. Pair production where the formation of an electron-positron (e + ) pair results in the production of two photons with energy 3'.

replaced with n identical particles but with weights w1,... , Wn, where w ! + ..o + w , - - w . These particles are then processed independently. On the other hand, if the number of particles becomes too large, the Russian roulette technique selects a particle and with some probability p discards it from the sample. If the selected particle is not discarded but allowed to proceed, its weight is multiplied by (1 _ p ) - l . This process is repeated until the number of particles is brought to a manageable computation size. Furthermore, particle cutoff techniques in the form of weight cutoff and energy cutoff routines are employed to reduce particle tracking time. Although the method may not be a direct algorithmic approach to solving the problem, the method does provide a powerful technique for the solution of many otherwise intractable problems [4].

3. Pseudo-random sequences and pseudo-random trees

True random sequences are no longer used in numerical computation. Many years ago they were replaced by pseudo.random sequences, with reproducibility being one of the primary reasons for the changeover. Pseudo-random sequences have the property that the entire sequence can be reconstructed if the initial seed X and the generator S are known, which makes reproducible Monte Carlo simulation possible. By far the most commonly used pseudo- random sequences are Lehmer sequences, or linear congruential sequences. For these the transformation S is simply

S( X ) ffi ( aX + c) mod m,

and the properties of the pseudo-random sequence are determined by the integers a, c, and m.

fp

7 m

fp << T

Fig. 3. Photoelectric effect with fluores- cence photon fp.

2#,4 P, Frederickson et al. / A parallel Monte Carlo transport algorithm

Fig. 4. A tree drawn pseudo-randomly.

For more detail on pseudo-random sequences, and Lehmer sequences in particular, refer to Knuth [5], the review article of Niederreiter [8], or the original paper of Lehmer [6].

A pseudo.random tree is a rather natural generalization of a pseudo-random sequence. In place of the transformation S there are two transformations L and R, which can act on each node in the tree, and the integer X t.t any node is followed by both a right successor R ( X ) and a left successor L ( X ) . Pseudo-random trees were introduced [3] in order to bring reproducibility to parallel Monte Carlo simulations, and this is the primary point that we discuss here.

We recommend a particular family of pseudo-random trees, which we refer to as Lehmer trees because they are simple generalizations of Lehmer sequences. Given any node X in the

/

Fig. 5. The same tree drawn again pseudo-randomly but with a random branch missing.

P. Frederickson et al. / A parallel Monte Carlo transport algorithm 285

tree, the two successors L(X) and R(X) of X are defined by

L(X)=(a~X+c~) mod m, R(X)=(a~X+c~) mod m.

The five constants a , , as , cL, ca, and m determine all of the statistical properties of the pseudo-random tree, and together with the seed X0 at the root node they determine a unique Lehmer tree. Our reasons for recommending this particular family of pseudo-random trees are threefold. First, they are easily and portably implemented on almost any computer. Second, the theory seems simple enough so that we can expect a firm foundation to be developed, analogous to the theory for Lehmer S~luences. Finally, we have no good reason to believe that any other family of pseudo-random trees offers any advantages.

The trees in Figs. 4 and 5 were dravm pseudo-randomly, starting from the same root seed. The decision to create a branch at one of the nodes was arbitrarily changed for the tree in fig. 5. Observe that the ~est of the tree remains the same, which would not have been the case for the tree drawn using a pseudo-random sequence.

4. Parallel implementation

Our primary concern in the introduction of pseudo-random trees is to maintain the re0roducibility of a computation even on asynchronous parallel computers. The implementation is almost totally parallel. The tracking of each particle and its daughter particles is independent of all other histories. Each original particle is given a unique random 'particle seed', which in turn is used to generate a unique sequence of random numbers used in doing the appropriate physics for that particle. A Lehmer tree is used for the production o~ a left sequence of panicle seeds and a corresponding right sequence of physics decision seeds. (See Fig. 6.) If a particular step of the Monte Carlo computation has a pseudo-random seed as one of its input variables, the result of the computation should not depend on when the step was executed in comparison with the other computation steps, nor should the result depend on which processor executes the step, or on how many processors are in use. A Monte Carlo computation often has branch points in the computation flow, as when a new particle is created

ORIGINAL ) - ~ ~ PARTICLE S PARTICLE / "~,~a" ~ PHYSICS

@@@

DAUGHTER -" SEED (~O(Z~---'O',~ DAUGHTER

"'~,~ ~I') "0"~ PHYSICS DECISION

Fig. 6. Monte Carlo Lehmer tree.

286 P, Frederickson et al. / A parallel Monte Carlo transport algorithm

after a collision. When this hal 3ens a new daughter particle seed Y - L(X) is produced with the left step of the generator and is set aside with the other data needed for this branch of the computation. The computation then continues along the main branch of the computational flow, driven by the pseudo-randorn sequence produced by the right transformations.

A global particle ba~k is implemented into which these daughter particles are deposited for future processing. When any process completes the particle history it is currently working on, as happens when that particle exits the region of interest, it will remove a particle ~rom the glob~ particle bank and continue its history. Only if the bank is empty does it creal~.e a new particle. This policy tends to minimize the needed capacity of the bank.

During each track of a particle's history, local tally bins are provided for the accumulation of statistics used, for example, in sample biasing. The accumulation of these local statistics couples one particle history to another and requires the local sums to be accumulated into a global tally bin, which forms a critical section of the parallel program. As implemented, each photon history is initiated by an available process using a self-scheduling technique.

The use of the Leluner tree for pseudo-random number generation and the bank data structure for saving the environment and seed of generated particles guarantees the tracking of identical particle histories independent of the order in which the particles are processed in parallel.

5. Detection of a programming error

As a result of demanding reproducible results, independent of the number of parallel processes spawned, we were able to detect a logical programming error that would have otherwise gone unnoticed. The ,rogramming error was subtle and undetected in the sequential execution because the results seemed statistically reasonable. The scope of this sequential modeling error manifested itself in the differentiation between particle and process state variables. Here we refer to particle state variables as those variables that uniquely define one particle from another. Process state variables, on the other hand, refer to those program parameters that combined with the actual program instruction stream define the underlying independent parallel automata. This differentiation was essential in analyzing the programming error that presented itself as a parallel race condition.

As implemented, the banking and unbanking of daughter particles were allowed to proceed in parallel among all processes. To insure that banking and unbanking conflicts were properly arbitrated, a critical section of code was formulated about each of the banking procedures that allowed only one action of each type to proceed at any given moment. Using this heap of particle tasks, ~ y (undetermined) process may access the heap for any additional or remaining particle task to be processed. In formulating the parallel segments of the program for this parallel heap construct, we placed strict attention on insuring that all particle and program parameters were transformed into local variables (in the sense that they were hidden from all other parallel processes). Under this implementation the correct parallel computational flow was assumed to have taken on the flow indicated by Fig. 7. Note that each bubble represents a random acquisition of a daughter particle from the bank or heap of particles, while the attached arrows indicate the processing of these particles in an independent and parallel mode. It is important to understand that each one of these vertically directed segments are independent and may be processed in any one of a number of parallel processing autom.-,ta.

Unfortunately, the heap construct requires not only locality of particle state variables for parallel independence but also a clear delineation between particle and process state variables. By unintentionally including a particle state variable with those defining the process state variables, a variable-type mismatch was introduced, which inadvertently introduced a correla-

P. Frederickson et al. / A parallel Monte Carlo transpor~ algorithm 287

, ?

[ Fig. 7. The proper parallel flow using the I heap construct.

tion between otherwise independent source and banked particle histories processed within a single process (or processor). In particular, the program failed to retain this variable-type distinction when processing those particles accessed from the global particle heap. A particle variable, though properly updated for each source particle, was used without modifications in tracking the histories of particles acquired from the heap. Since all variables were indeed local to the actions of each parallel process, the task of screening for global dependencies obviously proved fruitless. Only by deactivating the various parallel segments of program was the error detected. Figure 8 attempts to illustrate the flow of the error induced dependence, indicated by dashed arrows, between particles.

Had it not been for imposing exact reproducibility of results, the debugging of the parallel program would not have been instigated. Instead the irregularities of results would have simply been attributed to the inherent behavior of the Monte Carlo method complicated by the interaction with an asynchronous parallel processing environment. It seems clear that the concept of the pseudo-random tree (though an additional level of parallel complexity) in a parallel processing environment is a very important tool for developing parallel Monte Carlo algorithms. The advantages that the concept of pseudo-random trees holds may further be realized in sequential Monte Carlo algorithms [9]. Admittedly the programming oversight described above is rather mundane and even embarrassing. Yet it serves to illustrate the

©

? • Fig. 8. Parallel flow with mismatched

, , variable interactions.

288 P. Freder~ckson et al. / A parallel Monte Carlo transport algorithm

Table 1 [~nelcor HEP parallel execution times in seconds

No. of No. of particles No. of No. of particles processors 10k 100k processors 10k 100k

1 138.22 1371.57 9 27.30 257.59 2 74.62 737.31 10 25.94 246.06 3 53.99 526.30 11 25.23 238.04 4 43.40 424.26 12 24.49 233.04 5 37.28 361.80 13 24.86 230.20 6 33.44 321.37 14 24.48 228.54 7 30.56 293.19 !5 24.17 22~A9 8 28.66 272.83 " i 6 25.00 228.56

important need not only for programming tools such as a pseudo-random tree but also for parallel debugging and data dependence analysis tools (static and dynamic) as part of an overall parallel programming environment.

6. Results

The Monte Carlo photon transport problem was run on both a Denelcor HEP and a CRAY X-MP/48. On the HEP, which had a single process execution module, we executed the parallel problem with up to 16 parallel processes. A single problem was executed using two different starting particle batches. Table 1 lists the corresponding parallel execution times. The speedup results, scaled by their respective sequential execution times, are plotted for comparison in Fig. 9. Clearly the parallelism is a function of the problem size being solved.

The same problem was executed on the CRAY X-MP/48, using from one to four parallel processors. Table 2 lists the execution times for ten thousand, one hundred thousand and one million particles. The corresponding speedup profiles for the respective problem sizes are shown in Fig. 10, These results not only provided a means to compare the parallel performance in terms of speedup but also provided a simple means to detect other possible programming

7-

6~

f~T-~ ttr"

5 ~ ~i;~,% ~

~ .... ~ lOOK PARTICLES "~Q) 4 -~ /~. ~ ' ~ & IOK PARTICLES

3 Q. /

2--

h

0 -i . . . . . . . . . I . . . . . . T ..................... I .................... r . . . . . . . . F ~ ~ t - F - = I

0 2 4 6 B 10 1E 14 16 FiB. 9. Speedups for the Monte Carlo

Number of Processes code on the Denelcor HEP.

P. Frederickson et aL / A parallel Monte Carlo transport algorithm 289

Table 2 Cray X-MP/48 parallel execution times in seconds

No. of No. of particles

processors lOk 100k 106

1 1.96 19.40 194.0 2 1.01 9.91 98.9 3 0.69 6.76 67.4 4 0.59 5.31 , 52.5

errors sensitive to differing synchronization patterns that may develop on different multi- processor systems.

7. Conclusion

Our computational experience with pseudo-random trees has more than met our expecta- tions. We first installed the Monte Carlo program on the Denelcor HEP. Our demand for reproducibility allowed us to find a very elusive bug, one which would have seriously distorted the results if it had been left in the code. We were then able to move the code to the CRAY X-MP/48 and again verify the reproducibility of answers for all test cases. Moreover, we were able to obtain a speedup of 3°70 on the X-MP when we went from one to four processors.

We have demonstrated that the challenge of parallel processing a Monte Carlo algorithm correctly may be met by insuring reproducibility of results independent of the number of parallel processes spawned. Our approach was to introduce a .~:imple generalization of the Lehmer sequence to form a pseudo-random tree. The advantages of the pseudo-random tree include: insured reproducibility of results, which implies a deterministic computational flow under asynchronous processing: detection of program bugs: and the ability to measure true parallel performance of a given Monte Carlo algorithm for a given parallel architecture [2].

There are of course disadvantages that one must also be aware of when considering the use of such a pseudo-random tree: the addition of (at least) a second random number generator certainly adds additional programming complexity: the particle seed must now be 'carried' by

4"1

3-1

CL ! = i (~) 2 -~ o tO 6 PARTICLES Q)

f.n I . ~" ~ 1OK PARTICLES

0 ................... w -" i I i

0 ! 2 3 4 Fig. 10. Speedups for the same problem

N u m b e r of P r o c e s s o r s on the CRAY X-MP/48.

290 P. Frederickson et al. / A parallel Monte Carlo transport algorithm

the particle (i.e. an additional variable must now be incorporated into the particle descriptor list) that requires only p (equal to the number of parallel processes or processors) additional memory locations; and obtaining and banking particle seed values incurs additional execution overhead. This last point is not excessive in typical Monte Carlo algorithms, which are designed to solve particle transport problems.

Acknowledgment

We would like to thank Jeff Newberry of Portale~, New Mexico, a Co-Op student at Los Alamos, for the use of his highly artistic graphics.

References

[1] L.L. Carter and E.D. Cashwell, Particle-transport simulation with the Monte Carlo method, Technical Information Center Energy Research and Development Administration, 1975.

[2] Y. Chauvet, Multitaskin 8 a veetorized Monte Carlo algorithm on the CRAY X-MP/2, Cray Channels (1984). [3] P. Frederiekson, R. Hiromoto, T. Jordan, B. Smith and T. Warnock, Pseudo-random trees in Monte Carlo, Parallel

Comput. I (2) (1984) 175-180. [4] J.M. Hammersley and D.C. Handscomb. Monte Carlo Methods (Wiley, New York, 1964). [5] D.E. IOtuth, The Art of Computer Programming, Fol. 2 (Addison-Wesley, Reading, MA, 1981). [6] D.H. Lehmer, Proc. .?nd Symposium on Large.Scale Digital Calculation Machinery (Harvard University Press,

Cambridge, MA, 1951~. [7] N. Metropolis and S. Ulam, The Monte Carlo method, J. Amer. Star. Assoc. 44 (1949) 335-341. [8] H. Niederreiter, Quasi Monte Carlo methods and pseudo-random numbers, Bull. Amer. Math. Soc. 84 (1978)

957-1041. [9] T. Warnock, Synchronization of random number generators, Congress. Numer. 37 (1983) 135-144.

a parallel monte carlo transport algorithm using a pseudo-random tree to guarantee reproducibility

Documents