rapaport_multi-million particle md

198 Computer Physics Communications 62 (1991) 198216North-Holland

Multi-million particle molecular dynamicsI. Design considerations for vector processing

D.C. RapaportPhysics Department, Bar-han University, Ramat-Gan 52100, Israel *

andHochsleistungsrechenzentrum, Kernforschungsanlage JUlich, W-51 70 Jlich, Germany

Received 12 January 1990

Recent progress in developing enhanced methods for carrying out molecular dynamics simulation on vector supercomputersis described. The techniques in general use for rapid evaluation of the interactions between particles require modification inorder to allow efficient implementation within the pipelined processing environment intrinsic to practically all supercom-puters. These modifications, while effective in terms of processor utilization, consume substantial amounts of storage, andmethods of reducing these requirements have had to be developed. The techniques discussed in this paper have been used infeasibility tests involving systems with up to 2.5 million particles.

1. Introduction

This is the first of two papers that address the problem of implementing extremely large-scale moleculardynamics simulations on modern supercomputers. There are two outstanding architectural characteristicsthat are common to essentially all machines of this class, namely, that the computations are carried out bymeans of vector processing, and that performance is enhanced by distributing the computational effortacross several processing units. The first feature is practically universal, while the second is becomingincreasingly widespread. In order to carry out simulations that involve systems containing from a fewhundred thousand to as many as several million particles on machines of this type, it is necessary depending on hardware to incorporate either or both of these architectural features into the computa-tional algorithms. The manner of doing this for a vector processor is discussed in the present paper; in thesecond paper of the series [1] (hereinafter II) the approach used for distributed processing is described.Both papers deal with extensions of work originally described in ref. [21.

Vector processing is to computation what the production line is to manufacturing; for no matter howfast a particular generation of electronic device technology allows a computer to process data, if themachine is given vector-processing hardware based on similar technology it will be able to perform manyof its tasks a great deal faster. Distributing the computational load over several processors will of courselead to further improvements, as will become apparent in II. The hurdle faced when implementingalgorithms on vector machines is the need to ensure that the data is organized so that as much of the

* Permanent address.

0010-4655/91/$03.50 1991 Elsevier Science Publishers B.V. (North-Holland)

D.C. Rapaporl / Multi-million particle molecular dynamics. I 199

computation as possible makes effective use of the vector hardware. This is often a non-trivial and, onoccasion, an even impossible task.

The majority of the enormous body of work carried out using molecular dynamics simulation [3] overthe past three decades has not involved systems of the sizes addressed here. The reason for this is simple for the majority of problems there is absolutely nothing to be gained by studying such large systems, and afew hundred to a few thousand particles generally suffice. The physics that underlies the phenomenamodeled in most molecular dynamics studies involves short-ranged spatial correlations: correlated motionthat extends over a distance of order several times the mean interatomic separation can be adequatelyaccommodated within systems whose edges are typically of length some ten times the mean separation(even smaller sizes are still employed on occasion). Periodic boundaries are generally used in order toreduce finite-size effects; otherwise, a substantial fraction of the particles would lie close to a boundary,and relatively few in the bulk interior.

On the other hand there do exist problems in which the characteristic length scales of the phenomena ofinterest are orders of magnitude greater than the mean interatomic separation, examples of which includepolymers [4], incommensurate surface phases [5], and spontaneous structure formation in hydrodynamicflow [6]. Such problems cannot be seriously studied without resorting to systems of at least several tens ofthousands of particles, and it is reasonable to expect the demand for simulations of this (and even greater)magnitude to grow with time as the value of the simulational approach becomes more widely appreciated.

It is a fact of life that algorithms designed for small problems do not always prove suitable when theproblems are scaled up by several orders of magnitude. This is especially true in the case of moleculardynamics. The algorithm used for small numbers of particles is little more than trivial [7]. While such analgorithm is adequate, perhaps even optimal, for systems containing up to a few hundred particles, someform of enhancement is required to deal with larger systems, and it has long been recognized that theintroduction of certain bookkeeping techniques namely cells and/or neighbor lists [7,8] can greatlyenhance the performance, even by orders of magnitude. Until recently, however, it was thought that suchtechniques were only marginally useful on vector processors, but it has since been demonstrated that bothbookkeeping schemes can be very effective on computers of this type [2,9].

The availability of improved algorithms has made possible extensive simulations of systems containingas many as 2 x iO~ particles [10], taking full advantage of the benefits of vector processing. Thebookkeeping schemes exact an penalty however, namely one of storage, and the substantial price paid interms of storage requirements can make it unreasonable to consider systems of even greater size unless aninordinate amount of storage is available. An examination of the algorithms reveals that this appetite forstorage can be overcome at the expense of further algorithmic complication, but, more importantly, withonly minor impact on the efficiency of the computation. Test runs involving over 2 x 106 particles havebeen used to establish the effectiveness of the approach. In the large-system limit the fraction of storagerequired for the bookkeeping functions tends to zero. The focus of this paper is on the interactioncomputations, which constitute the bulk of the work in any molecular dynamics simulation. There are ofcourse many other details [7], such as integrating the equations of motion, constructing the initial state,and making measurements of various properties in the course of the simulation; these consume a relativelysmall fraction of the effort, and demand little by way of special treatment when transferred to a vectorenvironment.

2. Molecular dynamics conventional methods

2.1. Basic approach

In order to provide a framework for introducing novel methods for handling molecular dynamicscalculations in other than the most conventional computing milieux, we begin with a summary of the most

200 D.C. Rapaport / Multi-million particle molecular dynamics. I

straightforward approach for simple systems. The term simple as used here means that the molecules ofthe system are reduced to particles having spherical symmetry, with interactions defined in terms oftwo-body forces whose strength and direction depend only on the relative separation of the particlesinvolved. Many, perhaps the majority, of current applications of molecular dynamics deal with morecomplex molecules [3,7] which may have a rigid or flexible internal structure, interactions involving severalforce centers on each molecule or an explicit dependence on relative orientation, even three-bodypotentials or polarizability.

For a broad range of problems it suffices to consider the simplest of particles the atom whose onlystructural feature is the volume it occupies to the exclusion of all others. Such a particle is exemplified bythe hard sphere (or hard disk in two dimensions). While a truly hard sphere cannot be represented by adifferentiable potential, the step potential that characterizes the hard sphere can be replaced by a suitablyshaped differentiable function that acts repulsively over a very limited range and diverges rapidly when theseparation drops below the effective core diameter (typically following an r~2 law), but not too rapidly toprevent numerical integration of the equations of motion. A common example of such a potential is onederived from the LennardJones form by an appropriate shift and truncation, namely

U(r)= f4(r~l2_r6+~)~ F(r)= 48(r14~r8)r, r

D.C. Rapaport / Multi-million particle molecular dynamics. 1 201

Beyond the requirement that the minimum cell edge exceed r~,there is no precise recipe for determiningoptimal cell size. The preferred size is one in which the mean cell occupancy is close to unity; in a highdensity system this will also be the smallest size allowed, but at lower densities unit cell occupancy ratherthan minimum cell size might prove a more effective criterion, to avoid excessive processing of empty cells.The ideal solution is an empirical one that amounts to varying the cell size for the system at the state pointto be studied and determining the value at which the simulation runs fastest.

There is also the question of how to represent the information describing cell membership. Given thepermissible range of cell occupancies determined by the extreme limits of local density fluctuation, themost economical approach storagewise to to use a linked list of cell occupants. A separate list is used foreach cell, and all storage needed by the lists can be taken from a common pool whose overall size is just N.To complete this particular data access scheme an additional set of pointers is introduced, so that for eachcell there is a pointer to the first atom it contains; starting from this atom, the linked list provides access tothe remaining atoms in the cell. Assuming the total number of cells to be of order N, the storage requiredto implement this scheme is also proportional to N.

The algorithm is summarized below. Two techniques for improving computational efficiency areincluded. One is to use precomputed tables instead of evaluating the interactions, the other is to eliminatethe need for dealing with periodic boundaries by making replicas [2] of atoms within distance i~of theboundaries that are suitably offset to the opposite sides of the system.

The edges of the simulated region are of length La,..., the cell array used for the interactions is of sizeN~+ M~x M~x M~,and wy,... are the cell edge lengths. The coordinates of atom i are t~,,...; to reducethe work in computing cell membership the coordinates would normally range from 0 to L~(etc.) ratherthan be centered about the origin, but to allow space for the shifted replica atoms the actual coordinaterange of non-replica atoms is changed to ~ w,~+ L~ r~thenn*n+1 r~~*r~1~L5 r~r

endifenddoN * n

enddo

Assignment to cells results in a set of linked lists in which the pointers associated with both cells and


atoms are stored in a common set of N + N~array elements (p, }; pointers between atoms are stored first,followed by pointers to the first atoms in the cells.

forc=N-i-1toN+N~do p~Ofor i = 1 to N do

c ([r~/~] XM~+ [ii~,,/w~j) X M2 + [r~jw2] + N +1Pi ~p~ PC~~1

enddoThe interaction calculations consider each cell, pair it with its neighbors (actually just half), and thenexamine all pairs of atoms that appear in the linked lists. The organization used here differs from the morefamiliar form of the computation [2] in that the outermost loop is over the offsets between adjacent cellsrather than the cells themselves; there is no loss of efficiency, and the rearrangement hints at subsequentdevelopments.

The acceleration components of atom i are a~1,..., and is the total potential energy. The tableentries used for the force (which is identical to acceleration in reduced units) and energy terms are denotedby .F and U~these are tabulated for r r.~ at fixed r

2 increments of size stab chosen so as not tointroduce significant numerical error beyond that already present in the integration method. The tables areof length Ltab, ~ ~tab = ?~2/Ltab.The adjustments ~k 8xk etc. that are made to the cell loop limits havevalues 0 or 1 depending on the offset index k; the ranges are chosen to ensure that only the correct cellpairings are considered (determination of the actual values is left as an exercise). The offsets themselvesare 5xk~; they equal 0 or 1 and the three components of each offset can be combined into a singlevalue (as will be done in section 4). The case k = 1 corresponds to cells being paired with themselves, andcovers the intracell interactions.

for i = 1 to N do a~1f 0

E~~- 0for k=1 to l4do

for m~=8~kto MX1~kdofor m=~ to ~ ~

6yk dofor mZ 6Zk toMZl~Zkdo

m*(m~XM~m~)XM~+m+N+1m~~ m~+ 5xkm ~ (m~X M~,+ mi,) X M~+ m~+ N + 1~ ~Prnfor i 0 do

for i ~ 0 doif k>1~i>i then

d5~i~~r51 J*[(d~+...)/z~tab]+1

if j

D.C. Rapaport / Multi-million particle molecular dynamics. I 203

While the linked-list method is suitable for scalar processors, the fact that it requires accessing memoryin what amounts to a haphazard fashion means that it ceases to be effective when optimal performanceinsists that data be read from and written to memory sequentially, a feature of all modern vectorsupercomputers. Alternative approaches that extend the cell technique in a manner compatible withvectorization will be discussed in section 4.

2.3. Neighbor-list data organization

The observation that in a fluid of moderate to high density the environment of each atom changes onlygradually (relative to the size of time step used for integrating the equations of motion) suggests thatinformation on neighborhood relationships continues to be valid, at least approximately, for a certainperiod of time subsequent to its original generation. The neighborhood is defined to be a spherical(circular in two dimension) region with radius r~>r~.If a list of all the atoms present in the neighborhoodof a given atom is prepared [8], then it is clear that this information will remain useful in the sense that itstill contains all the interaction partners of that atom for a period spanning several time steps; the actualduration of this period depends on the maximum velocity of the atoms involved (as well as on r,~itself).The gain in performance over the cell method depends on the ratio of the volume of the neighborhoodregion to the combined volume of all the cells which would have to be examined otherwise. Continualmonitoring of the atomic displacements can be used as a means of determining when regeneration of theneighbor lists is required namely the earliest instant at which an atom not originally in the neighborhoodcould possibly become an interaction partner. Insofar the representation of lists of neighbors is concerned,the information can be stored as a sequence of atom pairs, or in a condensed format where data is groupedaccording to one of the neighbors; in either case, the fact that the neighbor relationship is commutativehalves the total storage requirement.

The amount of storage needed for the neighbor data depends directly on the number of occupants ofthe neighborhood region: The radius r,, is set equal to ~ plus a value representing the thickness of thebordering shell (or annulus in two dimensions). The larger r,~ the less frequent the time-consumingoperation of regenerating the neighbor data, but once the proportion of atom pairs that are classified asneighbors but which are separated by more than r~becomes substantial, the performance will begin todrop; the optimal size must once again be determined by experimentation. The process of preparing theneighbor lists should utilize the cell approach as a preliminary step, with the cell size now chosen to exceedr~rather than just r~.

The storage cost can be substantial the neighbor-list approach exhibits an obvious largess in terms ofmemory utilization in order to gain speed; for very large systems the method might not be viable for lackof memory. Neighbor lists can be used on a vector computer in a straightforward manner provided eachatom has a moderately large number of neighbors (this excludes the case of very short-range forces), andfurther improvements in performance are possible using a variant of the approach described in section 4.The extension of the vectorized cell method by the use of partitioning (section 5) that is intended fordealing with extremely large systems does not apply to neighbor lists. The reason for this is that theneighbor data is generated once for the entire system and then used over the course of several time steps,whereas the partitioned approach is designed for storage economy so that only the bare minimum of datais retained for those parts of the system not under immediate consideration.

3. Vector processing

3.1. Architecture of vector computersThe vector supercomputer [11] represents a compromise between the ability to maximize performance

for only a limited set of operations and a need for the fastest possible computations over a broad range of


problems. The dominance of the former consideration is reflected in the fact that performance figuresquoted by manufacturers are almost always beyond the reach of the user [12], often by an order ofmagnitude or more (these unachievable figures are sometimes referred to as machoflops); the situationswhere the performance potential of the supercomputer is far from realized are all too frequent.

What distinguishes algorithms that vectorize effectively is the manner in which data is accessed and thenature of the processing involved. The reason for a preferred mode of operation is that the processorhandles memory access and arithmetic in a pipelined fashion, with a resulting throughput substantiallygreater than what would be possible if each operation were to be carried out separately. The pipelining isonly possible if the same operation is performed repeatedly on a set of data items arranged in a specificmanner; the preferred manner generally involves data stored in consecutive memory locations, althoughevenly spaced items may be equally acceptable (with certain restrictions imposed by memory interleaving).Any deviation from a general operational pattern of this kind results in reduced performance. However,with the exception of limited kinds of computation mainly involving matrices which adhere preciselyto this prescription, such a state of perfection is rarely (if ever) attained. In addition to the dataorganizational requirements, each vectorized operation has a fixed startup period independent of thenumber of data items processed; this can sometimes be made to overlap (fully or partly) with a previousvector operation. A paradoxical consequence of this overhead is that if the vectors are too short, vectorprocessing leads to reduced performance; the minimal vector length requirements vary and depend onboth the type of operation and the machine itself. The issue is how to achieve the best performance giventhe preferred manner of operation of the hardware.

3.2. Vector operations

The term vector as used here has nothing to do with the vectors of physics and mathematics; itmerely denotes a sequence of data items that are processed as a single entity by the hardware. The dataitems themselves may be integers, floating-point numbers, memory addresses, single bits whatever themachine is prepared to accept. A vector is characterized by data type, the number of items involved, andthe starting address in memory. In a language such as Fortran, the name of the vector denotes the defaultstarting address, although this can be altered with an explicit index. The length would be either the size towhich the vector is dimensioned or some smaller value. The data type might be implicit in the name orspecified separately. Some language implementations (such as Cyber Fortran) allow the full description ofa vector to be summarized in a single quantity called a descriptor.

The concise notation for describing vector operations introduced previously [2] will be used here. Thevector x stands for an ordered set of n elements { x1, x2,..., x,~}; the final index will be shown only if notapparent from the context. A subvector of x will be denoted by x[n1 ... ni], or just x[n1] if the upper limitis obvious. A typical arithmetic operation z ~ x + y stands for

fori=ltondo z~4x1+y1.An example of a comparison operation with output stored as a bit vector (where a set bit corresponds tothe test being satisfied) is b ~ x > y, equivalent to

for i = 1 to n do b, ~ x, > y,.Other operations used are the sum over the elements of a vector, ~x, and the count of the number ofone-bits in a bit vector, #b.

To help the user implement an algorithm whose intrinsic data organization bears little resemblance tothat needed for efficient vector processing, the instruction sets of most vector computers include somecapability for reorganizing data at a relatively high rate, generally intermediate between the vector andscalar processing speeds. Different approaches to dealing with data reorganization exist, and not all are to

D.C. Rapaport / Multi-million particle molecular dynamics. 1 205

be found on all machines. Furthermore, even when a particular scheme for rearranging data is imple-mented in hardware, the questions of how fast such operations are carried out relative to peak computa-tion speed, and whether the compiler is even capable of utilizing the hardware feature, must be taken intoaccount.

The two principal schemes for reordering data are known as gatherscatter and compressexpand.Data gathering uses a vector of indices c to access in no particular order as far as the computer isconcerned some or all of the elements of a set of items (items can be accessed more than once) which arethen stored consecutively in another vector. The notation used is z ~ x~c, corresponding to the loop

for i = 1 to n do z, ~ XC.The scatter operation is the converse, in that the index vector is used to help store a consecutive set of dataitems in some alternative order in another (possibly longer) vector; not all elements of the destinationvector need be affected, and destination elements may actually be stored into several times (assuming thisis meaningful for the particular context). The notation z @c x is shorthand for

for i = 1 to n do z~~ x~.Compression involves selecting a subset of data items from a vector and storing them consecutively, in thesame order, in another vector; expansion is the converse. Because data order is preserved under theseoperations, addressing information can be represented by means of a bit vector b; this provides anextremely compact alternative for handling sparse data compared to the index vector needed for gatherand scatter. The compression operation is denoted by z ~- x ~ b, representing the loop

j4-0for i = 1 to n do

ifb1=lthen J4j+1 z~x1enddo

while expansion z ~ x ~ b corresponds to

for i = 1 to n doifb1=lthen j~j+1 z~x1else z,~0

enddoIn the event that no order is required, the proportion of elements participating in an operation on a subsetof a vector might be used to determine whether gathering or compression is preferable provided thechoice exists.

On further operation will be introduced here, namely index selection. The operation c ~ x > 0 (forexample) stands for

j*0for i = 1 to n do

ifx,>Othen j~j+1 crienddo

with c the resulting vector of indices showing which elements of x satisfy the given condition. The length(j) of the index vector is also a product of the operation. In terms of bit vectors the example is equivalentto

b*x>0 c4{1,2 n}.Lb j


3.3. Efficiency and portability

Ideally, one of the tasks of the compiler should be to produce a machine translation of the sourceprogram delivering close to optimal performance on the designated hardware, without any special effort onthe part of the author of the program. Unfortunately, such an idealized situation is rare indeed. Judging bythe achievements to date, compiler efficiency is an even more complex issue than hardware efficiency, andcompiler performance even products from the same manufacturer exhibits considerable variation.Irrespective of whether the logical structure of the algorithm is too complex to be analyzed by anautomated procedure, or whether the compiler simply has not been taught to recognize certain basiccomputational patterns, the onus is on the programmer to meet the requirements of the compiler, and if noalternative exists, to resort to additional measures (see further) that will ensure an efficient, although lessintelligible and portable program.

Even when the compiler is competent at mapping the source program to the hardware, there aresituations in which certain relatively simple constructs may, in principle, prevent the compiler vectorizingparts of the program. One example involves operations whose general form implies a potential dependenceon something that has only just been computed; such operations are not generally vectorizable because ofthe manner in which vector pipelining restricts data dependence. In those instances where it is known thatno dependence exists, the capability of conveying such information by means of compiler directives (thatare not part of the actual language) ought to help the compiler perform its task. The capacity for aiding thecompiler in this way varies.

The alternative to total reliance on the compiler is to use machine instructions directly. This can alwaysbe done by programming in assembly language, but is best avoided (with intelligibility in mind) in favor ofa sometimes available alternative which allows access to hardware via subroutine calls from higher levellanguages as such as Fortran. On the Cyber 205 and ETA processors, for example, q8vgath r,q8vscatr, q8vcmprs and q8vxpnd do exactly what their names suggest; on Cray computers thefunctions gather and scatter are available, while a set of functions with names such as when n e can beused to carry out index selection. These functions correspond directly to the vector operations introducedabove.

Even when it appears that the portions of the program consuming most of the execution time have beenfully vectorized, there is usually no indication given as to whether the machine code produced is the mostefficient possible. The machine may, for example, be capable of achieving a given result in more than oneway, while the judicious employment of temporary registers, or the simultaneous use of multiple functionalunits in the processor (typically by feeding the results of one vector operation into the next, a processknown as chaining), can result in substantial performance gains. Only by analyzing the assembly languagelisting produced by the compiler is it possible to determine whether the performance level reflects what themachine is really capable of achieving, but it is doubtful whether such a thorough analysis is often carriedout; the megaflop rate attained may well have little to do with the overall efficiency of the computation.

The issue of program portability is an acute one. Vector supercomputers tend to be very sensitive toprogram and data structure (for reasons already given) and a program that runs well on one brand ofmachine can fail to perform as expected on another, unless modifications are made. Performance can evenvary substantially among different models of a particular product line, depending on the kinds ofinstructions implemented in hardware, the degree to which functional units are replicated, memoryorganization and bandwidth, as well as other more subtle factors, such as the timing of individualinstructions, that might be entirely unknown to the user. Performance can also change between differentversions (or releases) of a compiler, and there is no guarantee that the code generation and optimizationcapabilities improve monotonically with time. These considerations apply especially to vectorized imple-mentations of molecular dynamics algorithms which, as pointed out in the course of this article, tend torequire machine-dependent adaptations in order to run efficiently.


4. Layer data organization4.1. Inhibiting vectorization

A molecular dynamics algorithm based on cells involves a set of linked lists, one per cell, in which thelist elements contain the identities of the atoms belong to the cell at a given instant. As pointed out insection 2, the reason for prefering linked lists over sequential storage is that the number of atoms per cellcan fluctuate considerably; the alternative requires that the storage reserved for each cell allows for thepossibility of maximal occupancy.

Linked lists are handled very inefficiently on a vector processor, since the use of pointers to connectrelated data items inhibits vectorization. The cell technique, which has shown itself to be very effective onscalar computers, must be modified in a way that renders it vectorizable. Obviously it would beunreasonable to abandon the use of cells entirely and return to the original method which considers allpairs of atoms; though fully vectorizable, and even efficient for small systems, there comes a point atwhich the gain due to vectorization can no longer compensate for the t!2(N2) dependence. The layermethod of reorganizing cell data, which will now be described, provides a solution that retains the benefitsof the cell framework.

4.2. Layers

In the cell version of the algorithm (section 2.2), the interaction computations involve a series of nestedloops. The outermost loop is over the possible offsets between pairs of neighboring cells, including the caseof zero offset where cells are paired with themselves. Scanning the cell array is the responsibility of thenext series of loops. The two innermost ioops generate pairings of occupants from the mutually offset cells,with the case of zero offset incorporating a test to ensure that atom pairs are considered once only. Notethat it is the innermost loops that have the fewest numbers of iterations because of low mean celloccupancy (typically unity). This fact rules out the possibility of vectorization. An essential requirementfor effective vector processing [11] is that the vector lengths are adequate to amortize fixed startup costsover maximum useful computation. While the loop order just described fails to obey this criterion, areordering so that cell scanning (the cells themselves, not their contents) is done during the innermost ioopwould constitute a satisfactory solution. How is this realized in practice?

The scheme calls for a reorganization of the cell data. Instead of representing cell occupancy usinglinked lists, the identities of atoms in the cells are placed in a set of arrays; each array contains oneelement per cell, and the total number of arrays is not less than the maximum expected cell occupancy.These arrays will be referred to as layers, and in fact amount to a return to the approach that wasdismissed earlier where a fixed amount of storage is allocated for each cell; alternative ways forovercoming the storage problem will be presented. While scanning the atoms during cell assignment, thefirst atom encountered in a given cell is assigned to the corresponding position in the first layer, the secondatom in the cell (if any) to the second layer, and so on. Unfilled layer positions are assigned a valuedistinct from all valid atom identity numbers (such as zero). Layer generation is carried out as shownbelow, with NL signifying the number of layers generated.

Initially { c~i = 1,..., N) are the cells to which the atoms (including replicas) belong, but as atoms areassigned to layers the corresponding c~are zeroed. f e

1,,, j = 1,..., N~} describe the contents of the mthlayer, and (s1 i = 1,..., n } are the (n) atoms remaining at the end of each layer. Several atoms may beassigned to a -particular layer position, but only the last assignment is effective; the other atoms will becandidates for subsequent layers.

for i = ito N do~ ([r~C,/wXJ X M~+ [r~~/w~]) X M~+ [r~~/w~]+ 1 s, f i

enddon~N m*0


while n > 0 do

for j = 1 to N~do ejm 0for i = 1 to n do ec,m -~forj=1toN~do

if ejm ~ 0 then ce ~ 0enddoj~0for i = 1 to n do

ifc5~0then J*j+1 sj~~~sjenddo

enddoNL ~- m

The interaction computation, based on the layers just constructed, consists of pairing layers using allallowed offsets, with an innermost loop that processes pairs of atoms specified in the layers. Only in caseswhere both layer positions specify valid atoms is the calculation actually carried out. The details appearelsewhere [2] and will not be repeated here; the algorithm can also be deduced from the vectorized versionsdescribed below (but there would be little point to a non-vectorized implementation).

Two vectorized forms of the layer-based interaction computation have been constructed, each designedbearing in mind the specific hardware features of the target processor. One version developed for use onthe (functionally identical) Cyber 205 and ETA machines represents cells requiring attention in each layerby means of bit vectors. The other version developed for Cray computers, but having wider applicability,employs sets of atom indices to represent layer contents and does not attempt to condense theinformation. Both approaches lead to fully vectorized computations, but the inability to represent sparsedata on the Cray with the aid of bit vectors means that the storage scheme for the layers is inefficient; analternative scheme that uses storage more economically is described in section 5.

The pipelined nature of vector processing imposes certain restrictions on the data contained in thevectors, the most significant being that processing of each data item can be carried out independently ofthe others. The implication is that a particular atom can be mentioned no more than once in a set of atomsthat are processed in a single vector operation. Use of layers guarantees this to be the case since an atomcan only appear once in a layer (even when a layer is paired with itself, the two appearances of each atomare in separate vectors). The compiler will not be aware of this fact, however, and it is necessary to informit by means of the special directives referred to earlier that certain loops involved in the layerprocessing can be safely vectorized without fear of data dependence.4.3. Vectorized layer algorithm using bit vectors

The computation begins with replication of atoms within r~of each of the boundaries. The vectornotation discussed in section 3 is used here.

n~Nfor x, y, z do

b~,~[1...n]w5+L~r~ n24#br~[n+nj+1]*,~J,bL~ r~,,[n+n1+l]+-r~bn ~ n + n1 + n2

enddoN ~ n


The next stage is layer assignment, resulting in a series of compressed layers that are accessed using bitvectors. The identities of the atoms packed into the layers are stored in the vector e~,and bit vectors(bm) in which a single bit corresponds to each cell in the m th layer are used to associate atoms withoccupied cells. The quantities qm and tim are the starting position in e ~ of data for the m th layer, and thenumber of atoms in that layer. The total length of e~ is N, while the storage needed for all the bmamounts to N~x NL bits. Other temporary quantities make an appearance, but their meanings should beobvious. Note that the first test for c ~ 0 produces an all-ones bit vector and is unnecessary here, butwould be needed if partitioning (section 5) is used, since gaps could then appear in the vectors holding theatom data.

~ 1m40 q0*0 n0~0s~{i,2,..., N) b[1...N]*c#Owhile #b> 0 do

m~m+i q~~q~1+n~1 e[1...Nj*0 e@(c~b)~s,~.bb~[i...Nj~e:~t:0 nm*_~#bm ~c~e~[q~]+-O b*c*0

enddoNL ~ m

Interaction calculations based on these compressed layers follow. The offset values5k are linear combina-

tions of the 5xk~ (section 2). Since 5k can be negative, the bit vector indices shown here can also becomenegative (they would not contribute to the final result, but might lead to invalid memory references); toensure positive indices each bit vector is augmented by a constant margin [2] and the index adjustedaccordingly these margins are omitted here. An additional bit vector b* is used to distinguish boundarycells, that contain only replica atoms, from interior cells; it is used to ensure that all pairings involve atleast one interior cell. Valid pairings between occupied cells in the layers are collected in the bit vector b.Vectors r, a~,etc., are all of length N~and hold data for expanded layers; vectors such as ,.(c) holdpacked data after rearrangement according to layers. The energy calculation is omitted. F (and later alsoU) holds the tabulated interaction terms; the possibility that a vectorized evaluation of the interactionfunction might be faster than table lookup should not be overlooked. All 27 offsets are used when distinctlayers are paired, whereas only 13 are needed when a layer is paired with itself (k = i corresponds to zerooffset and is skipped in this case).

While this algorithm follows the lines of one described previously [2], several changes have been madeto show that alternative implementations are possible, as well as to bring out the similarity with thesubsequent version based on index vectors. Here, the acceleration updates are done when the layer data isin expanded form; if the atoms are separated by a distance greater than r~ the accelerations are stillupdated, but using the final zero entry in the table; replica atom accelerations are eliminated after undoingthe initial layer rearrangement, simply by truncating at,... to length N. (For the case m m, a~anda~correspond to a single vector that is updated twice per offset; the present notation does not adequatelyconvey this fact and, ideally, a separate sequence in which a is replaced by a~should have been included.The descriptor variable mentioned in section 3.2 readily handles this case without any modification, and isimplicit in the syntax used here.) The operations A and V denote bitwise Boolean and and or.

N] ~ ,~[1 ... N]@e~ a~ 4 0for m = 1 to NL do

,.(C)[q,]~ bm a~~ a~[q~]Ibmfor m = m to NL do

if m ~ m thenI. -~ r~[q]T bm a~ a~[q~,]~bm


f 1 kmax ~ 27else i~~- r~ a~ a~ k,mn 2 kmax ~ 14for k = ~ to kmax do

b[1...Ne]4_bm~Abm~[Sk]A(b*Vb*[Sk])if #b> 0 then

d5 ~ ,~,~b r~J, b[sk] j4 min([(d5 x d~+ ...)/L~tab] + 1, Ltab)t4--F@j d~4-txd~a~*a~+d51b a~4-acd51b[sk]

endifenddoif m ~ m then a~[q~]4- a~~ bm

enddo

enddoa5[1... N]@e~~a~

In order to overcome problems associated with a Cyber 205 hardware restriction on maximum vectorlength as well as reduce a more general requirement for temporary storage used during the computation, ascheme for spatially subdividing the system during the layer construction was developed [2] (the forerunnerof the slice method described in section 5). At each time step the initial assignment of atoms to cells iscarried out for the entire system (taking care to break up vectorized loops that become too long), but thelayer data is then grouped according to which part of the system it addresses. Provided adjacentsubdivisions are extended to overlap by an amount r~,each group of layer data can be treatedindependently during the interaction computations. Atoms lying in the overlap regions have some of theirinteraction terms computed twice, but the bookkeeping ensures that this extra data is readily identifiedand not used subsequently. The effort expended on duplicate interactions is small, depending on theamount of overlap. The approach proved to be quite effective and was used in the large-scale productionruns, but is contingent on the ability to pack sparse data with the aid of bit vectors.

4.4. Vectorized layer algorithm using index vectors

The formulation in terms of index vectors might give the impression of a more concise algorithm thanbefore, but this has little bearing on whether the implementation is more efficient on a computer thatsupports both approaches. As before, the computation begins with replication, but now based on indexvectors generated with the aid of the unary @ operator.

for x, y, z doq[i...n1]4_-~r5[1...n]


allow for gaps in the atom data. Note that when s is compacted at the end of each iteration, a new valuefor its length, n, is also produced. There are several ways of employing @ operations to arrange atoms intolayers; the algorithm that follows is just one example.

~

m*O n*N

while n > 0 dom 4 m + 1 em[1 ... Nj 4 0 m@(C@S[1 ... n]) 4Sc@em 4 0 s[i ... n] 4 s@(@(c@s) ~ 0)

enddoNL 4- m

Finally, the interactions are evaluated. The vector g ~, with one element per cell, is used exactly as thebit vector b* earlier, to distinguish boundary cells from interior cells (elements of g * are 0 or 1). Vector qis filled with indices of cell positions that correspond to offset cell pairs in which at least one of the cells isnot a boundary cell and both are occupied. The margins mentioned previously are again omitted from thedescription, and are now also required for the vectors em. The potential energy is computed here; to avoidspurious contributions from replica atoms, it is evaluated separately for each atom (in vector u) andaccumulated at the end.

a5[1 ... N] ~0 u[1 ... N] ~0for m = 1 to NL do

for m = m to NL doif m~m then ~ kmax4-

27else ~ 4 2 kmax 4 14for k = kmin to kmax do

q[1...n]4-~((g* g*[5~])Xem,Xe~,,[s~])~Op 4_e~~~q p ~ d

5 ~i~~p j4min(t(d~Xd5+...)/L~tab]+l, Ltab) t4-F@jd~4tx d~ a~@p4a~@p+ d~ U~@p4a~@p d~t4 U@j uQp~u@p+t U@p4-U@p+t

enddoenddo

enddoE~~~u[1...N]

4.5. Performance

On the Cyber 205 and ETA-1OQ computers extensive production using two-dimensional systems withas many as 2 X iO~atoms has been carried out in exploring the applicability of molecular dynamicssimulation to the modeling of fluid flow instability. Test runs of up to 5 x 10~atoms were also conducted.Moderately high densities were used the values of area, or volume, per atom used were 2.0 and 1.4 in twoand three dimensions. With cell size chosen to give an average cell occupancy near unity, the first layer isalmost fully populated, and the occupancy of subsequent layers drops sharply to zero after three to fivelayers. The time required per atom step on the Cyber was 4.8 ~sswith the potential energy included in thecomputation (25 p.s in three dimensions), or 4.1 without, irrespective of system size (beyond a minimum ofseveral thousand atoms), with calculations carried out in 32-bit arithmetic. (On the ETA, which should


have given similar performance figures, a substantial but unexplained size dependence was noted possibly attributable to paging.)

Tests were also run using single processors of multiprocessor Cray XMP/48 and YMP systems. Thetests on the XMP only considered systems of up to approximately 7000 atoms at this juncture (thepartitioning scheme of section 5 was adopted for larger systems on the YMP) and used 64-bit arithmetic.In two dimensions the time per atom step on the XMP was 5.2 p.s, while the YMP performed the samecomputation in 3.4 or 3.9 p.s depending on the compiler (CFT and CFT77, respectively, the latter requiringassistance from the special vector subroutines discussed in section 3). In three dimensions the XMPrequired 23.5 p.s for the corresponding computation. One surprising detail emerged from these measure-ments: the bulk of the processing time on the Cray XMP (of the order of 7080%) was spent in the stepsleading to the construction of the index vectors p and p, and not in evaluating the interactions. Thetimes required when layers are not used are typically an order of magnitude larger.

4.6. Layers and neighbor lists

The layer approach can also be applied to neighbor lists. The idea is to associate the layers with cellslarge enough to cover the neighborhood range r,,, and then use the layers to generate neighbor tablessegmented in a manner that permits no atom to appear more than once per segment. The subsequentprocessing of such sets of atoms is then fully vectorizable. This scheme is practically identical to theimplementation of the layer approach just described; the principal difference being that the sets of indices(p and p) would be stored for use over several time steps. For a three-dimensional system similar tothat tested here the Cray XMP required only 4.3 p.s per atom step, with the neighbor lists being refreshedonce every 29 time steps [9]; given the reduced number of interacting pairs that have to be considered, thisresult is hardly surprising. The storage requirements for this method are approximately doubled, and if theneighbor list refresh rate is increased for any reason (such as in studies of fluid flow with stationary wallsthat produce shearing, or even by the use of a larger time step where accuracy permits) the benefitswould be less. Partitioning schemes aimed at reducing storage are of course not applicable.

5. Partitioning for storage economy

5.1. Schemes for partitioning

If the layer data can be compressed and the information needed for reconstruction stored as compactbit vectors, the storage overhead resulting from the introduction of layers is, for practical purposes,negligible, amounting to a single index variable per atom (in e ~) together with one bit per cell (in bm) foreach layer. If the efficient bit-vector representation of sparse data is not supported by the processor inquestion then, when the simulations become large enough that memory utilization becomes a seriousproblem, it is necessary to consider approaches to subdividing the system, but in a manner different fromthat outlined previously which was specifically designed for use with bit vectors.

The method of choice is to partition the system spatially and treat each of the subsystems separately. Asbefore, the separation cannot be complete since interactions will occur between atoms on opposite sides ofboundaries between subsystems, and atoms must also be allowed to cross these boundaries. In a mannerreminiscent of the approach which might be adopted for multiprocessor systems, the data for eachsubsystem is stored separately, and when atoms do cross boundaries their associated data is explicitlytransferred from one storage area to another. The storage overhead associated with layers is thenproportional to the number of atoms per partition rather than the total N, and for very large systems thatare split into a substantial number of subsystems ideally without incurring a costly performance penalty


the relative increase in storage needed to deal with layers falls to a low level.The benefits of the partitioned approach go beyond mere economy of storage. Modern computer

systems tend to utilize a hierarchy of storage methods, and this can be put to use in a computation which isorganized so that each part of the system is processed essentially on its own (how the coupling betweenparts is handled will be addressed below, but the basic idea remains unaltered). The main processormemory is the place where the application keeps its data; even faster cache storage and yet faster sets ofregisters may also exist, but these are of limited size and often beyond user control. Main memory is anexpensive commodity on the fastest of machines, but it is possible to augment storage by using what issometimes known as solid-state disk cheaper memory accessed in much larger blocks than normal butan efficient approach provided accesses are carried out in manner similar to the way a disk is used (namelyby means of blocks of data rather than individual items). The next stage in the hierarchy is a real disk, andagain there is a tradeoff more and cheaper storage but slower response times. Virtual memory systemsoperate in this way without the user being aware. A multilevel memory could well prove suitable for asubdivided computation since those parts of the system not actually being processed do not need to remainin main memory; the computation proceeds in a predictable fashion so that data for each part of thesystem need only be delivered to main memory just prior to processing, and afterwards the data areallowed to migrate to more economical levels in the storage hierarchy. Naturally an intricate organizationalproblem of this kind would only be attempted when the systems are large typically iO~atoms or more.

There is a certain amount of flexibility in the way the partitioning is carried out, with the simplestone-dimensional approach being used here: the system is cut into slices that span the entire region in alldirections except one. (The apparently optimal partitioning scheme is one which minimizes the surface tovolume ratio, but the gain can easily be outweighed by the extra computation required.) The computa-tional scheme described here considers each slice once per time step, even when the boundaries areperiodic (so that the first and last slices are adjacent). The slices are treated in cyclic order, and threeadjacent slices are required at any instant; only the central one is actually being processed, while the othertwo are either contributing data for inter-slice interactions or dealing with atoms that cross sliceboundaries. The slice thickness, and hence number of slices, must be optimized by experiment; if the sliceis too thick storage costs will grow, but too thin a slice will call for additional work to deal withinteractions across boundaries.

5.2. Data operations

Several sequences of data transfer and transformation appear in the various versions of the moleculardynamics algorithm for subdivided systems, irrespective of whether one (as described here) or several (as inII) processors are involved.

The copy operation is used to transfer atom coordinates between independently processed subregionsof the system; the subregions are processed either sequentially within the same processor, or concurrentlyby distinct processors. For short-range interactions, only atoms close to subregion edges will be involved.If a periodic boundary lies between the two subregions in question, then an appropriate shift in theaffected component of the coordinates is required. While it is usually only the coordinates that arerequired in computing forces, other data associated with the atoms such as indices that might be used ingrouping atoms into polymers might also be needed. Copied data is discarded once the interactions havebeen computed; the integration of the equations of motion for these atom takes place while processing thesubregion to which they belong.

The move operation transfers all the data associated with an atom that has drifted betweensubregions to the storage area associated with the new subregion (this may be in the same or anotherprocessor). Periodic boundaries are once again taken into account. The version of the data in the originalsubregion is flagged as invalid and the storage made available for reuse.


The number of sets of copy and move operations required per time step, as well as the actual datainvolved in each, depend on the integration method used. In contrast to other integration methods, thesimple leapfrog technique [7] requires one evaluation of the interactions per time step, makes only a singlecomputation of coordinates and velocities, involves no higher time derivatives of the coordinates thansecond (the accelerations), and does not require information from steps prior to the current one.Higher-order methods will require more data to be transferred (typically either accelerations from earliertime steps or, equivalently, higher derivatives of the acceleration at the current step), while use of apredictorcorrector solver will require two sets of move operations per time step.

The replicate operation, introduced earlier, is used to handle periodic boundaries in directions notinvolved in the spatial subdivision, and is of course needed only if the dimensionality of the subdivision isless than the system itself (e.g. a two- or three-dimensional system cut into slices). The need to address theissue of periodicity when computing interactions is thereby eliminated.5.3. Algorithm for partitioned systems

The scheme based on a one-dimensional subdivision is described here. No explicit reference is made toany memory hierarchy but this could easily be added, even at the control level outside the actual program(this is operating system dependent). Extension to allow overlap of computation with the transfer of datain and out of main memory is comparatively straightforward, but provision would have to be made foradditional buffering.

Slices are subdivisions of the region in the x-direction, with the full size being used in y- andz-directions. Periodicity in the x-coordinate is accommodated by retaining data from the first slice (boththe copy before and the move after the integration) for use with the last slice. Note that the first sliceis input while the last is still being processed, and the last is output during processing of the first on thesubsequent time step. This scheme for reducing transfers does not completely cover the initial and finaltime steps in a sequence; there a few extra transfers are required. Storage for three active slices is provided(since transfers do not overlap computation) and these are used in cyclic fashion. The absolute slicenumbers are used both for accessing external storage and for determining the spatial limits of the slice.The storage used for the .N~slices of the system is denoted by ~9 (n = 0,..., N~ 1), with .~,, (b = 0, 1, 2)denoting storage for the three slices currently receiving attention. (Use of indices beginning from zerosimplifies the arithmetic for determining adjacent slices.)

To make the approach more flexible and, in particular, to make it easier to adapt to a multiprocessingenvironment, several buffers are used to hold data destined for transfer between slices. Not all are reallynecessary for a one-processor implementation, since some of the data could be transferred directly betweenstorage areas associated with neighboring slices, but the clarity is improved at minimal cost. Bufferslabeled ~h and %~I are used to hold data for copy operations in the high and low x-directions, while ~is for copied data necessitated by the periodic wraparound between first and last slices. Buffers ~,gh ~iIand .~sY~play similar roles in the move operations.

The description which follows shows a schematic outline of the computations involved in this approach.Low-level details are omitted. The symbol denotes a transfer of all relevant data for a copy, move orslice input/output operation. Functions such as copyhi( ) are responsible for copying or moving data toa buffer in the high-x or low-x directions from the specified slice, and making the necessary corrections forperiod wraparound (specified by the second argument); the actual molecular dynamics work, includingreplication, is done by process()

~~N,_1 ~1 4-b~0for num steps iterations do

bh (b + 1) mod 3~ L5) gh4-~opyhj(g~6 L5)

D.C. Rapaport / Multi-million parricle molecular dynamics. I 215

for n = 0 to N, 1 dob4(b+1)mode3 b~(b+1)mod 3 b~(b1) mod 3

4- + 1)modN, hif n 0 then

4? I ~ move - lo (~b 0) ~b 4- 4 a - 1)modN,4- ~b ~b 4- 4? helse 4 ~ ~ move lo( ~b L~)if n = .1~4 1 then

4- move_hi(~b, L,~) ~b 4- 41) ~b 4-else

4h ~ move .hi(PJb, 0)enddo

1 4-enddo

As atoms move out of a slice, vacancies will appear in the arrays used for storage. Vacated storage neednot be recovered immediately, and it may be sufficient to flag the elements in question by setting the atomidentifiers to zero (presuming identifiers are used in the calculation) to eliminate them from furtherprocessing; such an approach does not preclude vectorization using the methods of section 4. Since veryfew atoms move between slices at each time step, the appearance of vacancies is slow, but, especially forvector processing, excessive fragmentation of storage will lead to reduced performance. The storage arrayswill have to be compressed periodically to eliminate gaps. The calculation must continually monitorstorage utilization to ensure that there is always sufficient space for copied and moved atoms.

5.4. Performance

Timing measurements were carried out on the Cray YMP for a series of two-dimensional systemsranging in size up to 2.56 X 10~atoms. The additional computation time required to deal with subdivisionwas minimal, amounting to no more than five percent (the time increased from 3.9 to 4.1 p.s per atomstep), and attributable principally to the additional computations involved in processing interactionsacross boundaries. Less than 16 X 106 words of storage were required for the largest system, which wassubdivided into 80 subregions; if the leapfrog method is used for integrating the equations of motion onlyfour words of storage must be reserved for atoms not in the subregion being processed (or five words ifatom identifiers are also required).

5.5. Application to shared-memory multiprocessors

While the partitioned approach shares a lot in common with the implementation using a set ofcommunicating processors each with its own private storage (see II), it was pointed out above that it canalso form the basis for a version of the program designed to run on a multiprocessor system using sharedmemory. One of the problems encountered when several processors attempt to use memory that iscommon to all of them is access conflict. A spatial subdivision of the computation ensures that eachprocessor will spend most of its time working on its own private data, and even when data is transferredbetween subregions there is little cause for conflict; the spatial subdivision approach should therefore -prove effective in such a situation. In fact two levels of partitioning are envisaged for a multiprocessor


implementation of this kind, one that splits the system among the processors, the other to economize onstorage needed for the layers in each processor. This is a subject for future exploration.

6. Conclusion

Supercomputers, with their uncompromising insistence on careful management of data, present achallenge when it comes to implementing algorithms whose data is not structured in the required way.Molecular dynamics simulation, especially in cases where the interactions are limited to a very short range,provides an example of such a problem. However, by careful reformulation of the algorithm, it is possibleto arrive at a computational scheme whose data is organized in a manner that can be processed efficientlyby a vector computer. While the performance achieved in this way might still be far from the theoreticalmaxima claimed for such machines owing to the considerable amount of data rearrangement that goes onthroughout the computation, the performance is substantially better than what would otherwise beachieved. Despite the fact that these methods call for extra storage, further enhancement of the algorithmskeeps such requirements to a minimum.

Acknowledgements

I would like to thank David Landau (University of Georgia), Kurt Binder (University of Mainz) andDietrich Stauffer (KFA JUlich) for their hospitality during the periods that much of this work was carriedout. Burkhard Dunweg and Shlomo Harari are thanked for helpful discussion.

References

[1] D.C. Rapaport, Comput. Phys. Commun. 62 (1991) 217, this issue.[2] D.C. Rapaport, Comput. Phys. Rep. 9 (1988) 1.[3] G. Ciccotti and W.G. Hoover, eds., Molecular Dynamics Simulation of Statistical Mechanical Systems, Proceedings of the

Ennco Fermi International School of Physics, Course XCVII, Varenna, 1985 (North-Holland, Amsterdam, 1986).[4] K. Kremer, G.S. Grest and I. Carmesin, Phys. Rev. Lett. 61(1988) 566.[5] F.F. Abraham, Adv. Phys. 35 (1986) 1.[61 D.C. Rapaport, Phys. Rev. A 36 (1987) 3288.[7] M.P. Allen and D.J. Tildesley, Computer Simulation of Liquids (Oxford Univ. Press, Oxford, 1987).[8] L. Verlet, Phys. Rev. 159 (1967) 98.[9] G.S. Grest, B. Dunweg and K. Kremer, Comput. Phys. Commun. 55 (1989) 269.

[10] D.C. Rapaport, to be published.[11] R.W. Hockney and C.R. Jesshope, Parallel Computers, 2nd ed. (Adam Hilger, Bristol, 1988).[12] J.J. Dongarra, Argonne National Lab. Math, and Comp. Sci. Tech. Memo no. 23 (1988).

rapaport_multi-million particle md

Documents

vectorprocessing hardware

vector machines

vector hardware

vector processor

bymeans of vector processing

vector processingd

vector supercomputersis

distributed processing