msc in high performance computing computational chemistry module parallel molecular dynamics (ii)...

MSc in High Performance ComputingMSc in High Performance ComputingComputational Chemistry ModuleComputational Chemistry Module

Parallel Molecular Dynamics (ii)Parallel Molecular Dynamics (ii)

Bill Smith

CCLRC Daresbury [email protected]

Computational Science & Engineering Department

CSEBasic MD Parallelization StrategiesBasic MD Parallelization Strategies

Recap:

● Last Lecture– Computing Ensemble – Hierarchical Control– Replicated Data

● This Lecture – Systolic Loops– Domain Decomposition


CSE

Proc 0Proc 0 Proc 1Proc 1

11

2P2P

22

2P-12P-1

Proc (P- 2)Proc (P- 2) Proc (P-1)Proc (P-1)

P-1P-1 PP

P+2P+2 P+1P+1

Systolic Loops: SLS-G AlgorithmSystolic Loops: SLS-G Algorithm● Systolic Loop algorithms

– Compute the interactions between (and within) `data packets’

– Data packets are then transferred between nodes to permit calculation of all possible pair interactions


CSESystolic Loop (SLS-G) AlgorithmSystolic Loop (SLS-G) Algorithm

● Systolic Loop Single-Group● Features:

– P processing nodes, N molecules– 2P groups (`packets’) of n molecules (N=2Pn)– For each time step:

• (a) calculate intra-group forces• (b) calculate inter-group forces• (c) move data packets one `pulse’• (d) repeat (b)-(c) 2P-1 times• (e) integrate equations of motion


CSESLS-G Communications PatternSLS-G Communications Pattern


CSE

Processing Time:

Communications Time:

with

Tn

fnn

f

P fn f n n

p

22

12

1

2 1 1 2

2

2 2

( ) ( ) '

( )[ ( ) ' ]

T P nc ( )( )2 1

n N P f / , /2 6

Systolic Loop Performance Systolic Loop Performance Analysis (i)Analysis (i)


CSE

Fundamental Ratio:

Large N (N>>P):

Small N (N~2P):

)2()')'((

)2)(12(2

NfN

PNPRcp

)')'((

)12(

fN

PRcp

2)')'((2

))(12(

fP

PRcp

Systolic Loop Performance Analysis Systolic Loop Performance Analysis (ii)(ii)


CSESystolic Loop AlgorithmsSystolic Loop Algorithms

● Advantages– Good load balancing– Portable between parallel machines– Good type 1 scaling with system size and processor

count– Memory requirement fully distributed– Asynchronous communications

● Disadvantages

– Complicated communications strategy– Complicated force fields difficult


CSE

Domain Decomposition (Scalar - Domain Decomposition (Scalar - 2D)2D)


CSE

Domain Decomposition (Parallel - Domain Decomposition (Parallel - 2D)2D)

AA BB

CC DD


CSE

Domain Decomposition (Parallel - Domain Decomposition (Parallel - 3D)3D)

(a) (b)


CSEDomain Decomposition MDDomain Decomposition MD

● Features:– Short range potential cut off (rcut << Lcell)– Spatial decomposition of atoms into domains– Map domains onto processors – Use link cells in each domain– Pass border link cells to adjacent processors– Calculate forces, solve equations of motion– Re-allocate atoms leaving domains


CSE

● Processing Time:

● Communications Time:

● with

● and is the number of link cells per node.

Tnm

fn n fp 3

21 27 2( ) ( ) '

T nmc 6 2( )

n N Pm f / , /3 4 3

m3

Domain Decomposition Domain Decomposition Performance Analysis (i)Performance Analysis (i)

NB: O(N) Algorithm


CSEFundamental Ratio:

Large N Case 1: (N>>P and fixed):

Large N Case 2: (N>>P and i.e. fixed):

Small N: (N=P and ):

RPm N mP

N f NPmcp

12

27 2

2

2 3

( )

( ( ' ) ' ) ( )

RPm

N fcp 12

27

2 ( ( ' ) ' )

Rf

PNcp

1227 2

1 3 ( ( ' ) ' )

/

m3

n 1

Rfcp

1227 2

( )( ( ' ) ' )

1m

Domain Decomposition Domain Decomposition Performance Analysis (ii)Performance Analysis (ii)


CSEDomain Decomposition MDDomain Decomposition MD

● Advantages:– Predominantly Local Communications– Good load balancing (if system is isotropic!)– Good type 1 scaling– Ideal for huge systems (105 ~ 105 atoms)– Simple communication structure– Fully distributed memory requirement– Dynamic load balancing possible

● Disadvantages– Problems with mapping/portability– Sub-optimal type 2 scaling– Requires short potential cut off– Complex force fields tricky


CSE

Force fieldForce fielddefinitiondefinition

Glo

bal ato

mic

ind

ices

Glo

bal ato

mic

in

dic

es

PP00LocalLocalatomicatomicindicesindices



Pro

cess

or

Dom

ain

sPro

cess

or

Dom

ain

s

Difficult!Difficult!

Domain Decomposition: Domain Decomposition: Intramolecular Forces Intramolecular Forces


CSE

2

102

22

exp)4/exp(

2

1

N

jjj

korecip rkiq

k

k

VU

The crucial part of the SPME method is the conversion of the Reciprocal Space component of the Ewald sum into a form suitable for Fast Fourier Transforms (FFT).Thus:

),,(),,(2

1321

,,321

321

kkkQkkkGV

Ukkk

T

orecip

becomes:

where G and Q are 3D grid arrays (see later)

Ref: Essmann et al., J. Chem. Phys. (1995) 103 8577

Coulombic Forces: Smoothed Particle-Mesh Coulombic Forces: Smoothed Particle-Mesh EwaldEwald


CSE

Central idea - share discrete charges on 3D grid:

Cardinal B-Splines Mn(u) - in 1D:

)1(1

)(1

)(

)0,max()!(!

!)1(

)!1(

1)(

/2exp)1()/)1(2exp()(

/2exp)()(/2exp

11

1

0

12

0

uMn

unuM

n

uuM

kuknk

n

nuM

KikMKknikb

KikuMkbLkiu

nnn

nn

k

kn

n

n

jnj

Recursionrelation

SPME: Spline SchemeSPME: Spline Scheme


CSE

321 ,,333322221111

1

321

)()()(

),,(

nnnjnjnjn

N

jj KnuMKnuMKnuMq

Q

Is the charge array and QT(k1,k2,k3) its discrete Fourier transform.

GT (k1,k2,k3) is the discrete Fourier Transform of the function:

*3213212

22

321 )),,()(,,()4/exp(

),,( kkkQkkkBk

kkkkG T

2

33

2

22

2

11321 )()()(),,( kbkbkbkkkB with

SPME: Building the ArraysSPME: Building the Arrays


CSESPME ParallelisationSPME Parallelisation

● Handle real space terms using short range force methods● Reciprocal space terms options:

– Fully replicated Q array construction and FFT (R. Data)– Atomic partition of Q array, replicated FFT (R. Data)

• Easily done, acceptable for few processors• Limits imposed by RAM, global sum required

– Domain decomposition of Q array, distributed FFT• Required for large Q array and many processors• Atoms `shared’ between domains - potentially awkward• Requires distributed FFT - implies comms dependence


CSESPME: Parallel ApproachesSPME: Parallel Approaches

● SPME is generally faster then conventional Ewald sum in most applications. Algorithm scales as O(NlogN)– In Replicated Data: build the FFT array in pieces on each

processor and make whole by a global sum for the FFT operation.

– In Domain Decomposition: build the FFT array in pieces on each processor and keep that way for the distributed FFT operation (The FFT `hides’ all the implicit communications)

● Characteristics of FFTs– Fast (!) - O(M log M) operations where M is the number of

points in the grid– Global operations - to perform a FFT you need all the points– This makes it difficult to write an efficient, good scaling FFT.


CSETraditional Parallel FFTsTraditional Parallel FFTs

● Strategy– Distribute the data by planes– Each processor has a complete set of points in the x and

y directions so can do those Fourier transforms– Redistribute data so that a processor holds all the points

in z– Do the z transforms

● Characteristics– Allows efficient implementation of the serial FFTs ( use a

library routine )– In practice for large enough 3D FFTs can scale reasonably– However the distribution does not usually map onto

domain decomposition of simulation - implies large amounts of data redistribution


CSE

Daresbury Advanced 3-D FFT Daresbury Advanced 3-D FFT (DAFT)(DAFT)

● Takes data distributed as MD domain decomposition.● So do a distributed data FFT in the x direction

– Then the y– And finally the z

● Disadvantage is that can not use the library routine for the 1D FFT ( not quite true – can do sub-FFTs on each domain )

● Scales quite well - e.g. on 512 procs, an 8x8x8 proc grid, a 1D FFT need only scale to 8 procs

● Totally avoids data redistribution costs● Communication is by rows/columns● In practice DAFT wins ( on the machines we have compared ) and also the coding is simpler !


CSEDomain Decomposition: Load Domain Decomposition: Load

Balancing IssuesBalancing Issues● Domain decomposition according to spatial domains

sometimes presents severe load balancing problems– Material can be inhomogeneous– Some parts may require different amounts of

computations• E.g. enzyme in a large bath of water

● Strategies can include– Dynamic load balancing: re-distribution (migration) of

atoms from one processor to another• Need to carry around associated data on bonds,

angles, constraints…..– Redistribution of parts of the force calculation

• E.g. NAMD


CSE

Boillat, Bruge, Kropf, J. Comput Phys., 96 1 (1991)

Domain Decomposition: Dynamic Load Domain Decomposition: Dynamic Load BalancingBalancing

Can be appliedin 3D (but not easily!)


CSENAMD: Dynamic Load BalancingNAMD: Dynamic Load Balancing

● NAMD exploits MD as a tool to understand the structure and function of biomolecules– proteins, DNA, membranes

● NAMD is a production quality MD program– Active use by biophysicists (science publications)– 50,000+ lines of C++ code– 1000+ registered users– Features and “accessories” such as

• VMD: visualization and analysis• BioCoRE: collaboratory• Steered and Interactive Molecular Dynamics

● Load balancing ref: – L.V. Kale, M. Bhandarkar and R. Brunner, Lecture

Notes in Computer Science 1998, 1457, 251-261.


CSENAMD : Initial Static BalancingNAMD : Initial Static Balancing

● Allocate patches (link cells) to processors so that– Each processor has same number of atoms (approx.)– Neighbouring patches share same processor if possible

● Weighing the workload on each processor

– Calculate forces internal to each patch (weight ~ np2/2)

– Calculate forces between patches (i.e. one compute object) on the same processor (weight ~ w*n1*n2). Factor w depends on connection (face-face > edge-edge > corner-corner)

– If two patches on different processors – send proxy patch to lesser loaded processor.

● Dynamic load balancing used during simulation run.


CSENAMD : Dynamic Load Balancing (i)NAMD : Dynamic Load Balancing (i)

● Balance maintained by a Distributed Load Balance Coordinator which monitors on each processor: – Background load (non migratable work)– Idle time– Migratable compute objects and their associated

compute load– The patches that compute objects depend upon– The home processor of each patch– The proxy patches required by each processor

● The monitored data is used to determine load balancing


CSE

NAMD : Dynamic Load Balancing NAMD : Dynamic Load Balancing (ii)(ii)

● Greedy load balancing strategy:– Sort migratable compute objects in order of heaviest load– Sort processors in order of `hungriest’– Share out compute objects so hungriest ranked processor

gets largest compute object available– BUT: this does not take into account communication cost

● Modification:– Identify least loaded processors with:

• Both patches or proxies to complete a compute object (no comms)

• One patch necessary for a compute object (moderate comms)

• No patches for a compute object (high comms)– Allocate compute object to processor giving best

compromise in cost (compute plus communication).


CSEImpact of Measurement-based Impact of Measurement-based

Load BalancingLoad Balancing


CSE

The End

msc in high performance computing computational chemistry module parallel molecular dynamics (ii)...

Documents