msc in high performance computing computational chemistry module parallel molecular dynamics (ii)...
TRANSCRIPT
MSc in High Performance ComputingMSc in High Performance ComputingComputational Chemistry ModuleComputational Chemistry Module
Parallel Molecular Dynamics (ii)Parallel Molecular Dynamics (ii)
Bill Smith
CCLRC Daresbury [email protected]
Computational Science & Engineering Department
CSEBasic MD Parallelization StrategiesBasic MD Parallelization Strategies
Recap:
● Last Lecture– Computing Ensemble – Hierarchical Control– Replicated Data
● This Lecture – Systolic Loops– Domain Decomposition
Computational Science & Engineering Department
CSE
Proc 0Proc 0 Proc 1Proc 1
11
2P2P
22
2P-12P-1
Proc (P- 2)Proc (P- 2) Proc (P-1)Proc (P-1)
P-1P-1 PP
P+2P+2 P+1P+1
Systolic Loops: SLS-G AlgorithmSystolic Loops: SLS-G Algorithm● Systolic Loop algorithms
– Compute the interactions between (and within) `data packets’
– Data packets are then transferred between nodes to permit calculation of all possible pair interactions
Computational Science & Engineering Department
CSESystolic Loop (SLS-G) AlgorithmSystolic Loop (SLS-G) Algorithm
● Systolic Loop Single-Group● Features:
– P processing nodes, N molecules– 2P groups (`packets’) of n molecules (N=2Pn)– For each time step:
• (a) calculate intra-group forces• (b) calculate inter-group forces• (c) move data packets one `pulse’• (d) repeat (b)-(c) 2P-1 times• (e) integrate equations of motion
Computational Science & Engineering Department
CSESLS-G Communications PatternSLS-G Communications Pattern
Computational Science & Engineering Department
CSE
Processing Time:
Communications Time:
with
Tn
fnn
f
P fn f n n
p
22
12
1
2 1 1 2
2
2 2
( ) ( ) '
( )[ ( ) ' ]
T P nc ( )( )2 1
n N P f / , /2 6
Systolic Loop Performance Systolic Loop Performance Analysis (i)Analysis (i)
Computational Science & Engineering Department
CSE
Fundamental Ratio:
Large N (N>>P):
Small N (N~2P):
)2()')'((
)2)(12(2
NfN
PNPRcp
)')'((
)12(
fN
PRcp
2)')'((2
))(12(
fP
PRcp
Systolic Loop Performance Analysis Systolic Loop Performance Analysis (ii)(ii)
Computational Science & Engineering Department
CSESystolic Loop AlgorithmsSystolic Loop Algorithms
● Advantages– Good load balancing– Portable between parallel machines– Good type 1 scaling with system size and processor
count– Memory requirement fully distributed– Asynchronous communications
● Disadvantages
– Complicated communications strategy– Complicated force fields difficult
Computational Science & Engineering Department
CSE
Domain Decomposition (Scalar - Domain Decomposition (Scalar - 2D)2D)
Computational Science & Engineering Department
CSE
Domain Decomposition (Parallel - Domain Decomposition (Parallel - 2D)2D)
AA BB
CC DD
Computational Science & Engineering Department
CSE
Domain Decomposition (Parallel - Domain Decomposition (Parallel - 3D)3D)
(a) (b)
Computational Science & Engineering Department
CSEDomain Decomposition MDDomain Decomposition MD
● Features:– Short range potential cut off (rcut << Lcell)– Spatial decomposition of atoms into domains– Map domains onto processors – Use link cells in each domain– Pass border link cells to adjacent processors– Calculate forces, solve equations of motion– Re-allocate atoms leaving domains
Computational Science & Engineering Department
CSE
● Processing Time:
● Communications Time:
● with
● and is the number of link cells per node.
Tnm
fn n fp 3
21 27 2( ) ( ) '
T nmc 6 2( )
n N Pm f / , /3 4 3
m3
Domain Decomposition Domain Decomposition Performance Analysis (i)Performance Analysis (i)
NB: O(N) Algorithm
Computational Science & Engineering Department
CSEFundamental Ratio:
Large N Case 1: (N>>P and fixed):
Large N Case 2: (N>>P and i.e. fixed):
Small N: (N=P and ):
RPm N mP
N f NPmcp
12
27 2
2
2 3
( )
( ( ' ) ' ) ( )
RPm
N fcp 12
27
2 ( ( ' ) ' )
Rf
PNcp
1227 2
1 3 ( ( ' ) ' )
/
m3
n 1
Rfcp
1227 2
( )( ( ' ) ' )
1m
Domain Decomposition Domain Decomposition Performance Analysis (ii)Performance Analysis (ii)
Computational Science & Engineering Department
CSEDomain Decomposition MDDomain Decomposition MD
● Advantages:– Predominantly Local Communications– Good load balancing (if system is isotropic!)– Good type 1 scaling– Ideal for huge systems (105 ~ 105 atoms)– Simple communication structure– Fully distributed memory requirement– Dynamic load balancing possible
● Disadvantages– Problems with mapping/portability– Sub-optimal type 2 scaling– Requires short potential cut off– Complex force fields tricky
Computational Science & Engineering Department
CSE
Force fieldForce fielddefinitiondefinition
Glo
bal ato
mic
ind
ices
Glo
bal ato
mic
in
dic
es
PP00LocalLocalatomicatomicindicesindices
PP11LocalLocalatomicatomicindicesindices
PP22LocalLocalatomicatomicindicesindices
Pro
cess
or
Dom
ain
sPro
cess
or
Dom
ain
s
Difficult!Difficult!
Domain Decomposition: Domain Decomposition: Intramolecular Forces Intramolecular Forces
Computational Science & Engineering Department
CSE
2
102
22
exp)4/exp(
2
1
N
jjj
korecip rkiq
k
k
VU
The crucial part of the SPME method is the conversion of the Reciprocal Space component of the Ewald sum into a form suitable for Fast Fourier Transforms (FFT).Thus:
),,(),,(2
1321
,,321
321
kkkQkkkGV
Ukkk
T
orecip
becomes:
where G and Q are 3D grid arrays (see later)
Ref: Essmann et al., J. Chem. Phys. (1995) 103 8577
Coulombic Forces: Smoothed Particle-Mesh Coulombic Forces: Smoothed Particle-Mesh EwaldEwald
Computational Science & Engineering Department
CSE
Central idea - share discrete charges on 3D grid:
Cardinal B-Splines Mn(u) - in 1D:
)1(1
)(1
)(
)0,max()!(!
!)1(
)!1(
1)(
/2exp)1()/)1(2exp()(
/2exp)()(/2exp
11
1
0
12
0
uMn
unuM
n
uuM
kuknk
n
nuM
KikMKknikb
KikuMkbLkiu
nnn
nn
k
kn
n
n
jnj
Recursionrelation
SPME: Spline SchemeSPME: Spline Scheme
Computational Science & Engineering Department
CSE
321 ,,333322221111
1
321
)()()(
),,(
nnnjnjnjn
N
jj KnuMKnuMKnuMq
Q
Is the charge array and QT(k1,k2,k3) its discrete Fourier transform.
GT (k1,k2,k3) is the discrete Fourier Transform of the function:
*3213212
22
321 )),,()(,,()4/exp(
),,( kkkQkkkBk
kkkkG T
2
33
2
22
2
11321 )()()(),,( kbkbkbkkkB with
SPME: Building the ArraysSPME: Building the Arrays
Computational Science & Engineering Department
CSESPME ParallelisationSPME Parallelisation
● Handle real space terms using short range force methods● Reciprocal space terms options:
– Fully replicated Q array construction and FFT (R. Data)– Atomic partition of Q array, replicated FFT (R. Data)
• Easily done, acceptable for few processors• Limits imposed by RAM, global sum required
– Domain decomposition of Q array, distributed FFT• Required for large Q array and many processors• Atoms `shared’ between domains - potentially awkward• Requires distributed FFT - implies comms dependence
Computational Science & Engineering Department
CSESPME: Parallel ApproachesSPME: Parallel Approaches
● SPME is generally faster then conventional Ewald sum in most applications. Algorithm scales as O(NlogN)– In Replicated Data: build the FFT array in pieces on each
processor and make whole by a global sum for the FFT operation.
– In Domain Decomposition: build the FFT array in pieces on each processor and keep that way for the distributed FFT operation (The FFT `hides’ all the implicit communications)
● Characteristics of FFTs– Fast (!) - O(M log M) operations where M is the number of
points in the grid– Global operations - to perform a FFT you need all the points– This makes it difficult to write an efficient, good scaling FFT.
Computational Science & Engineering Department
CSETraditional Parallel FFTsTraditional Parallel FFTs
● Strategy– Distribute the data by planes– Each processor has a complete set of points in the x and
y directions so can do those Fourier transforms– Redistribute data so that a processor holds all the points
in z– Do the z transforms
● Characteristics– Allows efficient implementation of the serial FFTs ( use a
library routine )– In practice for large enough 3D FFTs can scale reasonably– However the distribution does not usually map onto
domain decomposition of simulation - implies large amounts of data redistribution
Computational Science & Engineering Department
CSE
Daresbury Advanced 3-D FFT Daresbury Advanced 3-D FFT (DAFT)(DAFT)
● Takes data distributed as MD domain decomposition.● So do a distributed data FFT in the x direction
– Then the y– And finally the z
● Disadvantage is that can not use the library routine for the 1D FFT ( not quite true – can do sub-FFTs on each domain )
● Scales quite well - e.g. on 512 procs, an 8x8x8 proc grid, a 1D FFT need only scale to 8 procs
● Totally avoids data redistribution costs● Communication is by rows/columns● In practice DAFT wins ( on the machines we have compared ) and also the coding is simpler !
Computational Science & Engineering Department
CSEDomain Decomposition: Load Domain Decomposition: Load
Balancing IssuesBalancing Issues● Domain decomposition according to spatial domains
sometimes presents severe load balancing problems– Material can be inhomogeneous– Some parts may require different amounts of
computations• E.g. enzyme in a large bath of water
● Strategies can include– Dynamic load balancing: re-distribution (migration) of
atoms from one processor to another• Need to carry around associated data on bonds,
angles, constraints…..– Redistribution of parts of the force calculation
• E.g. NAMD
Computational Science & Engineering Department
CSE
Boillat, Bruge, Kropf, J. Comput Phys., 96 1 (1991)
Domain Decomposition: Dynamic Load Domain Decomposition: Dynamic Load BalancingBalancing
Can be appliedin 3D (but not easily!)
Computational Science & Engineering Department
CSENAMD: Dynamic Load BalancingNAMD: Dynamic Load Balancing
● NAMD exploits MD as a tool to understand the structure and function of biomolecules– proteins, DNA, membranes
● NAMD is a production quality MD program– Active use by biophysicists (science publications)– 50,000+ lines of C++ code– 1000+ registered users– Features and “accessories” such as
• VMD: visualization and analysis• BioCoRE: collaboratory• Steered and Interactive Molecular Dynamics
● Load balancing ref: – L.V. Kale, M. Bhandarkar and R. Brunner, Lecture
Notes in Computer Science 1998, 1457, 251-261.
Computational Science & Engineering Department
CSENAMD : Initial Static BalancingNAMD : Initial Static Balancing
● Allocate patches (link cells) to processors so that– Each processor has same number of atoms (approx.)– Neighbouring patches share same processor if possible
● Weighing the workload on each processor
– Calculate forces internal to each patch (weight ~ np2/2)
– Calculate forces between patches (i.e. one compute object) on the same processor (weight ~ w*n1*n2). Factor w depends on connection (face-face > edge-edge > corner-corner)
– If two patches on different processors – send proxy patch to lesser loaded processor.
● Dynamic load balancing used during simulation run.
Computational Science & Engineering Department
CSENAMD : Dynamic Load Balancing (i)NAMD : Dynamic Load Balancing (i)
● Balance maintained by a Distributed Load Balance Coordinator which monitors on each processor: – Background load (non migratable work)– Idle time– Migratable compute objects and their associated
compute load– The patches that compute objects depend upon– The home processor of each patch– The proxy patches required by each processor
● The monitored data is used to determine load balancing
Computational Science & Engineering Department
CSE
NAMD : Dynamic Load Balancing NAMD : Dynamic Load Balancing (ii)(ii)
● Greedy load balancing strategy:– Sort migratable compute objects in order of heaviest load– Sort processors in order of `hungriest’– Share out compute objects so hungriest ranked processor
gets largest compute object available– BUT: this does not take into account communication cost
● Modification:– Identify least loaded processors with:
• Both patches or proxies to complete a compute object (no comms)
• One patch necessary for a compute object (moderate comms)
• No patches for a compute object (high comms)– Allocate compute object to processor giving best
compromise in cost (compute plus communication).
Computational Science & Engineering Department
CSEImpact of Measurement-based Impact of Measurement-based
Load BalancingLoad Balancing