![Page 1: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois](https://reader034.vdocument.in/reader034/viewer/2022051622/5697bf711a28abf838c7df88/html5/thumbnails/1.jpg)
Scalability and interoperable libraries in NAMD
Laxmikant (Sanjay) KaleTheoretical Biophysics group
and
Department of Computer Science
University of Illinois at Urbana-Champaign
![Page 2: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois](https://reader034.vdocument.in/reader034/viewer/2022051622/5697bf711a28abf838c7df88/html5/thumbnails/2.jpg)
Contributors
• PI s : – Laxmikant Kale, Klaus Schulten, Robert Skeel
• NAMD 1: – Robert Brunner, Andrew Dalke, Attila Gursoy, Bill
Humphrey, Mark Nelson
• NAMD2: – M. Bhandarkar, R. Brunner, A. Gursoy, J. Philips,
N.Krawetz, A. Shinozaki, K. Varadarajan, Gengbin Zheng, ..
![Page 3: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois](https://reader034.vdocument.in/reader034/viewer/2022051622/5697bf711a28abf838c7df88/html5/thumbnails/3.jpg)
Middle layers
Applications
Parallel Machines
“Middle Layers”:Languages, Tools, Libraries
![Page 4: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois](https://reader034.vdocument.in/reader034/viewer/2022051622/5697bf711a28abf838c7df88/html5/thumbnails/4.jpg)
Molecular Dynamics
• Collection of [charged] atoms, with bonds• Newtonian mechanics• At each time-step
– Calculate forces on each atom
• bonds:
• non-bonded: electrostatic and van der Waal’s
– Calculate velocities and Advance positions
• 1 femtosecond time-step, millions needed!• Thousands of atoms (1,000 - 100,000)
![Page 5: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois](https://reader034.vdocument.in/reader034/viewer/2022051622/5697bf711a28abf838c7df88/html5/thumbnails/5.jpg)
Cut-off radius
• Use of cut-off radius to reduce work– 8 - 14 Å
– Faraway charges ignored!
• 80-95 % work is non-bonded force computations• Some simulations need faraway contributions
– Periodic systems: Ewald, Particle-Mesh Ewald
– Aperiodic systems: FMA
• Even so, cut-off based computations are important:– near-atom calculations are part of the above
– multiple time-stepping is used: k cut-off steps, 1 PME/FMA
![Page 6: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois](https://reader034.vdocument.in/reader034/viewer/2022051622/5697bf711a28abf838c7df88/html5/thumbnails/6.jpg)
Scalability
• The Program should scale up to use a large number of processors. – But what does that mean?
• An individual simulation isn’t truly scalable• Better definition of scalability:
– If I double the number of processors, I should be able to retain parallel efficiency by increasing the problem size
![Page 7: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois](https://reader034.vdocument.in/reader034/viewer/2022051622/5697bf711a28abf838c7df88/html5/thumbnails/7.jpg)
Isoefficiency
• Quantify scalability – (Work of Vipin Kumar, U. Minnesota)
• How much increase in problem size is needed to retain the same efficiency on a larger machine?
• Efficiency : Seq. Time/ (P · Parallel Time)– parallel time =
• computation + communication + idle
![Page 8: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois](https://reader034.vdocument.in/reader034/viewer/2022051622/5697bf711a28abf838c7df88/html5/thumbnails/8.jpg)
Traditional Approaches
• Replicated Data:– All atom coordinates stored on each processor
– Non-bonded Forces distributed evenly
– Analysis: Assume N atoms, P processors
• Computation: O(N/P)
• Communication: O(N log P)
• Communication/Computation ratio: P log P
• Fraction of communication increases with number of processors, independent of problem size!
– So, not scalable by this definition
![Page 9: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois](https://reader034.vdocument.in/reader034/viewer/2022051622/5697bf711a28abf838c7df88/html5/thumbnails/9.jpg)
Atom decomposition
• Partition the Atoms array across processors– Nearby atoms may not be on the same processor
– Communication: O(N) per processor
– Communication/Computation: O(N)/(N/P): O(P)
– Again, not scalable by our definition
![Page 10: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois](https://reader034.vdocument.in/reader034/viewer/2022051622/5697bf711a28abf838c7df88/html5/thumbnails/10.jpg)
Force Decomposition
• Distribute force matrix to processors– Matrix is sparse, non uniform
– Each processor has one block
– Communication:
– Ratio:
• Better scalability in practice – (can use 100+ processors)
– Plimpton:
– Hwang, Saltz, et al:
• 6% on 32 Pes 36% on 128 processor
– Yet not scalable in the sense defined here!
P
N
P
![Page 11: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois](https://reader034.vdocument.in/reader034/viewer/2022051622/5697bf711a28abf838c7df88/html5/thumbnails/11.jpg)
Spatial Decomposition
• Allocate close-by atoms to the same processor• Three variations possible:
– Partitioning into P boxes, 1 per processor
• Good scalability, but hard to implement
– Partitioning into fixed size boxes, each a little larger than the cutoff distance
– Partitioning into smaller boxes
• Communication: O(N/P): – so, scalable in principle
![Page 12: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois](https://reader034.vdocument.in/reader034/viewer/2022051622/5697bf711a28abf838c7df88/html5/thumbnails/12.jpg)
Spatial Decomposition in NAMD
• NAMD 1 used spatial decomposition• Good theoretical isoefficiency, but for a fixed size
system, load balancing problems• For midsize systems, got good speedups up to 16
processors….• Use the symmetry of Newton’s 3rd law to
facilitate load balancing
![Page 13: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois](https://reader034.vdocument.in/reader034/viewer/2022051622/5697bf711a28abf838c7df88/html5/thumbnails/13.jpg)
Spatial Decomposition
But the load balancing problems are still severe:
![Page 14: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois](https://reader034.vdocument.in/reader034/viewer/2022051622/5697bf711a28abf838c7df88/html5/thumbnails/14.jpg)
![Page 15: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois](https://reader034.vdocument.in/reader034/viewer/2022051622/5697bf711a28abf838c7df88/html5/thumbnails/15.jpg)
FD + SD
• Now, we have many more objects to load balance:– Each diamond can be assigned to any processor
– Number of diamonds (3D):
• 14·Number of Patches
![Page 16: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois](https://reader034.vdocument.in/reader034/viewer/2022051622/5697bf711a28abf838c7df88/html5/thumbnails/16.jpg)
Bond Forces
• Multiple types of forces:– Bonds(2), Angles(3), Dihedrals (4), ..
– Luckily, each involves atoms in neighboring patches only
• Straightforward implementation:– Send message to all neighbors,
– receive forces from them
– 26*2 messages per patch!
![Page 17: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois](https://reader034.vdocument.in/reader034/viewer/2022051622/5697bf711a28abf838c7df88/html5/thumbnails/17.jpg)
Bonded Forces:• Assume one patch per processor:
– an angle force involving atoms in patches:
• (x1,y1,z1), (x2,y2,z2), (x3,y3,z3)
• is calculated in patch: (max{xi}, max{yi}, max{zi})
B
CA
![Page 18: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois](https://reader034.vdocument.in/reader034/viewer/2022051622/5697bf711a28abf838c7df88/html5/thumbnails/18.jpg)
Implementation
• Multiple Objects per processor– Different types: patches, pairwise forces, bonded forces,
– Each may have its data ready at different times
– Need ability to map and remap them
– Need prioritized scheduling
• Charm++ supports all of these
![Page 19: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois](https://reader034.vdocument.in/reader034/viewer/2022051622/5697bf711a28abf838c7df88/html5/thumbnails/19.jpg)
Charm++
• Parallel C++ with Data Driven Objects• Object Groups:
– global object with a “representative” on each PE
• Asynchronous method invocation• Prioritized scheduling• Mature, robust, portable• http://charm.cs.uiuc.edu
![Page 20: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois](https://reader034.vdocument.in/reader034/viewer/2022051622/5697bf711a28abf838c7df88/html5/thumbnails/20.jpg)
Data driven execution
Scheduler Scheduler
Message Q Message Q
![Page 21: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois](https://reader034.vdocument.in/reader034/viewer/2022051622/5697bf711a28abf838c7df88/html5/thumbnails/21.jpg)
Load Balancing
• Is a major challenge for this application– especially for a large number of processors
• Unpredictable workloads– Each diamond (force object) and patch encapsulate variable
amount of work
– Static estimates are inaccurate
• Measurement based Load Balancing Framework– Robert Brunner’s recent Ph.D. thesis
– Very slow variations across timesteps
![Page 22: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois](https://reader034.vdocument.in/reader034/viewer/2022051622/5697bf711a28abf838c7df88/html5/thumbnails/22.jpg)
Bipartite graph balancing
• Background load:– Patches (integration, ..) and bond-related forces:
• Migratable load:– Non-bonded forces
– bond-related forces involving atoms of the same patch
• Bipartite communication graph – between migratable and non-migratable objects
• Challenge:– Balance Load while minimizing communication
![Page 23: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois](https://reader034.vdocument.in/reader034/viewer/2022051622/5697bf711a28abf838c7df88/html5/thumbnails/23.jpg)
Load balancing
• Collect timing data for several cycles• Run heuristic load balancer
– Several alternative ones
• Re-map and migrate objects accordingly– Registration mechanisms facilitate migration
• Needs a separate talk!
![Page 24: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois](https://reader034.vdocument.in/reader034/viewer/2022051622/5697bf711a28abf838c7df88/html5/thumbnails/24.jpg)
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
4500000
5000000
Processors
Tim
emigratable work
non-migratable work
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
4500000
Processors
Tim
e migratable work
non-migratable work
![Page 25: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois](https://reader034.vdocument.in/reader034/viewer/2022051622/5697bf711a28abf838c7df88/html5/thumbnails/25.jpg)
Performance: size of system
# ofatoms
Procs 1 2 4 8 16 32 64 128 160
bR Time 1.14 0.58 .315 .158 .086 .0483,762atoms
Speedup 1.0 1.97 3.61 7.20 13.2 23.7
ER-ERE Time 6.115 3.099 1.598 .810 .397 0.212 0.123 0.09836,573atoms
Speedup (1.97) 3.89 7.54 14.9 30.3 56.8 97.9 123
ApoA-I Time 10.76 5.46 2.85 1.47 0.729 0.382 0.32192,224atoms
Speedup (3.88) 7.64 14.7 28.4 57.3 109 130
Performance data on Cray T3E
![Page 26: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois](https://reader034.vdocument.in/reader034/viewer/2022051622/5697bf711a28abf838c7df88/html5/thumbnails/26.jpg)
Performance: various machines
Procs 1 2 4 8 16 32 64 128 160 192
T3E Time 6.12 3.10 1.60 0.810 0.397 0.212 0.123 0.098
- ---------
Speedup (1.97) 3.89 7.54 14.9 30.3 56.8 97.9 123
Origin Time 8.28 4.20 2.17 1.07 0.542 0.271 0.152
2000-------
Speedup 1.0 1.96 3.80 7.74 15.3 30.5 54.3
ASCI- Time 28.0 13.9 7.24 3.76 1.91 1.01 0.500 0.279 0.227 0.196
Red ---------
Speedup 1.0 2.01 3.87 7.45 14.7 27.9 56.0 100 123 143
NOWs Time 24.1 12.4 6.39 3.69
HP735/125
Speedup 1.0 1.94 3.77 6.54
![Page 27: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois](https://reader034.vdocument.in/reader034/viewer/2022051622/5697bf711a28abf838c7df88/html5/thumbnails/27.jpg)
Speedup
0
20
40
60
80
100
120
140
160
180
200
220
240
0 20 40 60 80 100 120 140 160 180 200 220 240
Processors
Sp
eed
up
Speedup
Perfect Speedup
![Page 28: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois](https://reader034.vdocument.in/reader034/viewer/2022051622/5697bf711a28abf838c7df88/html5/thumbnails/28.jpg)
Recent Speedup Results: ASCI RedSpeedup on ASCI Red: Apo-A1
0
100
200
300
400
500
600
700
0 200 400 600 800 1000 1200
Processors
Sp
eed
up
![Page 29: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois](https://reader034.vdocument.in/reader034/viewer/2022051622/5697bf711a28abf838c7df88/html5/thumbnails/29.jpg)
Recent Results on Linux ClusterSpeedup on Linux Cluster
0
10
20
30
40
50
60
70
80
0 20 40 60 80 100 120
Processors
Sp
eed
up
![Page 30: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois](https://reader034.vdocument.in/reader034/viewer/2022051622/5697bf711a28abf838c7df88/html5/thumbnails/30.jpg)
Recent Results on Origin 2000Performance on Origin 2000
0
10
20
30
40
50
60
70
80
90
0 20 40 60 80 100 120
Processors
Sp
ee
du
p
![Page 31: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois](https://reader034.vdocument.in/reader034/viewer/2022051622/5697bf711a28abf838c7df88/html5/thumbnails/31.jpg)
Multi-paradigm programming
• Long-range electrostatic interactions– Some simulations require this
– Contributions of faraway atoms can be calculated infrequently
– PVM based library, DPMTA
• developed at Duke by John Board et al
• Patch life cycle• Better expressed as a thread
![Page 32: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois](https://reader034.vdocument.in/reader034/viewer/2022051622/5697bf711a28abf838c7df88/html5/thumbnails/32.jpg)
Converse
• Supports multi-paradigm programming• Provides portability• Makes it easy to implement RTS for new paradigms• Several languages/libraries:
– Charm++, threaded MPI, PVM, Java, md-perl, pc++, Nexus, Path, Cid, CC++, DP, Agents,..
![Page 33: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois](https://reader034.vdocument.in/reader034/viewer/2022051622/5697bf711a28abf838c7df88/html5/thumbnails/33.jpg)
Namd2 with Converse
![Page 34: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois](https://reader034.vdocument.in/reader034/viewer/2022051622/5697bf711a28abf838c7df88/html5/thumbnails/34.jpg)
NAMD2
• In production use – Internally for about a year
– Several simulations completed/published
• Fastest MD program? We think so• Modifiable/extensible
– Steered MD
– Free energy calculations
![Page 35: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois](https://reader034.vdocument.in/reader034/viewer/2022051622/5697bf711a28abf838c7df88/html5/thumbnails/35.jpg)
Real Application for CS research?
• Benefits– Subtle and complex research problems uncovered only
with real application
– Satisfaction of “real” concrete contribution
– With careful planning, you can truly enrich the “middle layers”
– Bring back a rich variety of relevant CS problems
– Apply to other domains: Rockets? Casting?