scaling namd to 100 million atoms on petascale machines with charm++
DESCRIPTION
Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++. James Phillips Beckman Institute, University of Illinois http://www.ks.uiuc.edu/Research/namd/. Chao Mei Parallel Programming Lab, University of Illinois http://charm.cs.illinois.edu/. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++](https://reader035.vdocument.in/reader035/viewer/2022062721/568137c7550346895d9f6516/html5/thumbnails/1.jpg)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++
James PhillipsBeckman Institute, University of Illinoishttp://www.ks.uiuc.edu/Research/namd/
Chao MeiParallel Programming Lab, University of Illinoishttp://charm.cs.illinois.edu/
![Page 2: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++](https://reader035.vdocument.in/reader035/viewer/2022062721/568137c7550346895d9f6516/html5/thumbnails/2.jpg)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
UIUC Beckman Institute is a “home away from home” for interdisciplinary researchers
Theoretical and ComputationalBiophysics Group
![Page 3: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++](https://reader035.vdocument.in/reader035/viewer/2022062721/568137c7550346895d9f6516/html5/thumbnails/3.jpg)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Biomolecular simulations are our computational microscope
Ribosome: synthesizes proteins from genetic information, target for antibiotics
Silicon nanopore: bionanodevice for sequencing DNA efficiently
![Page 4: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++](https://reader035.vdocument.in/reader035/viewer/2022062721/568137c7550346895d9f6516/html5/thumbnails/4.jpg)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Our goal for NAMD is practical supercomputing for NIH researchers
• 44,000 users can’t all be computer experts.– 11,700 have downloaded more than one version.
– 2300 citations of NAMD reference papers.
• One program for all platforms.– Desktops and laptops – setup and testing
– Linux clusters – affordable local workhorses
– Supercomputers – free allocations on TeraGrid
– Blue Waters – sustained petaflop/s performance
• User knowledge is preserved.– No change in input or output files.
– Run any simulation on any number of cores.
• Available free of charge to all.
Phillips et al., J. Comp. Chem. 26:1781-1802, 2005.
![Page 5: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++](https://reader035.vdocument.in/reader035/viewer/2022062721/568137c7550346895d9f6516/html5/thumbnails/5.jpg)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
• Spatially decompose data and communication.• Separate but related work decomposition.• “Compute objects” facilitate iterative, measurement-based load balancing system.
NAMD uses a hybrid force-spatial parallel decomposition
Kale et al., J. Comp. Phys. 151:283-312, 1999.
![Page 6: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++](https://reader035.vdocument.in/reader035/viewer/2022062721/568137c7550346895d9f6516/html5/thumbnails/6.jpg)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Charm++ overlaps NAMD algorithms
Objects are assigned to processors, queued as data arrives, and executed in priority order.
Phillips et al., SC2002.
![Page 7: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++](https://reader035.vdocument.in/reader035/viewer/2022062721/568137c7550346895d9f6516/html5/thumbnails/7.jpg)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
NAMD adjusts grainsize to match parallelism to processor count
• Tradeoff between parallelism and overhead• Maximum patch size is based on cutoff• Ideally one or more patches per processor
– To double, split in x, y, z dimensions– Number of computes grows much faster!
• Hard to automate completely– Also need to select number of PME pencils
• Computes partitioned in outer atom loop– Old: Heuristic based on on distance, atom count– New: Measurement-based compute partitioning
![Page 8: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++](https://reader035.vdocument.in/reader035/viewer/2022062721/568137c7550346895d9f6516/html5/thumbnails/8.jpg)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Measurement-based grainsize tuning enables scalable implicit solvent simulation
After - Measurement-based (512 cores)
Before - Heuristic (256 cores)
![Page 9: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++](https://reader035.vdocument.in/reader035/viewer/2022062721/568137c7550346895d9f6516/html5/thumbnails/9.jpg)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
The age of petascale biomolecular simulation is near
![Page 10: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++](https://reader035.vdocument.in/reader035/viewer/2022062721/568137c7550346895d9f6516/html5/thumbnails/10.jpg)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Larger machines enable larger
simulations
![Page 11: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++](https://reader035.vdocument.in/reader035/viewer/2022062721/568137c7550346895d9f6516/html5/thumbnails/11.jpg)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
2002Gordon
BellAward
PSC Lemieux: 3000 cores
ATP synthase: 300K atoms
Blue Waters: 300,000 cores, 1.2M threads
Chromatophore: 100M atoms
Target is still 100 atoms per thread
![Page 12: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++](https://reader035.vdocument.in/reader035/viewer/2022062721/568137c7550346895d9f6516/html5/thumbnails/12.jpg)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Scale brings other challenges
• Limited memory per core
• Limited memory per node
• Finicky parallel filesystems
• Limited inter-node bandwidth
• Long load balancer runtimes
Which is why we collaborate with PPL!
![Page 13: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++](https://reader035.vdocument.in/reader035/viewer/2022062721/568137c7550346895d9f6516/html5/thumbnails/13.jpg)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Challenges in 100M-atom Biomolecule Simulation
• How to overcome sequential bottleneck?– Initialization– Output trajectory & restart data
• How to achieve good strong-scaling results?– Charm++ Runtime
![Page 14: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++](https://reader035.vdocument.in/reader035/viewer/2022062721/568137c7550346895d9f6516/html5/thumbnails/14.jpg)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Loading Data into System (1)
• Traditionally done on a single core– Molecule size is small
• Result of 100M-atom system– Memory: 40.5 GB !– Time: 3301.9 sec !
![Page 15: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++](https://reader035.vdocument.in/reader035/viewer/2022062721/568137c7550346895d9f6516/html5/thumbnails/15.jpg)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Loading Data into System (2)
• Compression scheme– Atom “Signature” representing common
attributes of a atom– Support more science simulation parameters– However, not enough
• Memory: 12.8 GB!
• Time: 125.5 sec!
![Page 16: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++](https://reader035.vdocument.in/reader035/viewer/2022062721/568137c7550346895d9f6516/html5/thumbnails/16.jpg)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Loading Data into System (3)
• Parallelizing initialization– #input procs: a parameter chosen either by user
or auto-computed at runtime– First, each loads 1/N of all atoms– Second, atoms shuffled with neighbor procs for
later spatial decomposition– Good enough e.g. 600 input procs
• Memory: 0.19 GB• Time: 12.4 sec
![Page 17: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++](https://reader035.vdocument.in/reader035/viewer/2022062721/568137c7550346895d9f6516/html5/thumbnails/17.jpg)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Output Trajectory & Restart Data (1)
• At least 4.8GB output to file system per output step– tens ms/step target makes it more critical
• Parallelizing output– Each output proc is responsible for a portion of
atoms
• Output to single file for compatibility
![Page 18: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++](https://reader035.vdocument.in/reader035/viewer/2022062721/568137c7550346895d9f6516/html5/thumbnails/18.jpg)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Output Issue (1)
![Page 19: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++](https://reader035.vdocument.in/reader035/viewer/2022062721/568137c7550346895d9f6516/html5/thumbnails/19.jpg)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Output Issue (2)
• Multiple and independent file
• Post-processing into a single file
![Page 20: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++](https://reader035.vdocument.in/reader035/viewer/2022062721/568137c7550346895d9f6516/html5/thumbnails/20.jpg)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Initial Strong Scaling on Jaguar6,720 cores
53,760 cores
107,520 cores224,076 cores
![Page 21: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++](https://reader035.vdocument.in/reader035/viewer/2022062721/568137c7550346895d9f6516/html5/thumbnails/21.jpg)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Multi-threading MPI-based Charm++ Runtime
• Exploit multicore
• Portable as based on MPI
• On each node:– “processor” represented as a thread– N “worker” threads share 1 “communication”
thread• Worker thread: only handle computation
• Communication: only handle network message
![Page 22: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++](https://reader035.vdocument.in/reader035/viewer/2022062721/568137c7550346895d9f6516/html5/thumbnails/22.jpg)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Benefits of SMP Mode (1)
• Intra-node communication is faster– Msg transferred as a pointer
• Program launch time reduced– 224K cores: ~6 min ~1 min
• Transparent to application developers– Correct charm++ program runs both in non-
SMP and SMP mode
![Page 23: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++](https://reader035.vdocument.in/reader035/viewer/2022062721/568137c7550346895d9f6516/html5/thumbnails/23.jpg)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Benefits of SMP Mode(2)
• Reduce memory footprint further– Read-only data structures shared– Memory footprint for MPI library is reduced – On avg. 7X reduction!
• Better cache performance
Enables the 100M-atom run on Intrepid (BlueGene/P 2GB/node)
![Page 24: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++](https://reader035.vdocument.in/reader035/viewer/2022062721/568137c7550346895d9f6516/html5/thumbnails/24.jpg)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Potential Bottleneck on Communication Thread
• Computation & Communication Overlap alleviates the problem to some extent
![Page 25: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++](https://reader035.vdocument.in/reader035/viewer/2022062721/568137c7550346895d9f6516/html5/thumbnails/25.jpg)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Node-aware Communication
• In runtime: multicast, broadcast etc.– E.g.: a series of bcast in startup: 2.78X
reduction
• In application: multicast tree– Incorporate knowledge of computation to guide
the construction of the tree• Least loaded node as intermediate node
![Page 26: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++](https://reader035.vdocument.in/reader035/viewer/2022062721/568137c7550346895d9f6516/html5/thumbnails/26.jpg)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Handle Burst of Messages (1)
• A global barrier after each timestep due to constant pressure algorithm
• More amplified due to only 1 comm thd per node
![Page 27: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++](https://reader035.vdocument.in/reader035/viewer/2022062721/568137c7550346895d9f6516/html5/thumbnails/27.jpg)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Handle Burst of Messages (2)
• Work flow of comm thread– Alternate in send/release/receive modes
• Dynamic flow control– Exit one mode to another – E.g. 12.3% for 4480-node (53,760 cores)
![Page 28: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++](https://reader035.vdocument.in/reader035/viewer/2022062721/568137c7550346895d9f6516/html5/thumbnails/28.jpg)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Hierarchical Load Balancer
• Large memory consumption in centralized one
• Processors are divided into groups
• Load balancing is done in each group
![Page 29: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++](https://reader035.vdocument.in/reader035/viewer/2022062721/568137c7550346895d9f6516/html5/thumbnails/29.jpg)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Improvement due to Load Balancing
![Page 30: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++](https://reader035.vdocument.in/reader035/viewer/2022062721/568137c7550346895d9f6516/html5/thumbnails/30.jpg)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Performance Improvement ofSMP over non-SMP on Jaguar
![Page 31: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++](https://reader035.vdocument.in/reader035/viewer/2022062721/568137c7550346895d9f6516/html5/thumbnails/31.jpg)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Strong Scale on Jaguar (2)
6,720 cores
53,760 cores
107,520 cores
224,076 cores
![Page 32: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++](https://reader035.vdocument.in/reader035/viewer/2022062721/568137c7550346895d9f6516/html5/thumbnails/32.jpg)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Weak Scale on Intrepid (~1466 atoms/core)
2M 6M 12M 24M 48M100M
1. 100M-atom ONLY runs in SMP mode
2. Dedicating one core to communication per node in SMP mode (25% loss) caused performance gap
![Page 33: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++](https://reader035.vdocument.in/reader035/viewer/2022062721/568137c7550346895d9f6516/html5/thumbnails/33.jpg)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Conclusion and Future Work
• IO bottleneck solved by parallelization
• An approach that optimizes both application and its underlying runtime– SMP mode in runtime
• Continue to improve performance– PME calculation
• Integrate and optimize new science codes
![Page 34: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++](https://reader035.vdocument.in/reader035/viewer/2022062721/568137c7550346895d9f6516/html5/thumbnails/34.jpg)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Acknowledgement
• Gengbin Zheng, Yanhua Sun, Eric Bohm, Chris Harrison, Osman Sarood for the 100M-atom simulation
• David Tanner for the implicit solvent work
• Machines: Jaguar@NCCS, Intrepid@ANL supported by DOE
• Funds: NIH, NSF
![Page 35: Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++](https://reader035.vdocument.in/reader035/viewer/2022062721/568137c7550346895d9f6516/html5/thumbnails/35.jpg)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Thanks