topology-aware mpi communication and scheduling for ......topology aware task mapping topology aware...
TRANSCRIPT
Topology-Aware MPI Communication and Scheduling for Petascale SystemsPIs: D. K. Panda (The Ohio State University), K. Schulz and B. Barth (Texas Advanced Computing Center)
and A. Majumdar (San Diego Supercomputer Center)
Motivation Vision and Problem Statement Framework and Approach
Network Topology of TACC Ranger Current Job Allocation Strategies and their Impact: A Case Study with TACC Ranger
Application Level Performance Impact: A Case Study with MPCUGLES
Topology Aware MPI_Gather Design
Topology Aware MPI_Scatter Design Conclusions and Continuing Work
Job allocation for the entire system Jobs using 16-4800 cores
Jobs using 4800-16000 cores Jobs using 16000-64000 cores
Table 1: Data Collected from TACC Ranger System
Modern networks (like InfiniBand and 10 GigE) are capable of providing topology and routing information
Research Challenge:
Can the next-generation petascale systems provide topology-aware MPI communication, mapping and scheduling which can improve performance and scalability for a range of scientific applications?
• On the left, we compare performance of MPCUGLES on the Normal Batch Queue as compared to runs conducted on an exclusive queue
• We observe that performance may be impacted by up to 15%
• On the right, we compare performance of MPCUGLES on the Normal Batch Queue, but with special randomization of hostfiles
• We observe that there is greater variance in performance with up to 16% difference between best case and worst case runs
Topology AwareTask Mapping
Topology Aware Communication
Collectives Point To PointIntegratedEvaluation
Topology Information Interface
Topology Graph Network Status Graph
Dynamic State & Topology Management Framework
Unified Abstraction Layer
EthernetNetwork Management System Topology
DiscoveryTraffic
Monitoring
Enhanced Subnet Management Layer
High Performance Interconnect
Job Scheduler
Topology-Aware Scheduling
Performance Feedback
Dependency
Legend:
Profiling Information
TurbulencePrediction
EarthquakeModeling
MPI Applications
FlowModeling
KineticSimulation
Application Hints
Rack 1 Rack 2 Rack 82
InfiniBand Switch (Magnum) 1 InfiniBand Switch (Magnum) 2
• Ranger's compute nodes are a blade-based configuration
• 12 blades in a chassis, 4 chassis in a rack, 82 racks in all for a total of 4096 compute nodes.
• Each chassis embeds a NEM (network express module) combining a 24-port leaf switch with 12 dual-rail SDR HCA's, one per node.
• Each NEM is connected to the core Magnum switch(es) with 4, 12X connectors, 2 to each Magnum.
80 1 2 3 4 5 6 7
1.3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
Run #
Perf
orm
ance
Nor
mal
ized
to E
xclu
sive
Que
ue
Batch Queue with Normal Ordering on 192 cores
60 1 2 3 4 5
1.3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
Run #
Perf
orm
ance
Nor
mal
ized
to E
xclu
sive
Que
ue
Batch Queue with Random Ordering on 192 cores
Research Questions:
(1) What are the topology aware communication and scheduling requirements of petascale applications?
(2) How to design a network topology and state management framework with static and dynamic network information?
(3) How to design topology-aware point-to-point and collective communication schemes?
(4) How to design topology-aware task mapping and scheduling schemes?
(5) How to design a flexible topology information interface?
• On the left, we compare performance of the default “Binomial Tree” algorithm under quiet and busy conditions
• We observe that the algorithm is impacted by background traffic
• On the right, we compare performance of proposed topology-aware algorithm under quiet and busy conditions
• Proposed algorithm out-performs default “Binomial Tree” algorithm under both quiet and busy conditions
• 23% performance improvement under quiet network conditions and 10% under busy conditions
• The graphs present the analysis of the jobs run on TACC ranger system in September '09
• There were a total of 19,441multi-node jobs, most of which used 16-4800 cores
• We observe that for the majority of jobs, average inter-node distance is significantly more than the best possible
• Modern high-end computing (HEC) systems enable scientists to tackle grand challenge problems
• Design and deployment of such ultra-scale HEC systems is being fueled by the increasing use of multi-core/many-core architectures and commodity networking technologies like InfiniBand.
• As a recent example, the TACC Ranger system was deployed with a total of 62,976 cores using a fat-tree InfiniBand interconnect to provide a peak performance of 579 TFlops.
• Most current petascale applications are written using the Message Passing Interface (MPI) programming model.
• By necessity, large-scale systems that support MPI are built using hierarchical topologies (multiple levels involving intra-socket, intra-node, intra-blade, intra-rack, and multi-stages within a high-speed switch).
• Current generation MPI libraries and schedulers do not take into account these various levels for optimizing communications
• Consequently, this leads to non-optimal performance and scalability for many applications.
Process Location Number of Hops MPI Latency (us)
Intra-RackIntra-ChassisInter-Chassis
Inter-Rack
0 Hops in Leaf Switch1 Hop in Leaf Switch
3 Hops Across Spine Switch5 Hops Across Spine Switch
1.572.042.452.85
Results of current work:
• We have observed a major impact on end applications if schedulers and communication libraries are not topology aware
• We have proposed topology aware collective communication algorithms for MPI_Gather and MPI_Scatter
• Our proposed algorithms outperform default implementation under both network quiet and busy conditions
Continuing work:
• Work towards a topology-aware scheduling scheme
• Adapt more collective algorithms dynamically according to topology interfacing with schedulers
• Gather more data from real-world application runs
• Integrated solutions will be available in future versions of MVAPICH/MVAPICH2 software
Publications:
• K. Kandalla, H. Subramoni and D. K. Panda, “Designing Topology-Aware Collective Communication Algorithms for Large Scale InfiniBand Clusters : Case Studies wih Scatter and Gather”, Communication Architecture for Clusters (CAC) Workshop, in conjunction with IPDPS 2010
Additional Personnel:
• Hari Subramoni, Krishna Kandalla, Sayantan Sur, Karen Tomko (OSU)
• Mahidhar Tatineni, Yifeng Cui, Dmitry Pekurovsky (SDSC)
• We conduct experiments with MVAPICH2 stack, which is one of the most popular MPI implementations over InfiniBand, currently used by more than 1,050 organizations worldwide (http://www.mvapich.cse.ohio-state.edu)
• On the left, we compare performance of the default “Binomial Tree” algorithm under quiet and busy conditions
• We observe that default algorithm is very sensitive to background traffic, with degradation up to 21% for large messages
• On the right, we compare performance of proposed topology-aware algorithm
• Proposed algorithm out-performs default “Binomial Tree” algorithm under both quiet and busy conditions
• Over 50% performance improvement even when network was busy