topology-aware mpi communication and scheduling for ......topology aware task mapping topology aware...

1
Topology-Aware MPI Communication and Scheduling for Petascale Systems PIs: D. K. Panda (The Ohio State University), K. Schulz and B. Barth (Texas Advanced Computing Center) and A. Majumdar (San Diego Supercomputer Center) Motivation Vision and Problem Statement Framework and Approach Network Topology of TACC Ranger Current Job Allocation Strategies and their Impact: A Case Study with TACC Ranger Application Level Performance Impact: A Case Study with MPCUGLES Topology Aware MPI_Gather Design Topology Aware MPI_Scatter Design Conclusions and Continuing Work Job allocation for the entire system Jobs using 16-4800 cores Jobs using 4800-16000 cores Jobs using 16000-64000 cores Table 1: Data Collected from TACC Ranger System Modern networks (like InfiniBand and 10 GigE) are capable of providing topology and routing information Research Challenge: Can the next-generation petascale systems provide topology-aware MPI communication, mapping and scheduling which can improve performance and scalability for a range of scientific applications? On the left, we compare performance of MPCUGLES on the Normal Batch Queue as compared to runs conducted on an exclusive queue We observe that performance may be impacted by up to 15% On the right, we compare performance of MPCUGLES on the Normal Batch Queue, but with special randomization of hostfiles We observe that there is greater variance in performance with up to 16% difference between best case and worst case runs Topology Aware Task Mapping Topology Aware Communication Collectives Point To Point Integrated Evaluation Topology Information Interface Topology Graph Network Status Graph Dynamic State & Topology Management Framework Unified Abstraction Layer Ethernet Network Management System Topology Discovery Traffic Monitoring Enhanced Subnet Management Layer High Performance Interconnect Job Scheduler Topology-Aware Scheduling Performance Feedback Dependency Legend: Profiling Information Turbulence Prediction Earthquake Modeling MPI Applications Flow Modeling Kinetic Simulation Application Hints Rack 1 Rack 2 Rack 82 InfiniBand Switch (Magnum) 1 InfiniBand Switch (Magnum) 2 Ranger's compute nodes are a blade-based configuration 12 blades in a chassis, 4 chassis in a rack, 82 racks in all for a total of 4096 compute nodes. Each chassis embeds a NEM (network express module) combining a 24-port leaf switch with 12 dual-rail SDR HCA's, one per node. Each NEM is connected to the core Magnum switch(es) with 4, 12X connectors, 2 to each Magnum. 8 0 1 2 3 4 5 6 7 1.3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 Run # Performance Normalized to Exclusive Queue Batch Queue with Normal Ordering on 192 cores 6 0 1 2 3 4 5 1.3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 Run # Performance Normalized to Exclusive Queue Batch Queue with Random Ordering on 192 cores Research Questions: (1) What are the topology aware communication and scheduling requirements of petascale applications? (2) How to design a network topology and state management framework with static and dynamic network information? (3) How to design topology-aware point-to-point and collective communication schemes? (4) How to design topology-aware task mapping and scheduling schemes? (5) How to design a flexible topology information interface? On the left, we compare performance of the default “Binomial Tree” algorithm under quiet and busy conditions We observe that the algorithm is impacted by background traffic On the right, we compare performance of proposed topology-aware algorithm under quiet and busy conditions Proposed algorithm out-performs default “Binomial Tree” algorithm under both quiet and busy conditions 23% performance improvement under quiet network conditions and 10% under busy conditions The graphs present the analysis of the jobs run on TACC ranger system in September '09 There were a total of 19,441multi-node jobs, most of which used 16-4800 cores We observe that for the majority of jobs, average inter-node distance is significantly more than the best possible Modern high-end computing (HEC) systems enable scientists to tackle grand challenge problems Design and deployment of such ultra-scale HEC systems is being fueled by the increasing use of multi-core/many-core architectures and commodity networking technologies like InfiniBand. As a recent example, the TACC Ranger system was deployed with a total of 62,976 cores using a fat-tree InfiniBand interconnect to provide a peak performance of 579 TFlops. Most current petascale applications are written using the Message Passing Interface (MPI) programming model. By necessity, large-scale systems that support MPI are built using hierarchical topologies (multiple levels involving intra-socket, intra- node, intra-blade, intra-rack, and multi-stages within a high-speed switch). Current generation MPI libraries and schedulers do not take into account these various levels for optimizing communications Consequently, this leads to non-optimal performance and scalability for many applications. Process Location Number of Hops MPI Latency (us) Intra-Rack Intra-Chassis Inter-Chassis Inter-Rack 0 Hops in Leaf Switch 1 Hop in Leaf Switch 3 Hops Across Spine Switch 5 Hops Across Spine Switch 1.57 2.04 2.45 2.85 Results of current work: We have observed a major impact on end applications if schedulers and communication libraries are not topology aware We have proposed topology aware collective communication algorithms for MPI_Gather and MPI_Scatter Our proposed algorithms outperform default implementation under both network quiet and busy conditions Continuing work: Work towards a topology-aware scheduling scheme Adapt more collective algorithms dynamically according to topology interfacing with schedulers Gather more data from real-world application runs Integrated solutions will be available in future versions of MVAPICH/ MVAPICH2 software Publications: K. Kandalla, H. Subramoni and D. K. Panda, “Designing Topology-Aware Collective Communication Algorithms for Large Scale InfiniBand Clusters : Case Studies wih Scatter and Gather”, Communication Architecture for Clusters (CAC) Workshop, in conjunction with IPDPS 2010 Additional Personnel: Hari Subramoni, Krishna Kandalla, Sayantan Sur, Karen Tomko (OSU) Mahidhar Tatineni, Yifeng Cui, Dmitry Pekurovsky (SDSC) We conduct experiments with MVAPICH2 stack, which is one of the most popular MPI implementations over InfiniBand, currently used by more than 1,050 organizations worldwide (http://www.mvapich.cse.ohio-state.edu ) On the left, we compare performance of the default “Binomial Tree” algorithm under quiet and busy conditions We observe that default algorithm is very sensitive to background traffic, with degradation up to 21% for large messages On the right, we compare performance of proposed topology-aware algorithm Proposed algorithm out-performs default “Binomial Tree” algorithm under both quiet and busy conditions Over 50% performance improvement even when network was busy

Upload: others

Post on 24-Mar-2021

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Topology-Aware MPI Communication and Scheduling for ......Topology Aware Task Mapping Topology Aware Communication Collectives Point To Point Integrated Evaluation Topology Information

Topology-Aware MPI Communication and Scheduling for Petascale SystemsPIs: D. K. Panda (The Ohio State University), K. Schulz and B. Barth (Texas Advanced Computing Center)

and A. Majumdar (San Diego Supercomputer Center)

Motivation Vision and Problem Statement Framework and Approach

Network Topology of TACC Ranger Current Job Allocation Strategies and their Impact: A Case Study with TACC Ranger

Application Level Performance Impact: A Case Study with MPCUGLES

Topology Aware MPI_Gather Design

Topology Aware MPI_Scatter Design Conclusions and Continuing Work

Job allocation for the entire system Jobs using 16-4800 cores

Jobs using 4800-16000 cores Jobs using 16000-64000 cores

Table 1: Data Collected from TACC Ranger System

Modern networks (like InfiniBand and 10 GigE) are capable of providing topology and routing information

Research Challenge:

Can the next-generation petascale systems provide topology-aware MPI communication, mapping and scheduling which can improve performance and scalability for a range of scientific applications?

• On the left, we compare performance of MPCUGLES on the Normal Batch Queue as compared to runs conducted on an exclusive queue

• We observe that performance may be impacted by up to 15%

• On the right, we compare performance of MPCUGLES on the Normal Batch Queue, but with special randomization of hostfiles

• We observe that there is greater variance in performance with up to 16% difference between best case and worst case runs

Topology AwareTask Mapping

Topology Aware Communication

Collectives Point To PointIntegratedEvaluation

Topology Information Interface

Topology Graph Network Status Graph

Dynamic State & Topology Management Framework

Unified Abstraction Layer

EthernetNetwork Management System Topology

DiscoveryTraffic

Monitoring

Enhanced Subnet Management Layer

High Performance Interconnect

Job Scheduler

Topology-Aware Scheduling

Performance Feedback

Dependency

Legend:

Profiling Information

TurbulencePrediction

EarthquakeModeling

MPI Applications

FlowModeling

KineticSimulation

Application Hints

Rack 1 Rack 2 Rack 82

InfiniBand Switch (Magnum) 1 InfiniBand Switch (Magnum) 2

• Ranger's compute nodes are a blade-based configuration

• 12 blades in a chassis, 4 chassis in a rack, 82 racks in all for a total of 4096 compute nodes.

• Each chassis embeds a NEM (network express module) combining a 24-port leaf switch with 12 dual-rail SDR HCA's, one per node.

• Each NEM is connected to the core Magnum switch(es) with 4, 12X connectors, 2 to each Magnum.

80 1 2 3 4 5 6 7

1.3

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

Run #

Perf

orm

ance

Nor

mal

ized

to E

xclu

sive

Que

ue

Batch Queue with Normal Ordering on 192 cores

60 1 2 3 4 5

1.3

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

Run #

Perf

orm

ance

Nor

mal

ized

to E

xclu

sive

Que

ue

Batch Queue with Random Ordering on 192 cores

Research Questions:

(1) What are the topology aware communication and scheduling requirements of petascale applications?

(2) How to design a network topology and state management framework with static and dynamic network information?

(3) How to design topology-aware point-to-point and collective communication schemes?

(4) How to design topology-aware task mapping and scheduling schemes?

(5) How to design a flexible topology information interface?

• On the left, we compare performance of the default “Binomial Tree” algorithm under quiet and busy conditions

• We observe that the algorithm is impacted by background traffic

• On the right, we compare performance of proposed topology-aware algorithm under quiet and busy conditions

• Proposed algorithm out-performs default “Binomial Tree” algorithm under both quiet and busy conditions

• 23% performance improvement under quiet network conditions and 10% under busy conditions

• The graphs present the analysis of the jobs run on TACC ranger system in September '09

• There were a total of 19,441multi-node jobs, most of which used 16-4800 cores

• We observe that for the majority of jobs, average inter-node distance is significantly more than the best possible

• Modern high-end computing (HEC) systems enable scientists to tackle grand challenge problems

• Design and deployment of such ultra-scale HEC systems is being fueled by the increasing use of multi-core/many-core architectures and commodity networking technologies like InfiniBand.

• As a recent example, the TACC Ranger system was deployed with a total of 62,976 cores using a fat-tree InfiniBand interconnect to provide a peak performance of 579 TFlops.

• Most current petascale applications are written using the Message Passing Interface (MPI) programming model.

• By necessity, large-scale systems that support MPI are built using hierarchical topologies (multiple levels involving intra-socket, intra-node, intra-blade, intra-rack, and multi-stages within a high-speed switch).

• Current generation MPI libraries and schedulers do not take into account these various levels for optimizing communications

• Consequently, this leads to non-optimal performance and scalability for many applications.

Process Location Number of Hops MPI Latency (us)

Intra-RackIntra-ChassisInter-Chassis

Inter-Rack

0 Hops in Leaf Switch1 Hop in Leaf Switch

3 Hops Across Spine Switch5 Hops Across Spine Switch

1.572.042.452.85

Results of current work:

• We have observed a major impact on end applications if schedulers and communication libraries are not topology aware

• We have proposed topology aware collective communication algorithms for MPI_Gather and MPI_Scatter

• Our proposed algorithms outperform default implementation under both network quiet and busy conditions

Continuing work:

• Work towards a topology-aware scheduling scheme

• Adapt more collective algorithms dynamically according to topology interfacing with schedulers

• Gather more data from real-world application runs

• Integrated solutions will be available in future versions of MVAPICH/MVAPICH2 software

Publications:

• K. Kandalla, H. Subramoni and D. K. Panda, “Designing Topology-Aware Collective Communication Algorithms for Large Scale InfiniBand Clusters : Case Studies wih Scatter and Gather”, Communication Architecture for Clusters (CAC) Workshop, in conjunction with IPDPS 2010

Additional Personnel:

• Hari Subramoni, Krishna Kandalla, Sayantan Sur, Karen Tomko (OSU)

• Mahidhar Tatineni, Yifeng Cui, Dmitry Pekurovsky (SDSC)

• We conduct experiments with MVAPICH2 stack, which is one of the most popular MPI implementations over InfiniBand, currently used by more than 1,050 organizations worldwide (http://www.mvapich.cse.ohio-state.edu)

• On the left, we compare performance of the default “Binomial Tree” algorithm under quiet and busy conditions

• We observe that default algorithm is very sensitive to background traffic, with degradation up to 21% for large messages

• On the right, we compare performance of proposed topology-aware algorithm

• Proposed algorithm out-performs default “Binomial Tree” algorithm under both quiet and busy conditions

• Over 50% performance improvement even when network was busy