disp:*optimizations*towards*scalable* mpi*startup

20
Huansong Fu, Swaroop Pophale*, Manjunath Gorentla Venkata*, Weikuan YuFlorida State University *Oak Ridge National Laboratory DISP: Optimizations Towards Scalable MPI Startup

Upload: others

Post on 25-Nov-2021

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DISP:*Optimizations*Towards*Scalable* MPI*Startup

Huansong Fu†,  Swaroop Pophale*,  Manjunath Gorentla Venkata*,  

Weikuan Yu†

†Florida  State  University*Oak  Ridge  National  Laboratory

DISP:  Optimizations  Towards  Scalable  MPI  Startup

Page 2: DISP:*Optimizations*Towards*Scalable* MPI*Startup

Outline

• Background and motivation– Issues with MPI startup– Cost analysis

• Design of DISP– Delayed initialization – Module sharing– Prediction-based topology setup

• Experiments• Conclusion

Page 3: DISP:*Optimizations*Towards*Scalable* MPI*Startup

S-­3

• The scale of High Performance Computing (HPC) systems is increasing rapidly.

Increasing  Scale  of  HPC  Systems

Rank Name Cores Rpeak(TFlop/s)

Rank  in  Nov  2015

1 Sunway 10,649,600 93,014.6 N/A2 Tianhe-­2 3,120,000 33,862.7 1

3 Titan 560,640 17,590.0 2

4 Sequoia 1,572,864 17,173.2 3

5 Cori 622,336 14,014.7 N/A6 Oakforest 556,104 13,554.6 N/A

Table:  Top500  List  (Nov  2016)

Page 4: DISP:*Optimizations*Towards*Scalable* MPI*Startup

S-­4

MPI  Startup  at  Scale• For the last 20 years, Message Passing Interface (MPI) is the

de facto parallel programming system on HPC systems.• However, MPI startup has serious performance issue at scale.

0

500

1000

1500

2000

2500

2 4 8 16 32 64 128 256 512 1024

Time  (ms)

No.  of  procs

MPI_Reduce x 1000

MPI_Init

10x

24x

Page 5: DISP:*Optimizations*Towards*Scalable* MPI*Startup

S-­5

MPI  Startup  Breakdown

• The initialization of communicator object and collective module is particularly non-scalable.

Startup Phase Work Phase

Init OMPI, including global communicator and collective module

ompi_mpi_init

ompi_comm_init

mca_coll_select

……

……

Initialize subsequentcommunicators

ompi_comm_split

ompi_comm_create

ompi_comm_dup

……

Get backend framework ready

opal_init

orte_init

……

Run collectives withcommunicators

barrier

bcast

reduce

……

global  comm init

coll init

sub-­‐comm init

Page 6: DISP:*Optimizations*Towards*Scalable* MPI*Startup

S-­6

Issues  with  Comm &  Coll Init• There can be many communicators for various uses.

– For every communicator, there is a same communicator object to be created on every participating process.

– A communicator object contains its basic info and a collective modulethat orchestrates collective communication.

• Thus, the computation time and memory consumption grows linearly with the number of processes.

Process 2

Comm AComm BComm C

Process 1

Comm AComm BComm C

Process 0

Comm AComm BComm C

Collmodule

rank sizec_id …

Comm A

9  comm objects  &  collective  modules

Page 7: DISP:*Optimizations*Towards*Scalable* MPI*Startup

S-­7

Multi-­Level  Collective• Cheetah is a popular framework that provide a suit of fast

collectives. Its module for OpenMPI is called Multi-Level (ML).• ML has a hierarchical structure. Process from one group

communicates with other group members through higher-level root. • Every process needs to set up a global topology in order to

communicate.

Page 8: DISP:*Optimizations*Towards*Scalable* MPI*Startup

S-­8

Cost  Analysis  of  MPI_Init• We study the time and memory cost of MPI_Init using

OpenMPI’s default collective module (i.e. Tuned) and ML.• Both have various performance behaviors.

– Generally, init of ML performs worse than init of Tuned. – Init of ML scales particularly bad in terms of time.

Fig.  Cost  of  MPI_Init using  ML

0

5

10

15

20

25

0

5

10

15

20

25

Memory  Consumption  

(MB)

Time  (s)

No.  of  procs

TimeMemory 16  s

24  MB

Page 9: DISP:*Optimizations*Towards*Scalable* MPI*Startup

S-­9

Cost  Breakdown  of  ML  Init• The topology setup needs to conduct many all-to-all collective

communications across all participating processes in the corresponding communicator.

• Time to finish the inter-process communication can occupy most of the init of ML. It is the essential cause that makes the ML init even more non-scalable.

0%

20%

40%

60%

80%

100%

2 4 8 16 32 64 128 256 512 1024 2048 4096

%  of  Total  Cost

No.  of  procs

44% 47%63% 64%

89% 86% 85% 86% 87% 87% 88%93%

Inter-process communication Other costs

Page 10: DISP:*Optimizations*Towards*Scalable* MPI*Startup

S-­10

Related  Works  &  our  Solution• Previous studies have recognized the performance issue of

communicator initialization. But most of them have not identified and addressed the issue of non-scalable initialization of the collective module, especially ML.

• We propose a hybrid solution – Delayed Initialization with Sharing and Prediction (DISP).

Inter-­processcommunicationComm A Comm A

Comm B

1. Delayed Initialization

3. Prediction-based Topology Setup

2. Module Sharing

Page 11: DISP:*Optimizations*Towards*Scalable* MPI*Startup

S-­11

Delayed  Initialization• Delay the initialization of communicator until it is actually used.

Instead of a full-fledged communicator, we create a shadow communicator that only contains its basic info.– It removes cost of unused module.– Delayed initialization also facilitates module sharing between successive

identical communicators. This helps remove initialization cost of identical modules.

Global  comm initSub-­comm initShallow  init On-­demand  init

MPI  startupCollectives

unused

Time

old  process:

new  process:module  sharing

Page 12: DISP:*Optimizations*Towards*Scalable* MPI*Startup

S-­12

• Temporal sharing: collective module is shared between identical communicators.

• Spatial sharing: collective module is shared between MPI processes on the same node. Only the root process on that node initializes the module.

Process 0

Module  Sharing

SpatialSharing

identical

Collmodule

Temporal sharing

Process 1

……

comm A

Collmodulecomm B

Collmodulecomm A

Node  1

Page 13: DISP:*Optimizations*Towards*Scalable* MPI*Startup

S-­13

Prediction-­based  Topology  Setup• Based on system specifics, every process predicts the topology

without exchanging information with others.• Our prediction algorithm can computes the information:

• ① Highest and lowest hierarchy level; • ② Ranks of all participating processes;• ③ All group lists that contain the ranks of the members;• ④ Routing table of how a process can be reached by another one.

Level 0

Level 1

Level 2

4

1

2

3ranks

groups

levels

Page 14: DISP:*Optimizations*Towards*Scalable* MPI*Startup

S-­14

Experimental  Setup• Testbed: all experiments are conducted on Titan’s Cray XK6

machines.– 16-core AMD Opteron 6200 series processor.– 32GB of DDR3 memory.– Connected through a Gemini interconnect.– 600 TB total storage.

• Software: OpemMPI 1.8.8 and Cheetah 1.0.0.• Benchmark: NAS Parallel Benchmarks v3.3 (customized) and

MVAPICH MPI benchmark suite v4.4.1.

Page 15: DISP:*Optimizations*Towards*Scalable* MPI*Startup

S-­15

Overall  Improvement• Real improvement is the difference between DISP’s

improvement to startup and its delay to work phase.• DISP improves ML by a bigger factor than Tuned because of

ML’s longer initialization cost.

Fig. 1 Improvement vs. Delay

0100200300400500600700800

bt cg ep is ft lu mg sp

Time  (ms)

Startup Improv. (Tuned) Work Delay (Tuned) Startup Improv. (ML) Work Delay (ML)

real  improvement

Page 16: DISP:*Optimizations*Towards*Scalable* MPI*Startup

S-­16

Memory  Savings• Delayed initialization saves memory of unused communicator,

and module sharing saves for reusable collective module. • Actual savings depend on the ratio between size of the

collective module and size of the communicator object.

Fig. 1 For Tuned. Fig. 2 For ML.

0

100

200

300

400

500

Memory  consum

ption  (MB)

No.  of  procs

OrigDISP

0

500

1000

1500

2000

2500

3000

Memory  consum

ption  (MB)

No.  of  procs

OrigDISP

Avg savings:Avg savings: 85.7%8.6%

Page 17: DISP:*Optimizations*Towards*Scalable* MPI*Startup

S-­17

Benefit  of  Prediction-­based  Setup• By speeding up the initialization of collective module, topology-

prediction significantly reduces MPI initialization calls.

Fig. 1 MPI_Init() Fig. 2 MPI_Comm_split & _create

0

5000

10000

15000

20000

25000

30000

35000

Time  (ms)

No.  of  procs

Orig

DISP

0

100

200

300

400

500

600

700

Time  (ms)

No.  of  Procs

Split (Orig)Split (DISP)Create (Orig)Create (DISP)70.0%  impr.

63.8%

74.9%

Page 18: DISP:*Optimizations*Towards*Scalable* MPI*Startup

S-­18

Conclusion• Issues with communicator and collective module can significantly

diminish its scalability to thousands or more processes. We have examined such impact in terms of time and memory cost.

• By prudently delaying the initialization and sharing the reusable collective module, we can efficiently reduce the time and memory cost.

• The costly topology setup of multi-level collective module can be well mitigated by a prediction-based approach without affecting the collective module’s functionality.

Page 19: DISP:*Optimizations*Towards*Scalable* MPI*Startup

S-­19

Acknowledgment

Page 20: DISP:*Optimizations*Towards*Scalable* MPI*Startup

S-­20

Thank  You  and  Questions?