job startup at exascale - mvapichmvapich.cse.ohio-state.edu/static/media/talks/slide/hari...fast and...

Network Based Computing Laboratory

Job Startup at Exascale:Challenges and Solutions

Hari Subramoni


The Ohio State University

http://nowlab.cse.ohio-state.edu/


Current Trends in HPC

SC '17 2

Fast and scalable job-startup is essential!

Stampede2 @ TACC

Sunway TaihuLight

• Supercomputing systems scaling rapidly• Multi-/Many-core architectures• High-performance interconnects

• Core density (per node) is increasing• Improvements in manufacturing tech• More performance per watt

• Hybrid programming models are popular for developing applications• Message Passing Interface (MPI)• Partitioned Global Address Space (PGAS)


Why is Job Startup Important?

Development and debugging

Regression / Acceptance testing

Checkpoint - Restart

SC '17 3


Towards Exascale: Challenges to Address

• Dynamic allocation of resources

• Leveraging high-performance interconnects

• Exploiting opportunities for overlap

• Minimizing memory usage

SC '17 4

Job Startup Performance

Mem

ory

Ove

rhea

d

State of the art

Towards Exascale


Challenge: Avoid All-to-all Connectivity

SC '17 5

0

5

10

15

20

25

30

35

32 64 128 256 512 1K 2K 4K

Tim

e Ta

ken

(Se

con

ds)

Number of Processes

Connection Setup

PMI Exchange

Memory Registration

Shared MemorySetup

Application Processes Average Peers

BT64 8.7

1024 10.6

EP64 3.0

1024 5.0

MG64 9.5

1024 11.9

SP64 8.8

1024 10.7

2D Heat64 5.3

1024 5.4

Connection setup phase takes 85% of initialization time with 4K processes

Applications rarely require full all-to-all connectivity


On-demand Connection Management

SC '17 6

Main

Thread

Main

Thread

Connection

Manager Thread

Connection

Manager Thread

Process 1 Process 2

Put/Get

(P2)

Create QP

QP→Init

Enqueue Send

Create QP

QP→Init

QP→RTR

QP→RTR

QP→RTSConnection Established

Dequeue Send

Connect Request

(LID, QPN)

(address, size, rkey)

Connect Reply

(LID, QPN)

(address, size, rkey)

QP→RTS

Connection Established

Put/Get

(P2)


Results - On-demand Connections

SC '17 7

0

20

40

60

80

100

16 32 64 128 256 512 1K 2K 4K 8K

Tim

e Ta

ken

(Se

con

ds)

Number of Processes

Performance of OpenSHMEMInitialization and Hello World

Hello World - Static

start_pes - Static

Hello World - On-demand

start_pes - On-demand

0

2

4

6

8

BT EP MG SP

Exec

uti

on

Tim

e (s

eco

nd

s)

Benchmark

Execution time of OpenSHMEM NAS Parallel Benchmarks

Static

On-demand

Initialization – 29.6 times faster Total execution time – 35% better

On-demand Connection Management for OpenSHMEM and OpenSHMEM+MPI. S. Chakraborty, H. Subramoni, J. Perkins, A. A. Awan, and

D K Panda, 20th International Workshop on High-level Parallel Programming Models and Supportive Environments (HIPS ’15)


Challenge: Exploit High-performance Interconnects in PMI

• Used for network address exchange, heterogeneity detection, etc.• Used by major parallel programming

frameworks

• Uses TCP/IP for transport• Not efficient for moving large amount

of data

• Required to bootstrap high-performance interconnects

SC '17 8

0

0.5

1

1.5

2

2.5

32 64 128 256 512 1K 2K 4K 8K

Tim

e Ta

ken

(Se

con

ds)

Number of Processes

Breakdown of MPI_Init in MVAPICH2

PMI Exchanges

Shared Memory

Other

PMI = Process Management Interface


PMIX_Ring: A Scalable Alternative

▪ Exchange data with only the left and right neighbors over TCP

▪ Exchange bulk of the data over High-speed interconnect (e.g. InfiniBand, OmniPath)

SC '17 9

int PMIX_Ring(char value[],char left[],char right[],

…)

0

1

2

3

4

5

6

7

16 64 256 1k 4k 16k

Tim

e Ta

ken

(se

con

ds)

Number of Processes

Comparison of PMI operations

Fence

Put

Gets

Ring

PMIX_Ring is more scalable


Results - PMIX_Ring

SC '17 10

33% improvement in MPI_Init Total execution time – 20% better

0

1

2

3

4

5

6

7

16 32 64 128 256 512 1K 2K 4K 8K

Tim

e Ta

ken

(se

con

ds)

Number of Processes

Performance of MPI_Init andHello World with PMIX_Ring

Hello World (Fence)

Hello World (Ring)

MPI_Init (Fence)

MPI_Init (proposed)

0

1

2

3

4

5

6

7

EP MG CG FT BT SP

Tim

e Ta

ken

(se

con

ds)

Benchmark

NAS Benchmarks with1K Processes, Class B Data

Fence

Ring

PMI Extensions for Scalable MPI Startup. S. Chakraborty, H. Subramoni, A. Moody, J. Perkins, M. Arnold, and D K Panda, Proceedings of the

21st European MPI Users' Group Meeting (EuroMPI/Asia ’14)


Challenge: Exploit Overlap in Application Initialization• PMI operations are progressed by the

process manager

• MPI/PGAS library is not involved

• Can be overlapped with other initialization tasks / application computation

• Put+Fence+Get combined into a single function - Allgather

int PMIX_KVS_Ifence(

PMIX_Request *request)

int PMIX_Iallgather(

const char value[],

char buffer[],

PMIX_Request *request)

int PMIX_Wait(

PMIX_Request request)

SC '17 11


Results - Non-blocking PMI Collectives

SC '17 12

Near-constant MPI_Init at any scale Allgather is 38% faster than Fence

0.4

0.8

1.2

1.6

2

64 256 1K 4K 16K

Tim

e Ta

ken

(Se

con

ds)

Number of Processes

Performance of MPI_Init

Fence

Ifence

Allgather

Iallgather

0

0.4

0.8

1.2

1.6

64 256 1K 4K 16K

Tim

e Ta

ken

(Se

con

ds)

Number of Processes

Comparison of Fence and Allgather

PMI2_KVS_Fence

PMIX_Allgather

Non-blocking PMI Extensions for Fast MPI Startup. S. Chakraborty, H. Subramoni, A. Moody, A. Venkatesh, J. Perkins, and D K Panda, 15th

IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid ’15)


Challenge: Minimize Memory Footprint

• Address table and similar information is stored in the PMI Key-value store (KVS)

• All processes in the node duplicate the KVS

• PPN redundant copies per node

SC '17 13

Process 1

PMIKVS

Process 2

PMIKVS

Process N

PMIKVS

Process Manager

PMI Key-Value Store (KVS)

PPN = Number of Processes per Node


Shared Memory based PMI• Process manager creates and

populates shared memory region

• MPI processes directly read from shared memory

• Dual shared memory region based hash-table design for performance and memory efficiency

SC '17 14

Empty Head Tail Key Value Next

Hash Table (Table)

Key Value Store (KVS)

Top21

3


Shared Memory based PMI

SC '17 15

PMI Gets are 1000x faster Memory footprint reduced by O(PPN)

0

50

100

150

200

250

300

1 2 4 8 16 32

Tim

e Ta

ken

(m

illis

eco

nd

s)

Number of Processes

Time Taken by one PMI_Get

Default

Shmem

1

10

100

1000

10000

32K 64K 128K 256K 512K 1MMem

ory

Usa

ge p

er N

od

e (M

B)

Number of Processes

PMI Memory Usage

Fence - DefaultAllgather - DefaultFence - ShmemAllgather - Shmem

SHMEMPMI – Shared Memory based PMI for Improved Performance and Scalability. S. Chakraborty, H. Subramoni, J. Perkins, and D K

Panda, 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid ’16)


Efficient Intra-node Topology Discovery

SC '17 16

Process 0

hwloc

Process 0

hwloc

Process 0

hwloc

/proc/ filesystem

Process 0

hwloc

Process 0

hwloc

Process 0

hwloc

/proc/ filesystem

Previous Design Current Design

Contention! No Contention!

Significant improvement on Many-core systems


0.00

5.00

10.00

15.00

20.00

25.00

64 128 256 512 1K 2K 4K 8K 16K 32K 64K

Tim

e Ta

ken

(Se

con

ds)

Number of Processes

MPI_Init and Hello World on Oakforest-PACS

Hello World (MVAPICH2-2.3a)

MPI_Init (MVAPICH2-2.3a)

0.000

50.000

100.000

150.000

200.0006

4

12

8

25

6

51

2

1K

2K

4K

8K

16

K

32

K

64

K

12

8K

18

1K

23

2K

MP

I_In

it(S

eco

nd

s)

Number of Processes

MPI_Init on TACC Stampede-KNL

Intel MPI 2018 beta

MVAPICH2 2.3a

Startup Performance on KNL + Omni-Path

SC '17 17

5.8s

21s

22s

8.8x

• MPI_Init takes 22 seconds on 231,956 processes on 3,624 KNL nodes (Stampede – Full scale)• 8.8 times faster than Intel MPI at 128K processes (Courtesy: TACC)• At 64K processes, MPI_Init and Hello World takes 5.8s and 21s respectively (Oakforest-PACS)• All numbers reported with 64 processes per node


Summary

SC '17 18

P M

O

Job Startup Performance

Mem

ory

Req

uir

ed t

o S

tore

En

dp

oin

t In

form

atio

n a b c d

eP

M

PGAS – State of the art

MPI – State of the art

O PGAS/MPI – Optimized

PMIX_Ring

PMIX_Ibarrier

PMIX_Iallgather

Shmem based PMI

b

c

d

e

a On-demand Connection

• Near constant MPI/OpenSHMEM initialization at any process count• 10x and 30x improvement in startup time of MPI and OpenSHMEM with 16,384 processes

(1,024 nodes)• Full scale startup on Stampede2 under 22 seconds with 232K processes • O(PPN) reduction in PMI memory footprint

Optimized designs available in MVAPICH2 and MVAPICH2X-2.3b


Thank You!

http://go.osu.edu/mvapich-startup

http://mvapich.cse.ohio-state.edu/

[email protected]

[email protected]

job startup at exascale - mvapichmvapich.cse.ohio-state.edu/static/media/talks/slide/hari...fast and...

Documents