job startup at exascale - mvapichmvapich.cse.ohio-state.edu/static/media/talks/slide/hari...fast and...
TRANSCRIPT
Network Based Computing Laboratory
Job Startup at Exascale:Challenges and Solutions
Hari Subramoni
Network Based Computing Laboratory
The Ohio State University
http://nowlab.cse.ohio-state.edu/
Network Based Computing Laboratory
Current Trends in HPC
SC '17 2
Fast and scalable job-startup is essential!
Stampede2 @ TACC
Sunway TaihuLight
• Supercomputing systems scaling rapidly• Multi-/Many-core architectures• High-performance interconnects
• Core density (per node) is increasing• Improvements in manufacturing tech• More performance per watt
• Hybrid programming models are popular for developing applications• Message Passing Interface (MPI)• Partitioned Global Address Space (PGAS)
Network Based Computing Laboratory
Why is Job Startup Important?
Development and debugging
Regression / Acceptance testing
Checkpoint - Restart
SC '17 3
Network Based Computing Laboratory
Towards Exascale: Challenges to Address
• Dynamic allocation of resources
• Leveraging high-performance interconnects
• Exploiting opportunities for overlap
• Minimizing memory usage
SC '17 4
Job Startup Performance
Mem
ory
Ove
rhea
d
State of the art
Towards Exascale
Network Based Computing Laboratory
Challenge: Avoid All-to-all Connectivity
SC '17 5
0
5
10
15
20
25
30
35
32 64 128 256 512 1K 2K 4K
Tim
e Ta
ken
(Se
con
ds)
Number of Processes
Connection Setup
PMI Exchange
Memory Registration
Shared MemorySetup
Application Processes Average Peers
BT64 8.7
1024 10.6
EP64 3.0
1024 5.0
MG64 9.5
1024 11.9
SP64 8.8
1024 10.7
2D Heat64 5.3
1024 5.4
Connection setup phase takes 85% of initialization time with 4K processes
Applications rarely require full all-to-all connectivity
Network Based Computing Laboratory
On-demand Connection Management
SC '17 6
Main
Thread
Main
Thread
Connection
Manager Thread
Connection
Manager Thread
Process 1 Process 2
Put/Get
(P2)
Create QP
QP→Init
Enqueue Send
Create QP
QP→Init
QP→RTR
QP→RTR
QP→RTSConnection Established
Dequeue Send
Connect Request
(LID, QPN)
(address, size, rkey)
Connect Reply
(LID, QPN)
(address, size, rkey)
QP→RTS
Connection Established
Put/Get
(P2)
Network Based Computing Laboratory
Results - On-demand Connections
SC '17 7
0
20
40
60
80
100
16 32 64 128 256 512 1K 2K 4K 8K
Tim
e Ta
ken
(Se
con
ds)
Number of Processes
Performance of OpenSHMEMInitialization and Hello World
Hello World - Static
start_pes - Static
Hello World - On-demand
start_pes - On-demand
0
2
4
6
8
BT EP MG SP
Exec
uti
on
Tim
e (s
eco
nd
s)
Benchmark
Execution time of OpenSHMEM NAS Parallel Benchmarks
Static
On-demand
Initialization – 29.6 times faster Total execution time – 35% better
On-demand Connection Management for OpenSHMEM and OpenSHMEM+MPI. S. Chakraborty, H. Subramoni, J. Perkins, A. A. Awan, and
D K Panda, 20th International Workshop on High-level Parallel Programming Models and Supportive Environments (HIPS ’15)
Network Based Computing Laboratory
Challenge: Exploit High-performance Interconnects in PMI
• Used for network address exchange, heterogeneity detection, etc.• Used by major parallel programming
frameworks
• Uses TCP/IP for transport• Not efficient for moving large amount
of data
• Required to bootstrap high-performance interconnects
SC '17 8
0
0.5
1
1.5
2
2.5
32 64 128 256 512 1K 2K 4K 8K
Tim
e Ta
ken
(Se
con
ds)
Number of Processes
Breakdown of MPI_Init in MVAPICH2
PMI Exchanges
Shared Memory
Other
PMI = Process Management Interface
Network Based Computing Laboratory
PMIX_Ring: A Scalable Alternative
▪ Exchange data with only the left and right neighbors over TCP
▪ Exchange bulk of the data over High-speed interconnect (e.g. InfiniBand, OmniPath)
SC '17 9
int PMIX_Ring(char value[],char left[],char right[],
…)
0
1
2
3
4
5
6
7
16 64 256 1k 4k 16k
Tim
e Ta
ken
(se
con
ds)
Number of Processes
Comparison of PMI operations
Fence
Put
Gets
Ring
PMIX_Ring is more scalable
Network Based Computing Laboratory
Results - PMIX_Ring
SC '17 10
33% improvement in MPI_Init Total execution time – 20% better
0
1
2
3
4
5
6
7
16 32 64 128 256 512 1K 2K 4K 8K
Tim
e Ta
ken
(se
con
ds)
Number of Processes
Performance of MPI_Init andHello World with PMIX_Ring
Hello World (Fence)
Hello World (Ring)
MPI_Init (Fence)
MPI_Init (proposed)
0
1
2
3
4
5
6
7
EP MG CG FT BT SP
Tim
e Ta
ken
(se
con
ds)
Benchmark
NAS Benchmarks with1K Processes, Class B Data
Fence
Ring
PMI Extensions for Scalable MPI Startup. S. Chakraborty, H. Subramoni, A. Moody, J. Perkins, M. Arnold, and D K Panda, Proceedings of the
21st European MPI Users' Group Meeting (EuroMPI/Asia ’14)
Network Based Computing Laboratory
Challenge: Exploit Overlap in Application Initialization• PMI operations are progressed by the
process manager
• MPI/PGAS library is not involved
• Can be overlapped with other initialization tasks / application computation
• Put+Fence+Get combined into a single function - Allgather
int PMIX_KVS_Ifence(
PMIX_Request *request)
int PMIX_Iallgather(
const char value[],
char buffer[],
PMIX_Request *request)
int PMIX_Wait(
PMIX_Request request)
SC '17 11
Network Based Computing Laboratory
Results - Non-blocking PMI Collectives
SC '17 12
Near-constant MPI_Init at any scale Allgather is 38% faster than Fence
0.4
0.8
1.2
1.6
2
64 256 1K 4K 16K
Tim
e Ta
ken
(Se
con
ds)
Number of Processes
Performance of MPI_Init
Fence
Ifence
Allgather
Iallgather
0
0.4
0.8
1.2
1.6
64 256 1K 4K 16K
Tim
e Ta
ken
(Se
con
ds)
Number of Processes
Comparison of Fence and Allgather
PMI2_KVS_Fence
PMIX_Allgather
Non-blocking PMI Extensions for Fast MPI Startup. S. Chakraborty, H. Subramoni, A. Moody, A. Venkatesh, J. Perkins, and D K Panda, 15th
IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid ’15)
Network Based Computing Laboratory
Challenge: Minimize Memory Footprint
• Address table and similar information is stored in the PMI Key-value store (KVS)
• All processes in the node duplicate the KVS
• PPN redundant copies per node
SC '17 13
Process 1
PMIKVS
Process 2
PMIKVS
Process N
PMIKVS
Process Manager
PMI Key-Value Store (KVS)
PPN = Number of Processes per Node
Network Based Computing Laboratory
Shared Memory based PMI• Process manager creates and
populates shared memory region
• MPI processes directly read from shared memory
• Dual shared memory region based hash-table design for performance and memory efficiency
SC '17 14
Empty Head Tail Key Value Next
Hash Table (Table)
Key Value Store (KVS)
Top21
3
Network Based Computing Laboratory
Shared Memory based PMI
SC '17 15
PMI Gets are 1000x faster Memory footprint reduced by O(PPN)
0
50
100
150
200
250
300
1 2 4 8 16 32
Tim
e Ta
ken
(m
illis
eco
nd
s)
Number of Processes
Time Taken by one PMI_Get
Default
Shmem
1
10
100
1000
10000
32K 64K 128K 256K 512K 1MMem
ory
Usa
ge p
er N
od
e (M
B)
Number of Processes
PMI Memory Usage
Fence - DefaultAllgather - DefaultFence - ShmemAllgather - Shmem
SHMEMPMI – Shared Memory based PMI for Improved Performance and Scalability. S. Chakraborty, H. Subramoni, J. Perkins, and D K
Panda, 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid ’16)
Network Based Computing Laboratory
Efficient Intra-node Topology Discovery
SC '17 16
Process 0
hwloc
Process 0
hwloc
Process 0
hwloc
/proc/ filesystem
Process 0
hwloc
Process 0
hwloc
Process 0
hwloc
/proc/ filesystem
Previous Design Current Design
Contention! No Contention!
Significant improvement on Many-core systems
Network Based Computing Laboratory
0.00
5.00
10.00
15.00
20.00
25.00
64 128 256 512 1K 2K 4K 8K 16K 32K 64K
Tim
e Ta
ken
(Se
con
ds)
Number of Processes
MPI_Init and Hello World on Oakforest-PACS
Hello World (MVAPICH2-2.3a)
MPI_Init (MVAPICH2-2.3a)
0.000
50.000
100.000
150.000
200.0006
4
12
8
25
6
51
2
1K
2K
4K
8K
16
K
32
K
64
K
12
8K
18
1K
23
2K
MP
I_In
it(S
eco
nd
s)
Number of Processes
MPI_Init on TACC Stampede-KNL
Intel MPI 2018 beta
MVAPICH2 2.3a
Startup Performance on KNL + Omni-Path
SC '17 17
5.8s
21s
22s
8.8x
• MPI_Init takes 22 seconds on 231,956 processes on 3,624 KNL nodes (Stampede – Full scale)• 8.8 times faster than Intel MPI at 128K processes (Courtesy: TACC)• At 64K processes, MPI_Init and Hello World takes 5.8s and 21s respectively (Oakforest-PACS)• All numbers reported with 64 processes per node
Network Based Computing Laboratory
Summary
SC '17 18
P M
O
Job Startup Performance
Mem
ory
Req
uir
ed t
o S
tore
En
dp
oin
t In
form
atio
n a b c d
eP
M
PGAS – State of the art
MPI – State of the art
O PGAS/MPI – Optimized
PMIX_Ring
PMIX_Ibarrier
PMIX_Iallgather
Shmem based PMI
b
c
d
e
a On-demand Connection
• Near constant MPI/OpenSHMEM initialization at any process count• 10x and 30x improvement in startup time of MPI and OpenSHMEM with 16,384 processes
(1,024 nodes)• Full scale startup on Stampede2 under 22 seconds with 232K processes • O(PPN) reduction in PMI memory footprint
Optimized designs available in MVAPICH2 and MVAPICH2X-2.3b
Network Based Computing Laboratory
Thank You!
http://go.osu.edu/mvapich-startup
http://mvapich.cse.ohio-state.edu/