mpi+threads: runtime contention and remedies abdelhalim amer*, huiwei lu+, yanjie wei #, pavan...
TRANSCRIPT
MPI+Threads: Runtime Contention and Remedies
Abdelhalim Amer*, Huiwei Lu+, Yanjie Wei#, Pavan Balaji+, Satoshi Matsuoka*
*Tokyo Institute of Technology
+Argonne National Laboratory
#Shenzhen Institute of Advanced Technologies, Chinese Academy of Sciences
PPoPP’15, February 7–11, 2015, San Francisco, CA, USA.
The Message Passing Interface (MPI)
2
• Standard library specification (not a language)• Several implementations
•MPICH and derivatives• MVAPICH• Intel-MPI• Cray-MPI• …
•OpenMPI• Large portion of legacy HPC applications use MPI• Not just message passing:
• Remote Memory Access (RMA)
Why MPI + X ?
•Core density is increasing•Other resources do not scale at the same rate
•Memory per core is reducing•Network endpoint
• Sharing resources within nodes is a becoming necessary• X : shared-memory programming
• Threads: OpenMP, TBB, …• MPI shared memory !• PGAS
[1] Peter Kogge. Pim & memory: The need for a revolution in architecture. The Argonne Training Program on Extreme-Scale Computing (ATPESC), 2013.
Evolution of the memory capacity per core in the Top500 list [1]
3
4
MPI_Init_thread (…, required, …)
• Restriction
•Low Thread-Safety Costs
• Flexibility
•High Thread-Safety Costs
MPI +Threads Interoperation
• MPI_THREAD_SINGLE– No additional threads
• MPI_THREAD_FUNNELED– Master thread communication
only• MPI_THREAD_SERIALIZED
– Multithreaded communication serialized
• MPI_THREAD_MULTIPLE– No restrictions
5
Architecture NehalemProcessor Xeon E5540Clock frequency 2.6 GHzNumber of Sockets 2Cores per Socket 4L3 Size 8192 KBL2 Size 256 KBNumber of nodes 310Interconnect Mellanox QDRMPI LibraryNetwork Module
MPICHNemesis:MXM
Fusion cluster at Argonne National Laboratory
Test Environement
6Multithreaded Point-to-Point BW
P0 P1
Contention in Multithreaded Communication
1 10 100
1000
1000
0
1000
00
1000
000
1000
0000
2
20
200
2000
20000
1 ppn
2 ppn
4 ppn
8 ppn
Message Size [Bytes]
Me
ss
ag
e R
ate
[1
03
ms
gs
/s]
P0 P4P1 P5P2 P6P3 P7
Multi-process Point-to-Point BW
1 10 100
1000
1000
0
1000
00
1000
000
1000
0000
2
20
200
2000
1 tpn
2 tpn
4 tpn
8 tpn
Message Size [Bytes]
Me
ss
ag
e R
ate
[1
03
ms
gs
/s]
7
• Critical Section Granularity– Shorter is better but more
complex• Synchronization Mechanism
– How to hand-off to the next thread?
• Atomic ops, memory barriers, system calls, NUMA-awareness
– Arbitration: Who enters the CS?
• Fairness• Random, FIFO, Priority
Threads
Critical SectionLength
Arbitration
Dimensions of Thread-Safety
Hand-Off
8
• Critical Section Granularity– Shorter is better but more
complex• Synchronization Mechanism
– How to hand-off to the next thread?
• Atomic ops, memory barriers, system calls, NUMA-awareness
– Arbitration: Who enters the CS?
• Fairness• Random, FIFO, Priority
Threads
Critical SectionLength
Arbitration
Dimensions of Thread-Safety
Hand-Off
9Balaji, Pavan, et al. "Fine-grained multithreading support for hybrid threaded MPI programming." International Journal of High Performance Computing Applications 24.1 (2010): 49-57.
Reducing Contention by Refining Critical Section Granularity
10
• GCS: Global CS only• POCS: Per-Object CS supported
MPICHMPI MPID Headers
ThreadCH3
NemesisMRail PSM
PAMID
IB MXM …
Current Work(GCS)
MVAPICH(GCS)
BlueGene(POCS)
SockTCP
POCS
Thread-Safety in MPICH
• Supports a 1:1 threading model: only sees kernel threads
11
• Global critical section• Implementation: NPTL
Pthread mutex• Pthread mutex
– CAS in the user-space– Futex wait/wake in contended
cases– Arbitration: Fastest thread
first Possible unfainess
Use
r-Sp
ace
Kernel-Space
pthread_mutex_lock
CASFUTEX_WAIT
FUTEX_WA
KE
Sleep
FUTEX_WAIT
FUTEX_WA
KE
Sleep
CAS
CAS
Go inside the critical section
Baseline Thread-Safety in MPICH:Nemesis: Pthread Mutex
12
Hierarchical Memory
Mutex
Core Core Core Core
T0 T1 T2 T3
Access biased by the proximity to the cache containing the mutex
Mutex
Core Core Core Core
T0 T1 T2 T3
Access should be random
Flat memory
User Space
L1 L1 L1 L1
L2 L2
CAS CAS CAS CAS
CAS CAS CAS CAS
User Space
Unfairness May Occur!
13
• Bandwidth benchmark• Unfairness levels
– Core Level : A single thread is monopolizing the lock
– Socket Level : Threads on the same socket are monopolizing the lock
• Bias factor– How much a fair
arbitration is biased– Bias factor = 1 = fair
arbitrationFairness analysis of the BW benchmark with 8 threads
Fairness Analysis
1 10 100
1000
1000
0
1000
00
1000
000
1000
0000
1.17
1.18
1.19
1.2
1.21
1.22
1.23
1.24
1.25
1.26
Core Level
Socket Level
Message Size [Bytes]
Bia
s F
ac
tor
14
Internals of an MPI Runtime and Mutex
Work availability sequence
Thread resource acquisition sequence
Time
Threads
Penalty
ResourceHand-off
Wasting resource acquisition with mutex
Communication Progress Engine
15
• DR: Dangling requests– Completed but not free’d
• Want to keep low this number
40% of the maximum
Consequences of Unfair Arbitration
1 10 100 1000 100000
50
100
150
200
250
Message Size [Bytes]
Av
era
ge
Nu
mb
er
of
Da
ng
ling
Re
qu
es
ts
16
• Ticket Lock– Busy waiting– FIFO arbitration
Simple Solution: Force FIFO
TimePenalty
Fairness (FIFO) reduces wasted resource acquisitions
TimePenalty
Mutex
Ticket
1 10 100 1000 100000
50
100
150
200
250
MutexTicket
Message Size [Bytes]
Av
era
ge
Nu
mb
er
of
Da
ng
ling
Re
qu
es
ts
17
Preliminary Throughput Results
2 40
200
400
600
800
1000
1200
1400Mutex Ticket
#Threads per Node
Me
ss
ag
e R
ate
[1
03
ms
gs
/s]
2 40
200
400
600
800
1000
1200
1400Mutex Ticket
#Threads per Node
Me
ss
ag
e R
ate
[1
03
ms
gs
/s]
Compact Binding Scatter Binding
1
10
100
1000
Mutex
Ticket
Message Size [Bytes]
Me
ss
ag
e R
ate
[1
03
ms
gs
/s]
8 cores/node
18
• Critical section constrains– Threads have to yield when
blocking in the progress engine
– To respect MPI progress semantics
• Observations– Most MPI calls do useful
work the first time they enter the runtime
– Thread starts polling if the operation is not completed Simplified Execution flow of a Thead-safe
MPI implementation with critical sections
Can we do better?
19
• Idea:– Two priority levels: High and
Low– All threads start with a high
priority (1)– Fall to low priority (2) if the
operation is• Blocking • Failed to complete immediately
• 3 Ticket-Locks:– One for mutual exclusion in
each priority level– Another for high priority
threads to block lower onesSimplified Execution flow of a Thead-safe MPI implementation with critical sections
Can we do better?
20
N2N Benchmark
Preliminary Throughput Results
8
80
800
8000
Ticket
Priority
Message Size [Bytes]
Me
ss
ag
e R
ate
[1
03
ms
gs
/s]
Evaluation
22
Two-Sided Pt2Pt with 32 cores
1 321024
32767.9
999999
999
1048576
33554432
1
10
100
1000
10000
Single
Mutex
Ticket
Priority
Message Size [Bytes]
La
ten
cy
[u
s]
1 10100
1000
10000
100000
1000000
10000000
1
10
100
1000
10000
Single
Mutex
Ticket
Priority
Message Size [Bytes]
Me
ss
ag
e R
ate
[1
03
ms
gs
/s]
LatencyThroughput
~ 8x
23
Put
Get Accumulate
P
MPI_Put()
Progress Thread
ARMCI-MPI + Async. Progress
8 80 800
8000
8000
0
8000
00
8000
000
0.1
1
10
100
MutexTicketPriority
Data Element Size [Bytes]
Da
ta T
ran
sfe
r R
ate
[1
03
ele
me
nts
/s]
8 80 800
8000
8000
0
8000
00
8000
000
0.1
1
10
100
MutexTicketPriority
Data Element Size [Bytes]
Da
ta T
ran
sfe
r R
ate
[1
03
ele
me
nts
/s]
8 80 800
8000
8000
0
8000
00
8000
000
0.064
0.64
6.4
64
MutexTicketPriority
Data Element Size [Bytes]
Da
ta T
ran
sfe
r R
ate
[1
03
ele
me
nts
/s]
24
4095.99999999999 4095999.99999999 4095999999.999994
40
400
MutexTicketPriority
Problem Size per Core [Bytes]
Pe
rfo
rma
nc
e [
GF
lop
s]
512
4096
3276
8
2621
44
2097
152
1677
7216
1342
1772
80%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
MPI Computation OMP_Sync
Problem Size per Core [Bytes]
Pe
rce
nta
ge
of
Tim
e
3D 7-Pt Stencil
Execution Breakdown
Domain Decomposition
Strong Scaling with 64 Nodes
25
1 2 4 8507090
110130150170190210230250
MutexTicketPriority
Number of Threads per Node
Pe
rfo
rma
nc
e [
MT
EP
S]
16 64 256
1024
4095
.999
9999
9999
8
80
800
8000MutexTicketPriority
Number of Cores
Pe
rfo
rma
nc
e [
MT
EP
S]
16 Nodes and Compact Binding
Weak Scaling
MPI+OpenMP Graph500 BFS
While(1){ #pragma omp parallel { Process_Current_Level(); Synchronize(); }
MPI_Allreduce(QLength); if(QueueLenth == 0) break;}
26
• Blocking Send/Recv• Two threads per process
– One sending– The other receiving
• Strong scaling with 1 millions reads, each with 36 nucleotides
4 40 400 40004
40
400
4000
Mutex
Ticket
Priority
Number of CoresE
xe
cu
tio
n T
ime
[s
]
SWAP-Assembler
Genome Assembly : SWAP-Assembler
Strong scaling results
27
• Critical section arbitration plays an important role in communication performance
• By changing the arbitration, substantial improvements were observed
• Further improvement requires a synergy of all the dimensions of thread-safety– Smarter arbitration
• Message-driven to further reduce resource waste
– Low latency hand-off (NUMA-aware synchronization)– Reduce serialization thhrough finer-grained critical
sections
Summary and Future Directions