mpi+threads: runtime contention and remedies abdelhalim amer*, huiwei lu+, yanjie wei #, pavan...

MPI+Threads: Runtime Contention and Remedies

Abdelhalim Amer*, Huiwei Lu+, Yanjie Wei#, Pavan Balaji+, Satoshi Matsuoka*

*Tokyo Institute of Technology

+Argonne National Laboratory

#Shenzhen Institute of Advanced Technologies, Chinese Academy of Sciences

PPoPP’15, February 7–11, 2015, San Francisco, CA, USA.

The Message Passing Interface (MPI)

2

• Standard library specification (not a language)• Several implementations

•MPICH and derivatives• MVAPICH• Intel-MPI• Cray-MPI• …

•OpenMPI• Large portion of legacy HPC applications use MPI• Not just message passing:

• Remote Memory Access (RMA)

Why MPI + X ?

•Core density is increasing•Other resources do not scale at the same rate

•Memory per core is reducing•Network endpoint

• Sharing resources within nodes is a becoming necessary• X : shared-memory programming

• Threads: OpenMP, TBB, …• MPI shared memory !• PGAS

[1] Peter Kogge. Pim & memory: The need for a revolution in architecture. The Argonne Training Program on Extreme-Scale Computing (ATPESC), 2013.

Evolution of the memory capacity per core in the Top500 list [1]

3

4

MPI_Init_thread (…, required, …)

• Restriction

•Low Thread-Safety Costs

• Flexibility

•High Thread-Safety Costs

MPI +Threads Interoperation

• MPI_THREAD_SINGLE– No additional threads

• MPI_THREAD_FUNNELED– Master thread communication

only• MPI_THREAD_SERIALIZED

– Multithreaded communication serialized

• MPI_THREAD_MULTIPLE– No restrictions

5

Architecture NehalemProcessor Xeon E5540Clock frequency 2.6 GHzNumber of Sockets 2Cores per Socket 4L3 Size 8192 KBL2 Size 256 KBNumber of nodes 310Interconnect Mellanox QDRMPI LibraryNetwork Module

MPICHNemesis:MXM

Fusion cluster at Argonne National Laboratory

Test Environement

6Multithreaded Point-to-Point BW

P0 P1

Contention in Multithreaded Communication

1 10 100

1000

1000

0

1000

00

1000

000

1000

0000

2

20

200

2000

20000

1 ppn

2 ppn

4 ppn

8 ppn

Message Size [Bytes]

Me

ss

ag

e R

ate

[1

03

ms

gs

/s]

P0 P4P1 P5P2 P6P3 P7

Multi-process Point-to-Point BW

1 10 100

1000

1000

0

1000

00

1000

000

1000

0000

2

20

200

2000

1 tpn

2 tpn

4 tpn

8 tpn


Me

ss

ag

e R

ate

[1

03

ms

gs

/s]

7

• Critical Section Granularity– Shorter is better but more

complex• Synchronization Mechanism

– How to hand-off to the next thread?

• Atomic ops, memory barriers, system calls, NUMA-awareness

– Arbitration: Who enters the CS?

• Fairness• Random, FIFO, Priority

Threads

Critical SectionLength

Arbitration

Dimensions of Thread-Safety

Hand-Off

8

• Critical Section Granularity– Shorter is better but more

complex• Synchronization Mechanism

– How to hand-off to the next thread?

• Atomic ops, memory barriers, system calls, NUMA-awareness

– Arbitration: Who enters the CS?

• Fairness• Random, FIFO, Priority

Threads

Critical SectionLength

Arbitration

Dimensions of Thread-Safety

Hand-Off

9Balaji, Pavan, et al. "Fine-grained multithreading support for hybrid threaded MPI programming." International Journal of High Performance Computing Applications 24.1 (2010): 49-57.

Reducing Contention by Refining Critical Section Granularity

10

• GCS: Global CS only• POCS: Per-Object CS supported

MPICHMPI MPID Headers

ThreadCH3

NemesisMRail PSM

PAMID

IB MXM …

Current Work(GCS)

MVAPICH(GCS)

BlueGene(POCS)

SockTCP

POCS

Thread-Safety in MPICH

• Supports a 1:1 threading model: only sees kernel threads

11

• Global critical section• Implementation: NPTL

Pthread mutex• Pthread mutex

– CAS in the user-space– Futex wait/wake in contended

cases– Arbitration: Fastest thread

first Possible unfainess

Use

r-Sp

ace

Kernel-Space

pthread_mutex_lock

CASFUTEX_WAIT

FUTEX_WA

KE

Sleep

FUTEX_WAIT

FUTEX_WA

KE

Sleep

CAS

CAS

Go inside the critical section

Baseline Thread-Safety in MPICH:Nemesis: Pthread Mutex

12

Hierarchical Memory

Mutex

Core Core Core Core

T0 T1 T2 T3

Access biased by the proximity to the cache containing the mutex

Mutex

Core Core Core Core

T0 T1 T2 T3

Access should be random

Flat memory

User Space

L1 L1 L1 L1

L2 L2

CAS CAS CAS CAS

CAS CAS CAS CAS

User Space

Unfairness May Occur!

13

• Bandwidth benchmark• Unfairness levels

– Core Level : A single thread is monopolizing the lock

– Socket Level : Threads on the same socket are monopolizing the lock

• Bias factor– How much a fair

arbitration is biased– Bias factor = 1 = fair

arbitrationFairness analysis of the BW benchmark with 8 threads

Fairness Analysis

1 10 100

1000

1000

0

1000

00

1000

000

1000

0000

1.17

1.18

1.19

1.2

1.21

1.22

1.23

1.24

1.25

1.26

Core Level

Socket Level


Bia

s F

ac

tor

14

Internals of an MPI Runtime and Mutex

Work availability sequence

Thread resource acquisition sequence

Time

Threads

Penalty

ResourceHand-off

Wasting resource acquisition with mutex

Communication Progress Engine

15

• DR: Dangling requests– Completed but not free’d

• Want to keep low this number

40% of the maximum

Consequences of Unfair Arbitration

1 10 100 1000 100000

50

100

150

200

250


Av

era

ge

Nu

mb

er

of

Da

ng

ling

Re

qu

es

ts

16

• Ticket Lock– Busy waiting– FIFO arbitration

Simple Solution: Force FIFO

TimePenalty

Fairness (FIFO) reduces wasted resource acquisitions

TimePenalty

Mutex

Ticket

1 10 100 1000 100000

50

100

150

200

250

MutexTicket


Av

era

ge

Nu

mb

er

of

Da

ng

ling

Re

qu

es

ts

17

Preliminary Throughput Results

2 40

200

400

600

800

1000

1200

1400Mutex Ticket

#Threads per Node

Me

ss

ag

e R

ate

[1

03

ms

gs

/s]

2 40

200

400

600

800

1000

1200

1400Mutex Ticket

#Threads per Node

Me

ss

ag

e R

ate

[1

03

ms

gs

/s]

Compact Binding Scatter Binding

1

10

100

1000

Mutex

Ticket


Me

ss

ag

e R

ate

[1

03

ms

gs

/s]

8 cores/node

18

• Critical section constrains– Threads have to yield when

blocking in the progress engine

– To respect MPI progress semantics

• Observations– Most MPI calls do useful

work the first time they enter the runtime

– Thread starts polling if the operation is not completed Simplified Execution flow of a Thead-safe

MPI implementation with critical sections

Can we do better?

19

• Idea:– Two priority levels: High and

Low– All threads start with a high

priority (1)– Fall to low priority (2) if the

operation is• Blocking • Failed to complete immediately

• 3 Ticket-Locks:– One for mutual exclusion in

each priority level– Another for high priority

threads to block lower onesSimplified Execution flow of a Thead-safe MPI implementation with critical sections

Can we do better?

20

N2N Benchmark

Preliminary Throughput Results

8

80

800

8000

Ticket

Priority


Me

ss

ag

e R

ate

[1

03

ms

gs

/s]

Evaluation

22

Two-Sided Pt2Pt with 32 cores

1 321024

32767.9

999999

999

1048576

33554432

1

10

100

1000

10000

Single

Mutex

Ticket

Priority


La

ten

cy

[u

s]

1 10100

1000

10000

100000

1000000

10000000

1

10

100

1000

10000

Single

Mutex

Ticket

Priority


Me

ss

ag

e R

ate

[1

03

ms

gs

/s]

LatencyThroughput

~ 8x

23

Put

Get Accumulate

P

MPI_Put()

Progress Thread

ARMCI-MPI + Async. Progress

8 80 800

8000

8000

0

8000

00

8000

000

0.1

1

10

100

MutexTicketPriority

Data Element Size [Bytes]

Da

ta T

ran

sfe

r R

ate

[1

03

ele

me

nts

/s]

8 80 800

8000

8000

0

8000

00

8000

000

0.1

1

10

100

MutexTicketPriority


Da

ta T

ran

sfe

r R

ate

[1

03

ele

me

nts

/s]

8 80 800

8000

8000

0

8000

00

8000

000

0.064

0.64

6.4

64

MutexTicketPriority


Da

ta T

ran

sfe

r R

ate

[1

03

ele

me

nts

/s]

24

4095.99999999999 4095999.99999999 4095999999.999994

40

400

MutexTicketPriority

Problem Size per Core [Bytes]

Pe

rfo

rma

nc

e [

GF

lop

s]

512

4096

3276

8

2621

44

2097

152

1677

7216

1342

1772

80%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

MPI Computation OMP_Sync

Problem Size per Core [Bytes]

Pe

rce

nta

ge

of

Tim

e

3D 7-Pt Stencil

Execution Breakdown

Domain Decomposition

Strong Scaling with 64 Nodes

25

1 2 4 8507090

110130150170190210230250

MutexTicketPriority

Number of Threads per Node

Pe

rfo

rma

nc

e [

MT

EP

S]

16 64 256

1024

4095

.999

9999

9999

8

80

800

8000MutexTicketPriority

Number of Cores

Pe

rfo

rma

nc

e [

MT

EP

S]

16 Nodes and Compact Binding

Weak Scaling

MPI+OpenMP Graph500 BFS

While(1){ #pragma omp parallel { Process_Current_Level(); Synchronize(); }

MPI_Allreduce(QLength); if(QueueLenth == 0) break;}

26

• Blocking Send/Recv• Two threads per process

– One sending– The other receiving

• Strong scaling with 1 millions reads, each with 36 nucleotides

4 40 400 40004

40

400

4000

Mutex

Ticket

Priority

Number of CoresE

xe

cu

tio

n T

ime

[s

]

SWAP-Assembler

Genome Assembly : SWAP-Assembler

Strong scaling results

27

• Critical section arbitration plays an important role in communication performance

• By changing the arbitration, substantial improvements were observed

• Further improvement requires a synergy of all the dimensions of thread-safety– Smarter arbitration

• Message-driven to further reduce resource waste

– Low latency hand-off (NUMA-aware synchronization)– Reduce serialization thhrough finer-grained critical

sections

Summary and Future Directions

mpi+threads: runtime contention and remedies abdelhalim amer*, huiwei lu+, yanjie wei #, pavan...

Documents

mpi threads

mpi x

mpi shared memory

dimensions of thread

additional threads mpi

fastest thread

safety handoff slide

message passing interface