@spcl eth evaluating the cost of atomic operations on ...spcl.inf.ethz.ch @spcl_eth atomics:...

Post on 01-Jun-2020

9 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

spcl.inf.ethz.ch

@spcl_eth

MACIEJ BESTA, HERMANN SCHWEIZER, TORSTEN HOEFLER

Evaluating the Cost of Atomic Operations

on Modern Architectures

spcl.inf.ethz.ch

@spcl_eth

LARGE-SCALE IRREGULAR GRAPH PROCESSING

spcl.inf.ethz.ch

@spcl_eth

LARGE-SCALE IRREGULAR GRAPH PROCESSING

spcl.inf.ethz.ch

@spcl_eth

LARGE-SCALE IRREGULAR GRAPH PROCESSING

spcl.inf.ethz.ch

@spcl_eth

LARGE-SCALE IRREGULAR GRAPH PROCESSING

spcl.inf.ethz.ch

@spcl_eth

LARGE-SCALE IRREGULAR GRAPH PROCESSING

[1] A. Lumsdaine et al. Challenges in Parallel Graph Processing. Parallel Processing Letters. 2007.

spcl.inf.ethz.ch

@spcl_eth

A BRIEF SUMMARY OF RESEARCH I DO IN HPC

spcl.inf.ethz.ch

@spcl_eth

Com

puting a

bstr

actions

A BRIEF SUMMARY OF RESEARCH I DO IN HPC

spcl.inf.ethz.ch

@spcl_eth

Com

puting a

bstr

actions

Hardware

PACT’15

HPDC’15

HPDC’14

HPDC’16

A BRIEF SUMMARY OF RESEARCH I DO IN HPC

spcl.inf.ethz.ch

@spcl_eth

Com

puting a

bstr

actions

Hardware

Topologies

PACT’15

SC14

HPDC’15

HPDC’14

HPDC’16

A BRIEF SUMMARY OF RESEARCH I DO IN HPC

spcl.inf.ethz.ch

@spcl_eth

Com

puting a

bstr

actions

Hardware

Topologies

OS, middleware

PACT’15

SC14

ICS’15 HPDC’15

HPDC’14

HPDC’16

A BRIEF SUMMARY OF RESEARCH I DO IN HPC

spcl.inf.ethz.ch

@spcl_eth

Com

puting a

bstr

actions

Hardware

Topologies

OS, middleware

Algorithms

PACT’15

SC14

ICS’15 HPDC’15

HPDC’14

SC13

HPDC’16

A BRIEF SUMMARY OF RESEARCH I DO IN HPC

spcl.inf.ethz.ch

@spcl_eth

Com

puting a

bstr

actions

Hardware

Topologies

OS, middleware

Algorithms

Programming models

PACT’15

SC14

ICS’15 HPDC’15

HPDC’14

SC13

HPDC’16

A BRIEF SUMMARY OF RESEARCH I DO IN HPC

spcl.inf.ethz.ch

@spcl_eth

Com

puting a

bstr

actions

Hardware

Topologies

OS, middleware

Algorithms

Programming models

PACT’15

SC14

ICS’15 HPDC’15

HPDC’14

SC13

HPDC’16

A BRIEF SUMMARY OF RESEARCH I DO IN HPC

spcl.inf.ethz.ch

@spcl_eth

Com

puting a

bstr

actions

Hardware

Topologies

OS, middleware

Algorithms

Programming models

PACT’15

SC14

ICS’15 HPDC’15

HPDC’14

SC13

HPDC’16

A BRIEF SUMMARY OF RESEARCH I DO IN HPC

Most of these layers require

efficient synchronization

spcl.inf.ethz.ch

@spcl_eth

SYNCHRONIZATION MECHANISMS

LOCKS

spcl.inf.ethz.ch

@spcl_eth

Proc qProc p

SYNCHRONIZATION MECHANISMS

LOCKS

spcl.inf.ethz.ch

@spcl_eth

Proc qProc p

SYNCHRONIZATION MECHANISMS

LOCKS

An example

graph

spcl.inf.ethz.ch

@spcl_eth

Proc qProc p

SYNCHRONIZATION MECHANISMS

LOCKS

An example

graph

spcl.inf.ethz.ch

@spcl_eth

Proc qProc p

SYNCHRONIZATION MECHANISMS

LOCKS

An example

graph

spcl.inf.ethz.ch

@spcl_eth

Proc qProc p

SYNCHRONIZATION MECHANISMS

LOCKS

An example

graph

spcl.inf.ethz.ch

@spcl_eth

Proc qProc p

SYNCHRONIZATION MECHANISMS

LOCKS

An example

graph

spcl.inf.ethz.ch

@spcl_eth

Proc qProc p

Intuitive

semantics

SYNCHRONIZATION MECHANISMS

LOCKS

An example

graph

spcl.inf.ethz.ch

@spcl_eth

Proc qProc p

Intuitive

semantics

SYNCHRONIZATION MECHANISMS

LOCKS

An example

graph

spcl.inf.ethz.ch

@spcl_eth

Proc qProc p

Intuitive

semantics

SYNCHRONIZATION MECHANISMS

LOCKS

Serialization

An example

graph

spcl.inf.ethz.ch

@spcl_eth

Proc qProc p

Possibly complex

protocols

Intuitive

semantics

SYNCHRONIZATION MECHANISMS

LOCKS

Serialization

An example

graph

spcl.inf.ethz.ch

@spcl_eth

Proc qProc p

Possibly complex

protocols

Intuitive

semantics

SYNCHRONIZATION MECHANISMS

LOCKS

Serialization

An example

graph

High performance

distributed locks?

spcl.inf.ethz.ch

@spcl_eth

Proc qProc p

SYNCHRONIZATION MECHANISMS

ATOMIC OPERATIONS

spcl.inf.ethz.ch

@spcl_eth

Proc qProc p

SYNCHRONIZATION MECHANISMS

ATOMIC OPERATIONS

Complex access

patterns

spcl.inf.ethz.ch

@spcl_eth

Proc qProc p

High

performance

SYNCHRONIZATION MECHANISMS

ATOMIC OPERATIONS

Complex access

patterns

spcl.inf.ethz.ch

@spcl_eth

Proc qProc p

High

performance

SYNCHRONIZATION MECHANISMS

ATOMIC OPERATIONS

Complex access

patterns

Very common, truly

hardware mechanizm

spcl.inf.ethz.ch

@spcl_eth

Proc qProc p

High

performance

SYNCHRONIZATION MECHANISMS

ATOMIC OPERATIONS

Complex access

patterns

Very common, truly

hardware mechanizm

spcl.inf.ethz.ch

@spcl_eth

Proc qProc p

Complex

protocols

High

performance

SYNCHRONIZATION MECHANISMS

ATOMIC OPERATIONS

Complex access

patterns

Very common, truly

hardware mechanizm

spcl.inf.ethz.ch

@spcl_eth

Proc qProc p

Complex

protocols

High

performance

SYNCHRONIZATION MECHANISMS

ATOMIC OPERATIONS

Subtle

issues (ABA

problem, ...)Complex access

patterns

Very common, truly

hardware mechanizm

spcl.inf.ethz.ch

@spcl_eth

Proc qProc p

Complex

protocols

High

performance

SYNCHRONIZATION MECHANISMS

ATOMIC OPERATIONS

Subtle

issues (ABA

problem, ...)Complex access

patterns

Do we really

understand their

performance?

Very common, truly

hardware mechanizm

spcl.inf.ethz.ch

@spcl_eth

ATOMICS: POPULARITY

spcl.inf.ethz.ch

@spcl_eth

ATOMICS: POPULARITY

spcl.inf.ethz.ch

@spcl_eth

ATOMICS: POPULARITY

[PPoPP’14] Fast

Concurrent Lock-Free

Binary Search Trees

[PPoPP’14] A General

Technique for Non-

blocking Trees

[PPoPP’14] Practical

Concurrent Binary Search

Trees via Logical Ordering

[PPoPP’14] A Practical

Wait-Free Simulation for

Lock-Free Data Structures

[PPoPP’15]

[PPoPP’15]

[SPAA’15]

[SPAA’16]

spcl.inf.ethz.ch

@spcl_eth

ATOMICS: POPULARITY

[PPoPP’14] Fast

Concurrent Lock-Free

Binary Search Trees

[PPoPP’14] A General

Technique for Non-

blocking Trees

[PPoPP’14] Practical

Concurrent Binary Search

Trees via Logical Ordering

[PPoPP’14] A Practical

Wait-Free Simulation for

Lock-Free Data Structures

[PPoPP’15]

[PPoPP’15]

[SPAA’15]

[SPAA’16]Used in so many designs...

But do we really know their

raw performance?

spcl.inf.ethz.ch

@spcl_eth

ATOMICS: POPULARITY

[PPoPP’14] Fast

Concurrent Lock-Free

Binary Search Trees

[PPoPP’14] A General

Technique for Non-

blocking Trees

[PPoPP’14] Practical

Concurrent Binary Search

Trees via Logical Ordering

[PPoPP’14] A Practical

Wait-Free Simulation for

Lock-Free Data Structures

[PPoPP’15]

[PPoPP’15]

[SPAA’15]

[SPAA’16]Used in so many designs...

But do we really know their

raw performance?

Raw performance... Of what?

spcl.inf.ethz.ch

@spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

spcl.inf.ethz.ch

@spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

spcl.inf.ethz.ch

@spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Cache level?

spcl.inf.ethz.ch

@spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Cache level?

Locality?

spcl.inf.ethz.ch

@spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

spcl.inf.ethz.ch

@spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Atomic?

spcl.inf.ethz.ch

@spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Atomic?

Performance

metrics?

spcl.inf.ethz.ch

@spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Atomic?

Performance

metrics?

Architecture

spcl.inf.ethz.ch

@spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Atomic?

Contention?

Performance

metrics?

Architecture

spcl.inf.ethz.ch

@spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Cache

coherence

state?

Atomic?

Contention?

Performance

metrics?

Architecture

spcl.inf.ethz.ch

@spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Cache

coherence

state?

Atomic?

Contention?

Operand

size?

Performance

metrics?

Architecture

spcl.inf.ethz.ch

@spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Cache

coherence

state?

Atomic?

Contention?

Operand

size?

Performance

metrics?

Alignment?

Architecture

spcl.inf.ethz.ch

@spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Cache

coherence

state?

Atomic?

Contention?

Operand

size?

Performance

metrics?

Alignment?

Mechanisms?

Architecture

spcl.inf.ethz.ch

@spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Atomic?

spcl.inf.ethz.ch

@spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Atomic?

Compare-and-

Swap (CAS)

spcl.inf.ethz.ch

@spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Atomic?

Compare-and-

Swap (CAS)

Fetch-and-Add

(FAA)

spcl.inf.ethz.ch

@spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Atomic?

Compare-and-

Swap (CAS)

Fetch-and-Add

(FAA)

Swap (SWP)

spcl.inf.ethz.ch

@spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Performance

metrics?

spcl.inf.ethz.ch

@spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Performance

metrics?

Latency

spcl.inf.ethz.ch

@spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Performance

metrics?

Bandwidth

Latency

spcl.inf.ethz.ch

@spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Cache

coherence

state?

spcl.inf.ethz.ch

@spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Cache

coherence

state?

Modified: in one

cache and dirty

spcl.inf.ethz.ch

@spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Cache

coherence

state?

Modified: in one

cache and dirty

Exclusive: in one

cache and clean

spcl.inf.ethz.ch

@spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Cache

coherence

state?

Modified: in one

cache and dirty

Exclusive: in one

cache and clean

Shared: in >1

cache and clean

spcl.inf.ethz.ch

@spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Cache

coherence

state?

Modified: in one

cache and dirty

Exclusive: in one

cache and clean

Shared: in >1

cache and clean

Invalid: garbage

data

spcl.inf.ethz.ch

@spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Architecture

spcl.inf.ethz.ch

@spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Architecture

spcl.inf.ethz.ch

@spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Architecture

spcl.inf.ethz.ch

@spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Architecture

spcl.inf.ethz.ch

@spcl_eth

ATOMICS: PERFORMANCE DIMENSIONS

Architecture

spcl.inf.ethz.ch

@spcl_eth

RESEARCH QUESTIONS

spcl.inf.ethz.ch

@spcl_eth

How do we model the

performance of

atomics?

RESEARCH QUESTIONS

spcl.inf.ethz.ch

@spcl_eth

What is the

performance

difference between

various atomics?

How do we model the

performance of

atomics?

RESEARCH QUESTIONS

spcl.inf.ethz.ch

@spcl_eth

What is the

performance

difference between

various atomics?

What is the influence

of various parameters

and mechanisms?

How do we model the

performance of

atomics?

RESEARCH QUESTIONS

spcl.inf.ethz.ch

@spcl_eth

Core

Cache Cache Cache

…Cache line

LATENCY MODEL

spcl.inf.ethz.ch

@spcl_eth

Core

Cache Cache Cache

…Cache line

LATENCY MODEL

spcl.inf.ethz.ch

@spcl_eth

Core

Cache Cache Cache

…Cache line

Read for ownership

LATENCY MODEL

spcl.inf.ethz.ch

@spcl_eth

Core

Cache Cache Cache

…Cache line

Read for ownership

LATENCY MODEL

Cache

coherence state

spcl.inf.ethz.ch

@spcl_eth

Core

Cache Cache Cache

…Cache line

Read for ownership

= max(read latency, invalidation latency)

LATENCY MODEL

Cache

coherence state

spcl.inf.ethz.ch

@spcl_eth

Core

Cache Cache Cache

…Cache line

Cache line Read for ownership

= max(read latency, invalidation latency)

LATENCY MODEL

Cache

coherence state

spcl.inf.ethz.ch

@spcl_eth

Core

Cache Cache Cache

…Cache line

Cache line Read for ownership

= max(read latency, invalidation latency)

LATENCY MODEL

Cache

coherence state

spcl.inf.ethz.ch

@spcl_eth

Core

Cache Cache Cache

…Cache line

Cache line Read for ownership

Execute

= max(read latency, invalidation latency)

= constant

LATENCY MODEL

Cache

coherence state

spcl.inf.ethz.ch

@spcl_eth

Core

Cache Cache Cache

…Cache line

Cache line Read for ownership

Execute

= max(read latency, invalidation latency)

= constant

LATENCY MODEL

Cache

coherence state

Atomic

spcl.inf.ethz.ch

@spcl_eth

Core

Cache Cache Cache

…Cache line

Cache line Read for ownership

Execute

= max(read latency, invalidation latency)

= constant

LATENCY MODEL

Cache

coherence state

Atomic

spcl.inf.ethz.ch

@spcl_eth

Core

Cache Cache Cache

…Cache line

Cache line Read for ownership

Execute

Atomic

= max(read latency, invalidation latency)

= constant

LATENCY MODEL

Cache

coherence state

Atomic

spcl.inf.ethz.ch

@spcl_eth

Core

Cache Cache Cache

…Cache line

Cache line Read for ownership

Execute

AtomicCache

coherence state

= max(read latency, invalidation latency)

= constant

LATENCY MODEL

Cache

coherence state

Atomic

spcl.inf.ethz.ch

@spcl_eth

Core

Cache Cache Cache

…Cache line

Cache line Read for ownership

Execute

AtomicCache

coherence state

= constant

LATENCY MODEL

EXCLUSIVE OR MODIFIED STATE

= max(read latency, invalidation latency)

spcl.inf.ethz.ch

@spcl_eth

Core

Cache Cache Cache

…Cache line

Cache line Read for ownership

Execute

AtomicCache

coherence state

= read latency

= constant

LATENCY MODEL

EXCLUSIVE OR MODIFIED STATE

spcl.inf.ethz.ch

@spcl_eth

Core

Cache Cache Cache

…Cache line

Cache line Read for ownership

Execute

AtomicCache

coherence state

= read latency

= constant

LATENCY MODEL

EXCLUSIVE OR MODIFIED STATE

predictions observed

data

mean of

observed data

spcl.inf.ethz.ch

@spcl_eth

LATENCY

HASWELL, EXCLUSIVE

spcl.inf.ethz.ch

@spcl_eth

CAS FAA

LATENCY

BULLDOZER, EXCLUSIVE

spcl.inf.ethz.ch

@spcl_eth

LATENCY

HASWELL, EXCLUSIVE Alignment?

spcl.inf.ethz.ch

@spcl_eth

64 bit 128 bit

LATENCY

BULLDOZER, EXCLUSIVE Operand

size?

spcl.inf.ethz.ch

@spcl_eth

BANDWIDTH

HASWELL, ATOMICS

spcl.inf.ethz.ch

@spcl_eth

CONCLUSIONS

PERFORMANCE INSIGHTS

spcl.inf.ethz.ch

@spcl_eth

CONCLUSIONS

PERFORMANCE INSIGHTS

The same latency of

different atomics in most

scenarios

spcl.inf.ethz.ch

@spcl_eth

CONCLUSIONS

PERFORMANCE INSIGHTS

The same latency of

different atomics in most

scenarios

CAS is the fastest for

some cases

spcl.inf.ethz.ch

@spcl_eth

CONCLUSIONS

PERFORMANCE INSIGHTS

The same latency of

different atomics in most

scenarios

CAS is the fastest for

some cases

Unaligned atomics

should be avoided at all

costs

spcl.inf.ethz.ch

@spcl_eth

CONCLUSIONS

PERFORMANCE INSIGHTS

The same latency of

different atomics in most

scenarios

CAS is the fastest for

some cases

Unaligned atomics

should be avoided at all

costs

No parallel execution

(low bandwidth) even if

there are no data deps

spcl.inf.ethz.ch

@spcl_eth

CONCLUSIONS

PERFORMANCE INSIGHTS

The same latency of

different atomics in most

scenarios

CAS is the fastest for

some cases

Small operand sizes

give best performance

Unaligned atomics

should be avoided at all

costs

No parallel execution

(low bandwidth) even if

there are no data deps

spcl.inf.ethz.ch

@spcl_eth

Atomics Read

LATENCY

HASWELL, EXCLUSIVE

spcl.inf.ethz.ch

@spcl_eth

Atomics Read

LATENCY

HASWELL, MODIFIED

spcl.inf.ethz.ch

@spcl_eth

LATENCY

IVY BRIDGE, EXCLUSIVE

CAS

spcl.inf.ethz.ch

@spcl_eth

LATENCY

IVY BRIDGE, EXCLUSIVE

CAS

spcl.inf.ethz.ch

@spcl_eth

LATENCY

XEON PHI, MODIFIED / EXCLUSIVE

CAS

spcl.inf.ethz.ch

@spcl_eth

Core

Cache Cache Cache

…Cache line

Cache line Read for ownership

Execute

AtomicCache

coherence state

= constant

LATENCY MODEL

EXCLUSIVE OR MODIFIED STATE

= max(read latency, invalidation latency)

spcl.inf.ethz.ch

@spcl_eth

Core

Cache Cache Cache

…Cache line

Cache line Read for ownership

Execute

AtomicCache

coherence state

= constant

LATENCY MODEL

SHARED STATE

= max(read latency, invalidation latency)

spcl.inf.ethz.ch

@spcl_eth

Core

Cache Cache Cache

…Cache line

Cache line Read for ownership

Execute

AtomicCache

coherence state

= invalidation latency

= constant

LATENCY MODEL

SHARED STATE

spcl.inf.ethz.ch

@spcl_eth

Atomics Read

LATENCY

HASWELL, SHARED

spcl.inf.ethz.ch

@spcl_eth

How to force cache coherence state

F(M): Write cache line (invalidates all copies)

F(E): F(M) flush read

F(S): F(E) read by some other core

D. Hackenberg, D. Molka, W. Nagel. Comparing Cache Architectures and Coherency Protocols on x86-64

Multicore SMP Systems. MICRO’09

top related