zvika guz 1, oved itzhak 1, idit keidar 1, avinoam kolodny 1, avi mendelson 2, and uri c. weiser 1...

Post on 19-Dec-2015

217 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Zvika Guz1, Oved Itzhak1, Idit Keidar1, Avinoam Kolodny1, Avi Mendelson2, and Uri C. Weiser1

Threads vs. Caches: Modeling the Behavior of Parallel Workloads

1Technion – Israel Institute of Technology, 2Microsoft Corporation

Challenges: Single-core performance trend is gloomy

Exploit chip-multiprocessors with multithreaded applications

The memory gap is paramount Latency, bandwidth, power

2

Chip-Multiprocessor Era

2[Figure: Hennessy and Patterson, Computer Architecture- A Quantitative approach]

Two basic remedies: Cache – Reduce the number of out-of-die memory accesses Multi-threading – Hide memory accesses behind threads execution

How do they play together? How do we make the most out of them?

The many-core span Cache-Machines ↔ MT-Machines

A high-level analytical model Performance curves study

Few examples

Summary

3

Outline

3

The many-core span Cache-Machines ↔ MT-Machines

A high-level analytical model Performance curves study

Few examples

Summary

4

Outline

4

Cache-Machines vs. MT-Machines

# of Threads

Cache/Thread

Thread Context

Cache

Cache Architecture

Region

Many-Core – CMP with many, simple cores Tens hundreds of Processing Elements (PEs)

MT Architecture

Region

Intel’s Larrabee

Nvidia’s GT200

5

Nvidia’s Fermi

Cache

Core

Multi-Core

Region

Uni-Processor

Region

Cache

cccc

What are the basic tradeoffs? How will workloads behave across the range?

Predicting performance

The many-core span Cache-Machines ↔ MT-Machines

A high-level analytical model Performance curves study

Few examples

Summary

6

Outline

6

Use both cache and many threads to shield memory access The uniform framework renders the comparison meaningful We derive simple, parameterized equations for performance, power, BW,..

A Unified Machine Model

7

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

Cache

To Memory

Threads Architectural States

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C

C

C

C

C C

C C

C C

C C

C

C

C

C

Cache Machines

8

C

Many cores (each may have its private L1) behind a shared cache

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C

C

C

C

Cache

To Memory

C

C

C

C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

# Threads

Performance

Cache Non Effective point (CNE)

Memory latency shielded by multiple thread execution

Multi-Thread Machines

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

To Memory

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

Threads Architectural States

Ban

dw

idth

L

imit

atio

ns

# Threads

PerformanceMax performance

executionMemory access

9

Analysis (1/3) Given a ratio of memory access instructions rm (0≤rm≤1)

Every 1/rm instruction accesses memory A thread executes 1/rm instructions

Then stalls for tavg cycles

tavg=Average Memory Access Time (AMAT) [cycles]

10

Cache

Thread Context

t [cycles]

ld

1CPIexerm

avgt

ld

PE stays idle unless filled with instructions from other threads Each thread occupies the PE for additional cycles

threads needed to fully utilize each PE

Analysis (2/3)

t [cycles]

ld

1CPIexerm

avgt

ld ld ld ld

1CPIexerm

1exe

avg

m

CPI

r

t

1CPIexerm

11

Cache

Thread Context

Analysis (3/3) Machine utilization:

Performance in Operations Per Seconds [OPS]:

1min 1, threads

avgm

PEexe

rN tCPI

n

Number of available threads

[ ]PEexe

fPerformance N OPS

CPI

Peak Performance

#Threads needed to utilize a single PE

12

Cache

Thread Context

Performance Model

13

$ $ $

,

min , [ ]1 $,

( , ) 1 ( , )

PEexe

max

m reg hit threads

max

ex m hit hit mem

Power

fN

CPI

BWPerformance OPS

r b P n

e r P S n e P S n e

1 av

threads

mPE

exg

e

n

rN

CPIt

min 1 ,Machine Utilization

$ [ ]$, 1 $, hit threads hit threads mavg cyclesAMAT P n tt t P n

PE Utilization

Off-Chip BW

Power

The many-core span Cache-Machines ↔ MT-Machines

A high-level analytical model Performance curves study

Few examples

Summary

14

Outline

14

15

# Threads

3 regions: Cache efficiency region, The Valley, MT efficiency region

Unified Machine PerformanceP

erfo

rman

ce

Ca

ch

e r

egio

n

MT regionThe Valley

0

100

200

300

400

500

600

700

800

900

1000

1100

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

1000

0

1100

0

1200

0

1300

0

1400

0

1500

0

1600

0

1700

0

1800

0

1900

0

2000

0

GO

PS

Number Of Threads

Performance for Different Cache Sizes (Limited BW)

no $

16M

32M

64M

128M

perfect $

Increase in cache size cache suffices for more in-flight threads Extends the $ region

17

Increase in cache size

Cache Size Impact

..AND also Valuable in the MT region Caches reduce off-chip bandwidth delay the BW saturation point

Simulation results from the PARSEC workloads kit Swaptions:

Perfect Valley

Hit Rate Function Impact

Swaptions

0

20

40

60

80

100

120

140

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of Threads

Pe

rfo

rma

nc

e (

GO

PS

)

0

10

20

30

40

50

60

70

80

90

100

Ca

ch

e H

it R

ate

(%

)

Analytical Model

Simulation

Cache Hit Rate

19

Simulation results from the PARSEC workloads kit Raytrace:

Monotonically-increasing performance

Hit Rate Function Impact

Raytrace

0

10

20

30

40

50

60

70

80

90

100

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of Threads

Pe

rfo

rma

nc

e (

GO

PS

)

0

10

20

30

40

50

60

70

80

90

100

Ca

ch

e H

it R

ate

(%

)

Analytical Model

Simulation

Cache Hit Rate

20

Three applications families based on cache miss rate dependency: A “strong” function of number of threads – f(Nq) when q>1 A “weak” function of number of threads - f(Nq) when q≤1 Not a function of number of threads

Threads

Per

form

ance

Hit Rate Dependency – 3 ClassesP

erfo

rman

ce

# Threads

21

Simulation results from the PARSEC workloads kit Canneal

Not enough parallelism available

Workload Parallelism Impact

Canneal

0

2

4

6

8

10

12

14

16

18

20

22

24

26

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of Threads

Pe

rfo

rma

nc

e (

GO

PS

)

0

10

20

30

40

50

60

70

80

90

100

Ca

ch

e H

it R

ate

(%

)

Simulation

Analytical Model

Cache Hit Rate

22

The many-core span Cache-Machines ↔ MT-Machines

A high-level analytical model Performance curves study

Few examples

Summary

23

Outline

23

A high-level model for many-core engines A unified framework for machines and workloads from across the range

A vehicle to derive intuition Qualitative study of the tradeoffs A tool to understand parameters impact Identifies new behaviors and the applications that exhibit them Enables reasoning of complex phenomena

First step towards escaping the valley

24

Summary

24

Thank You!zguz@tx.technion.ac.il

25

Backup

25

26

Model Parameters

26

27

Model Parameters

27

Parameter Description

NPENumber of PEs (in-order processing elements)

S$Cache size [Bytes]

NmaxMaximal number of thread contexts in the register file

CPIexeAverage number of cycles required to execute an instruction assuming a perfect (zero-latency) memory system [cycles]

f Processor frequency [Hz]

t$Cache latency [cycles]

tmMemory latency [cycles]

BWmaxMaximal off-chip bandwidth [GB/sec]

bregOperands size [Bytes]

Machine parameters:

28

Model Parameters

28

Workload parameters:

Parameter Description

n Number of threads that execute or are in ready state (not blocked) concurrently

rmFraction of instructions accessing memory out of the total number of instructions [0≤rm≤1]

Phit(s, n) Cache hit rate for each thread, when n threads are using a cache of size s

29

Model Parameters

29

Power parameters:

Parameter Description

eexEnergy per operation [j]

e$Energy per cache access [j]

emem Energy per memory access [j]

PowerleakageLeakage power [W]

30

Parsec Workloads

30

Model Validation, PARSEC Workloads

Raytrace

0

10

20

30

40

50

60

70

80

90

100

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of Threads

Pe

rfo

rma

nc

e (

GO

PS

)

0

10

20

30

40

50

60

70

80

90

100

Ca

ch

e H

it R

ate

(%

)

Analytical Model

Simulation

Cache Hit Rate

Dedup

0

10

20

30

40

50

60

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of ThreadsP

erf

orm

an

ce

(G

OP

S)

0

10

20

30

40

50

60

70

80

90

100

Ca

ch

e H

it R

ate

(%

)

Analytical Model

Simulation

Cache Hit Rate

Canneal

0

2

4

6

8

10

12

14

16

18

20

22

24

26

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of Threads

Pe

rfo

rma

nc

e (

GO

PS

)

0

10

20

30

40

50

60

70

80

90

100

Ca

ch

e H

it R

ate

(%

)

Simulation

Analytical Model

Cache Hit Rate

Bodytrack

0

1

2

3

4

5

6

7

8

9

10

0 20 40 60 80 100 120 140 160 180 200

Number Of Threads

Pe

rfo

rma

nc

e (

GO

PS

)

0

10

20

30

40

50

60

70

80

90

100

Ca

ch

e H

it R

ate

(%

)

Analytical Model

Simulation

Cache Hit Rate

Swaptions

0

20

40

60

80

100

120

140

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of Threads

Pe

rfo

rma

nc

e (

GO

PS

)

0

10

20

30

40

50

60

70

80

90

100

Ca

ch

e H

it R

ate

(%

)

Analytical Model

Simulation

Cache Hit Rate

Blackscholes

0

20

40

60

80

100

120

140

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of Threads

Pe

rfo

rma

nc

e (

GO

PS

)

0

10

20

30

40

50

60

70

80

90

100

Ca

ch

e H

it R

ate

(%

)Analytical Model

Simulation

Cache Hit Rate

Related Work

32

Similar approach of using high level models: Morad et al., CA-Letters 2005 Hill and Michael, IEEE Computer 2008 Eyerman and Eeckhout, ISCA-2010

Related Work

33

Agrawal, TPDS-1992

Saavedra-Barrera and Culler, Berkeley 1991

Sorin et al., ISCA-1998

Hong and Kim, ISCA-2009

Baghsorkhi et al., PPoPP-2010

Thread Context

Cache

Cache Architecture

Region

MT Architecture

Region

Cache

Core

Multi-Core

Region

Uni-Processor

Region

Cache

cccc

top related