zvika guz 1, oved itzhak 1, idit keidar 1, avinoam kolodny 1, avi mendelson 2, and uri c. weiser 1...

Zvika Guz1, Oved Itzhak1, Idit Keidar1, Avinoam Kolodny1, Avi Mendelson2, and Uri C. Weiser1

Threads vs. Caches: Modeling the Behavior of Parallel Workloads

1Technion – Israel Institute of Technology, 2Microsoft Corporation

Challenges: Single-core performance trend is gloomy

Exploit chip-multiprocessors with multithreaded applications

The memory gap is paramount Latency, bandwidth, power

Chip-Multiprocessor Era

2[Figure: Hennessy and Patterson, Computer Architecture- A Quantitative approach]

Two basic remedies: Cache – Reduce the number of out-of-die memory accesses Multi-threading – Hide memory accesses behind threads execution

How do they play together? How do we make the most out of them?

The many-core span Cache-Machines ↔ MT-Machines

A high-level analytical model Performance curves study

Few examples

Summary

Outline

Few examples

Summary

Outline

Cache-Machines vs. MT-Machines

# of Threads

Cache/Thread

Thread Context

Cache Architecture

Region

Many-Core – CMP with many, simple cores Tens hundreds of Processing Elements (PEs)

MT Architecture

Region

Intel’s Larrabee

Nvidia’s GT200

Nvidia’s Fermi

Multi-Core

Region

Uni-Processor

Region

What are the basic tradeoffs? How will workloads behave across the range?

Predicting performance

Few examples

Summary

Outline

Use both cache and many threads to shield memory access The uniform framework renders the comparison meaningful We derive simple, parameterized equations for performance, power, BW,..

A Unified Machine Model

To Memory

Threads Architectural States

Cache Machines

Many cores (each may have its private L1) behind a shared cache

To Memory

# Threads

Performance

Cache Non Effective point (CNE)

Memory latency shielded by multiple thread execution

Multi-Thread Machines

To Memory

Threads Architectural States

# Threads

PerformanceMax performance

executionMemory access

Analysis (1/3) Given a ratio of memory access instructions rm (0≤rm≤1)

Every 1/rm instruction accesses memory A thread executes 1/rm instructions

Then stalls for tavg cycles

tavg=Average Memory Access Time (AMAT) [cycles]

Thread Context

t [cycles]

1CPIexerm

PE stays idle unless filled with instructions from other threads Each thread occupies the PE for additional cycles

threads needed to fully utilize each PE

Analysis (2/3)

t [cycles]

1CPIexerm

ld ld ld ld

1CPIexerm

Thread Context

Analysis (3/3) Machine utilization:

Performance in Operations Per Seconds [OPS]:

1min 1, threads

rN tCPI

Number of available threads

[ ]PEexe

fPerformance N OPS

Peak Performance

#Threads needed to utilize a single PE

Thread Context

Performance Model

min , [ ]1 $,

( , ) 1 ( , )

m reg hit threads

ex m hit hit mem

BWPerformance OPS

r b P n

e r P S n e P S n e

threads

min 1 ,Machine Utilization

$ [ ]$, 1 $, hit threads hit threads mavg cyclesAMAT P n tt t P n

PE Utilization

Off-Chip BW

Few examples

Summary

Outline

# Threads

3 regions: Cache efficiency region, The Valley, MT efficiency region

Unified Machine PerformanceP

MT regionThe Valley

Number Of Threads

Performance for Different Cache Sizes (Limited BW)

perfect $

Increase in cache size cache suffices for more in-flight threads Extends the $ region

Increase in cache size

Cache Size Impact

..AND also Valuable in the MT region Caches reduce off-chip bandwidth delay the BW saturation point

Simulation results from the PARSEC workloads kit Swaptions:

Perfect Valley

Hit Rate Function Impact

Swaptions

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of Threads

Analytical Model

Simulation

Cache Hit Rate

Simulation results from the PARSEC workloads kit Raytrace:

Monotonically-increasing performance

Hit Rate Function Impact

Raytrace

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of Threads

Analytical Model

Simulation

Cache Hit Rate

Three applications families based on cache miss rate dependency: A “strong” function of number of threads – f(Nq) when q>1 A “weak” function of number of threads - f(Nq) when q≤1 Not a function of number of threads

Threads

Hit Rate Dependency – 3 ClassesP

# Threads

Simulation results from the PARSEC workloads kit Canneal

Not enough parallelism available

Workload Parallelism Impact

Canneal

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of Threads

Simulation

Analytical Model

Cache Hit Rate

Few examples

Summary

Outline

A high-level model for many-core engines A unified framework for machines and workloads from across the range

A vehicle to derive intuition Qualitative study of the tradeoffs A tool to understand parameters impact Identifies new behaviors and the applications that exhibit them Enables reasoning of complex phenomena

First step towards escaping the valley

Summary

Thank You!zguz@tx.technion.ac.il

Backup

Model Parameters

Parameter Description

NPENumber of PEs (in-order processing elements)

S$Cache size [Bytes]

NmaxMaximal number of thread contexts in the register file

CPIexeAverage number of cycles required to execute an instruction assuming a perfect (zero-latency) memory system [cycles]

f Processor frequency [Hz]

t$Cache latency [cycles]

tmMemory latency [cycles]

BWmaxMaximal off-chip bandwidth [GB/sec]

bregOperands size [Bytes]

Machine parameters:

Model Parameters

Workload parameters:

n Number of threads that execute or are in ready state (not blocked) concurrently

rmFraction of instructions accessing memory out of the total number of instructions [0≤rm≤1]

Phit(s, n) Cache hit rate for each thread, when n threads are using a cache of size s

Model Parameters

Power parameters:

eexEnergy per operation [j]

e$Energy per cache access [j]

emem Energy per memory access [j]

PowerleakageLeakage power [W]

Parsec Workloads

Model Validation, PARSEC Workloads

Raytrace

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of Threads

Analytical Model

Simulation

Cache Hit Rate

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of ThreadsP

Analytical Model

Simulation

Cache Hit Rate

Canneal

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of Threads

Simulation

Analytical Model

Cache Hit Rate

Bodytrack

0 20 40 60 80 100 120 140 160 180 200

Number Of Threads

Analytical Model

Simulation

Cache Hit Rate

Swaptions

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of Threads

Analytical Model

Simulation

Cache Hit Rate

Blackscholes

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of Threads

)Analytical Model

Simulation

Cache Hit Rate

Related Work

Similar approach of using high level models: Morad et al., CA-Letters 2005 Hill and Michael, IEEE Computer 2008 Eyerman and Eeckhout, ISCA-2010

Related Work

Agrawal, TPDS-1992

Saavedra-Barrera and Culler, Berkeley 1991

Sorin et al., ISCA-1998

Hong and Kim, ISCA-2009

Baghsorkhi et al., PPoPP-2010

Thread Context

Cache Architecture

Region

MT Architecture

Region

Multi-Core

Region

Uni-Processor

Region

zvika guz 1, oved itzhak 1, idit keidar 1, avinoam kolodny 1, avi mendelson 2, and uri c. weiser 1...

cache machines

threads performance

performance slide

memory latency

memory gap

core cmp

singlecore performance

threads execution

Documents

saron paz zvika markfeld tomer daniel oleg imanilov

1 modeling and optimization of vlsi interconnect 049031...

science 2011 avinoam 589 92

hierarchical functional encryption · 2018. 1. 4. ·...

avinoam danin © flora of the shroud of turin avinoam danin

micro-modem reliability solution for noc communications...

programming with xml written by: adam carmi zvika gutterman

journal of economic psychology - zvika neeman

power and energy conservation techniques for disk array...

redis on nvme ssd - zvika guz, samsung

analia schlosser, zvika neeman, yigal attali · analia...

desert vegetation of israel by prof. avinoam danin

slip-2008, newcastle upon tyne, ukapril 5, 2008 parallel vs....

speex encoder project presented by: gruper leetal kamelo...

by : ido shayevitz and yoav shargil supervisor: zvika guz

presented by : moran katz and zvika pery mentor: moshe...

1 1 avinoam kolodny technion – israel institute of...

january 2009 supply chain shifting from an expense source to...

2001 zvika serper - kurosawa's dreams a cinematic reflection...

crypto laboratory winter 2007-2008 alexander grechin and...