stream-based memory specialization for general purpose ... · – energy-efficient prefetching. –...

57
1 Stream-based Memory Specialization for General Purpose Processors Zhengrong Wang Tony Nowatzki

Upload: others

Post on 11-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

1

Stream-based Memory Specialization for General Purpose Processors

Zhengrong Wang

Tony Nowatzki

Page 2: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

2

Computation & Memory Specialization

ISA abstraction for certain computation pattern.

New ISA abstraction for memory access pattern?

+-+-+-+-

SIMD

+

+/

Compound Instruction

a[b[0]]

a[b[1]]

a[b[2]]

b[0]

b[1]

b[2]

b[i]

a[b[i]]

Stream

ComputationStructure

MemoryStructure

Page 3: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

3

Conventional Memory Abstraction

addr

load

if

add

br

addr

load

if

add

br

O3 Core

Miss

L1 Cache L2 Cache

Hit

Resp.Resp.

Miss Hit

Resp.Resp.

Addr.

Addr.

Val.

Val.

while (i < N) {if (cond)

v += a[i];i++;

}

Page 4: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

4

Opportunity 1: Prefetch with Ctrl. Flow

addr

load

if

addr

load

if

O3 Core

Miss

L1 Cache L2 Cache

Hit

Resp.

Miss Hit

Resp.

Addr.

Addr.

add

br add

br

Resp.

Resp.Val.

Overhead 1:Hard to prefetch with control flow.

Before loop.

cfg. SE. Miss Hit

Resp.Resp.

Prefetch.

Hit

Hit

Val.

Opportunity 1:Prefetch with control flow.

cfg(a[i]);while (i < N) {if (cond)

v += a[i];i++;

}

Page 5: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

5

Opportunity 2: Semi-Binding Prefetch

if

if

O3 Core L1 Cache L2 Cache

add

br add

br

Before loop.

cfg. SE. Miss Hit

Resp.Resp.

Prefetch.

Hit

addr

load addr

load

Addr.

Addr.Resp.

Resp.Val.

HitOpportunity 1:Prefetch with control flow.

Overhead 2:Redundant address computation.

FIFOOpportunity 2:Semi-binding prefetch.

s_a = cfg();while (i < N) {if (cond)

v += s_a;i++;

}

Page 6: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

6

Opportunity 3: Stream-Aware Policies

if

if

O3 Core L1 Cache L2 Cache

add

br add

br

Before loop.

cfg. SE. Hit

Resp.

Prefetch.

Opportunity 1:Prefetch with control flow.

Overhead 2:Repeated address computation/loads.

FIFO

Miss

Resp.

Opportunity 2:Semi-binding prefetch.

Overhead 3:Assumption on reuse causes extra access/conflicts

Opportunity 3:Better policies, e.g. bypass a cache level if no locality.

s_a = cfg();while (i < N) {if (cond)

v += s_a;}

Page 7: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

7

Stream: A New ISA Memory Abstraction

• Potential Benefits of Stream Abstractions– Energy-efficient prefetching.

– Offload Memory from expensive OOO Pipeline

– Leverage stream information in cache policies.

• This Work– Analyze programs to determine the requirements of the ISA

– Develop a hardware-neutral ISA extension: decoupled-stream ISA

– Develop micro-architecture extensions for an OOO processor

• Evaluation– 60% memory accesses → streams.

– 1.37 speedup over a traditional O3 processor.

Page 8: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

8

Outline• Stream Characteristics.

• Stream ISA Extension.

• Stream-Aware Policies.

• Microarchitecture Extension.

• Evaluation.

Page 9: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

9

Outline• Stream Characteristics.

• Stream ISA Extension.

• Stream-Aware Policies.

• Microarchitecture Extension.

• Evaluation.

Page 10: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

10

Stream Definition

• Stream -- Decoupled portion of a program:– Accesses memory at most one times

– No internal control flow

– Has no dependences from the main program to the stream besides configuration and termination

a[i][j]

Main Program

for (int i = 0; i < M; ++i) {for (int j = 0; j < N; ++j) {… = a[i][j];

Stream

Page 11: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

11

Stream Characteristics – Stream TypeTrace analysis on CortexSuite/SPEC CPU 2017.

• 51.49% affine, 10.19% indirect.

• Indirect streams can be as high as 40%.

0%10%20%30%40%50%60%70%80%90%

100%

Affine Indirect PC Unqualified Outside

Support indirect stream.

Page 12: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

12

Stream Characteristics – Stream Length• 51% stream accesses from stream longer than 1k.

• Some benchmarks contain short streams.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

pca rbm disparity lbm_s sphinx srr svm xz_s avg.

>1k >100 >50 >0

Support longer stream to capture long term behavior.Low overhead to support short streams.

Page 13: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

13

Stream Characteristics – Control Flow• 53% stream accesses from loop with control flow.

0%10%20%30%40%50%60%70%80%90%

100%

Execution Paths within the Loop >3 3 2 1

Decouple from control flow.

Page 14: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

15

Outline• Stream Characteristics.

• Stream ISA Extension.

• Stream-Aware Policies.

• Microarchitecture Extension.

• Evaluation.

Page 15: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

16

Decoupled Stream ISA Components

• Setup/Teardown– stream_config(s_i) -- setup stream parameters

– stream_end(s_i) -- deallocates streams

• Pseudo Register– Register mapped to stream data

– Allows offset to access different elements of stream

• Supporting Control Flow– Decoupled stream advance from use of pseudo-register

– step(s_i) – update the position of a stream by one iter.

Page 16: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

17

Stream ISA Extension – Basic Example

Original C Code Stream Decoupled Pseudo Code Stream Dependence Graph

int i = 0;while (i < N) {sum += a[i];i++;

}

stream_cfg(s_i, s_a);while (s_i < N) {sum += s_a;stream_step(s_i);

}stream_end(s_i, s_a);

s_i

s_a

Memory 0x408

Memory 0x400

Memory 0x404

Stream a[i]

s_ai++

UserStep.

i++

Pseudo-RegIter.

0

1

2

Induction Var. Stream

MemoryStream

Page 17: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

18

Stream ISA Extension – Control FlowOriginal C Code Stream Decoupled Pseudo Code Stream Dependence Graph

int i = 0, j = 0;while (cond) {if (a[i] < b[j])

i++;else

j++;}

stream_cfg(s_i, s_a, s_j, s_b);while (cond) {

if (s_a < s_b)stream_step(s_i);

elsestream_step(s_j);

}stream_end(s_i, s_a, s_j, s_b);

s_i

s_a

s_j

s_b

Memory 0x408

Memory 0x400

Memory 0x404

Stream a[i]

s_a

i++

UserStep

i++

Pseudo-RegIter.

0

1

2

Page 18: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

19

Stream ISA Extension – Indirect StreamOriginal C Code Stream Decoupled Pseudo Code Stream Dependence Graph

int i = 0;while (i < N) {sum += a[b[i]];i++;

}

stream_cfg(s_i, s_a, s_b);while (s_i < N) {

sum += s_a;stream_step(s_i);

}stream_end(s_i, s_a, s_b);

s_i

s_b

s_a

Memory 0x86c

Memory 0x888

Memory 0x668

a[b[i]]

i++

UserStep

i++

Pseudo-Reg

Memory 0x408

Memory 0x400

Memory 0x404

b[i]

s_a s_b

Pseudo-RegIter.012

Page 19: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

20

Implications of Decoupled-Stream Extensions

• New architectural states:– Stream configuration.

– Current iteration’s data

• Maintain the memory order.– Load → first use of the pseudo-register after configured/stepped.

• New speculation in ISA:– Stream elements will be used.

– Streams are not too short (overfetch).

• “Vectorize” access that was not vectorizable before– From the perspective of L1-Cache->Core

– Regular parts of irregular program regions

Page 20: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

21

Outline• Stream Characteristics.

• Stream ISA Extension.

• Stream-Aware Policies.

• Microarchitecture Extension.

• Evaluation.

Page 21: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

22

Microarchitecture

Pseudo-Reg Memory 0x408

Memory 0x40c

Memory 0x400

Memory 0x404

Memory 0x410

Stream

Page 22: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

23

Microarchitecture – Misspeculation• Control misspeculated stream_step.

– Decrement the iteration map.

– No need to flush the FIFO and re-fetch data (decoupled) !

• Other misspeculation.– Revert the stream states, including stream FIFO.

• Memory fault delayed until the use of the element.

Page 23: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

24

Outline• Stream Characteristics.

• Stream ISA Extension.

• Microarchitecture Extension.

• Stream-Aware Policies.

• Evaluation.

Page 24: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

25

Methodology

• Compiler in LLVM:– Identify stream candidates.

– Generate stream configuration.

– Transform the program.

• Gem5 + McPAT simulation.

• 33 Benchmarks:– SPEC2017 C/CPP benchmarks.

– CortexSuite.

• SimPoint:– 10 million instructions’ simpoints.

10 simpoints per benchmark.

Page 25: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

27

Results – Overall Performance

0

1

2

3

4

5

6

7

Pf-Stride SSP-Non-Bind SSP-Semi-Bind SSP-Cache-Aware Pf-Helper

TraditionalStride

Prefetcher

Stream is just a

prefetch

Stream instructions

re decoupled

Stream-based Cache Bypassing

Ideal Prefetch 1K instructions

ahead

Page 26: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

28

Results – Design Space Interaction

0.5

0.6

0.7

0.8

0.9

1

1.1

1 1.5 2 2.5 3

Ener

gy

CortexSuite Speedup

OOO[2,6,8] Pf-Stride[2,6,8]

0.5

0.6

0.7

0.8

0.9

1

1.1

1 1.5 2 2.5 3

Ener

gy

SPEC CPU 2017 Speedup

Pf-Helper[2,6,8] SSP-Cache-Aware[2,6,8]

Page 27: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

29

Conclusion

• Stream as a new memory abstraction in ISA.– ISA/Microarchitecture extension.

– Stream-aware cache bypassing.

• New paradigm of memory specialization.– New direction for improving cache architectures.

– Combine memory and computation specialization.

Page 28: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

30

Stream-Aware Policies

Compiler (ISA)/Hardware

Memory Footprint

Reuse Distance

Modified?

Conditional Used?

Indirect

Prefetch Throttling

Cache Replacement

Cache Bypassing

Sub-Line Transfer

Rich Information Better Policies

Page 29: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

31

Backup Slides

Page 30: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

32

Results – Semi-Binding Prefetching

00.20.40.60.8

1

Remain Insts Added Insts

1

1.5Speedup of Semi-Binding Prefetch vs. Non-Binding Prefetch

Page 31: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

33

Related Work• Decouple access execute.

– Outrider [ISCA’11], DeSC [MICRO’15], etc.

– Ours: New ISA abstraction for the access engine.

• Prefetching.– Stride, IMP [MICRO’15], etc.

– Ours: Explicit access pattern in ISA.

• Cache bypassing policy.– Counter-based [ICCD’05], LLC bypassing [ISCA’11], etc.

– Ours: Incorporate static stream information.

Page 32: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

34

Stream-Aware Policies – Cache Bypass• Stream: Access Pattern → Precise Memory Footprint.

while (i < N) while (j < N)

while (k < N) sum += a[k][i]

* b[k][j];

Core

L1$

L2$

s_as_b

s_b s_a

a[N][N] b[N][N]

Reuse Dist. Reuse Dist.

Page 33: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

35

Results – Unused Stream Requests

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

lda

liblin

ear

mo

tio

n-e

stim

atio

np

carb

msp

hin

xsr

rsv

d3

dis

par

ity

loca

lizat

ion

mse

rm

ult

i_n

cut

sift

stit

chsv

mte

xtu

re_

syn

thes

istr

acki

ng

nam

d_r

par

est_

rp

ovr

ay_r

ble

nd

er_r

per

lben

ch_s

gcc_

sm

cf_s

lbm

_so

mn

etp

p_s

xala

ncb

mk_

sx2

64_

sd

eep

sjen

g_s

imag

ick_

sle

ela_

sn

ab_s

xz_s

avg.

Page 34: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

36

Stream ISA Extension – Compiler in LLVM• Identify stream candidates.– Search backwards on operands of candidates.

–Detect pattern and dependency between candidates.

Identify Stream

Candidates

SelectStream

Candidates

Code Generation

• Select stream candidates.– Check decouplable pattern.

– Limited number of pseudo-registers.

• Code generation.– Stream configuration, e.g. type, parameters, etc.

– Transform the program.

Page 35: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

37

Stream: A New ISA Memory AbstractionHigher level of abstraction embeds richer semantic info.

Opens up new opportunities.

1.37x speedup over a hardware stride prefetch.

Core

Mem

Acc.(Address, Value):(0x400, 1)(0x404, 2)(0x408, 3)…

Streamin ISA

• What stream patterns to support?• How to deal with control flow?• ISA microarchitecture co-design?• …

• Prefetch with control flow.• Decouple address computation.• Better cache policies.• …

Page 36: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

38

Insight: Stream as a new ISA abstraction• Higher level abstraction than single memory requests.

– Access pattern, memory footprint, etc.

– More accurate & efficient than hardware approaches, e.g. hardware prefetching.

• Rich information brings new opportunities:– More accurate & timely prefetching.

– Removing redundant work, e.g. address computation.

– Better cache policies.

Page 37: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

39

Stream ISA Extension – Basic Principles• Stream: A decoupled sequence of values/addresses.

• Usage of stream data:– Referred by a pseudo-register.

– Explicitly stepped by a stream_step inst.

Arch. Reg Phys. Reg

Memory 0x408

Memory 0x40c

Memory 0x400

Memory 0x404

Memory 0x410

Stream

Pseudo-Reg

UserStep.

Page 38: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

40

Stream ISA Extension – Nested Loop• Indirect Stream.

• Decoupled from control flow.

• Longer streams: Configure stream in outer loop.

• Coalesce stream.

Original C Code Stream Decoupled Pseudo Code Stream Dependence Graph

int i = 0;while (i < M) { int j = 0; while (j < N) { sum += a[i][j]; j++; } i++;}

int i = 0;stream_cfg(s_j, s_a);while (i < M) { while (s_j < N) { sum += s_a; stream_step(s_j); } stream_step(s_j); i++;}stream_end(s_j, s_a);

s_a

s_j

Page 39: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

41

Stream ISA Extension – Coalesce Stream• Indirect Stream.

• Decoupled from control flow.

• Longer streams (nested loop).

• Coalesce stream.

Original C Code Stream Decoupled Pseudo Code Stream Dependence Graphint i = 0;while (i < N) { b[i] = a[i].x + a[i].y; i++;}

stream_cfg(s_i, s_a, s_b);while (s_i < N) { s_b = s_a.x + s_a.y; stream_step(s_i);}stream_end(s_i, s_a, s_b);

s_i

s_b s_a

Page 40: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

42

Microarchitecture – Core View

Goal: What is the current iteration of a stream?

s step step s step s

0Stream 1 Iter.

Dispatch Iter. Map

Stream User Inst.

steam_step Inst.

Lookup

Increment

Page 41: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

43

Microarchitecture – Core View

Goal: What is the current iteration of a stream?

s step step s step 0

1Stream 1 Iter.

Dispatch Iter. Map

Stream User Inst.

steam_step Inst.

Lookup

Increment

Page 42: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

44

Microarchitecture – Core View

Goal: What is the current iteration of a stream?

s step step s step 0

1Stream 1 Iter.

Dispatch Iter. Map

Stream User Inst.

steam_step Inst.

Lookup

Increment

Page 43: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

45

Microarchitecture – Core View

Goal: What is the current iteration of a stream?

s step step 1 step 0

2Stream 1 Iter.

Dispatch Iter. Map

Stream User Inst.

steam_step Inst.

Lookup

Increment

Page 44: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

46

Microarchitecture – Core View

Goal: What is the current iteration of a stream?

3 step step 1 step 0

3Stream 1 Iter.

Dispatch Iter. Map

Stream User Inst.

steam_step Inst.

Lookup

Increment

Page 45: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

47

Microarchitecture – Revert stream_step• Control misspeculated stream_step.

– Decrement the iteration map.

– No need to flush the FIFO and re-fetch data (decoupled) !

Memory 0x408

Pseudo-Reg Memory 0x40c

Memory 0x400

Memory 0x404

Memory 0x410

Stream

UserStep.

Page 46: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

48

Stream ISA Extension – Stepping• Explicitly stepping with stream_step instruction.

Memory 0x408

Pseudo-Reg

Memory 0x40c

Memory 0x400

Memory 0x404

Memory 0x410

Stream

UserStep.

Page 47: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

49

Microarchitecture – Stream View

Managed by the Stream Engine.

Definition:Address pattern parameters,Initial address, stride, etc.

State:Current iteration.

Operands:Index from cache.For indirect streams.

Page 48: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

50

Microarchitecture – Aliasing Streams

• Leverage the LSQ to handle aliasing.– Extend the LSQ with PEB to record prefetching stream elements.

– Insert into the LSQ for the user instruction that semantically triggers the stream access.

– Revert and restart at memory order violation.

Page 49: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

51

Stream-Aware Policies - Throttling

Goal: Prefetch just in time.Evenly distribute FIFO among streams leads to untimely prefetching.

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

Iterations (Stream Elements)

Allocated FIFO Entry Unallocated FIFO Entry

Inst

Stream 1

Stream 2

Not Ready

Ready

Page 50: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

52

Stream-Aware Policies - Throttling

A “Late” counter per stream.Increased every time a user instruction blocked by the stream.

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

Iterations (Stream Elements)

Allocated FIFO Entry Unallocated FIFO Entry

Inst

Stream 1

Stream 2

Not Ready

Ready

1

0

Late Counter

Increment

Page 51: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

53

Stream-Aware Policies - Throttling

A “Late” counter per stream.Allocate more FIFO entries to the stream with high “Late” count.

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

Iterations (Stream Elements)

Allocated FIFO Entry Unallocated FIFO Entry

Inst

Stream 1

Stream 2

Not Ready

Ready

1

0

Late Counter

Increment

Page 52: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

54

Stream-Aware Policies – Throttling

A “Late” counter per stream.After some iterations, stream 1 has enough entries to hide the latency.

4 5 6 7 8 9 10 11

4 5 6 7 8 9 10 11

Iterations (Stream Elements)

Allocated FIFO Entry Unallocated FIFO Entry

Inst

Stream 1

Stream 2

Not Ready

Ready

0

0

Late Counter

Page 53: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

55

Stream-Aware Policies – Cache Bypass

0 1 2 3 4 5 6 7 …

Iterations (Stream Elements)

Allocated FIFO Entry Unallocated FIFO Entry

Stream 1

Not Fetching

Fetching

MSHR

0 1

0 1 2 3

L1

L2

Memory

Overhead 1: Caching low reuse data.

Overhead 2: Unused L2 MSHRs.

• Stream: Access Pattern -> Precise Memory Footprint.– Precise estimate the memory footprint when configuring streams.

– Bypass.

Page 54: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

56

Stream-Aware Policies – Cache Bypass

0 1 2 3 4 5 6 7 …

Iterations (Stream Elements)

Allocated FIFO Entry Unallocated FIFO Entry

Stream 1

Not Fetching

Fetching

MSHR

0 1 2 3

L1 (Bypassed)

L2

Memory

Benefit 1: Do not evict data in L1.

Benefit 2: Increase memory parallelism.

• Stream: Access Pattern -> Precise Memory Footprint.– Precise estimate the memory footprint when configuring streams.

– Bypass streams with low reuse.

Page 55: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

57

Stream-Aware Policies – Cache Bypass

Goal: Correctly identify streams with low reuse.A stream table to combine static & dynamic information.

Tags are augmented with the stream id.

A FSM to bypass streams with:

✓ High miss count &&

✓ Low reuse count &&

✓ (Large footprint || No footprint info)

Page 56: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

58

Results – Stream-Aware Cache Bypassing

3.40

1

1.1

1.2

1.3

1.4

1.5

SSP-Cache-Aware vs. SSP-Semi-Bind

Non-Binding Prefetch

Semi-Binding Prefetch

Stream-AwareCache

Page 57: Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

59