access map pattern matching prefetch: optimization friendly method

Access Map Pattern Matching Prefetch:

Optimization Friendly MethodYasuo Ishii1, Mary Inaba2, and Kei

Hiraki2

1 NEC Corporation2 The University of Tokyo

BackgroundBackgroundSpeed gap between processor and

memory has been increased

To hide long memory latency, many techniques have been proposed.Importance of HW data prefetch has been

increased

Many HW prefetchers have been proposed

Conventional MethodsConventional MethodsPrefetchers uses

1. Instruction Address2. Memory Access Order3. Memory Address

Optimizations scrambles information Out-of-Order memory access Loop unrolling

Limitation of Stride Limitation of Stride Prefetch[Chen+95]Prefetch[Chen+95]Out-of-Order Memory AccessOut-of-Order Memory Access

Memory Address Space

・・・

・・・

0xAB040xAB03

0xAB050xAB06

0xABFF

0xAB04 2 steady

Cache Line

・・・

0xAB02

A Access 4

Access 3

Access 10xAB010xAB000xAAFF

Access 2

for (int i=0; i<N; i++) { load A[2*i]; ・・・・・ (A)}

Tag Address Stride State

Out of Order

Cannot detect strides

Weakness of Conventional Weakness of Conventional MethodsMethodsOut-of-Order Memory Access

Scrambles memory access order Prefetcher cannot detect address correlations

Loop-Unrolling Requires additional table entry Each entry trained slowly

Optimization friendly prefetcher is required

Access Map Pattern MatchingAccess Map Pattern MatchingPattern Matching

Order Free PrefetchingOptimization Friendly Prefetch

Access MapMap-base history2-bit state map

Each state is attached to cache block

State Diagram for Each Cache State Diagram for Each Cache BlockBlock

InitInitialized state

AccessAlready accessed

PrefetchIssued Pref.

RequestsSuccess

Accessed Pref. Data

Init Access

Access

SuccessAccess

Pre-fetch

Prefetch

Memory Access Pattern MapMemory Access Pattern MapCorresponding to

memory address spaceCache line granularity

I I


・・・

Cache Line

Zone Size

・・・

・・・

・・・

A

Memory Access Pattern Map

Pattern Match Logic

S PA

Pattern Matching LogicPattern Matching Logic

Access Map Shifter

Pattern Detector

Pipeline Register

Prefetch Selector

Addr


I AA AI AII I A

Access Map Shifter

10 1

I AA AI AA AII I

A

・・・

Addr

・・・

1

Priority Encoder & Adder

Prefetch Request

Feedback Path0

+1

+2

+3

・・・

(Addr+2)

Access Map Shifter

・・・

00・・・Priority Encoder & Adder

II AI I AA AI A A

Parallel Pattern MatchingParallel Pattern MatchingDetects patterns from memory access map

Detects address correlations in parallelSearches candidates effectively

I SI AI AI AA II I I AA


・・・

・・・

AMPM PrefetchAMPM PrefetchMemory address

space divides into zone

Detects hot zone

Memory Access Map TableLRU

replacement

Pattern Matching

Zone

Zone

Zone

Zone

Zone


HotZone

HotZone

HotZone

AccessZone

Prefetch Request

Memory Access Map Table

・・・

P S A I・・・

P S IA ・・・

Pattern MatchLogic

Features of AMPM PrefetcherFeatures of AMPM PrefetcherPattern Matching Base Prefetching

Map base historyOptimization friendly prefetching

Parallel pattern matchingSearches candidates effectivelyComplexity-effective

implementation

Configuration for DPC Configuration for DPC CompetitionCompetitionAMPM Prefetcher

Full-assoc 52 maps, 256 states / map

Adaptive Stream Prefetcher [Hur+ 2006]16 Histograms, 8 Stream Length

MSHR Configuration16 entries for Demand Requests

(Default)32 entries for Prefetch Requests

(Additional)

Budget CountBudget CountBudget

MSHR Valid bit (1bit)Address bit (26 bit) 16 entries 0bit

(Default)

PrefetchMSHR

Valid bit (1bit)Address bit (26 bit)Issue bit (1 bit)

32 entries5bit pointer 901 bit

MemoryAccessMapTable

Address Tag (18 bit)LRU status(6 bit)Access Counter (4 bit)Interval Timer (18 bit)Access Map (256 x 2 bit)

52 entries+ mode register (3 bit)+ performance counter(32 bit x 4)

29147 bit

AdaptiveStreamFilter

Valid bit (1bit)Address bit (26 bit)Lifetime (10 bit)Stream Length (4 bit)Direction (1 bit)

16 entries 672 bit

StreamLengthHistogram

Counter (16 bit)16 entries2 series2 direction

1024 bit

PipelineRegisters 292 bit

Total 32036 bit

Components

MethodologyMethodologySimulation Environment

DPC FrameworkSkips first 4000M instructions and

evaluate following 100M instructions

BenchmarkSPEC CPU2006 benchmark suiteCompile Option: “-O3 -fomit-frame-pointer

-funroll-all-loops”

IPC MeasurementIPC Measurement

Improves performance by 53%Improves performance in all benchmarks

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

400.

perl

benc

h40

1.bz

ip2

403.

gcc

410.

bwav

es41

6.ga

mes

s42

9.m

cf43

3.m

ilc43

4.ze

usm

p43

5.gr

omac

s43

6.ca

ctus

AD

M43

7.le

slie

3d44

4.na

md

445.

gobm

k44

7.de

alII

450.

sopl

ex45

3.po

vray

454.

calc

ulix

456.

hmm

er45

8.sj

eng

459.

Gem

sFD

TD46

2.lib

quan

tum

464.

h264

ref

465.

tont

o47

0.lb

m47

1.om

netp

p47

3.as

tar

481.

wrf

482.

sphi

nx3

483.

xala

ncbm

kA

rith

Mea

n

Inst

ruct

ions

Per

Cyc

le

NOPREF PREFETCH

L2 Cache Miss CountL2 Cache Miss Count

Reduces L2 Cache Miss by 76%

0500000

100000015000002000000250000030000003500000400000045000005000000

400.

perl

benc

h40

1.bz

ip2

403.

gcc

410.

bwav

es41

6.ga

mes

s42

9.m

cf43

3.m

ilc43

4.ze

usm

p43

5.gr

omac

s43

6.ca

ctus

AD

M43

7.le

slie

3d44

4.na

md

445.

gobm

k44

7.de

alII

450.

sopl

ex45

3.po

vray

454.

calc

ulix

456.

hmm

er45

8.sj

eng

459.

Gem

sFD

TD46

2.lib

quan

tum

464.

h264

ref

465.

tont

o47

0.lb

m47

1.om

netp

p47

3.as

tar

481.

wrf

482.

sphi

nx3

483.

xala

ncbm

kA

rith

Mea

nL2 M

iss C

ount

Per

100

M In

stru

ctio

ns L2 Miss Count / 100M Inst. (without Prefetch)L2 Miss Count / 100M Inst. (with Prefetch)

Related WorksRelated WorksSequence-base Prefetching

Sequential Prefetch [Smith+ 1978]Stride Prefetching Table [Fu+ 1992]Markov Predictor [Joseph+ 1997]Global History Buffer [Nesbit+ 2004]

Adaptive PrefetchingAC/DC [Nesbit+ 2004]Feedback Directed Prefetch [Srinath+ 2007]Focus Prefetching[Manikantan+ 2008]

ConclusionConclusionAccess Map Pattern Matching Prefetch

Order-Free Prefetch Optimization friendly prefetching

Parallel Pattern Matching Complexity-effective implementation

Optimized AMPM realizes good performanceImproves IPC by 53%Reduces L2 cache miss by 76%

Spatial

Q & AQ & A

Stride PrefetchFu+ 1992

Markov PrefetchJoseph+ 1997

GHBNesbit+ 2004

Feedback basedHonjo 2009

HybridHsu+ 1998

Software SupportMowry+ 1992

AC/DCNesbit+ 2004

Adaptive StreamHur+ 2006

FDPSrinath+ 2007

Software

Sequence-Base(Order Sensitive)

Tag CorrelationHu+ 2003

Buffer Block Gindele1977

SMSSomogyi 2006

SequentialSmith+ 1978

RPTChen+ 1995

Locality DetectJohnson+, 1998

Spatial Pat. Chen+ 2004

Adaptive

Hybrid

Adaptive Seq.Dahlgren+ 1993

CommercialProcessors

SuperSPARC

R10000PA7200

Power4

Pentium 4

AMPM PrefetchIshii+ 2009

HW/SW IntegrateGornish+ 1994

access map pattern matching prefetch: optimization friendly method

Documents