access map pattern matching prefetch: optimization friendly method
DESCRIPTION
Access Map Pattern Matching Prefetch: Optimization Friendly Method. Yasuo Ishii 1 , Mary Inaba 2 , and Kei Hiraki 2 1 NEC Corporation 2 The University of Tokyo. Background. Speed gap between processor and memory has been increased - PowerPoint PPT PresentationTRANSCRIPT
Access Map Pattern Matching Prefetch:
Optimization Friendly MethodYasuo Ishii1, Mary Inaba2, and Kei
Hiraki2
1 NEC Corporation2 The University of Tokyo
BackgroundBackgroundSpeed gap between processor and
memory has been increased
To hide long memory latency, many techniques have been proposed.Importance of HW data prefetch has been
increased
Many HW prefetchers have been proposed
Conventional MethodsConventional MethodsPrefetchers uses
1. Instruction Address2. Memory Access Order3. Memory Address
Optimizations scrambles information Out-of-Order memory access Loop unrolling
Limitation of Stride Limitation of Stride Prefetch[Chen+95]Prefetch[Chen+95]Out-of-Order Memory AccessOut-of-Order Memory Access
Memory Address Space
・・・
・・・
0xAB040xAB03
0xAB050xAB06
0xABFF
0xAB04 2 steady
Cache Line
・・・
0xAB02
A Access 4
Access 3
Access 10xAB010xAB000xAAFF
Access 2
for (int i=0; i<N; i++) { load A[2*i]; ・・・ ・・ (A)}
Tag Address Stride State
Out of Order
Cannot detect strides
Weakness of Conventional Weakness of Conventional MethodsMethodsOut-of-Order Memory Access
Scrambles memory access order Prefetcher cannot detect address correlations
Loop-Unrolling Requires additional table entry Each entry trained slowly
Optimization friendly prefetcher is required
Access Map Pattern MatchingAccess Map Pattern MatchingPattern Matching
Order Free PrefetchingOptimization Friendly Prefetch
Access MapMap-base history2-bit state map
Each state is attached to cache block
State Diagram for Each Cache State Diagram for Each Cache BlockBlock
InitInitialized state
AccessAlready accessed
PrefetchIssued Pref.
RequestsSuccess
Accessed Pref. Data
Init Access
Access
SuccessAccess
Pre-fetch
Prefetch
Memory Access Pattern MapMemory Access Pattern MapCorresponding to
memory address spaceCache line granularity
I I
Memory Address Space
・・・
Cache Line
Zone Size
・・・
・・・
・・・
A
Memory Access Pattern Map
Pattern Match Logic
S PA
Pattern Matching LogicPattern Matching Logic
Access Map Shifter
Pattern Detector
Pipeline Register
Prefetch Selector
Addr
Memory Access Pattern Map
I AA AI AII I A
Access Map Shifter
10 1
I AA AI AA AII I
A
・・・
Addr
・・・
1
Priority Encoder & Adder
Prefetch Request
Feedback Path0
+1
+2
+3
・・・
(Addr+2)
Access Map Shifter
・・・
00・・・Priority Encoder & Adder
II AI I AA AI A A
Parallel Pattern MatchingParallel Pattern MatchingDetects patterns from memory access map
Detects address correlations in parallelSearches candidates effectively
I SI AI AI AA II I I AA
Memory Access Pattern Map
・・・
・・・
AMPM PrefetchAMPM PrefetchMemory address
space divides into zone
Detects hot zone
Memory Access Map TableLRU
replacement
Pattern Matching
Zone
Zone
Zone
Zone
Zone
Memory Address Space
HotZone
HotZone
HotZone
AccessZone
Prefetch Request
Memory Access Map Table
・・・
P S A I・・・
P S IA ・・・
Pattern MatchLogic
Features of AMPM PrefetcherFeatures of AMPM PrefetcherPattern Matching Base Prefetching
Map base historyOptimization friendly prefetching
Parallel pattern matchingSearches candidates effectivelyComplexity-effective
implementation
Configuration for DPC Configuration for DPC CompetitionCompetitionAMPM Prefetcher
Full-assoc 52 maps, 256 states / map
Adaptive Stream Prefetcher [Hur+ 2006]16 Histograms, 8 Stream Length
MSHR Configuration16 entries for Demand Requests
(Default)32 entries for Prefetch Requests
(Additional)
Budget CountBudget CountBudget
MSHR Valid bit (1bit)Address bit (26 bit) 16 entries 0bit
(Default)
PrefetchMSHR
Valid bit (1bit)Address bit (26 bit)Issue bit (1 bit)
32 entries5bit pointer 901 bit
MemoryAccessMapTable
Address Tag (18 bit)LRU status(6 bit)Access Counter (4 bit)Interval Timer (18 bit)Access Map (256 x 2 bit)
52 entries+ mode register (3 bit)+ performance counter(32 bit x 4)
29147 bit
AdaptiveStreamFilter
Valid bit (1bit)Address bit (26 bit)Lifetime (10 bit)Stream Length (4 bit)Direction (1 bit)
16 entries 672 bit
StreamLengthHistogram
Counter (16 bit)16 entries2 series2 direction
1024 bit
PipelineRegisters 292 bit
Total 32036 bit
Components
MethodologyMethodologySimulation Environment
DPC FrameworkSkips first 4000M instructions and
evaluate following 100M instructions
BenchmarkSPEC CPU2006 benchmark suiteCompile Option: “-O3 -fomit-frame-pointer
-funroll-all-loops”
IPC MeasurementIPC Measurement
Improves performance by 53%Improves performance in all benchmarks
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
400.
perl
benc
h40
1.bz
ip2
403.
gcc
410.
bwav
es41
6.ga
mes
s42
9.m
cf43
3.m
ilc43
4.ze
usm
p43
5.gr
omac
s43
6.ca
ctus
AD
M43
7.le
slie
3d44
4.na
md
445.
gobm
k44
7.de
alII
450.
sopl
ex45
3.po
vray
454.
calc
ulix
456.
hmm
er45
8.sj
eng
459.
Gem
sFD
TD46
2.lib
quan
tum
464.
h264
ref
465.
tont
o47
0.lb
m47
1.om
netp
p47
3.as
tar
481.
wrf
482.
sphi
nx3
483.
xala
ncbm
kA
rith
Mea
n
Inst
ruct
ions
Per
Cyc
le
NOPREF PREFETCH
L2 Cache Miss CountL2 Cache Miss Count
Reduces L2 Cache Miss by 76%
0500000
100000015000002000000250000030000003500000400000045000005000000
400.
perl
benc
h40
1.bz
ip2
403.
gcc
410.
bwav
es41
6.ga
mes
s42
9.m
cf43
3.m
ilc43
4.ze
usm
p43
5.gr
omac
s43
6.ca
ctus
AD
M43
7.le
slie
3d44
4.na
md
445.
gobm
k44
7.de
alII
450.
sopl
ex45
3.po
vray
454.
calc
ulix
456.
hmm
er45
8.sj
eng
459.
Gem
sFD
TD46
2.lib
quan
tum
464.
h264
ref
465.
tont
o47
0.lb
m47
1.om
netp
p47
3.as
tar
481.
wrf
482.
sphi
nx3
483.
xala
ncbm
kA
rith
Mea
nL2 M
iss C
ount
Per
100
M In
stru
ctio
ns L2 Miss Count / 100M Inst. (without Prefetch)L2 Miss Count / 100M Inst. (with Prefetch)
Related WorksRelated WorksSequence-base Prefetching
Sequential Prefetch [Smith+ 1978]Stride Prefetching Table [Fu+ 1992]Markov Predictor [Joseph+ 1997]Global History Buffer [Nesbit+ 2004]
Adaptive PrefetchingAC/DC [Nesbit+ 2004]Feedback Directed Prefetch [Srinath+ 2007]Focus Prefetching[Manikantan+ 2008]
ConclusionConclusionAccess Map Pattern Matching Prefetch
Order-Free Prefetch Optimization friendly prefetching
Parallel Pattern Matching Complexity-effective implementation
Optimized AMPM realizes good performanceImproves IPC by 53%Reduces L2 cache miss by 76%
Spatial
Q & AQ & A
Stride PrefetchFu+ 1992
Markov PrefetchJoseph+ 1997
GHBNesbit+ 2004
Feedback basedHonjo 2009
HybridHsu+ 1998
Software SupportMowry+ 1992
AC/DCNesbit+ 2004
Adaptive StreamHur+ 2006
FDPSrinath+ 2007
Software
Sequence-Base(Order Sensitive)
Tag CorrelationHu+ 2003
Buffer Block Gindele1977
SMSSomogyi 2006
SequentialSmith+ 1978
RPTChen+ 1995
Locality DetectJohnson+, 1998
Spatial Pat. Chen+ 2004
Adaptive
Hybrid
Adaptive Seq.Dahlgren+ 1993
CommercialProcessors
SuperSPARC
R10000PA7200
Power4
Pentium 4
AMPM PrefetchIshii+ 2009
HW/SW IntegrateGornish+ 1994