decomposing memory performance data structures and phases kartik k. agaram, stephen w. keckler,...
Post on 20-Jan-2016
216 views
TRANSCRIPT
![Page 1: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences](https://reader036.vdocument.in/reader036/viewer/2022070415/56649d3e5503460f94a16b91/html5/thumbnails/1.jpg)
Decomposing Memory Performance
Data Structures and Phases
Kartik K. Agaram,Stephen W. Keckler, Calvin Lin, Kathryn McKinley
Department of Computer SciencesThe University of Texas at Austin
![Page 2: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences](https://reader036.vdocument.in/reader036/viewer/2022070415/56649d3e5503460f94a16b91/html5/thumbnails/2.jpg)
2
Memory hierarchy trends
• Growing latency to main memory• Growing cache complexity
– More cache levels– New mechanisms, optimizations
• Growing application complexity– Lots of abstraction
Application-System interactions
increasingly hard to predict
![Page 3: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences](https://reader036.vdocument.in/reader036/viewer/2022070415/56649d3e5503460f94a16b91/html5/thumbnails/3.jpg)
3
The solution: More fine-grained metrics
• More insight within an application• More rigorous comparisons across applications• Potential applications:
– Hardware/software tuning– Global hints for online phase detection
Our approach: data structure decomposition
High-level, easy to understand
Highlights important access patterns
![Page 4: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences](https://reader036.vdocument.in/reader036/viewer/2022070415/56649d3e5503460f94a16b91/html5/thumbnails/4.jpg)
4
ammp vs twolf:The tale of two applications
Conventional view: they’re pretty similar• IPC: 0.57 vs 0.51• DL1 Miss-rate (%): 10% vs 9.5%• Access patterns
– Lots of pointer access in both..– Mostly linked list traversal
![Page 5: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences](https://reader036.vdocument.in/reader036/viewer/2022070415/56649d3e5503460f94a16b91/html5/thumbnails/5.jpg)
5
ammp vs twolf:Data structure decomposition
0102030405060708090
100
ammp twolf
RestDS3DS2DS1
DL
1 m
isse
s (%
)
![Page 6: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences](https://reader036.vdocument.in/reader036/viewer/2022070415/56649d3e5503460f94a16b91/html5/thumbnails/6.jpg)
6
ammp vs twolf:Access patterns
twolf
t1 = b[c[i]cblock]t2 = t1tiletermt3 = n[t2net]…
i=rand()
ammp
atom atom=atom next
atom[i] neighbour[j] ++j ++i
twolf has more complex access patterns
![Page 7: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences](https://reader036.vdocument.in/reader036/viewer/2022070415/56649d3e5503460f94a16b91/html5/thumbnails/7.jpg)
7
ammp vs twolf:Phase behavior
0
2
4
6
8
10
12
14
16
18D
L1
mis
ses
(mil
lio
ns)
total
0
1
2
3
4
5
6
7
8
9
DL
1 m
isse
s (m
illi
on
s)
total
ammp
twolf
Time60 billion cycles
![Page 8: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences](https://reader036.vdocument.in/reader036/viewer/2022070415/56649d3e5503460f94a16b91/html5/thumbnails/8.jpg)
8
ammp vs twolf:Phase behavior by data structure
0
2
4
6
8
10
12
14
16
18D
L1
mis
ses
(mil
lio
ns)
total
atoms
nodelist
0
1
2
3
4
5
6
7
8
9
DL
1 m
isse
s (m
illi
on
s)
total
netptr
tmp_rows
ammp
twolf
ammp has more interesting phase behavior
![Page 9: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences](https://reader036.vdocument.in/reader036/viewer/2022070415/56649d3e5503460f94a16b91/html5/thumbnails/9.jpg)
9
Outline
• Motivation• Data structure decomposition• Phase analysis: selecting sampling period• Results:
– Aggregate– Phase
![Page 10: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences](https://reader036.vdocument.in/reader036/viewer/2022070415/56649d3e5503460f94a16b91/html5/thumbnails/10.jpg)
10
Data structure decomposition
Application communicates with simulator
Leave core application oblivious; automatically add simulator-aware instrumentation
Application Simulator
Resources
![Page 11: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences](https://reader036.vdocument.in/reader036/viewer/2022070415/56649d3e5503460f94a16b91/html5/thumbnails/11.jpg)
11
DTrack
ApplicationSources
DetailedStatistics
ApplicationExecutable
InstrumentedSources
SourceTranslator
Compiler Simulator
- DTrack’s protocol for application-simulator communication
![Page 12: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences](https://reader036.vdocument.in/reader036/viewer/2022070415/56649d3e5503460f94a16b91/html5/thumbnails/12.jpg)
12
DTrack’s protocol
1. Application stores mapping at a predetermined shared location– (start address, end address) → variable name
2. ..and signals simulator by special opcode• Other techniques possible
3. Simulator detects signal, reads shared location
1. Application stores mapping at a predetermined shared location– (start address, end address) → variable name
2. ..and signals simulator by special opcode• Other techniques possible
3. Simulator detects signal, reads shared location
Simulator now knows variable names
of address regions
![Page 13: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences](https://reader036.vdocument.in/reader036/viewer/2022070415/56649d3e5503460f94a16b91/html5/thumbnails/13.jpg)
13
Instrumentation without perturbance
• Global segment: write to file– Expensive, but one-time cost during initialization– Amortized across all global variables
• Heap: save in special variables after every malloc/free– Overhead α frequency of mallocs/frees– Special variables always hit in cache
• Stack: no instrumentation– Function calls too frequent– Causes negligible misses anyway
![Page 14: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences](https://reader036.vdocument.in/reader036/viewer/2022070415/56649d3e5503460f94a16b91/html5/thumbnails/14.jpg)
14
Measuring perturbance
• Communicate specific start and end points in application to simulator
• Compare instruction counts between them with and without instrumentation
ΔInstruction count <4%
even with frequent malloc
![Page 15: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences](https://reader036.vdocument.in/reader036/viewer/2022070415/56649d3e5503460f94a16b91/html5/thumbnails/15.jpg)
15
Outline
• Motivation• Data structure decomposition• Phase analysis: selecting sampling period• Results:
– Aggregate– Phase
![Page 16: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences](https://reader036.vdocument.in/reader036/viewer/2022070415/56649d3e5503460f94a16b91/html5/thumbnails/16.jpg)
16
The importance of sampling period
Good sampling period Low noise
0
100
200
300
400
0
200
400
600
800
DL1 misses/
10M cycles
(thousands)
DL1 misses/
230M cycles
(thousands)
![Page 17: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences](https://reader036.vdocument.in/reader036/viewer/2022070415/56649d3e5503460f94a16b91/html5/thumbnails/17.jpg)
17
Volatility: A noise metricfor time sequence graphs
Raw datastream
Volatilityvalue
Volatilitygraph
Missgraph
Aggregatefor some
Sampling period
Pointvolatilities
Sort,extract 90th
percentile
Point volatility =abs(Xt-Xt-1)
max(Xt, Xt-1)
![Page 18: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences](https://reader036.vdocument.in/reader036/viewer/2022070415/56649d3e5503460f94a16b91/html5/thumbnails/18.jpg)
18
Volatility depends on sampling period
Raw datastream
Volatilityvalue
Volatilitygraph
samplingPeriod
Aggregate Pointvolatilities
![Page 19: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences](https://reader036.vdocument.in/reader036/viewer/2022070415/56649d3e5503460f94a16b91/html5/thumbnails/19.jpg)
19
Volatility profile:Volatility vs sampling period
00.10.20.30.40.50.60.70.80.9
1
0 80 160 240 320 400 480
Sampling period (millions of samples)
Vo
latil
ity
164.gzip
![Page 20: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences](https://reader036.vdocument.in/reader036/viewer/2022070415/56649d3e5503460f94a16b91/html5/thumbnails/20.jpg)
20
Outline
• Motivation• Data structure decomposition• Phase analysis: selecting sampling period• Results:
– Aggregate– Phase
![Page 21: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences](https://reader036.vdocument.in/reader036/viewer/2022070415/56649d3e5503460f94a16b91/html5/thumbnails/21.jpg)
21
Methodology
• A Source translator: C-Breeze• B Compiler: Alpha GEM cc• C Simulator: sim-alpha
– Validated model of 21264 pipeline
• Simulated machine: Alpha 21264– 4-way issue, 64KB 3-cycle DL1
• Benchmarks: 12 C applications from SPEC CPU2000 suite
A B C
![Page 22: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences](https://reader036.vdocument.in/reader036/viewer/2022070415/56649d3e5503460f94a16b91/html5/thumbnails/22.jpg)
22
Major data structures by DL1 misses
0
20
40
60
80
100
120
164.
gzip
175.
vpr
176.
gcc
177.
mes
a
179.
art
181.
mcf
183.
equak
e
188.
amm
p
197.
parse
r
256.
bzip2
300.
twol
f
#3#2#1
% D
L1 m
isse
s
![Page 23: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences](https://reader036.vdocument.in/reader036/viewer/2022070415/56649d3e5503460f94a16b91/html5/thumbnails/23.jpg)
23
Most misses Most pipeline stalls?≣• Process:
– Detect stall cycles when no instructions were committed
– Assign blame to data structure of oldest instruction in pipeline
• Results– Stall cycle ranks track miss count ranks– Exceptions:
•tds in 179.art•search in 186.crafty
![Page 24: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences](https://reader036.vdocument.in/reader036/viewer/2022070415/56649d3e5503460f94a16b91/html5/thumbnails/24.jpg)
24
Types of phase behavior
0
10
20
30
40
DL
1 M
isse
s (M
illio
ns)
I. mcf
0
10
20
30
II. art
115 billion cycles
![Page 25: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences](https://reader036.vdocument.in/reader036/viewer/2022070415/56649d3e5503460f94a16b91/html5/thumbnails/25.jpg)
25
0
0.2
0.4
0.6
0.8
DL
1 M
isse
s (M
illio
ns)
III. mesa
Types of phase behavior
cycles
![Page 26: Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences](https://reader036.vdocument.in/reader036/viewer/2022070415/56649d3e5503460f94a16b91/html5/thumbnails/26.jpg)
26
Summary
• More detailed metrics richer application comparison
• Low-overhead data structure decomposition• Determining ideal sampling period
– A volatility metric inspired by spectral analysis
• Ideal sampling period is application-specific• Data structures in an application share
common phase boundaries