exploring the potential of performance monitoring hardware to support run-time optimization alex...

23
Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization Alex Shye M.S. Thesis Defense Committee: Daniel A. Connors, Andrew R. Pleszkun, and Manish Vachharajani University of Colorado at Boulder Department of Electrical and Computer Engineering DRACO Architecture Research Group

Post on 19-Dec-2015

217 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization Alex Shye M.S. Thesis Defense Committee: Daniel A. Connors,

Exploring the Potential of Performance Monitoring Hardware to

Support Run-time Optimization

Alex Shye

M.S. Thesis Defense

Committee: Daniel A. Connors, Andrew R. Pleszkun, and Manish Vachharajani

University of Colorado at Boulder

Department of Electrical and Computer Engineering

DRACO Architecture Research Group

Page 2: Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization Alex Shye M.S. Thesis Defense Committee: Daniel A. Connors,

Thesis Statement

• Hardware Performance Monitoring (HPM) can be utilized to provide a low-overhead alternative to current techniques for profiling run-time code behavior.

Page 3: Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization Alex Shye M.S. Thesis Defense Committee: Daniel A. Connors,

Introduction

• Profile information is critical to success of profile-based optimizations

– Point Profile - BB count, edge profile, etc.– Path Profile - correlated points

• Off-line Path Profiling Methods:– Use static/dynamic instrumentation to gather

full path profile

• On-line Path Profiling Method:– Interpretation and MRET

• Both incur high overhead!!– Slowdown of 2-3x with Pin for BB counting

A

B C

D

E F

G

80 20

7030

Edge Profile: ABDFG 70-50

Path Profile: ABDFG 60 ACDFG 10 …

Page 4: Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization Alex Shye M.S. Thesis Defense Committee: Daniel A. Connors,

Performance Monitoring• HPM through on-chip Performance

Monitoring Units (PMUs)– Itanium, Pentium 4, PowerPC– Coarse-grained, fine-grained features

• Obstacles to PMU profiling– Non-deterministic (sampling)– Sample aliasing– Less information

• Compiler analysis can extend PMU information!!!

Features Description

Event Counters Counts of course grained events. ex. cpu cycles, flushes,etc.

Branch Trace Buffer (BTB)

Record branch vector of last 4 branches executed.

Filters: T/NT, predicted correct/mispredicted,etc.

Instruction Event Address Registers (IEAR)

Sample Icache/ITLB missed. Addresses and latency

Data Event Address Registers (DEAR)

Sample Dcache, DTLB, ALAT misses. Addresses and latency

Itanium-2 PMU Features

Goal: Use sampled branch vectors on PMU to derive a path profile comparable to software path profiling techniques.

Page 5: Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization Alex Shye M.S. Thesis Defense Committee: Daniel A. Connors,

Contributions

I. Characterize the information provided by PMU sampling of branch vectors

II. Characterize the effect compiler analysis on PMU information

III. Demonstrate the construction of a PMU-based path profiler

Page 6: Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization Alex Shye M.S. Thesis Defense Committee: Daniel A. Connors,

PMU Profiling Framework

PMU

BranchVectors…

Partial Paths

OfflineCompiler Analysis

Profile Information

IntermediateFile

Kernel Buffer

Branch VectorHash Table

Online

perfmoninterface

Interrupt onkernel buffer overflow

TerminologyBranch Vector: Series of addresses from BTB

Partial Path: Path of ops in compiler IR

Dominator Analysis

Path Profile Generation

Partial Path ExtensionsAddress Map

Annotated Binary

Page 7: Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization Alex Shye M.S. Thesis Defense Committee: Daniel A. Connors,

PMU Configuration

• Itanium-2 PMU BTB masks– Taken Mask (All, T, NT, None)– Predicted Target Address Mask (All, Correct, Incorrect, None)– Predicted Predicate Mask (All, Correct, Incorrect, None)– Branch Type Mask (All, Indirect, Return, IP-relative)

• Configuration depends on goal– Branch prediction performance? Building call graph?

• PMU configured to sample only taken branches for path information– Not taken branches can be inferred in control flow graph

Page 8: Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization Alex Shye M.S. Thesis Defense Committee: Daniel A. Connors,

Partial Path Extensions

• Compiler view of CFG can be used to extend paths

• Extend until point of uncertainty– Up until Join Point– Down until Branch Point

Join Point

Branch Point

Partial Path from Branch Vector

Extended Partial Path

BTB Branch Vector

1-2-3-4

1

2

3

4

Page 9: Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization Alex Shye M.S. Thesis Defense Committee: Daniel A. Connors,

Dominator Analysis

• Dominator Analysis– Finds all blocks guaranteed to

execute

• Partial Path Extensions– Subset of dominator analysis– Constrained to a path

Join Point

Branch Point

Partial Path from Branch Vector

Basic Blocks added with Dom. Analysis

BTB Branch Vector

1-2-3-4

1

2

3

4

TerminologyDominator: u dominates v if all paths from Entry to v include u

Post Dominate: u post-dominates v if all paths from v to Exit include u

Page 10: Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization Alex Shye M.S. Thesis Defense Committee: Daniel A. Connors,

Path Profile Generation• Combine compiler analysis and PMU branch

vectors to generate a path profile comparable to software path profiling techniques

• Issues:– Path of a branch vector inherently different

• Random start and end of path - path ambiguity• Spans boundaries compiler-based paths do not

– Number of paths increases exponentially

• Must map PMU paths to compiler paths– Region Formation– Split partial paths– Path Matching– Path Crediting

Hot Path

BTB Trace

Page 11: Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization Alex Shye M.S. Thesis Defense Committee: Daniel A. Connors,

Region 3

Region 1

Region 2

Region Formation• Use region-based paths

– Makes total # paths more manageable

• Functions can be large• Create loop-based regions

– Programs spend most of time in loops

• Rules for Region R:– R must be single entry– R may not cross function boundaries– R may not cross loop boundaries

A

CB

D

L

NM

O

E

GF

HQP

R

TS

U

WV

X

JI

K

Y

Page 12: Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization Alex Shye M.S. Thesis Defense Committee: Daniel A. Connors,

Path Matching and Crediting• Path Matching

– Find list of all paths that contain partial path

• Path Crediting– Distribute partial path weight equally among

matched paths

• Ex. ABDLMOP, ABDEFHIK, OPRSUVX

Partial Path Count Matches Inc Total

ABDLMOP 100 ABDLMOPRSUVX

ABDLMOPRSUWX

ABDLMOPRSUVX

ABDLMOPRSUWX

+25

+25

+25

+25

25

25

25

25

ABD 160 ABDLMOPRSUVX

…(14 more)

ABDLNOQRTUWX

+10

+10

35

10

EFHIK 160 EFHIK +160 160

OPRSUVX 280 ABDLMOPRSUVX

ABDLNOPRSUVX

ACDLMOPRSUVX

ACDLNOPRSUVX

+70

+70

+70

+70

105

80

70

70Region 3

Region 1

Region 2

A

CB

D

L

NM

O

E

GF

HQP

R

TS

U

WV

X

JI

K

Y

Page 13: Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization Alex Shye M.S. Thesis Defense Committee: Daniel A. Connors,

Methodology

• Experiments run on Itanium-2 with 2.6.10 kernel• Developed tool using perfmon kernel interface

and libpfm-3.1 to interface with PMU

• Benchmarks– Set of SPEC2000 benchmarks– Compiled with the OpenIMPACT Research Compiler

• Compared to full path profile gathered with a Pin path profiling tool

Page 14: Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization Alex Shye M.S. Thesis Defense Committee: Daniel A. Connors,

Percent Overhead vs. Sampling Period

0

5

10

15

20

25

30

35

40

45

50

50K 100K 500K 1M 5M 10M

Sampling Period (cpu cycles)

Percent Overhead

Effect of Sampling Period

• Sampling Overhead due to:– Periodic interrupt, copying between buffers, hash table insertion

Page 15: Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization Alex Shye M.S. Thesis Defense Committee: Daniel A. Connors,

PMU vs Actual Instruction Distribution

• Kullback-Leibler Divergence (Entropy)

– d = k=0 pk log2(pk/qk)

• Relative measure of distance between two distributions

Page 16: Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization Alex Shye M.S. Thesis Defense Committee: Daniel A. Connors,

Code Coverage

• Explore how PMU branch vectors translate to code coverage information

• Code Coverage Types– Single BB: Simulates PC-sampling

– Branch Vectors

– Branch Vectors w/ Dom. Analysis

• Coverage percentage is percent of actually covered code discovered with compiler-aided analysis of branch vectors

Benchmark #Ops # Covered Ops

164.gzip 6,466 3,063 (47%)

175.vpr 23,573 12,229 (52%)

177.mesa 89,006 7,390 (8%)

179.art 2,201 1,515 (69%)

181.mcf 1,973 1,401 (71%)

183.equake 3,033 2,265 (75%)

188.ammp 19,562 5,835 (30%)

197.parser 17,541 11,271 (64%)

256.bzip2 5,095 3,138 (62%)

300.twolf 40,490 15,705 (39%)

Number of Instructions and Actual Code Covered

Page 17: Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization Alex Shye M.S. Thesis Defense Committee: Daniel A. Connors,

Code Coverage

Page 18: Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization Alex Shye M.S. Thesis Defense Committee: Daniel A. Connors,

Hot Instruction Thresholds

• For top 10-30% of instructions, code coverage does well (80-100%)

• Drops off at around 40-50% of hot instructions

Coverage for Hot Instruction Thresholds

0

10

20

30

40

50

60

70

80

90

100

10 20 30 40 50 60 70 80 90 100

Percentage of Instructions (sorted by execution count)

Percent Coverage

164.gzip175.vpr177.mesa181.mcf197.parser300.twolf

Page 19: Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization Alex Shye M.S. Thesis Defense Committee: Daniel A. Connors,

Stability

• Across 20 runs, PMU code coverage varies ~5-10%

Page 20: Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization Alex Shye M.S. Thesis Defense Committee: Daniel A. Connors,

Multiple Runs

• Regular Sampling: 1) gzip, parser, twolf improve greatly• Randomized Sampling may discover code regular sampling cannot

Page 21: Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization Alex Shye M.S. Thesis Defense Committee: Daniel A. Connors,

Partial Path Characteristics

• Partial Path extensions increase length ~20%• However, splitting drastically decreases lengths

– ~30% on function boundaries, ~20% more on loop back edges

Partial Path Lengths

0

10

20

30

40

50

60

70

80

gzip vprmesa

art mcfequakeammpparser bzip2 twolf

Benchmark

Length (number of IR ops)

Initial Partial PathsExtended Partial PathsSplit on Func. and Loop Boundaries

Page 22: Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization Alex Shye M.S. Thesis Defense Committee: Daniel A. Connors,

Accuracy Results• Accuracy measured similar to Wall’s weight matching scheme[Wall91]

– Threshold = .125%

Accuracy Vs. Sampling Period

0

10

20

30

40

50

60

70

80

90

100

50K 100K500K1M 5M 10M 50M100M500M

Sampling Period

Accuracy (%)

164.gzip175.vpr177.mesa179.art181.mcf183.equake188.ammp197.parser256.bzip2300.twolf

Page 23: Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization Alex Shye M.S. Thesis Defense Committee: Daniel A. Connors,

Conclusion

• Motivates and presents initial results and rational for PMU-based profiling

• Characterizes branch vector sampling– Improves code coverage > 50% over PC-sampling

– Branch vector paths are inter-procedural

• Characterizes effect of compiler analysis– Partial path extensions increase length by ~20%

– Dominator analysis on branch vectors improve code coverage > 50%

• Demonstrates construction of a PMU-based path profiler– ~85% accurate at 1% overhead (at sampling period of 5M)

Questions?