improving the speed and quality of architectural performance evaluation vijay s. pai with...

Improving the Speed and Quality of Architectural Performance Evaluation

Vijay S. Pai

with contributions from: Derek Schuff, Milind Kulkarni

Electrical and Computer EngineeringPurdue University

Outline

•Intro to Reuse Distance Analysis▫Contributions

•Multicore-Aware Reuse Distance Analysis▫Design▫Results

•Sampled Parallel Reuse Distance Analysis▫Design: Sampling, Parallelisim▫Results▫Application: selection of low-locality code

2

Reuse Distance Analysis•Reuse Distance Analysis (RDA):

architecture-neutral locality profile▫Number of distinct data referenced

between use and reuse of data element▫Elements can be memory pages, disk

blocks, cache blocks, etc•Machine-independent model of locality

▫Predicts hit ratio in any size fully-associative LRU cache

▫Hit ratio in cache with X blocks = % of references with RD < X

3

Reuse Distance Analysis

•Applications in performance modeling, optimization▫Multiprogramming/scheduling

interaction, phase prediction▫Cache hint generation, restructuring

code, data layout

4

Reuse Distance Profile Example5

Reuse Distance Measurement

•Maintain stack of all previous data addresses

•For each reference:▫Search stack for referenced address▫Depth in stack = reuse distance

If not found, distance = ∞▫Remove from stack, push on top

6

Example

7

A

A

BA

∞

CB

A

C

B

A

C

B

A A

Address

Distance

B C C B A

∞ ∞ 0 1 2

B

C

RDA Applications•VM page locality [Mattson 1970]•Cache performance prediction [Beyls01,

Zhong03]•Cache hinting [Beyls05]•Code restructuring [Beyls06], data layout

[Zhong04]•Application performance modeling [Marin04]•Phase prediction [Shen04]•Visualization, manual optimization

[Beyls04,05,Marin08]•Modeling cache contention

(multiprogramming) [Chandra05,Suh01,Fedorova05,Kim04]

8

Measurement Methods

•List-based stack algorithm is O(NM)•Balanced binary trees or splay trees

O(NlogM)▫[Olken81, Sugumar93]

•Approximate analysis (tree compression) O(NloglogM) time and O(logM) space [Ding03]

9

Contributions

•Multicore-Aware Reuse Distance Analysis▫First RDA to include sharing and invalidation▫Study different invalidation timing strategies

•Acceleration of Multicore RDA▫Sampling, Parallelization▫Demonstration of application: selection of

low-locality code▫Validation against full analysis, hardware

•Prefetching model in RDA▫Hybrid analysis

10

Outline

•Intro to Reuse Distance Analysis▫Contributions

•Multicore-Aware Reuse Distance Analysis▫Design▫Results

•Sampled Parallel Reuse Distance Analysis▫Design: Sampling, Parallelisim▫Results▫Application: selection of low-locality code

11

Extending RDA to Multicore

•RDA defined for single reference stream▫No prior work accounts for multithreading

•Multicore-aware RDA accounts for invalidations and data sharing▫Models locality of multi-threaded programs▫Targets multicore processors with private

or shared caches

12

Multicore Reuse Distance

•Invalidations cause additional misses in private caches▫2nd order effect: holes can be filled without

eviction•Sharing affects locality in shared caches

▫Inter-thread data reuse (reduces distance to shared data)

▫Capacity contention (increases distance to unshared data)

13

Invalidations

14

A

A

BA

∞

CB

A

C

B

A

C

(hole) (hole)

Address

Distance (unaware)B C C B A

∞ ∞ 0 1 ∞

B

C

C

B

A

ARemote write

(hole)

A

∞ ∞ ∞ 0 1 2

B

Invalidation Timing

•Multithreaded interleaving is nondeterministic▫If no races, invalidations can be propagated

between write and next synchronization•Eager invalidation – immediately at write•Lazy invalidation – at next synchronization

▫Could increase reuse distance•Oracular invalidation – at previous sync.

▫Data-race-free (DRF) → will not be referenced by invalidated thread

▫Could decrease reuse distance

15

Sharing

16

A

A

BA

∞

CB

A

C

B

A

Address

Distance (unaware)B C C B A

∞ ∞ 0 2 1

ARemote write

∞ ∞ ∞ 0 1 2

2

A

C

B

B

A

C

B

A

C

17

MCRD Results

19

Impact of Inaccuracy

Summary So Far

•Compared Unaware and Multicore-aware RDA to simulated caches▫Private caches: Unaware 37% error, aware

2.5%▫Invalidation timing had minor affect on

accuracy▫Shared caches: Unaware 76+%, aware

4.6%•Made RDA viable for multithreaded

workloads

21

Problems with Multicore RDA

•RDA is slow in general▫Even efficient implementations require

O(log M) time per reference•Multi-threading makes it worse

▫Serialization▫Synchronization (expensive bus-locked

operations on every program reference)•Goal: Fast enough to use by programmers

in development cycle

22

Accelerating Multicore RDA

•Sampling

•Parallelization

23

Reuse Distance Sampling

•Randomly select individual references▫Select count before sampled reference

Geometric distribution, expect 1/n sampled references n = 1,000,000

▫Fast mode until target reference is reached

24

References

Fast mode


•Monitor all references until sampled address is reused (Analysis mode)▫Track unique addresses in distance set▫RD of the reuse reference is size of distance set

Return to fast mode until next sample

25

References

Fast mode Analysis mode


•Analysis mode is faster than full RDA▫Full stack tracking not needed▫Distance set implemented as hash table

26

References


RD Sampling of MT Programs

•Data Sharing•Invalidation

▫Invalidation of tracked address▫Invalidation of address in the distance set

27

RD Sampling of MT programs•Data Sharing

▫Analysis mode sees references from all threads▫Reuse reference can be on any thread

28


Tracking thread

Remote thread

RD Sampling of MT programs

•Invalidation of tracked address▫∞ distance

29


RD Sampling of MT programs

• Invalidation of address in distance set▫Remove from set, increment hole count▫New addresses “fill” holes (decrement count)

30


At reuse, RD = set size + hole count

Parallel Measurement

•Goals: Get parallelism in analysis, eliminate per-ref synchronization

•2 properties facilitate▫Sampled analysis only tracks distance set,

not whole stack Allows separation of state

▫Exact timing of invalidations not significant Allows delayed synchronization

31

Parallel Measurement

•Data Sharing▫Each thread has its own distance set▫All sets merged on reuse

32


Tracking thread

Remote thread

At reuse, RD = set size

Parallel Measurement• Invalidations

▫Other threads record write sets▫On synchronization, write set contents invalidated

from distance set

33


Tracking thread

Remote thread

Pruning

•Analysis mode stays active until reuse▫What if address is never reused?▫Program locality determines time

spent in analysis mode•Periodically prune (remove & record)

the oldest sample▫If its distance is large enough, e.g. top

1% of distances seen so far▫Size-relative threshold allows different

input sizes

34

Results

•Comparison with full analysis▫Histograms▫Accuracy metric

•Performance▫Slowdown from native

35

Example RD Histograms

36

Reuse distance (64-byte blocks) Reuse distance (64-byte blocks)


37


Slowdown of full analysis perturbs execution of spin-locks, inflates 0-distance bin in histogram


38


Results: Private Stacks

•Error metric used by previous work:▫Normalize histogram bins▫Error E = ∑i(|fi - si|)

▫Accuracy = 1 – E / 2•91%-99% accuracy (avg 95.6%)•177x faster than full analysis•7.1x-143x slowdown from native (avg

29.6x)▫Fast mode: 5.3x▫80.4% of references in fast mode

39

Results: Shared Stacks

•Shared reuse distances depend on all references by other threads▫Not just to shared data▫Relative execution rate matters▫More variation in measurements and

in real execution•Compare fully-parallel sample

analysis mode to serialized sample analysis mode▫Round-robin ensures threads progress

at same rate as in non-sampled analysis

40

Accuracy Slowdown

Parallel Sampling 74.1% 80

Sequential Sampling

88.9% 265

41

FT Histogram

Reuse distance (64-byte blocks)

Performance Comparison

• Single-thread sampling [Zhong08]▫ Instrumentation 2x-4x (compiler), 4x-10x

(Valgrind) ▫ Additional 10x-90x with analysis• Approximate non-random sampling

[Beyls04]▫ 15x-25x (single-thread, compiler)• Valgrind, our benchmarks▫ Instrumentation 4x-75x, avg 23x▫ Memcheck avg 97x

42

Low-locality PC Selection

•Application: Find code with poor locality to assist programmer optimization▫e.g. n PCs account for y% of misses at cache

size C•Select C such that miss ratio is 10%, find

enough PCs to cover 75/80/90/95% of misses•Use weight-matching to compare selection

against full analysis•Selection accuracy 91% - 92% for private

and shared caches▫In spite of reduced accuracy in parallel-shared

43

Smarter Multithreaded Replacement• Shared cache management is challenging

▫Benefits of demand multiplexing▫Cost of performance interference

• Most work addresses multi-programming▫Destructive interference only▫Per-benchmark performance targets

• Multi-threading presents opportunities and challenges▫Constructive interference, process performance

target▫Reuse distance profiles can help understand

needs▫Work in progress!

44

Conclusion•Two techniques to accelerate multicore-

aware reuse distance analysis▫Sampled analysis▫Parallel analysis▫Private caches: 96% accuracy, 30x native▫Shared caches: 74/89% accuracy, 80/265x

native•Demonstrate effectiveness for selection of

code with low locality▫91% weight-matched coverage of PCs

•Other applications in progress•Validated against hardware caches

▫7-16% average error in miss prediction

45

Questions?

46

improving the speed and quality of architectural performance evaluation vijay s. pai with...

Documents

reuse distance measurement

reuse distance profile

b c slide

rda hybrid analysis

address distance bccba

reuse of data element

unshared data

cache blocks