prefetcher implementation the best-offset & signature … prefetching ... - record the distance...

ECE/CS 752 Final Project:The Best-Offset & Signature Path

Prefetcher Implementation

Qisi WangHui-Shun HungChien-Fu Chen

Outline Data Prefetching Exist Data Prefetcher

• Stride Data Prefetcher• Offset Prefetcher (Best-Offest Prefetcher)• Look-Ahead Prefetcher (Signature Pattern Prefetcher)

Experiment Result• Tool Background• Simulation Result

Conclusion

Data Prefetching(background) Prefetching the data before it is needed

• Reduce the compulsory miss• Reduce the memory access latency if

- High prefetching accuracy- Prefetch early enough

Goal: Predict which address is needed in the future Next N Lines Prefetching

• Always prefetch next N cache lines after a demand access or a demand miss

• Pros- Easy to implement- Suitable for sequential accessing

• Cons- Waste bandwidth on unwanted data if data pattern is irregular

Data Prefetching(background) Offset Prefetching

• Prefetch the address with an offset X• If X = 1 => Next Line Prefetching

Prefetcher with offset X

Demanded Address[A]

Prefetch Address[A] +X

Stride Prefetcher A kind of offset prefetcher with fixed distance 2 kind of stride prefetcher

• Program Counter (PC) based- Record the distance of memory access by load instruction- Next time fetch the same load instruction is fetched, prefetch last address +

distance• Cache block address based

- Prefetch A + X, A + 2X, A + 3X …..- Stream Buffer is a special case of this type of prefetcher

– Avoid cache pollution– If load miss, check stream buffer and pop to cache– If stream buffer also miss, allocate a new stream buffer

Cons• Distance (Stride) is fixed• Several varied offset scheme are proposed

- Best Offset (BO) Prefetcher- Signature Path Prefetcher (SPP)

Best-Offset Prefetcher (Idea) Varied offset through a learning procedure

• Finding the best offset value of different application• Several candidate of offset are tested

RR table records the completed prefetch requests• Prefetch Y, current offset is O => Y-O saves into RR table

Best-Offset Prefetcher (Learning) In learning phase, all the offsets in list will be tested (1 round)

• Each L2 access test 1 offset• DPC ver.: 46 offsets, paper ver.: 52 offsets

If hit in RR table, score + 1• All scores reset to 0 when learning phase begin

If learning phase finish (ex. 100 round) or some offset reach SCORE_MAX (DPC ver. = 31), the phase ends

The offset with highest score will be the best offset • New learning phase starts

Best-Offset Prefetcher 1-degree prefetcher (only prefetch 1 address)

• Prefetch 2 offset result many useless prefetch

Turn off the prefetcher if the best score too low• BAD_SCORE is the threshold• Learning procedure still work

MSHR threshold varied depends on BO score and L3 access rate

Signature Path Prefetcher Path confidence-based prefetcher History lookahead prefetching SPP table trained by L2 access Prefetching depend on

• The signature and pattern in SPP• The overall probability

Signature Path Prefetcher Table Updating

• When L2 access a page, the corresponding signature table will update

- Offset update- Offset difference (delta)

use to generate new signature

- The old signature is used for modifying pattern table

Same pattern will have same signature

• Reduce training time and PT store entries

Signature Path Prefetcher Prefetching

• Search the signature of current accessed page

• Choose the delta with highest probability Pi (Cdelta/Csig) of ith prefetch depth

• If multiply of all P larger than threshold

- Prefetch current address + delta

- Use delta to update signature and access pattern table again

• If P < threshold, the procedure end

Gem5 Simulation System

CPU

L1DCache

L1ICache

L2 Cache Prefetcher

MemoryInterface

Gem5 Implementation

CPU

L1DCache

L1ICache

L2 Cache Prefetcher

MemoryInterface

System Setting CPU: TimingSimpleCPU

L1 Caches (Data/Instruction) L2 Cache

Size 16 KB 128 KB

Associativity 2 8

Tag Latency 2 Cycle 20 Cycles

Data Latency 2 Cycle 20 Cycles

MSHR Size 4 Entries 16 Entries

Replacement LRU LRU

Gem5 Implementation

CPU

L1DCache

L1ICache

L2 Cache

WriteQueue MSHR

Prefetcher

PriorityQueue

MemoryInterface

L2 Cache-Prefetcher Interface

L2 Cache

WriteQueue MSHR

Prefetcher

PriorityQueue

MemoryInterface

Notify onAccess&

Fill

insert

hit/missPCAddress

setwayis prefetchEvicted address

Compute Prefetch

Bechmark SettingPrefetcher Configuration

• basic PF Types: Baseline, Stride (PC&Addr)• DPC-2 PF Types: Best Offset, SPP, AMPM,

Benchmark• SPEC 2006

- 450.soplex- 454.calculix- 456 Hmmer- 462.libquantum- 998.specrand

Sim. Result – Normalized Performance

Sim. Result – L2C Overall Miss Rate

Sim. Result – Miss Rate Improvement

Conclusion Contribution

• Open source Github repository @ hfsken/gem5-with-DPC-2-prefetcher

- With DPC-2 Wrapper for adding DPC PFs- Integrated with following DPC PFs: Best-Offset, AMPM, Stride, SPP

Summary• For a short term running time …

- Best-offset Prefetcher have better performance in benchmarks which has more regular access pattern and higher overall miss rate

- Performance gain in random access pattern is ignorable

Future Work• Complete documentation on Github repo• Analysis benchmark behavior in detail in the report

https://github.com/hfsken

https://github.com/hfsken/gem5-with-DPC-2-prefetcher

https://github.com/hfsken/gem5-with-DPC-2-prefetcher

Reference [1] Pierre Michaud, “Best-Offset Hardware Prefetching” IEEE HPCA,

2016 [2] Pierre Michaud, “A Best-Offset Prefetcher” DPC-2, 2015 [3] J. Kim, S. H. Pugsley, P. V. Gratz, A. L. N. Reddy, C. Wilkerson

and Z. Chishti, "Path confidence based lookahead prefetching,“ IEEE/ACM MICRO 2016

[4] Jinchun Kim, Paul V. Gratz and A. L. Narasimha Reddy, “Lookahead Prefetching with Signature Path”, DPC-2, 2015

[5] Course Slide of Prof. Onur Mutlu, CMU [6] Course Slide of Prof. Mikko Lipasti, UW Madison

prefetcher implementation the best-offset & signature … prefetching ... - record the distance...

Documents