improving the prefetching performance through code region profiling martí torrents, raúl...

IMPROVING THE PREFETCHING PERFORMANCE

THROUGH CODE REGION PROFILING

Martí Torrents, Raúl Martínez, and Carlos Molina

Computer Architecture DepartmentUPC – BarcelonaTech

2

Outline

Motivation- Prefetching- Prefetching in CMPs- Prefetch adverse behaviors

Objective- Proposal- Code region granularity- Switch the prefetcher off- Switch the prefetcher on

Experimental frameworkExpected Results

3

Outline




Motivation

• Number of cores in a same chip grows every year

Nehalem4~6 Cores

Tilera64~100 Cores

Intel Polaris80 Cores

Nvidia GeForceUp to 256 Cores

4

5

Prefetching

• Reduce memory latency• Bring to a nearest cache next data required by CPU• Increase the hit ratio• It is implemented in most of the commercial

processors• Erroneous prefetching may produce

– Cache pollution– Resources consumption (queues, bandwidth, etc.)– Power consumption

6

Prefetch in CMPs

• Useful prefetchers implies more performance – Avoid network latency – Reduce memory access latency

• Useless prefetchers implies less performance– More power consumption– More NoC congestion– Interference with other cores requests

7

Prefetch adverse behaviors

M. Torrents, R. Martínez, C. Molina. “Network Aware Performance Evaluation of Prefetching Techniques in CMPs”. Simulation Modeling Practice and Theory (SIMPAT), 2014.

8

Prefetch in shared memories

• Prefetcher distributed

• Entails challenges – Distributed memory streams – Distributed prefetch queue– Statistics generation and recollection point differ

• Difficult the prefetcher task

• Harder to prefetch accuratelyM. Torrents, et al. “Prefetching Challenges in Distributed Memories for CMPs”, In Proceedings of the International Conference on Computational Science (ICCS'15), Reykjavík, (Iceland), June 2015.

9

Outline




10

Objective• Maximize the prefetching effect • By using it only when it is working properly• Minimizing its adverse effects

11

Proposal

• Identify when the prefetcher generates slowdown– Identify code regions with several granularities– Analyze the prefetcher performance in these regions – Tag this code regions with stats

• Switch the prefetcher off– Save power– Avoid network contention– Avoid cache pollution

• Switch it on again– When it generates speedup

12

Code Region Granularity

• Divide the code in code regions– Single instructions, basic blocs, etc. or all the code

mov ebx, 0 mov eax, 0 mov ecx, 0

_Label_1: mov ecx, [esi + ebx * 4] add eax, ecx inc ebx cmp ebx, 100 jne _Label_1

Instruction level

13





Basic Bloc level

14





All the code

15


• Regions tagged with statistics– Accuracy / Miss Ratio

• Activate or deactivate at every new code region– According to the statistic and the current code region


• Identify and tag the regions – Statically (Profiling execution)– Dynamically (During the warm up)

16

Switching off the prefetcher

• Detect the uselessness of the prefetcher

• Accuracy– Useful prefetches / Total number of prefetches– Switch off when the accuracy decreases

• Miss Ratio– Based on the number of misses

17

Switching on the prefetcher

• Switched off prefetcher does not generate stats

• Cannot reactivate with accuracy increment

• Reactivate when?– Based on miss ratio– After a certain timeout

18

Outline




19

Experimental framework

• Gem5– 16 x86 CPUs– Ruby memory system– L1 prefetchers– MOESI coherency protocol– Garnet network simulator

• Parsecs 2.1

20

Simulation environment

21

Outline




22

Expected Results

• Power savings without losing performance

• Smaller granularity more accuracy– Blocs or super blocs better than the whole code– Single instructions more accurate than blocs or super blocs

• Smaller granularity: – More resources– More complexity

• Basic bloc granularity should provide good results with a realistic complexity

23

Q & A

24

IMPROVING THE PREFETCHING PERFORMANCE

THROUGH CODE REGION PROFILING

Martí Torrents, Raúl Martínez, and Carlos Molina

Computer Architecture DepartmentUPC – BarcelonaTech

25

Back up slides

26

Prefetch Distributed Memory Systems

• Increases the complexity of prefetching

• Challenges without trivial solutions

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

DISTRIBUTED L2 MEMORY

27




PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU


@

L1 MISS for @

28




PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU


@

L1 MISS for @

Distributed patterns

29




PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU


@@+4

@+2

@ + 2 @ + 4

30




PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU


@@+4

@+2

@ + 2 @ + 4

Queue filtering

31




PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU


@@+4

@+2

@ + 2 @ + 4

L1 MISS for @ + 2

32




PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU


@@+4

@+2

@ + 2 @ + 4

L1 MISS for @ + 2

Dynamic profiling

improving the prefetching performance through code region profiling martí torrents, raúl...

Documents