improving the prefetching performance through code region profiling martí torrents, raúl...
DESCRIPTION
Outline Motivation - Prefetching - Prefetching in CMPs - Prefetch adverse behaviors Objective - Proposal - Code region granularity - Switch the prefetcher off - Switch the prefetcher on Experimental framework Expected Results 3TRANSCRIPT
IMPROVING THE PREFETCHING PERFORMANCE
THROUGH CODE REGION PROFILING
Martí Torrents, Raúl Martínez, and Carlos Molina
Computer Architecture DepartmentUPC – BarcelonaTech
2
Outline
Motivation- Prefetching- Prefetching in CMPs- Prefetch adverse behaviors
Objective- Proposal- Code region granularity- Switch the prefetcher off- Switch the prefetcher on
Experimental frameworkExpected Results
3
Outline
Motivation- Prefetching- Prefetching in CMPs- Prefetch adverse behaviors
Objective- Proposal- Code region granularity- Switch the prefetcher off- Switch the prefetcher on
Experimental frameworkExpected Results
Motivation
• Number of cores in a same chip grows every year
Nehalem4~6 Cores
Tilera64~100 Cores
Intel Polaris80 Cores
Nvidia GeForceUp to 256 Cores
4
5
Prefetching
• Reduce memory latency• Bring to a nearest cache next data required by CPU• Increase the hit ratio• It is implemented in most of the commercial
processors• Erroneous prefetching may produce
– Cache pollution– Resources consumption (queues, bandwidth, etc.)– Power consumption
6
Prefetch in CMPs
• Useful prefetchers implies more performance – Avoid network latency – Reduce memory access latency
• Useless prefetchers implies less performance– More power consumption– More NoC congestion– Interference with other cores requests
7
Prefetch adverse behaviors
M. Torrents, R. Martínez, C. Molina. “Network Aware Performance Evaluation of Prefetching Techniques in CMPs”. Simulation Modeling Practice and Theory (SIMPAT), 2014.
8
Prefetch in shared memories
• Prefetcher distributed
• Entails challenges – Distributed memory streams – Distributed prefetch queue– Statistics generation and recollection point differ
• Difficult the prefetcher task
• Harder to prefetch accuratelyM. Torrents, et al. “Prefetching Challenges in Distributed Memories for CMPs”, In Proceedings of the International Conference on Computational Science (ICCS'15), Reykjavík, (Iceland), June 2015.
9
Outline
Motivation- Prefetching- Prefetching in CMPs- Prefetch adverse behaviors
Objective- Proposal- Code region granularity- Switch the prefetcher off- Switch the prefetcher on
Experimental frameworkExpected Results
10
Objective• Maximize the prefetching effect • By using it only when it is working properly• Minimizing its adverse effects
11
Proposal
• Identify when the prefetcher generates slowdown– Identify code regions with several granularities– Analyze the prefetcher performance in these regions – Tag this code regions with stats
• Switch the prefetcher off– Save power– Avoid network contention– Avoid cache pollution
• Switch it on again– When it generates speedup
12
Code Region Granularity
• Divide the code in code regions– Single instructions, basic blocs, etc. or all the code
mov ebx, 0 mov eax, 0 mov ecx, 0
_Label_1: mov ecx, [esi + ebx * 4] add eax, ecx inc ebx cmp ebx, 100 jne _Label_1
Instruction level
13
Code Region Granularity
• Divide the code in code regions– Single instructions, basic blocs, etc. or all the code
mov ebx, 0 mov eax, 0 mov ecx, 0
_Label_1: mov ecx, [esi + ebx * 4] add eax, ecx inc ebx cmp ebx, 100 jne _Label_1
Basic Bloc level
14
Code Region Granularity
• Divide the code in code regions– Single instructions, basic blocs, etc. or all the code
mov ebx, 0 mov eax, 0 mov ecx, 0
_Label_1: mov ecx, [esi + ebx * 4] add eax, ecx inc ebx cmp ebx, 100 jne _Label_1
All the code
15
Code Region Granularity
• Regions tagged with statistics– Accuracy / Miss Ratio
• Activate or deactivate at every new code region– According to the statistic and the current code region
• Divide the code in code regions– Single instructions, basic blocs, etc. or all the code
• Identify and tag the regions – Statically (Profiling execution)– Dynamically (During the warm up)
16
Switching off the prefetcher
• Detect the uselessness of the prefetcher
• Accuracy– Useful prefetches / Total number of prefetches– Switch off when the accuracy decreases
• Miss Ratio– Based on the number of misses
17
Switching on the prefetcher
• Switched off prefetcher does not generate stats
• Cannot reactivate with accuracy increment
• Reactivate when?– Based on miss ratio– After a certain timeout
18
Outline
Motivation- Prefetching- Prefetching in CMPs- Prefetch adverse behaviors
Objective- Proposal- Code region granularity- Switch the prefetcher off- Switch the prefetcher on
Experimental frameworkExpected Results
19
Experimental framework
• Gem5– 16 x86 CPUs– Ruby memory system– L1 prefetchers– MOESI coherency protocol– Garnet network simulator
• Parsecs 2.1
20
Simulation environment
21
Outline
Motivation- Prefetching- Prefetching in CMPs- Prefetch adverse behaviors
Objective- Proposal- Code region granularity- Switch the prefetcher off- Switch the prefetcher on
Experimental frameworkExpected Results
22
Expected Results
• Power savings without losing performance
• Smaller granularity more accuracy– Blocs or super blocs better than the whole code– Single instructions more accurate than blocs or super blocs
• Smaller granularity: – More resources– More complexity
• Basic bloc granularity should provide good results with a realistic complexity
23
Q & A
24
IMPROVING THE PREFETCHING PERFORMANCE
THROUGH CODE REGION PROFILING
Martí Torrents, Raúl Martínez, and Carlos Molina
Computer Architecture DepartmentUPC – BarcelonaTech
25
Back up slides
26
Prefetch Distributed Memory Systems
• Increases the complexity of prefetching
• Challenges without trivial solutions
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
DISTRIBUTED L2 MEMORY
27
Prefetch Distributed Memory Systems
• Increases the complexity of prefetching
• Challenges without trivial solutions
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
DISTRIBUTED L2 MEMORY
@
L1 MISS for @
28
Prefetch Distributed Memory Systems
• Increases the complexity of prefetching
• Challenges without trivial solutions
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
DISTRIBUTED L2 MEMORY
@
L1 MISS for @
Distributed patterns
29
Prefetch Distributed Memory Systems
• Increases the complexity of prefetching
• Challenges without trivial solutions
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
DISTRIBUTED L2 MEMORY
@@+4
@+2
@ + 2 @ + 4
30
Prefetch Distributed Memory Systems
• Increases the complexity of prefetching
• Challenges without trivial solutions
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
DISTRIBUTED L2 MEMORY
@@+4
@+2
@ + 2 @ + 4
Queue filtering
31
Prefetch Distributed Memory Systems
• Increases the complexity of prefetching
• Challenges without trivial solutions
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
DISTRIBUTED L2 MEMORY
@@+4
@+2
@ + 2 @ + 4
L1 MISS for @ + 2
32
Prefetch Distributed Memory Systems
• Increases the complexity of prefetching
• Challenges without trivial solutions
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
DISTRIBUTED L2 MEMORY
@@+4
@+2
@ + 2 @ + 4
L1 MISS for @ + 2
Dynamic profiling