prefetching challenges in distributed memories for cmps martí torrents, raúl martínez, and carlos...
TRANSCRIPT
Prefetching Challenges in
Distributed Memories for CMPs
Martí Torrents, Raúl Martínez, and Carlos Molina
Computer Architecture DepartmentUPC – BarcelonaTech
2
Outline
Introduction
Naming the challenges
Challenge evaluation methodology
Experimental framework
Challenge Quantification
Facing the Challenges
Conclusions
3
Outline
Introduction
Naming the challenges
Challenge evaluation methodology
Experimental framework
Challenge Quantification
Facing the Challenges
Conclusions
4
Prefetching
• Reduce memory latency
• Bring to a nearest cache next data required by CPU
• Increase the hit ratio
• It is implemented in most of the commercial
processors
• Erroneous prefetching may produce
– Cache pollution
– Resources consumption (queues, bandwidth, etc.)
– Power consumption
Motivation
• Number of cores in a same chip grows every year
Nehalem4~6 Cores
Tilera64~100 Cores
Intel Polaris80 Cores
Nvidia GeForceUp to 256 Cores
5
6
Prefetch in CMPs
• Useful prefetchers implies more performance
– Avoid network latency
– Reduce memory access latency
• Useless prefetchers implies less performance
– More power consumption
– More NoC congestion
– Interference with other cores requests
7
Prefetch adverse behaviors
M. Torrents, R. Martínez, C. Molina. “Network Aware Performance Evaluation of Prefetching Techniques in CMPs”. Simulation Modeling Practice and Theory (SIMPAT), 2014.
8
Distributed memories
• Distribution of the memory access pattern:
@ @+2 @+4 @+6 @+8 @+10
@
@ + 2
@ + 4
@ + 6
@ + 8
@ + 10
9
@ @ + 2 @ + 4 @ + 6
@ + 8 @ + 10 @ + 12 @ + 14
TILE 00 TILE 01 TILE 02 TILE 03
TILE 04 TILE 05 TILE 06 TILE 07
Distributed memories
• Distribution of the memory access pattern:
@ @+2 @+4 @+6 @+8 @+10 @+12 @+14
10
Outline
Introduction
Naming the challenges
Challenge evaluation methodology
Experimental framework
Challenge Quantification
Facing the Challenges
Conclusions
11
Prefetch Distributed Memory Systems
• Analysis phase
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
DISTRIBUTED L2 MEMORY
@
L1 MISS for @
Distributed patterns
12
Pattern Detection Challenge
• Distribution of the memory stream
• Prefetcher aware of a certain part of the stream
• Harder to detect access patterns or correlation
• Not all the prefetchers affected– Correlation prefetchers affected: GHB– One Block Lookahead not affected: Tagged
13
Prefetch Distributed Memory Systems
• Request generation phase
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
DISTRIBUTED L2 MEMORY
@
@+4
@+2
@ + 2 @ + 4
Queue filtering
14
Prefetch Queue Filtering Challenge
• Prefetch requests queued in distributed queues
• Independent engines generating requests
• Repeated requests can be queued
• In a centralized queue those would be merged
• Adverse effects:– Power consumption– Network contention
15
Prefetch Distributed Memory Systems
• Evaluation phase
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
DISTRIBUTED L2 MEMORY
@
@+4
@+2
@ + 2 @ + 4
L1 MISS for @ + 2
Dynamic profiling
16
Dynamic Profiling Challenge
• Prefetch requests generated in one tile
• Dynamic profiling information in another tile
• Erroneous profiling in the self tile
• Techniques using this info may work erroneously– Filtering– Throttling– Concrete prefetching engines
17
Outline
Introduction
Naming the challenges
Challenge evaluation methodology
Experimental framework
Challenge Quantification
Facing the Challenges
Conclusions
18
Challenge evaluation methodology
• Three environments to test the challenges
• Pattern Detection Challenge: Ideal Prefetcher– Prefetcher that it is aware of all the memory stream– No extra network contention added in the system– No extra power consumed– Requests classified depending on its core identifier– To preserve the original stream of each core
• Prefetcher used to test: Global History Buffer
19
Pattern Detection Challenge
20
Challenge evaluation methodology
• Three environments to test the challenges
• Prefetch Queue Filtering: Centralized queue– All the requests sent to a centralized queue– Repeated requests are merged– No extra network contention added in the system– No extra power consumed– Repeated requests are not issued
• Prefetcher used to test: Tagged prefercher
21
Prefetch Queue Filtering Challenge
22
Challenge evaluation methodology
• Three environments to test the challenges
• Dynamic Profiling Challenge: Hardware counters– For each statistic and core, add a hardware counter– Useful prefetchers and unuseful prefetchers– Use the id of the origin core to classify the statistic– Quantify the error for each core by:
*Where statistic is useful or unuseful prefetch
• Prefetcher used to test: Tagged Prefetcher
23
Dynamic Profiling Challenge
24
Outline
Introduction
Naming the challenges
Challenge evaluation methodology
Experimental framework
Challenge Quantification
Facing the Challenges
Conclusions
25
Experimental framework
• Gem5– 64 x86 CPUs– Ruby memory system– L2 prefetchers– MOESI coherency protocol– Garnet network simulator
• Parsecs 2.1
26
Simulation environment
27
Outline
Introduction
Naming the challenges
Challenge evaluation methodology
Experimental framework
Challenge Quantification
Facing the Challenges
Conclusions
28
Pattern Detection Challenge
29
Prefetch Queue Filtering Challenge
30
Dynamic Profiling Challenge
31
Outline
Introduction
Naming the challenges
Challenge evaluation methodology
Experimental framework
Challenge Quantification
Facing the Challenges
Conclusions
32
Facing the challenges
• There are two main options – Redesign the entire prefetch philosophy– Adapt the current techniques to work with DSMs
• Moreover, there are two main directions– Centralize the information
– Handicap of communication increment
– Distribute the prefetcher – Handicap of smartly distribute the prefetcher
33
Outline
Introduction
Naming the challenges
Challenge evaluation methodology
Experimental framework
Challenge Quantification
Facing the Challenges
Conclusions
34
Conclusions
• Three challenges when prefetching in DSMs– Prefetch Queue Filtering Challenge– Dynamic Profiling Challenge– Challenge evaluation methodology
• Directions for future investigators
• There are no evident solutions for them
• Not solving them -> limited prefetch performance
35
Q & A
Prefetching Challenges in
Distributed Memories for CMPs
Martí Torrents, Raúl Martínez, and Carlos Molina
Computer Architecture DepartmentUPC – BarcelonaTech