prefetching challenges in distributed memories for cmps martí torrents, raúl martínez, and carlos...

36
Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

Upload: patrick-lee-scott

Post on 03-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

Prefetching Challenges in

Distributed Memories for CMPs

Martí Torrents, Raúl Martínez, and Carlos Molina

Computer Architecture DepartmentUPC – BarcelonaTech

Page 2: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

2

Outline

Introduction

Naming the challenges

Challenge evaluation methodology

Experimental framework

Challenge Quantification

Facing the Challenges

Conclusions

Page 3: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

3

Outline

Introduction

Naming the challenges

Challenge evaluation methodology

Experimental framework

Challenge Quantification

Facing the Challenges

Conclusions

Page 4: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

4

Prefetching

• Reduce memory latency

• Bring to a nearest cache next data required by CPU

• Increase the hit ratio

• It is implemented in most of the commercial

processors

• Erroneous prefetching may produce

– Cache pollution

– Resources consumption (queues, bandwidth, etc.)

– Power consumption

Page 5: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

Motivation

• Number of cores in a same chip grows every year

Nehalem4~6 Cores

Tilera64~100 Cores

Intel Polaris80 Cores

Nvidia GeForceUp to 256 Cores

5

Page 6: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

6

Prefetch in CMPs

• Useful prefetchers implies more performance

– Avoid network latency

– Reduce memory access latency

• Useless prefetchers implies less performance

– More power consumption

– More NoC congestion

– Interference with other cores requests

Page 7: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

7

Prefetch adverse behaviors

M. Torrents, R. Martínez, C. Molina. “Network Aware Performance Evaluation of Prefetching Techniques in CMPs”. Simulation Modeling Practice and Theory (SIMPAT), 2014.

Page 8: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

8

Distributed memories

• Distribution of the memory access pattern:

@ @+2 @+4 @+6 @+8 @+10

@

@ + 2

@ + 4

@ + 6

@ + 8

@ + 10

Page 9: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

9

@ @ + 2 @ + 4 @ + 6

@ + 8 @ + 10 @ + 12 @ + 14

TILE 00 TILE 01 TILE 02 TILE 03

TILE 04 TILE 05 TILE 06 TILE 07

Distributed memories

• Distribution of the memory access pattern:

@ @+2 @+4 @+6 @+8 @+10 @+12 @+14

Page 10: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

10

Outline

Introduction

Naming the challenges

Challenge evaluation methodology

Experimental framework

Challenge Quantification

Facing the Challenges

Conclusions

Page 11: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

11

Prefetch Distributed Memory Systems

• Analysis phase

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

DISTRIBUTED L2 MEMORY

@

L1 MISS for @

Distributed patterns

Page 12: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

12

Pattern Detection Challenge

• Distribution of the memory stream

• Prefetcher aware of a certain part of the stream

• Harder to detect access patterns or correlation

• Not all the prefetchers affected– Correlation prefetchers affected: GHB– One Block Lookahead not affected: Tagged

Page 13: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

13

Prefetch Distributed Memory Systems

• Request generation phase

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

DISTRIBUTED L2 MEMORY

@

@+4

@+2

@ + 2 @ + 4

Queue filtering

Page 14: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

14

Prefetch Queue Filtering Challenge

• Prefetch requests queued in distributed queues

• Independent engines generating requests

• Repeated requests can be queued

• In a centralized queue those would be merged

• Adverse effects:– Power consumption– Network contention

Page 15: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

15

Prefetch Distributed Memory Systems

• Evaluation phase

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

PREFETCHL1

CPU

DISTRIBUTED L2 MEMORY

@

@+4

@+2

@ + 2 @ + 4

L1 MISS for @ + 2

Dynamic profiling

Page 16: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

16

Dynamic Profiling Challenge

• Prefetch requests generated in one tile

• Dynamic profiling information in another tile

• Erroneous profiling in the self tile

• Techniques using this info may work erroneously– Filtering– Throttling– Concrete prefetching engines

Page 17: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

17

Outline

Introduction

Naming the challenges

Challenge evaluation methodology

Experimental framework

Challenge Quantification

Facing the Challenges

Conclusions

Page 18: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

18

Challenge evaluation methodology

• Three environments to test the challenges

• Pattern Detection Challenge: Ideal Prefetcher– Prefetcher that it is aware of all the memory stream– No extra network contention added in the system– No extra power consumed– Requests classified depending on its core identifier– To preserve the original stream of each core

• Prefetcher used to test: Global History Buffer

Page 19: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

19

Pattern Detection Challenge

Page 20: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

20

Challenge evaluation methodology

• Three environments to test the challenges

• Prefetch Queue Filtering: Centralized queue– All the requests sent to a centralized queue– Repeated requests are merged– No extra network contention added in the system– No extra power consumed– Repeated requests are not issued

• Prefetcher used to test: Tagged prefercher

Page 21: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

21

Prefetch Queue Filtering Challenge

Page 22: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

22

Challenge evaluation methodology

• Three environments to test the challenges

• Dynamic Profiling Challenge: Hardware counters– For each statistic and core, add a hardware counter– Useful prefetchers and unuseful prefetchers– Use the id of the origin core to classify the statistic– Quantify the error for each core by:

*Where statistic is useful or unuseful prefetch

• Prefetcher used to test: Tagged Prefetcher

Page 23: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

23

Dynamic Profiling Challenge

Page 24: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

24

Outline

Introduction

Naming the challenges

Challenge evaluation methodology

Experimental framework

Challenge Quantification

Facing the Challenges

Conclusions

Page 25: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

25

Experimental framework

• Gem5– 64 x86 CPUs– Ruby memory system– L2 prefetchers– MOESI coherency protocol– Garnet network simulator

• Parsecs 2.1

Page 26: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

26

Simulation environment

Page 27: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

27

Outline

Introduction

Naming the challenges

Challenge evaluation methodology

Experimental framework

Challenge Quantification

Facing the Challenges

Conclusions

Page 28: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

28

Pattern Detection Challenge

Page 29: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

29

Prefetch Queue Filtering Challenge

Page 30: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

30

Dynamic Profiling Challenge

Page 31: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

31

Outline

Introduction

Naming the challenges

Challenge evaluation methodology

Experimental framework

Challenge Quantification

Facing the Challenges

Conclusions

Page 32: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

32

Facing the challenges

• There are two main options – Redesign the entire prefetch philosophy– Adapt the current techniques to work with DSMs

• Moreover, there are two main directions– Centralize the information

– Handicap of communication increment

– Distribute the prefetcher – Handicap of smartly distribute the prefetcher

Page 33: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

33

Outline

Introduction

Naming the challenges

Challenge evaluation methodology

Experimental framework

Challenge Quantification

Facing the Challenges

Conclusions

Page 34: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

34

Conclusions

• Three challenges when prefetching in DSMs– Prefetch Queue Filtering Challenge– Dynamic Profiling Challenge– Challenge evaluation methodology

• Directions for future investigators

• There are no evident solutions for them

• Not solving them -> limited prefetch performance

Page 35: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

35

Q & A

Page 36: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

Prefetching Challenges in

Distributed Memories for CMPs

Martí Torrents, Raúl Martínez, and Carlos Molina

Computer Architecture DepartmentUPC – BarcelonaTech