cluster prefetch: tolerating on-chip wire delays in clustered microarchitectures

25
luster Prefetch: Tolerating On-Chip Wir Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1 st 2004

Upload: teva

Post on 29-Jan-2016

23 views

Category:

Documents


0 download

DESCRIPTION

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures. Rajeev Balasubramonian School of Computing, University of Utah. July 1 st 2004. Billion-Transistor Chips. Partitioned architectures: small computational units connected by a communication fabric - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

Cluster Prefetch: Tolerating On-Chip WireDelays in Clustered Microarchitectures

Rajeev Balasubramonian

School of Computing, University of Utah

July 1st 2004

Page 2: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

University of Utah 2

Billion-Transistor Chips

• Partitioned architectures: small computational units connected by a communication fabric

Small computational units with limited functionality fast clocks, low design effort, low power

Numerous computational units high parallelism

Page 3: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

University of Utah 3

The Communication Bottleneck

• Wire delays do not scale down at the same rate as logic delays [Agarwal, ISCA’00][Ho, Proc. IEEE’01]

30 cycle delay to go across the chip in 10 years

1-cycle inter-hop latency in the RAW prototype [Taylor, ISCA’04]

Page 4: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

University of Utah 4

Cache Design

L1D

AddressTransfer 6 cyc

Data6 cyc Transfer

6 cyc RAM Access

Centralized Cache

18-cycle access (12 cyclesfor communication)

Page 5: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

University of Utah 5

Cache Design

L1D

AddressTransfer 6 cyc

Data6 cyc Transfer

6 cyc RAM Access

Centralized Cache

18-cycle access (12 cyclesfor communication)

L1D L1D

L1D L1D

Decentralized Cache

Page 6: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

University of Utah 6

Research Goals

• Identify bottlenecks in cache access

• Design cluster prefetch, a latency hiding mechanism

• Evaluate and compare centralized and decentralized designs

Page 7: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

University of Utah 7

Outline

• Motivation

• Evaluation platform

• Cluster prefetch

• Centralized vs. decentralized caches

• Conclusions

Page 8: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

University of Utah 8

Clustered Microarchitectures

• Centralized front-end

• Dynamically steered (dependences & load)

• O-o-o issue and 1-cycle bypass within a cluster

• Hierarchical interconnect

L1D

lsq

InstrFetch

crossbar

ring

Page 9: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

University of Utah 9

Simulation Parameters

• Simplescalar-based simulator

• In-flight instruction window of 480

• 16 clusters, each with 60 registers, 30 issue queue entries, and one FU of each kind

• Inter-cluster latencies between 2-10

• Primary focus on SPEC-FP programs

Page 10: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

University of Utah 10

Steps Involved in Cache Access

L1D

lsq

InstrFetch

Instr Dispatch

Effective Address Computation

Effective Address Transfer

Memory Disambiguation

RAM Access

Data Transfer

Page 11: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

University of Utah 11

Lifetime of a Load

2

25

7

3426

5

98

0

20

40

60

80

100

120

Transf

er o

f inst

r to c

lust

er

Eff. A

ddr. C

omput

atio

n

Addr.

Transf

er to

LSQ

Mem

ory D

epen

dence

Res

olutio

n

Cache

Acces

s

Data

Transf

er fr

om L

SQ to C

lust

er

TotalA

ve

rag

e c

yc

les

pe

r lo

ad

Page 12: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

University of Utah 12

Load Address Prediction

L1DLSQ

ClusterEff. Addr. Transfer

Cycle 27

Data TransferCycle 94

Dispatch at cycle 0Cache Access

Cycle 68

Page 13: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

University of Utah 13

Load Address Prediction

L1DLSQ

ClusterEff. Addr. Transfer

Cycle 27

Data TransferCycle 94

L1DLSQ

ClusterEff. Addr. Transfer

Cycle 27

Data TransferCycle 26

AddressPredictor

Cache AccessCycle 68 Dispatch at cycle 0

Cache Access Cycle 0

Page 14: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

University of Utah 14

Memory Dependence Speculation

• To allow early cache access, loads must issue before resolving earlier store addresses

• High-confidence store address predictions are employed for disambiguation

• Stores that have never forwarded results within the LSQ are ignored

Cluster Prefetch: Combination of Load Address Prediction and Memory Dependence Speculation

Page 15: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

University of Utah 15

Implementation Details

• Centralized table that maintains stride and last address; stride is determined by five consecutive accesses and cleared in case of five mispredicts

• Separate centralized table that maintains a single bit per entry to indicate stores that pose conflicts

• Each mispredict flushes all subsequent instrs

• Storage overhead: 18KB

Page 16: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

University of Utah 16

Performance Results

0

0.5

1

1.5

2

2.5

3

applu

apsi ar

t

equak

e

fma3

d

galgel

luca

s

mes

a

mgrid

swim

wupwis

eHM

Ins

tru

cti

on

s p

er

cy

cle

(IP

C)

Base case

Ld-addr pred only

St-addr and mem-dep pred onlyLd-addr, st-addr, and mem-dep pred

Overall IPC improvement: 21%

Page 17: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

University of Utah 17

Results Analysis

• Roughly half the programs improved IPC by >8%

• Load address prediction rate: 65%• Store address prediction rate: 79%• Stores likely to not pose conflicts: 59%

• Avg. number of mispredicts: 12K per 100M instrs

Page 18: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

University of Utah 18

Decentralized Cache

Replicated Cache Banks• Loads do not travel far• Stores & cache refills are broadcast• Memory disambiguation is not accelerated

• Overheads: interconnect for broadcast and cache refill, power for redundant writes, distributed LRU, etc.

L1D

lsq

L1D

lsq

L1D

lsq

L1D

lsq

Page 19: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

University of Utah 19

Comparing Centralized & Decentralized

L1D

lsq

L1D

lsq

L1D

lsq

L1D

lsq

L1D

lsq

IPCs without cluster prefetch

1.43 1.52

IPCs with cluster prefetch

1.73 1.79

Page 20: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

University of Utah 20

Sensitivity Analysis

• Results verified for processor models with varying resources and interconnect latencies

• Evaluations on SPEC-Int: address prediction rate is only 38% modest speedups:

twolf (7%), parser (9%) crafty, gcc, vpr (3-4%) rest (< 2%)

Page 21: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

University of Utah 21

Related Work

• Modest speedups with decentralized caches: Racunas and Patt [ICS ’03], for dynamic clustered processors; Gibert et al. [MICRO ’02] , for VLIW clustered processors

• Gibert et al. [MICRO ’03]: compiler-managed L0 buffers for critical data

Page 22: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

University of Utah 22

Conclusions

• Address prediction and memory dependence speculation can hide latency to cache banks; prediction rate of 66% for SPEC-FP and IPC improvement of 21%

• Additional benefits from decentralization are modest

• Future work: build better predictors, impact on power consumption [WCED ’04]

Page 23: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures
Page 24: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

University of Utah 24

Title

• Bullet

Page 25: Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

University of Utah 25

Title

• Bullet