stream register files with indexed access

46
Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally

Upload: chidi

Post on 01-Feb-2016

35 views

Category:

Documents


0 download

DESCRIPTION

Stream Register Files with Indexed Access. Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally. NV35. NV10. Scaling Trends. ILP increasingly harder and more expensive to extract. Graphics processors exploit data parallelism. CPU data courtesy of Francois Labonte, Stanford University. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Stream Register Files with Indexed Access

Stream Register Files with Indexed Access

Nuwan JayasenaMattan ErezJung Ho Ahn

William J. Dally

Page 2: Stream Register Files with Indexed Access

HPCA-10 NSJ 2

Scaling Trends

• ILP increasingly harder and more expensive to extract

CPU data courtesy of Francois Labonte, Stanford University

• Graphics processors exploit data parallelism

CPUs - Specint2000 per MHz

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Jan-85 Sep-87 Jun-90 Mar-93 Dec-95 Sep-98 Jun-01

80386

80486

Pentium

Pentium II

Pentium III

Pentium 4

Graphics Processors - Vertices per MHz

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Jan-85 Sep-87 Jun-90 Mar-93 Dec-95 Sep-98 Jun-01

nVidia

NV10

NV35

Page 3: Stream Register Files with Indexed Access

HPCA-10 NSJ 3

Renewed Interest in Data Parallelism

• Data parallel application classes– Media, signal, network processing, scientific simulations,

encryption etc.

• High-end vector machines– Have always been data parallel

• Academic research– Stanford Imagine, Berkeley V-IRAM, programming GPUs

etc.

• “Main-stream” industry– Sony Emotion Engine, Tarantula etc.

Page 4: Stream Register Files with Indexed Access

HPCA-10 NSJ 4

Storage Hierarchy

• Bandwidth taper

• Only supports sequential streams/vectors

• But many data parallel apps with– Data reorderings

– Irregular data structures

– Conditional accesses

DRAM

Stream/vector storage

+x

+x

+x

Cache

Page 5: Stream Register Files with Indexed Access

HPCA-10 NSJ 5

Sequential Streams/Vectors Inefficient

Evaluate arbitrary order access to streams

Memory/cache Stream/vector storage Compute unitsa00 a01 a02 a03

a10 a11 a12 a13

a20 a21 a22 a23

a30 a31 a32 a33

Time

Rowmajor

Columnmajor

b13 b12 b11 b10 b03 b02 b01 b00b33

a13 a12 a11 a10 a03 a02 a01 a00a33

c00 c01 c02 c03

c10 c11 c12 c13

c20 c21 c22 c23

c30 c31 c32 c33

c31 c21 c11 c01 c30 c20 c10 c00c33

b00 b01 b02 b03

b10 b11 b12 b13

b20 b21 b22 b23

b30 b31 b32 b33b31 b21 b11 b01 b30 b20 b10 b00b33

Reorder

Page 6: Stream Register Files with Indexed Access

HPCA-10 NSJ 6

Outline

• Stream processing overview

• Applications

• Implementation

• Results

• Conclusion

Page 7: Stream Register Files with Indexed Access

HPCA-10 NSJ 7

Stream Programming

• Streams of records passing through compute kernels

• Parallelism– Across stream elements

– Across kernels

• Locality– Within kernels

– Between kernels

FFT_stage FFT_stage FFT_stage

in1

in2

OutOutput

Page 8: Stream Register Files with Indexed Access

HPCA-10 NSJ 8

Bandwidth Hierarchy

• Stream programming is well matched to bandwidth hierarchy

FFT_stage

FFT_stage

FFT_stage

Memory Stream register file (SRF) Compute units

Time

Page 9: Stream Register Files with Indexed Access

HPCA-10 NSJ 9

Stream Processors

• Several lanes– Execute in SIMD

– Operate on records

• Inter-cluster network

Compute cluster 0

SRF bank 0

Compute cluster(N-1)

SRF bank(N-1)

Inter-cluster network

Lane 0

Memory system

Memory switch

Page 10: Stream Register Files with Indexed Access

HPCA-10 NSJ 10

Outline

• Stream processing overview

• Applications

• Implementation

• Results

• Conclusion

Page 11: Stream Register Files with Indexed Access

HPCA-10 NSJ 11

Stream-Level Data Reuse

• Sequential streams only capture in-order reuse• Arbitrary access patterns in SRF capture more of available

temporal locality

Sequential (in-order) reuse

e.g.: linear streams

Non-sequential reuse

Stream data reuse

Reordered reusee.g.: 2-D, 3-D accesses,

multi-grid

Intra-stream reusee.g.: irregular

neighborhoods, table lookups

Page 12: Stream Register Files with Indexed Access

HPCA-10 NSJ 12

Reordered Reuse

Memory/cache Stream register file (SRF) Compute clustersa00 a01 a02 a03

a10 a11 a12 a13

a20 a21 a22 a23

a30 a31 a32 a33

a00 a01 a02 a03

a10 a11 a12 a13

a20 a21 a22 a23

a30 a31 a32 a33

a13 a12 a11 a10 a03 a02 a01 a00a33

b13 b12 b11 b10 b03 b02 b01 b00b33

b31 b21 b11 b01 b30 b20 b10 b00b33

Time

c00 c01 c02 c03

c10 c11 c12 c13

c20 c21 c22 c23

c30 c31 c32 c33

c31 c21 c11 c01 c30 c20 c10 c00c33

1D FFT

1D FFT

Reorder

• Indexed SRF access eliminates reordering through memory

Page 13: Stream Register Files with Indexed Access

HPCA-10 NSJ 13

Reordered Reuse

Memory/cache Stream register file (SRF) Compute clustersa00 a01 a02 a03

a10 a11 a12 a13

a20 a21 a22 a23

a30 a31 a32 a33

a00 a01 a02 a03

a10 a11 a12 a13

a20 a21 a22 a23

a30 a31 a32 a33

a13 a12 a11 a10 a03 a02 a01 a00a33

b13 b12 b11 b10 b03 b02 b01 b00

b31 b21 b11 b01 b30 b20 b10 b00b33

Time

c00 c01 c02 c03

c10 c11 c12 c13

c20 c21 c22 c23

c30 c31 c32 c33

c31 c21 c11 c01 c30 c20 c10 c00c33

1D FFT

1D FFT

Reorder

b33

Reorder

• Indexed SRF access eliminates reordering through memory

Page 14: Stream Register Files with Indexed Access

HPCA-10 NSJ 14

Intra-stream Reuse

• Indexed SRF access eliminates – Replication in SRF

– Redundant memory transfers

Memory/cache Stream register file (SRF) Compute clusters

Time

A

B D C B AD

A D BCA B DB Compute

E

F H

C

G

G F EH

Replicate

Page 15: Stream Register Files with Indexed Access

HPCA-10 NSJ 15

Intra-stream Reuse

Memory/cache Stream register file (SRF) Compute clusters

Time

A

B D

A D BCA B DB Compute

E

F H

C

G

G F EH

Replicate

C B AD

C B AD Replicate

• Indexed SRF access eliminates – Replication in SRF

– Redundant memory transfers

Page 16: Stream Register Files with Indexed Access

HPCA-10 NSJ 16

Conditional Accesses

• Fine-grain conditional accesses– Expensive in SIMD architectures

– Translate to conditional address computation

Page 17: Stream Register Files with Indexed Access

HPCA-10 NSJ 17

Outline

• Stream processing overview

• Applications

• Implementation

• Results

• Conclusion

Page 18: Stream Register Files with Indexed Access

HPCA-10 NSJ 18

Base Architecture

• Each SRF bank accesses block of b contiguous words

Compute cluster 0

SRFbank 0

Compute cluster(N-1)

SRFbank(N-1)

Inter-cluster network

b*W

Page 19: Stream Register Files with Indexed Access

HPCA-10 NSJ 19

Indexed SRF Architecture

• Address path from clusters

• Lower indexed access bandwidth

Compute cluster(N-1)

SRFbank(N-1)

Inter-cluster network

Compute cluster 0

SRFbank 0

Address FIFOs

Page 20: Stream Register Files with Indexed Access

HPCA-10 NSJ 20

Base SRF Bank

• Several SRAM sub-arrays

• Each access is to one sub-array

Compute cluster

SRFbank

Sub array 0

Sub array 1

Sub array 2

Local word -line drivers

Sub array 3

Page 21: Stream Register Files with Indexed Access

HPCA-10 NSJ 21

Indexed SRF Bank

• Extra 8:1 mux at sub-array output– Allows 4x 1-word

accesses

Compute cluster

SRFbank

Sub array 1

Sub array 2

Pre-decode

& row

dec.

Pre-decode

& row

dec.

Pre-decode

& row

dec.

Pre-decode

& row

dec.

Sub array 3

mux

Sub array 0

Page 22: Stream Register Files with Indexed Access

HPCA-10 NSJ 22

Cross-lane Indexed SRF

• Address switch added

• Inter-cluster network used for cross-lane SRF data

Inter-cluster network

SRF address network

Compute cluster 0

SRFbank 0

Address FIFOs

Compute cluster 0

SRFbank 0

Page 23: Stream Register Files with Indexed Access

HPCA-10 NSJ 23

Overhead - Area

• In-lane indexing overheads– 11% over sequential SRF

• Per-sub-array independent addressing overheads

• Cross-lane indexing overheads– 22% over sequential SRF

• Address switch

• 1.5% to 3% increase in die area (Imagine processor)

Page 24: Stream Register Files with Indexed Access

HPCA-10 NSJ 24

Overhead - Energy

• 0.1nJ (0.13m) per indexed SRF access

• ~4x sequential SRF access

• > order of magnitude lower than DRAM access

• 0.25nJ per cache access

• Each indexed access replaces many SRF and DRAM/cache accesses

Page 25: Stream Register Files with Indexed Access

HPCA-10 NSJ 25

Outline

• Stream processing overview

• Applications

• Implementation

• Results

• Conclusion

Page 26: Stream Register Files with Indexed Access

HPCA-10 NSJ 26

Benchmarks

• 64x64 2D FFT– 2D accesses

• Rijndael (AES)– Table lookups

• Merge-sort– Fine-grain conditionals

• 5x5 convolution filter– Regular neighborhood

• Irregular graph – Irregular neighborhood access– Parameterized (IG_SML/DMS/DCS/SCL): Sparse/Dense

graph, Memory/Compute-limited, Short/Long strips

Page 27: Stream Register Files with Indexed Access

HPCA-10 NSJ 27

Machine Organizations

Base(Sequential SRF)

Computeclusters

SRFbanks

Base + Cache

DRAM

Memory switch

Inter-cluster net

DRAM

Memory switch

Inter-cluster net

Cache

Indexed SRF

SRF address net

DRAM

Memory switch

Inter-cluster net

Page 28: Stream Register Files with Indexed Access

HPCA-10 NSJ 28

Machine Parameters

Base Base +

cache

Indexed

SRF

Technology 0.13m

1GHz

Compute 8 compute clusters

32GFLOPs (peak)

SRF 128KB

128GB/s seq.

128KB

128GB/s seq

128GB/s in-lane

32GB/s x-lane

Cache 128KB

16GB/s

DRAM 9.14GB/s

Page 29: Stream Register Files with Indexed Access

HPCA-10 NSJ 29

Off-chip Memory Bandwidth

0

0.2

0.4

0.6

0.8

1

FFT 2D

Rijnda

elSor

tFilte

r

IG_S

ML

IG_D

MS

IG_D

CS

IG_S

CL

No

rmal

ized

mem

ory

tra

ffic

ISRF

Page 30: Stream Register Files with Indexed Access

HPCA-10 NSJ 30

Off-chip Memory Bandwidth

0

0.2

0.4

0.6

0.8

1

FFT 2D

Rijnda

elSor

tFilte

r

IG_S

ML

IG_D

MS

IG_D

CS

IG_S

CL

No

rmal

ized

mem

ory

tra

ffic ISRF

Cache

Page 31: Stream Register Files with Indexed Access

HPCA-10 NSJ 31

Execution Time

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Ba

seC

ach

eIS

RF

Ba

seC

ach

eIS

RF

Ba

seC

ach

eIS

RF

Ba

seC

ach

eIS

RF

Ba

seC

ach

eIS

RF

Ba

seC

ach

eIS

RF

Ba

seC

ach

eIS

RF

Ba

seC

ach

eIS

RF

FFT 2D Rijndael Sort Filter IG_SML IG_DMS IG_DCS IG_SCL

No

rmal

ized

exe

cuti

on

tim

eKerneloverheads

SRF stall

Memorystall

Kernelloop body

Page 32: Stream Register Files with Indexed Access

HPCA-10 NSJ 32

Outline

• Stream processing overview

• Applications

• Implementation

• Results

• Conclusion

Page 33: Stream Register Files with Indexed Access

HPCA-10 NSJ 33

Conclusions

• Data parallelism increasingly important• Current data parallel architectures inefficient for some

application classes– Irregular accesses

• Indexed SRF accesses– Reduce memory traffic– Reduce SRF data replication– Efficiently support complex/conditional stream accesses

• Performance improvements– 3% to 410% for target application classes

• Low implementation overhead– 1.5% to 3% die area

Page 34: Stream Register Files with Indexed Access

HPCA-10 NSJ 34

Backups

Page 35: Stream Register Files with Indexed Access

HPCA-10 NSJ 35

Indexed Access Instruction Overhead

Relative Instruction Counts of SRF Indexed Kernels

0

0.2

0.4

0.6

0.8

1

1.2

FFT2D Rijndael Sort IG_1 IG_2

• Excludes address issue instructions

Page 36: Stream Register Files with Indexed Access

HPCA-10 NSJ 36

Kernel C API

while(!eos(in)) { in >> a; LUT[a] >> b; c = foo(a, b); out << c;}

LUT.index << a;Indep. instructions;LUT >> b;

• 2 separate instructions– Address issue– Data read

• Address-data separation– May require loop unrolling, software pipelining etc.

Page 37: Stream Register Files with Indexed Access

HPCA-10 NSJ 37

Sensitivity to SRF Access Latency (1)

0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

0 4 8 12 16 20 24

Index and data separation (cycles)

Lo

op

le

gth

FFT2D

Rijndael

Sort1

Sort2

Filter

IGraph1

IGraph2

0.5

0.6

0.7

0.8

0.9

1

1.1

0 2 4 6 8 10 12

Index and data separation (cycles)

Av

era

ge

ke

rne

l ex

ec

uti

on

ti

me

FFT2D

Rijndael

Filter

Sort1

Sort2

Page 38: Stream Register Files with Indexed Access

HPCA-10 NSJ 38

Sensitivity to SRF Access Latency (2)

0.88

0.9

0.92

0.94

0.96

0.98

1

1.02

4 8 12 16 20 24 28

Index and data separation (cycles)

Av

era

ge

ke

rne

l ex

ec

uti

on

ti

me

IGraph1

IGraph2

Page 39: Stream Register Files with Indexed Access

HPCA-10 NSJ 39

Why Graphics Hardware?

Pentium 4 SSE theoretical*

3GHz * 4 wide * .5 inst / cycle = 6 GFLOPS

GeForce FX 5900 (NV35) fragment shader observed:

MULR R0, R0, R0: 20 GFLOPS

equivalent to a 10 GHz P4

and getting faster: 3x improvement over NV30 (6 months)

*from Intel P4 Optimization Manual

0

5

10

15

20

25

Jun-01 Sep-01 Dec-01 Mar-02 Jun-02 Sep-02 Dec-02 Apr-03 Jul-03

GF

LO

PS

Pentium 4NV30

NV35

Slide from Ian Buck, Stanford University

Page 40: Stream Register Files with Indexed Access

HPCA-10 NSJ 40

NVIDIA Graphics growth (225%/yr)

• 1: Dual textured

• 2: Programmable

Essentially Moore’s Law Cubed.

Season Product Process # Trans Gflops 32-bit AA Fill Mpolys Notes

2H97 Riva 128 .35 3M 5 20M 3M Integrated 2D/3D

1H98 Riva ZX .25 5M 7 31M 3M AGP2x

2H98 Riva TNT .25 7M 10 50M 6M 32-bit

1H99 TNT2 .22 9M 15 75M 9M AGP4x

2H99 GeForce .22 23M 25 120M 15M HW T&L

1H00 GF2 GTS .18 25M 35 200M1 25M Per-Pixel Shading

2H00 GF2 Ultra .18 25M 45 250M1 31M 230 Mhz DDR

1H01 GeForce3

.15 57M 80 500M1 30M2 Programmable

Slide from Pat Hanrahan, Kurt Akeley

Page 41: Stream Register Files with Indexed Access

HPCA-10 NSJ 41

NVIDIA Historicals

Season Product MT/s Yr rate MF/s Yr rate

2H97 Riva 128 5 - 100 -

1H98 Riva ZX 5 1.0 100 1.0

2H98 Riva TNT 5 1.0 180 3.2

1H99 Riva TNT2 8 1.0 333 3.4

2H99 GeForce 15 3.5 480 2.1

1H00 GeForce2 GTS 25 2.8 666 1.9

2H00 GeForce2 Ultra 31 1.5 1000 2.3

1H01 GeForce3 40 1.7 3200 10.2

1H02 GeForce4 65 1.6 4800 1.5

1.8 2.4Slide from Pat Hanrahan, Kurt Akeley

Page 42: Stream Register Files with Indexed Access

HPCA-10 NSJ 42

Base Architecture

• Stream buffers match SRF bandwidth to compute needs

Stream buffers

32b

128b

Compute cluster 0

SRFbank 0

32b

128b

Compute cluster 7

SRFbank 7

Inter-cluster network

Page 43: Stream Register Files with Indexed Access

HPCA-10 NSJ 43

Indexed SRF Architecture

• Address path from clusters

• Lower indexed access bandwidth

Stream buffers

Address FIFOs

Inter-cluster network

Compute cluster 7

SRFbank 7

32b

128b

Compute cluster 0

SRFbank 0

Page 44: Stream Register Files with Indexed Access

HPCA-10 NSJ 44

Base SRF Bank

• Several SRAM sub-arrays

Sub array 1

Sub array 2

Sub array 3

Local WL drivers

Sub array 0256 128

Compute cluster

SRFbank

Page 45: Stream Register Files with Indexed Access

HPCA-10 NSJ 45

Indexed SRF Bank

• Extra 8:1 mux at sub-array output– Allows 4x 1-word

accesses

Compute cluster

SRFbank

Sub array 1

Sub array 0256

128

Sub array 2

Sub array 3

Pre-decode

& row

dec.

Pre-decode

& row

dec.

Pre-decode

& row

dec.

Pre-decode

& row

dec.

8:1 mux

Page 46: Stream Register Files with Indexed Access

HPCA-10 NSJ 46

Cross-lane Indexed SRF

• Address switch added

• Inter-cluster network used for cross-lane SRF data

Stream buffers

Address FIFOs

Inter-cluster network

Compute cluster 7

SRFbank 7

32b

Compute cluster 0

SRFbank 0

SRF address network