stream register files with indexed access

Stream Register Files with Indexed Access

Nuwan JayasenaMattan ErezJung Ho Ahn

William J. Dally

HPCA-10 NSJ 2

Scaling Trends

• ILP increasingly harder and more expensive to extract

CPU data courtesy of Francois Labonte, Stanford University

• Graphics processors exploit data parallelism

CPUs - Specint2000 per MHz

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Jan-85 Sep-87 Jun-90 Mar-93 Dec-95 Sep-98 Jun-01

80386

80486

Pentium

Pentium II

Pentium III

Pentium 4

Graphics Processors - Vertices per MHz

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Jan-85 Sep-87 Jun-90 Mar-93 Dec-95 Sep-98 Jun-01

nVidia

NV10

NV35

HPCA-10 NSJ 3

Renewed Interest in Data Parallelism

• Data parallel application classes– Media, signal, network processing, scientific simulations,

encryption etc.

• High-end vector machines– Have always been data parallel

• Academic research– Stanford Imagine, Berkeley V-IRAM, programming GPUs

etc.

• “Main-stream” industry– Sony Emotion Engine, Tarantula etc.

HPCA-10 NSJ 4

Storage Hierarchy

• Bandwidth taper

• Only supports sequential streams/vectors

• But many data parallel apps with– Data reorderings

– Irregular data structures

– Conditional accesses

DRAM

Stream/vector storage

+x

+x

+x

Cache

HPCA-10 NSJ 5

Sequential Streams/Vectors Inefficient

Evaluate arbitrary order access to streams

Memory/cache Stream/vector storage Compute unitsa00 a01 a02 a03

a10 a11 a12 a13

a20 a21 a22 a23

a30 a31 a32 a33

Time

Rowmajor

Columnmajor

b13 b12 b11 b10 b03 b02 b01 b00b33

a13 a12 a11 a10 a03 a02 a01 a00a33

c00 c01 c02 c03

c10 c11 c12 c13

c20 c21 c22 c23

c30 c31 c32 c33

c31 c21 c11 c01 c30 c20 c10 c00c33

b00 b01 b02 b03

b10 b11 b12 b13

b20 b21 b22 b23

b30 b31 b32 b33b31 b21 b11 b01 b30 b20 b10 b00b33

Reorder

HPCA-10 NSJ 6

Outline

• Stream processing overview

• Applications

• Implementation

• Results

• Conclusion

HPCA-10 NSJ 7

Stream Programming

• Streams of records passing through compute kernels

• Parallelism– Across stream elements

– Across kernels

• Locality– Within kernels

– Between kernels

FFT_stage FFT_stage FFT_stage

in1

in2

OutOutput

HPCA-10 NSJ 8

Bandwidth Hierarchy

• Stream programming is well matched to bandwidth hierarchy

FFT_stage

FFT_stage

FFT_stage

Memory Stream register file (SRF) Compute units

Time

HPCA-10 NSJ 9

Stream Processors

• Several lanes– Execute in SIMD

– Operate on records

• Inter-cluster network

Compute cluster 0

SRF bank 0

Compute cluster(N-1)

SRF bank(N-1)

Inter-cluster network

Lane 0

Memory system

Memory switch

HPCA-10 NSJ 10

Outline


• Applications

• Implementation

• Results

• Conclusion

HPCA-10 NSJ 11

Stream-Level Data Reuse

• Sequential streams only capture in-order reuse• Arbitrary access patterns in SRF capture more of available

temporal locality

Sequential (in-order) reuse

e.g.: linear streams

Non-sequential reuse

Stream data reuse

Reordered reusee.g.: 2-D, 3-D accesses,

multi-grid

Intra-stream reusee.g.: irregular

neighborhoods, table lookups

HPCA-10 NSJ 12

Reordered Reuse

Memory/cache Stream register file (SRF) Compute clustersa00 a01 a02 a03

a10 a11 a12 a13

a20 a21 a22 a23

a30 a31 a32 a33

a00 a01 a02 a03

a10 a11 a12 a13

a20 a21 a22 a23

a30 a31 a32 a33

a13 a12 a11 a10 a03 a02 a01 a00a33

b13 b12 b11 b10 b03 b02 b01 b00b33

b31 b21 b11 b01 b30 b20 b10 b00b33

Time

c00 c01 c02 c03

c10 c11 c12 c13

c20 c21 c22 c23

c30 c31 c32 c33

c31 c21 c11 c01 c30 c20 c10 c00c33

1D FFT

1D FFT

Reorder

• Indexed SRF access eliminates reordering through memory

HPCA-10 NSJ 13

Reordered Reuse

Memory/cache Stream register file (SRF) Compute clustersa00 a01 a02 a03

a10 a11 a12 a13

a20 a21 a22 a23

a30 a31 a32 a33

a00 a01 a02 a03

a10 a11 a12 a13

a20 a21 a22 a23

a30 a31 a32 a33

a13 a12 a11 a10 a03 a02 a01 a00a33

b13 b12 b11 b10 b03 b02 b01 b00

b31 b21 b11 b01 b30 b20 b10 b00b33

Time

c00 c01 c02 c03

c10 c11 c12 c13

c20 c21 c22 c23

c30 c31 c32 c33

c31 c21 c11 c01 c30 c20 c10 c00c33

1D FFT

1D FFT

Reorder

b33

Reorder

• Indexed SRF access eliminates reordering through memory

HPCA-10 NSJ 14

Intra-stream Reuse

• Indexed SRF access eliminates – Replication in SRF

– Redundant memory transfers

Memory/cache Stream register file (SRF) Compute clusters

Time

A

B D C B AD

A D BCA B DB Compute

E

F H

C

G

G F EH

Replicate

HPCA-10 NSJ 15

Intra-stream Reuse

Memory/cache Stream register file (SRF) Compute clusters

Time

A

B D

A D BCA B DB Compute

E

F H

C

G

G F EH

Replicate

C B AD

C B AD Replicate

• Indexed SRF access eliminates – Replication in SRF

– Redundant memory transfers

HPCA-10 NSJ 16

Conditional Accesses

• Fine-grain conditional accesses– Expensive in SIMD architectures

– Translate to conditional address computation

HPCA-10 NSJ 17

Outline


• Applications

• Implementation

• Results

• Conclusion

HPCA-10 NSJ 18

Base Architecture

• Each SRF bank accesses block of b contiguous words

Compute cluster 0

SRFbank 0


SRFbank(N-1)


b*W

HPCA-10 NSJ 19

Indexed SRF Architecture

• Address path from clusters

• Lower indexed access bandwidth


SRFbank(N-1)


Compute cluster 0

SRFbank 0

Address FIFOs

HPCA-10 NSJ 20

Base SRF Bank

• Several SRAM sub-arrays

• Each access is to one sub-array

Compute cluster

SRFbank

Sub array 0

Sub array 1

Sub array 2

Local word -line drivers

Sub array 3

HPCA-10 NSJ 21

Indexed SRF Bank

• Extra 8:1 mux at sub-array output– Allows 4x 1-word

accesses

Compute cluster

SRFbank

Sub array 1

Sub array 2

Pre-decode

& row

dec.

Pre-decode

& row

dec.

Pre-decode

& row

dec.

Pre-decode

& row

dec.

Sub array 3

mux

Sub array 0

HPCA-10 NSJ 22

Cross-lane Indexed SRF

• Address switch added

• Inter-cluster network used for cross-lane SRF data


SRF address network

Compute cluster 0

SRFbank 0

Address FIFOs

Compute cluster 0

SRFbank 0

HPCA-10 NSJ 23

Overhead - Area

• In-lane indexing overheads– 11% over sequential SRF

• Per-sub-array independent addressing overheads

• Cross-lane indexing overheads– 22% over sequential SRF

• Address switch

• 1.5% to 3% increase in die area (Imagine processor)

HPCA-10 NSJ 24

Overhead - Energy

• 0.1nJ (0.13m) per indexed SRF access

• ~4x sequential SRF access

• > order of magnitude lower than DRAM access

• 0.25nJ per cache access

• Each indexed access replaces many SRF and DRAM/cache accesses

HPCA-10 NSJ 25

Outline


• Applications

• Implementation

• Results

• Conclusion

HPCA-10 NSJ 26

Benchmarks

• 64x64 2D FFT– 2D accesses

• Rijndael (AES)– Table lookups

• Merge-sort– Fine-grain conditionals

• 5x5 convolution filter– Regular neighborhood

• Irregular graph – Irregular neighborhood access– Parameterized (IG_SML/DMS/DCS/SCL): Sparse/Dense

graph, Memory/Compute-limited, Short/Long strips

HPCA-10 NSJ 27

Machine Organizations

Base(Sequential SRF)

Computeclusters

SRFbanks

Base + Cache

DRAM

Memory switch

Inter-cluster net

DRAM

Memory switch

Inter-cluster net

Cache

Indexed SRF

SRF address net

DRAM

Memory switch

Inter-cluster net

HPCA-10 NSJ 28

Machine Parameters

Base Base +

cache

Indexed

SRF

Technology 0.13m

1GHz

Compute 8 compute clusters

32GFLOPs (peak)

SRF 128KB

128GB/s seq.

128KB

128GB/s seq

128GB/s in-lane

32GB/s x-lane

Cache 128KB

16GB/s

DRAM 9.14GB/s

HPCA-10 NSJ 29

Off-chip Memory Bandwidth

0

0.2

0.4

0.6

0.8

1

FFT 2D

Rijnda

elSor

tFilte

r

IG_S

ML

IG_D

MS

IG_D

CS

IG_S

CL

No

rmal

ized

mem

ory

tra

ffic

ISRF

HPCA-10 NSJ 30

Off-chip Memory Bandwidth

0

0.2

0.4

0.6

0.8

1

FFT 2D

Rijnda

elSor

tFilte

r

IG_S

ML

IG_D

MS

IG_D

CS

IG_S

CL

No

rmal

ized

mem

ory

tra

ffic ISRF

Cache

HPCA-10 NSJ 31

Execution Time

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Ba

seC

ach

eIS

RF

Ba

seC

ach

eIS

RF

Ba

seC

ach

eIS

RF

Ba

seC

ach

eIS

RF

Ba

seC

ach

eIS

RF

Ba

seC

ach

eIS

RF

Ba

seC

ach

eIS

RF

Ba

seC

ach

eIS

RF

FFT 2D Rijndael Sort Filter IG_SML IG_DMS IG_DCS IG_SCL

No

rmal

ized

exe

cuti

on

tim

eKerneloverheads

SRF stall

Memorystall

Kernelloop body

HPCA-10 NSJ 32

Outline


• Applications

• Implementation

• Results

• Conclusion

HPCA-10 NSJ 33

Conclusions

• Data parallelism increasingly important• Current data parallel architectures inefficient for some

application classes– Irregular accesses

• Indexed SRF accesses– Reduce memory traffic– Reduce SRF data replication– Efficiently support complex/conditional stream accesses

• Performance improvements– 3% to 410% for target application classes

• Low implementation overhead– 1.5% to 3% die area

HPCA-10 NSJ 34

Backups

HPCA-10 NSJ 35

Indexed Access Instruction Overhead

Relative Instruction Counts of SRF Indexed Kernels

0

0.2

0.4

0.6

0.8

1

1.2

FFT2D Rijndael Sort IG_1 IG_2

• Excludes address issue instructions

HPCA-10 NSJ 36

Kernel C API

while(!eos(in)) { in >> a; LUT[a] >> b; c = foo(a, b); out << c;}

LUT.index << a;Indep. instructions;LUT >> b;

• 2 separate instructions– Address issue– Data read

• Address-data separation– May require loop unrolling, software pipelining etc.

HPCA-10 NSJ 37

Sensitivity to SRF Access Latency (1)

0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

0 4 8 12 16 20 24

Index and data separation (cycles)

Lo

op

le

gth

FFT2D

Rijndael

Sort1

Sort2

Filter

IGraph1

IGraph2

0.5

0.6

0.7

0.8

0.9

1

1.1

0 2 4 6 8 10 12


Av

era

ge

ke

rne

l ex

ec

uti

on

ti

me

FFT2D

Rijndael

Filter

Sort1

Sort2

HPCA-10 NSJ 38

Sensitivity to SRF Access Latency (2)

0.88

0.9

0.92

0.94

0.96

0.98

1

1.02

4 8 12 16 20 24 28


Av

era

ge

ke

rne

l ex

ec

uti

on

ti

me

IGraph1

IGraph2

HPCA-10 NSJ 39

Why Graphics Hardware?

Pentium 4 SSE theoretical*

3GHz * 4 wide * .5 inst / cycle = 6 GFLOPS

GeForce FX 5900 (NV35) fragment shader observed:

MULR R0, R0, R0: 20 GFLOPS

equivalent to a 10 GHz P4

and getting faster: 3x improvement over NV30 (6 months)

*from Intel P4 Optimization Manual

0

5

10

15

20

25

Jun-01 Sep-01 Dec-01 Mar-02 Jun-02 Sep-02 Dec-02 Apr-03 Jul-03

GF

LO

PS

Pentium 4NV30

NV35

Slide from Ian Buck, Stanford University

HPCA-10 NSJ 40

NVIDIA Graphics growth (225%/yr)

• 1: Dual textured

• 2: Programmable

Essentially Moore’s Law Cubed.

Season Product Process # Trans Gflops 32-bit AA Fill Mpolys Notes

2H97 Riva 128 .35 3M 5 20M 3M Integrated 2D/3D

1H98 Riva ZX .25 5M 7 31M 3M AGP2x

2H98 Riva TNT .25 7M 10 50M 6M 32-bit

1H99 TNT2 .22 9M 15 75M 9M AGP4x

2H99 GeForce .22 23M 25 120M 15M HW T&L

1H00 GF2 GTS .18 25M 35 200M1 25M Per-Pixel Shading

2H00 GF2 Ultra .18 25M 45 250M1 31M 230 Mhz DDR

1H01 GeForce3

.15 57M 80 500M1 30M2 Programmable

Slide from Pat Hanrahan, Kurt Akeley

HPCA-10 NSJ 41

NVIDIA Historicals

Season Product MT/s Yr rate MF/s Yr rate

2H97 Riva 128 5 - 100 -

1H98 Riva ZX 5 1.0 100 1.0

2H98 Riva TNT 5 1.0 180 3.2

1H99 Riva TNT2 8 1.0 333 3.4

2H99 GeForce 15 3.5 480 2.1

1H00 GeForce2 GTS 25 2.8 666 1.9

2H00 GeForce2 Ultra 31 1.5 1000 2.3

1H01 GeForce3 40 1.7 3200 10.2

1H02 GeForce4 65 1.6 4800 1.5

1.8 2.4Slide from Pat Hanrahan, Kurt Akeley

HPCA-10 NSJ 42

Base Architecture

• Stream buffers match SRF bandwidth to compute needs

Stream buffers

32b

128b

Compute cluster 0

SRFbank 0

32b

128b

Compute cluster 7

SRFbank 7


HPCA-10 NSJ 43

Indexed SRF Architecture

• Address path from clusters

• Lower indexed access bandwidth

Stream buffers

Address FIFOs


Compute cluster 7

SRFbank 7

32b

128b

Compute cluster 0

SRFbank 0

HPCA-10 NSJ 44

Base SRF Bank

• Several SRAM sub-arrays

Sub array 1

Sub array 2

Sub array 3

Local WL drivers

Sub array 0256 128

Compute cluster

SRFbank

HPCA-10 NSJ 45

Indexed SRF Bank

• Extra 8:1 mux at sub-array output– Allows 4x 1-word

accesses

Compute cluster

SRFbank

Sub array 1

Sub array 0256

128

Sub array 2

Sub array 3

Pre-decode

& row

dec.

Pre-decode

& row

dec.

Pre-decode

& row

dec.

Pre-decode

& row

dec.

8:1 mux

HPCA-10 NSJ 46

Cross-lane Indexed SRF

• Address switch added

• Inter-cluster network used for cross-lane SRF data

Stream buffers

Address FIFOs


Compute cluster 7

SRFbank 7

32b

Compute cluster 0

SRFbank 0

SRF address network

stream register files with indexed access

Documents

data parallelismchart120

4533333333nv30 geforce

4nv35 geforce fx

125nv16 geforce

124nv20 geforce

pentium iiifeb

mhzintel pentium pronov

mhzintel pentium iimay