the imagine stream processor flexibility with performance march 30, 2001

The Imagine Stream ProcessorFlexibility with Performance

March 30, 2001

William J. DallyComputer Systems Laboratory

Stanford [email protected]

Convergence WorkshopMarch 30, 2001 2

Outline

• Motivation – We need low-power, programmable TeraOps

• The problem is bandwidth– Growing gap between special-purpose and general-

purpose hardware– Its easy to make ALUs, hard to keep them fed

• A stream processor gives programmable bandwidth– Streams expose locality and concurrency in the

application– A bandwidth hierarchy exploits this

• Imagine is a 20GFLOPS prototype stream processor

• Many opportunities to do better– Scaling up– Simplifying programming


Motivation

• Some things I’d like to do with a few TeraOps– Have a realistic face-to-face meeting with someone

in Boston without riding an airplane• 4-8 cameras, extract depth, fit model, compress, render

to several screens

– High-quality rendering at video rates• Ray tracing a 2K x 4K image with 105 objects at 60

frames/s


The good news – FLOPS are cheap, OPS are cheaper

• 32-bit FPU – 2GFLOPS/mm2 – 400GFLOPS/chip• 16-bit add – 40GOPS/mm2 – 8TOPS/chip

460 m

146.7 m

Local RF

Integer Adder


The bad news – General purpose processors can’t harness this

1e+8

1e+9

1e+10

1e+11

1e+12

1e+13

1e+14

1e+15

2001 2003 2005 2007 2009 2011

Year

FL

OP

S

FLOPS

GP-Peak

GP-Useful


Why do Special-Purpose Processors Perform Well?

Fed by dedicated wires/memoriesLots (100s) of ALUs


Care and Feeding of ALUs

DataBandwidth

Instruction Bandwidth

Regs

Instr.Cache

IR

IP‘Feeding’ Structure Dwarfs ALU


The problem is bandwidth

• Can we solve this bandwidth problem without sacrificing programmability?


Streams expose locality and concurrency

SAD

Image 1 convolve convolve

Image 0 convolve convolve

Depth Map

Operations within a kernel operate on local data

Streams expose data parallelism

Kernels can be partitioned across chips to exploit control parallelism


A Bandwidth Hierarchy exploits locality and concurrency

• VLIW clusters with shared control• 41.2 32-bit operations per word of memory bandwidth

2GB/s 32GB/s

SDRAM

SDRAM

SDRAM

SDRAM

Str

eam

R

egis

ter

File

ALU Cluster

ALU Cluster

ALU Cluster

544GB/s


Bandwidth Usage

Memory BW Global RF BW Local RF BW

Depth Extractor 0.80 GB/s 18.45 GB/s 210.85 GB/s

MPEG Encoder 0.47 GB/s 2.46 GB/s 121.05 GB/s

Polygon Rendering 0.78 GB/s 4.06 GB/s 102.46 GB/s

QR Decomposition 0.46 GB/s 3.67 GB/s 234.57 GB/s

2GB/s 32GB/s

SDRAM

SDRAM

SDRAM

SDRAMS

trea

m

Reg

iste

r F

ile ALU Cluster

ALU Cluster

ALU Cluster

544GB/s


The Imagine Stream Processor

Stream Register FileNetworkInterface

StreamController

Imagine Stream Processor

HostProcessor

Net

wor

k

AL

U C

lust

er 0

AL

U C

lust

er 1

AL

U C

lust

er 2

AL

U C

lust

er 3

AL

U C

lust

er 4

AL

U C

lust

er 5

AL

U C

lust

er 6

AL

U C

lust

er 7

SDRAMSDRAM SDRAMSDRAM

Streaming Memory SystemM

icro

con

trol

ler


Arithmetic Clusters

CU

Inte

rclu

ster

N

etw

ork+

From SRF

To SRF

+ + * * /

Cross Point

Local Register File


Performance

12.1

19.8

11.0

23.925.6

7.0

0

5

10

15

20

25

30

GO

PS

depth mpeg qrd dct convolve fft

16-bit kernels16-bit

applications

floating-pointapplication

floating-pointkernel


0.0

0.5

1.0

1.5

2.0

2.5

3.0

Wa

tts

depth mpeg qrd dct convolve fft average

OtherMem SysPinsSRF ClustClock

Power

GOPS/W: 4.6 10.7 4.1 10.2 9.6 2.4 6.9

23%

63%

6%5%

1% 2%


A Look Inside an ApplicationStereo Depth Extraction

• 320x240 8-bit grayscale images

• 30 disparity search• 220 frames/second• 12.7 GOPS• 5.7 GOPS/W

Clusters Mem_0 Mem_121300

21400

21500

21600

21700

21800

21900

22000

22100

22200

22300

22400

22500

22600

22700

22800

22900

23000

23100

23200

23300

23400

23500

23600

CONV 3x3

STOREUNPACK LOADCONV 7x7

CONV 3x3


CONV 3x3


Clust Mem0 Mem1501300

501400

501500

501600

501700

501800

501900

502000

502100

502200

502300

502400

502500

502600

502700

502800

502900

503000

503100

503200

503300

BlockSAD

BlockSADLoad Load

BlockSAD

StoreBlockSAD

BlockSAD

BlockSADLoad Load

BlockSAD

StoreBlockSAD

Load originalpacked row

Unpack (8bit -> 16 bit)

7x7 Convolve

3x3 Convolve

Store convolved row

Load ConvolvedRows

CalculateBlockSADs atdifferent disparities

Store bestdisparity values

Stereo Depth ExtractorConvolutions Disparity Search


ADD0 ADD1 ADD2 MUL0 MUL1 DIV0 INP0 INP1 INP2 INP3 OUT0 OUT1 SP_0 SP_0 COM0 MC_0 J UK0 VAL0

G E N _ C I S T A T E

C O N D _ I N _ D

G E N _ C C E N D

S P C R E A D _ W T S P C W R I T E

C O M M U C D A T A

C H K _ A N Y

S E L E C T

S H I F T A 1 6

C O M M U C P E R M

C O M M U C P E R M

C O M M U C P E R M

S E L E C T

S E L E C T C O M M U C P E R M

I M U L R N D 1 6 S E L E C T

I M U L R N D 1 6

I M U L R N D 1 6I M U L R N D 1 6 N S E L E C T

I M U L R N D 1 6I M U L R N D 1 6

I M U L R N D 1 6I M U L R N D 1 6 P A S S

I M U L R N D 1 6 I M U L R N D 1 6 P A S S

I M U L R N D 1 6 I M U L R N D 1 6 P A S S

I M U L R N D 1 6 I M U L R N D 1 6




I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 N S E L E C T

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 S E L E C T P A S S

I M U L R N D 1 6 I M U L R N D 1 6 P A S SI A D D S 1 6 N S E L E C TP A S S

I M U L R N D 1 6 I M U L R N D 1 6 P A S SI A D D S 1 6 I A D D S 1 6 N S E L E C T

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6S H U F F L E

I M U L R N D 1 6 I M U L R N D 1 6S H U F F L E P A S S

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 P A S SI A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 P A S SS H U F F L E

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 S H U F F L E

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 S H U F F L ES H U F F L E

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 P A S S P A S S

P A S SS H U F F L E

S H U F F L ES H U F F L E S H U F F L E

I A D D S 1 6 I A D D S 1 6 I A D D S 1 6 P A S S

I A D D S 1 6I A D D S 1 6I A D D S 1 6 P A S S P A S SP A S S

I A D D S 1 6 I A D D S 1 6I A D D S 1 6

I A D D S 1 6I A D D S 1 6 I A D D S 1 6

S H U F F L E S H U F F L ES H U F F L E D A T A _ I N

I A D D S 1 6I A D D S 1 6 S H U F F L E P A S SD A T A _ I N

I A D D S 1 6 I A D D S 1 6I A D D S 1 6 P A S SP A S S D A T A _ I N

S E L E C TI A D D S 1 6I A D D S 1 6 D A T A _ I N

I A D D S 1 6 S E L E C TI A D D S 1 6I A D D S 1 6 N S E L E C T D A T A _ O U T

I A D D S 1 6 S E L E C TI A D D S 1 6 I A D D S 1 6 D A T A _ I N D A T A _ O U T

I A D D S 1 6 N S E L E C T D A T A _ I N D A T A _ O U T

I A D D S 1 6 N S E L E C TN S E L E C T D A T A _ O U T

L O O PI A D D S 1 6 I A D D S 1 6 D A T A _ O U T

D A T A _ O U T D A T A _ O U T

ADD0 ADD1 ADD2 MUL0 MUL1 DIV0 INP0 INP1 INP2 INP3 OUT0 OUT1 SP_0 SP_0 COM0 MC_0 J UK0 VAL0

IMULRND16 IMULRND16 PASSIADDS16 NSELECTPASS

IMULRND16 IMULRND16 PASSIADDS16 IADDS16 NSELECTSHIFTA16

IMULRND16 IMULRND16IADDS16 IADDS16

IMULRND16 IMULRND16IADDS16IADDS16SHUFFLE

IMULRND16 IMULRND16SHUFFLE PASS

IMULRND16 IMULRND16IADDS16 IADDS16 PASSIADDS16

IMULRND16 IMULRND16IADDS16 IADDS16

IMULRND16 IMULRND16IADDS16 IADDS16 PASSSHUFFLE

IMULRND16 IMULRND16IADDS16 SHUFFLE

IMULRND16 IMULRND16IADDS16 SHUFFLESHUFFLE

IMULRND16 IMULRND16IADDS16 IADDS16IADDS16 COMMUCPERM

IMULRND16 IMULRND16IADDS16IADDS16IADDS16 COMMUCPERM

IMULRND16 IMULRND16IADDS16IADDS16IADDS16 COMMUCPERM

IMULRND16 IMULRND16IADDS16 IADDS16 PASS PASS

PASSSHUFFLE SELECT

SHUFFLESHUFFLE SHUFFLE SELECT COMMUCPERM

IADDS16 IADDS16 IADDS16 PASSIMULRND16 SELECT

IADDS16IADDS16IADDS16 PASS PASSPASS IMULRND16

IADDS16 IADDS16IADDS16 IMULRND16IMULRND16 NSELECT

IADDS16IADDS16 IADDS16 IMULRND16IMULRND16

IMULRND16IMULRND16 PASS

SHUFFLE SHUFFLESHUFFLE DATA_INIMULRND16 IMULRND16 PASS

IADDS16IADDS16 SHUFFLE PASSDATA_INIMULRND16 IMULRND16 PASS GEN_CISTATE

IADDS16 IADDS16IADDS16 PASSPASS DATA_INIMULRND16 IMULRND16 COND_IN_D

SELECTIADDS16IADDS16 DATA_INIMULRND16 IMULRND16 GEN_CCEND

IADDS16 SELECTIADDS16IADDS16 NSELECTIMULRND16 IMULRND16 SPCREAD_WT SPCWRITEDATA_OUT

IADDS16 SELECTIADDS16 IADDS16 DATA_INIMULRND16 IMULRND16 COMMUCDATADATA_OUT

IADDS16 NSELECT DATA_INIMULRND16 IMULRND16IADDS16 CHK_ANYDATA_OUT

IADDS16 NSELECTNSELECTIMULRND16 IMULRND16IADDS16IADDS16 DATA_OUT

LOOPIADDS16 IADDS16 IMULRND16 IMULRND16IADDS16 NSELECTSELECT DATA_OUT

IMULRND16 IMULRND16IADDS16 IADDS16 SELECT PASSDATA_OUT DATA_OUT

7x7 Convolve Kernel


Imagine gives high performance with low power and flexible programming

• Matches capabilities of communication-limited technology to demands of signal and image processing applications

• Performance– compound stream operations realize >10GOPS on key

applications– can be extended by partitioning an application across

several Imagines (TFLOPS on a circuit board)

• Power– three-level register hierarchy gives 2-10GOPS/W

• Flexibility– programmed in “C”– streaming model– conditional stream operations enable applications like sort


A look forward

• Next steps– Build some Imagine prototypes

• Dual-processor 40GFLOPS systems, 64-processor TeraFLOPS systems

• Longer term– ‘Industrial Strength’ Imagine – 100-200GFLOPS/chip

• Multiple sets of arithmetic clusters per chip, higher clock rate, on-chip cache, more off-chip bandwidth

– Graphics extensions• Texture cache, raster unit – as SRF clients

– A streaming supercomputer• 64-bit FP, high-bandwidth global memory, MIMD

extensions– Simplified stream programming

• Automate inter-cluster communication, partitioning into kernels, sub-word arithmetic, staging of data.


Take home message

• VLSI technology enables us to put TeraOPS on a chip

• Conventional general-purpose architecture cannot exploit this– The problem is bandwidth

• Casting an application as kernels operating on streams exposes locality and concurrency

• A stream architecture exploits this locality and concurrency to achieve high arithmetic rates with limited bandwidth– Bandwidth hierarchy, compound stream operations

• Imagine is a prototype stream processor– One chip – 20GFLOPS peak, 10GFLOPS sustained, 4W– Systems scale to TeraFLOPS and more.

the imagine stream processor flexibility with performance march 30, 2001

Documents