stream architecture: rethinking media processor design

37
Stream Architecture: Rethinking Media Processor Design Rice University Computer Systems Laboratory Scott Rixner April 9, 2001

Upload: dava

Post on 17-Jan-2016

42 views

Category:

Documents


0 download

DESCRIPTION

Stream Architecture: Rethinking Media Processor Design. Scott Rixner April 9, 2001. Rice University Computer Systems Laboratory. Video/image compression & decompression MPEG, JPEG, ... Signal Processing DSL modems, cellular base stations, ... Image synthesis - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Stream Architecture: Rethinking Media Processor Design

Stream Architecture:Rethinking Media Processor Design

Rice University

Computer Systems Laboratory

Scott Rixner

April 9, 2001

Page 2: Stream Architecture: Rethinking Media Processor Design

Scott Rixner Stream Architecture 2

Media Processing

Video/image compression & decompression– MPEG, JPEG, ...

Signal Processing– DSL modems, cellular base stations, ...

Image synthesis– Polygon rendering, image-based rendering, ...

Image understanding– Face recognition, depth extraction, ...

Page 3: Stream Architecture: Rethinking Media Processor Design

Scott Rixner Stream Architecture 3

640x480 @ 30 fps Requirements

– 11 GOPS Imagine stream processor

– 12.1 GOPS, 4.6 GOPS/W

Stereo Depth Extraction

Left Camera Image Right Camera Image

Depth Map

Page 4: Stream Architecture: Rethinking Media Processor Design

Scott Rixner Stream Architecture 4

Outline

Stream Processing VLSI Constraints Register Organization Imagine Conclusions

Page 5: Stream Architecture: Rethinking Media Processor Design

Scott Rixner Stream Architecture 5

Media Processing Characteristics

Low-precision data– 24% 8-bit integer operations

– 29% 16-bit integer operations Abundant data-parallelism Little global data reuse

– Average of 1.5 references per global data word Numerous computations per global reference

– 50-500 operations per global data reference

Page 6: Stream Architecture: Rethinking Media Processor Design

Scott Rixner Stream Architecture 6

Stream Processing

SAD

Kernel StreamInput Data

Output Data

Image 1 convolve convolve

Image 0 convolve convolve

Depth Map

Little data reuse (pixels never revisited) Highly data parallel (output pixels not dependent on other output pixels) Compute intensive (>60 operations per memory reference)

Page 7: Stream Architecture: Rethinking Media Processor Design

Scott Rixner Stream Architecture 7

Locality and Concurrency

SAD

Image 1 convolve convolve

Image 0 convolve convolve

Depth Map

Operations within a kernel operate on local data

Streams expose data parallelism

Kernels can be partitioned across chips to exploit control parallelism

Page 8: Stream Architecture: Rethinking Media Processor Design

Scott Rixner Stream Architecture 8

Sony PlayStation2

MIPSCore

FPU

VPU0

IPU

GraphicsSynthesizer

VPU1

RDRAM, I/O,DMAC, etc.

Display

Emotion Engine

Page 9: Stream Architecture: Rethinking Media Processor Design

Scott Rixner Stream Architecture 9

Special vs. General Purpose

Special Purpose– Fixed function

– High performance

General Purpose– Programmable

– Insufficient performance

InstructionCache

IR

IP

Reg

iste

rs

Page 10: Stream Architecture: Rethinking Media Processor Design

Scott Rixner Stream Architecture 10

Register Files Dwarf ALUs

N A rithm etic Units

1 cm

32 ALUs

Size of RFto support32 ALUs

Size of1 ALU

Size of RFto support

1 ALU1 cm

4 ALUs 16 ALUs

Page 11: Stream Architecture: Rethinking Media Processor Design

Scott Rixner Stream Architecture 11

Register File Area

Each cell requires:– 1 word line per port

– 1 bit line per port Each cell grows as p2

R registers in the file Area: p2R N3

Bit Lines

Wor

d Li

nes

...

1 wiregrid

...

p

p w

h

Register Bit Cell

Page 12: Stream Architecture: Rethinking Media Processor Design

Scott Rixner Stream Architecture 12

Register File Access Delay

Signal must traverse:– Word line to access cell

– Bit line to transfer data Wire capacitance dominates Delay: pR1/2 N3/2

wordline

b it line

registersRp

p

registersR

Register File

Page 13: Stream Architecture: Rethinking Media Processor Design

Scott Rixner Stream Architecture 13

Register File Power Dissipation

100% utilization requires

driving all pR1/2 bit lines Wire capacitance dominates

Power: p2R N3

Register File

registersRp

p

registersR

linesbit Rp

Page 14: Stream Architecture: Rethinking Media Processor Design

Scott Rixner Stream Architecture 14

0.1

1

10

100

1000

1 10 100 1000Number of Arithmetic Units

T=1T=40

Centralized Register Organization

– Area, Power N3, Delay N3/2

N A rithm etic Units

Page 15: Stream Architecture: Rethinking Media Processor Design

Scott Rixner Stream Architecture 15

Partitioned Organizations

SIMD– Data-parallel axis

Distributed Register Files (DRF)– Instruction-level parallel axis

Hierarchical– Memory hierarchy axis

Stream– Optimizing for streams

N/C A rith .Units

C S IM D C lusters

N/C A rith .Units

N A rithm etic Units

N A rithm etic Units

C S IM D C lusters

N/C A rith Units N/C A rith Units

Page 16: Stream Architecture: Rethinking Media Processor Design

Scott Rixner Stream Architecture 16

SIMD Register Organization

– Area, Power N3/C2, Delay (N/C)3/2

N/C A rith .Units

C S IM D C lusters

N/C A rith .Units

0.1

1

10

100

1000

1 10 100 1000Number of Arithmetic Units

SIMD(8 Clusters)

Central

Page 17: Stream Architecture: Rethinking Media Processor Design

Scott Rixner Stream Architecture 17

0.1

1

10

100

1000

1 10 100 1000Number of Arithmetic Units

Central

SIMD/DRF

DRF

Distributed Register Organization

– Area, Power N2, Delay N

N A rithm etic Units

Page 18: Stream Architecture: Rethinking Media Processor Design

Scott Rixner Stream Architecture 18

Combining SIMD and DRF

N A rithm etic U n its N/CA rithm etic

U nits

C S IM D C lusters

N/CA rithm etic

U nits

C S IM D C lusters

N A rithm etic U n itsN/C A rithm etic

U nitsN/C A rithm etic

U nits

Scalar SIMD

Central

DRF

Page 19: Stream Architecture: Rethinking Media Processor Design

Scott Rixner Stream Architecture 19

Hierarchical Register Organization

– Area, Power N3, Delay N3/2

N A rithm etic Units

0.1

1

10

100

1000

1 10 100 1000Number of Arithmetic Units

T=1

T=40Central

Central

Central

Hiera

rchic

al T

=40

Page 20: Stream Architecture: Rethinking Media Processor Design

Scott Rixner Stream Architecture 20

Hierarchical Organizations

N/CA rith . U n its

C S IM D C lusters

N/CA rith . U n its

C S IM D C lusters

N/C A rithm eticU nits

N/C A rithm eticU nits

N A rithm etic U n its

N A rithm etic U n its

Scalar SIMD

Central

DRF

Page 21: Stream Architecture: Rethinking Media Processor Design

Scott Rixner Stream Architecture 21

Stream Register Organization

– Area, Power N2/C, Delay N/C

C S IM D C lusters

N/C A rith Units N/C A rith Units

0.1

1

10

100

1000

1 10 100 1000Number of Arithmetic Units

Stream

Hierarchical

Central

Page 22: Stream Architecture: Rethinking Media Processor Design

Scott Rixner Stream Architecture 22

Stream Organizations

N A rithm etic U n its N/CA rith . U n its

N/CA rith . U n its

C S IM D C lusters

C S IM D C lusters

N/C A rith . U n itsN A rithm etic U n its N/C A rith . U n its

Scalar SIMD

Central

DRF

Page 23: Stream Architecture: Rethinking Media Processor Design

Scott Rixner Stream Architecture 23

Comparison of Organizations

0.1

1

10

100

1000

1 10 100 1000Number of Arithmetic Units

SIMDCentral

Stream/SIMD/DRF

Hier/SIMD/DRF

SIMD/DRF

480.1

1

10

100

1000

1 10 100 1000Number of Arithmetic Units

SIMD

Central

Hier/SIMD/DRF &Stream/SIMD/DRF

SIMD/DRF

48

48 ALUs (32-bit), 500 MHz Stream organization improves central organization by

Area: 195x, Delay: 20x, Power: 430x

Page 24: Stream Architecture: Rethinking Media Processor Design

Scott Rixner Stream Architecture 24

Performance

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Sp

eed

up

CENTRAL SIMD SIMD/DRF HIER. STREAM

16% Performance Drop(8% with latency constraints)

0

50

100

150

200

250

Per

form

ance

/Are

a

CENTRAL S IMD S IMD/DRF HIER. S TREAM

Convolve DCT Transform Shader FIR FFT Mean

180x Improvement

Page 25: Stream Architecture: Rethinking Media Processor Design

Scott Rixner Stream Architecture 25

Stream Architecture

Stream Processing– Matched to media processing

– Exposes locality and concurrency Stream Register Organization

– Efficiency of special-purpose hardware

– Optimized for streaming applications Data bandwidth

– Bandwidth hierarchy

– Memory access scheduling

– Conditional streams

C S IM D C lusters

N/C A rith Units N/C A rith Units

Page 26: Stream Architecture: Rethinking Media Processor Design

Scott Rixner Stream Architecture 26

The Imagine Stream Processor

Stream Register FileNetworkInterface

StreamController

Imagine Stream Processor

HostProcessor

Net

wor

k

AL

U C

lust

er 0

AL

U C

lust

er 1

AL

U C

lust

er 2

AL

U C

lust

er 3

AL

U C

lust

er 4

AL

U C

lust

er 5

AL

U C

lust

er 6

AL

U C

lust

er 7

SDRAMSDRAM SDRAMSDRAM

Streaming Memory SystemM

icro

con

trol

ler

Page 27: Stream Architecture: Rethinking Media Processor Design

Scott Rixner Stream Architecture 27

Arithmetic Clusters

CU

Inte

rclu

ster

N

etw

ork+

From SRF

To SRF

+ + * * /

Cross Point

Local Register File

Scratch-padRegister File

CommunicationUnit

Page 28: Stream Architecture: Rethinking Media Processor Design

Scott Rixner Stream Architecture 28

Bandwidth Hierarchy

41.2 32-bit operations per word of memory bandwidth

2GB/s 32GB/s

SDRAM

SDRAM

SDRAM

SDRAM

Str

eam

R

egis

ter

File

ALU Cluster

ALU Cluster

ALU Cluster

544GB/s

Page 29: Stream Architecture: Rethinking Media Processor Design

Scott Rixner Stream Architecture 29

Stream Recirculation

ColorConvert

DCT

DCT

IDCT

IDCT

Run-LevelEncoding

VariableLengthCoding

Arithmetic ClustersStream Register FileMemory (or I/O)

InputImage

RGBPixels

LuminancePixels

TransformedLuminance

LuminanceReference

EncodedBitstream

RLE Stream

Bitstream

ReferenceChrominance

Image

ReferenceLuminance

Image

ChrominancePixels

TransformedChrominance

ChrominanceReference

Data Referenced: 835KB 4.8MB 154.4MB

Page 30: Stream Architecture: Rethinking Media Processor Design

Scott Rixner Stream Architecture 30

Bandwidth Demands of FIR Filter

References (bytes) Stream

Memory £ 4.03 36.0 (8.9x) 49.9 (12.4x)

Global RF 4.03 664.1 (164.8x) 296.7 (73.6x)

Local RF 420.02 N/A N/A

DSP MMX

Page 31: Stream Architecture: Rethinking Media Processor Design

Scott Rixner Stream Architecture 31

Bandwidth Utilization of FIR Filter

Stream

Memory (GB/s) £ 2.62

Global RF (GB/s) 2.62

Local RF (GB/s) 273.25

Performance (GOPS) 17.57 1.01 1.47

DSP MMX

N/A N/A

1.42

24.88

2.73

16.20

Page 32: Stream Architecture: Rethinking Media Processor Design

Scott Rixner Stream Architecture 32

Performance

12.1

17.9

12.5

23.925.6

7.0

0

5

10

15

20

25

30

GO

PS

depth mpeg qrd dct convolve fft

16-bit kernels16-bitapplications

floating-pointapplication

floating-pointkernel

Page 33: Stream Architecture: Rethinking Media Processor Design

Scott Rixner Stream Architecture 33

Power

GOPS/W: 4.6 6.9 4.1 10.2 9.6 2.4 6.3

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Wa

tts

depth mpeg qrd dct convolve fft average

OtherMem SysPinsSRF ClustClock

24%

63%

5%5%

1% 2%

Page 34: Stream Architecture: Rethinking Media Processor Design

Scott Rixner Stream Architecture 34

Relative Performance and Power Efficiency

2.4

0.28

0.22

1.2

0.0

0.5

1.0

1.5

2.0

2.5

GO

PS/

W

Imagine AD 21160 TI 'C6701 SA-1100

Dhrystone

FFT

7640

7000

5120

1830 412

0

1000

2000

3000

4000

5000

6000

7000

8000

GO

PS

Jaguar II Imagine DSP-224 PULSAR 'C67 DSP

ProgammableSpecial-PurposeImagine

FFT Performance Power Efficiency

Page 35: Stream Architecture: Rethinking Media Processor Design

Scott Rixner Stream Architecture 35

Imagine Floorplan

Tapeout ~Q2 ’01 21 million T’s

– 6M SRF SRAM– 6M UC SRAM– 6M Clusters– 3M Other

Target: 32 FO4– 300 MHz at SSSS – 500 MHz at TTSS

TI GS30KA:

– 0.15 m Ldrawn

457 Signal Pins

Micro-Controller

ALU Cluster 7

ALU Cluster 6

ALU Cluster 5

ALU Cluster 4

ALU Cluster 3

ALU Cluster 2

ALU Cluster 1

ALU Cluster 0

HostInt

12

mm

12 mm

NetworkInterface

Str

ea

m R

eg

iste

r F

ile

MemBank

0

MemBank

1

MemBank

2

MemBank

3

AddrGen

JTAG/BIST

StreamCtrl

Page 36: Stream Architecture: Rethinking Media Processor Design

Scott Rixner Stream Architecture 36

Imagine Team

William J. Dally

Ujval Kapasi

Brucek Khailany

Peter Mattson

Jinyung Namkoong

John Owens

Ben Serebrin

Brian Towles

Scott Rixner

Don Alpert (Intel)

Ghazi Ben Amor

Chris Buehler (MIT)

JP Grossman (MIT)

Brad Johanson

Abelardo Lopez-Lagunas

Ben Mowery

Manman Ren

Page 37: Stream Architecture: Rethinking Media Processor Design

Scott Rixner Stream Architecture 37

Conclusions

Media Processing– Little data reuse

– Highly data parallel

– Compute intensive

VLSI– Stream register organization

– Bandwidth hierarchy

Imagine– Stream architecture

– 10 GOPS sustained application performance

– 5 GOPS/W application power efficiency

C S IM D C lusters

N/C A rith Units N/C A rith Units