extending a stream programming paradigm to hardware...

Motorola General Business Information, saahpc2009, v1.0A patent is pending that claims aspects of items and methods described in this paper and presentation.

MOTOROLA and the Stylized M Logo are registered in the US Patent & Trademark Office. All other product or service names are the property of their respective owners. © Motorola, Inc. 2008

Extending a Stream Programming Paradigm to Hardware Accelerator Platforms

Sek Chai Motorola Schaumburg, [email protected]

Abelardo López-LagunasInstituto Tecnológico y de EstudiosSuperiores de Monterrey Campus Toluca, Mé[email protected]

Nikos BellasUniversity of Thessaly,[email protected]

Contributors

Nikos BellasSek ChaiSilviu ChiricescuMalcolm DwyerRay EssickDan LinzmeierAbelardo Lopez-Lagunas

Brian LucasPhil MayKent MoatJim NorrisMichael SchuetteAli Saidi

Agenda

• Separation of concerns between computation and memory access

• RSVP and Proteus streaming accelerators• Results and summary

Hardware Acceleration

Loop 1Loop 1Loop 1Loop 1Loop 2Loop 2Loop 2Loop 2

Loop 3Loop 3Loop 3Loop 3

Host Cycles

Loop 1

Loop 2

Loop 3Iden

tify

Loop

s By

Pro

filin

g C

Cod

e

Recode loops as for acceleration

Host Calls

In embedded applications:• Loops take most of the execution time• Access patterns are usually uniform and static

• Use hardware acceleration for compute intensive loops

• Keep single processor programming flow

Stream Processing

FeatureDetection

Modeling &Tracking

ClassificationVideo Output

Example processing chain as stream kernels

Large (possibly infinite) amount of data

Limited lifetime of datum

Compute graph is mostly constant;Static computation patterns

Characteristics Architecture implications

Computation is repetitive onlocalized data regions

Kernels are independent andself contained

Explicit parallelism. Overlap data movements to accelerator.

Low temporal locality for data. Traditional caches are not effective.

Memory access patterns are deterministic.

Kernel Parallelism

Kernels read and write streams(no global variables)

Optimal exploitation of Instruction &Data Level parallelism via loop unrollingmodulo scheduling, etc

Lack of dynamic memory scheduleallows the compiler to produce optimalcode for the given hardware resources

Chain functional units based on stream consumption and production rate

InputStream1

InputStream2

OutputStream

ALU1 ALU2 MUL1 ACC1

Communication Locality

Inter-kernel communication througha producer-consumer model.

System-level task scheduling is easier because the communications are explicit in the program

Memory wall problem can be significantly reduced

Chain hardware accelerators based on stream consumption and production rate

Kernel A

Kernel C

Needed if Kernel Creads data in a different rate or pattern

Kernel B

Buffer

Separation of Concerns

AddressAddressAddressAddress

Stride, Span, Skip

Data Access

Location

Shape

Computation

Separation of Concerns

Stream DescriptorsDFG

Stream kernels

Annotated C

Decoupled memory accesses and computations:• enable better optimization of hardware• simplify compiler tasks

FeatureDetection

Modeling &Tracking

ClassificationVideo Output

Example processing chain as stream kernels

Hardware Acceleration of Stream Kernels

StreamDescriptors

StreamKernels

C code

ScalarProcessor

MemoryController

Application-SpecificPeripherals

SU

Buffers

Streaming Accelerators

DRAM

SU

SU

Streaming Accelerators

Accelerate computeintensive functionsin streaming hardwareaccelerators Bus Network

Stream DescriptorsStream DescriptorsStream DescriptorsStream Descriptors help define memory subsystem structure.Stream KernelsStream KernelsStream KernelsStream Kernels define hardware accelerators.

Streaming Accelerator Template

ControlRegisters

Multiplexer Tree

FU

cont

rol

FU

cont

rol

RegReg

Acc

Linebuffers

Data alignmentAGU

Arbiter & Bridge

System Bus

Input Stream

To Output Stream

Constants

Streaming Data

Bus Line Buffer

Stream Buffer

To otherInput Stream

Str

eam

Uni

tD

ata

Pat

h Addr

Queue

Programmer Visible Architectural Elements

S. Chai, et. al., “Streaming Processors for Next Generation Mobile Imaging Applications”, in IEEE Communications Magazine, Circuits for Communication Series, vol 43, no 12, Dec 2005, pp. 81-89

S. Chiricescu, et. al., "The Reconfigurable Streaming Vector Processor (RSVP™),“ Micro, December 2003.

OutputSU

Interconnect

FunctionalUnits Constants

MemorySubsystem

InputSU

…

ControlAccumulators

Scheduler

HostCycles

Loop 1

Loop 2

Iden

tify

Loop

s B

y P

rofil

ing

C C

ode

Loop 1

Loop 2

ARM Cycles+ accelerator calls

AcceleratorCycles accelerator

calls

Speedup

Describing Computation

SRND

MUL MUL MUL

ADD

ADD

ADD

3208

16

15

8414 16519

V1 V2 V3

V0

8414 16519 3208

15

16

0

1

2

3

4

5

6

Components of sDFG(stream data flow graph)

• Nodes in graph represent computation (performed by functional units)

• Edges in graph represent data movement between functional units

• Multiple computation elements arranged across the data streams

STO

�� A

lgor

ithm

Dat

aflo

w

Datapath Construction

Multiplexer Tree

fromInput Stream

Interface

toOutput Stream

Interface

min minmin

maxmax max

Store z0 Store z1 Store z2 ......

255 vld (v1)

vmin Store z0

Store z1

Store z2

Store z3

Store z4

Store z5

vmin

vmin

vmin

vmin

vmin

vmin

vmax Store z6

Store z7vmax

vst

0

Stream

V1

Stream

V0

Resource constraints:192 ALU bits,128 MUL bits,64 SHIFTER bits128 NAMED REG bits4, 32-bit INSTREAM PORTS1, 32-bit OUSTREAM PORT

vmin

vmin Store z1

Store z2

vmax

Map sDFG to functional units

Streaming DFG – Modulo Scheduling

Steady state schedule

Prologue

Epilogue

Hardware mechanisms ensure that theprologue and the epilogue are executed correctly

Describing Memory Access

A method to move data efficiently using known shape of data

SPAN = 4

STRIDE=3 SKIP=5SA

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

2-D Subarrays (row)(SA, Stride, span, skip)(4, 1, 4, 97)

92

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

4

For data prefetch, staging, and reuse

Efficient data movementUtilize unused bandwidthLess sensitive to peak bandwidth

Mapped to “stream unit” or smart DMA in hardware accelerators

Stream Unit

Keep bus busy with requests.

Align and order streamelements for data path.

Prefetch data usingStream Descriptors.

Larger queues and buffers allow more aggressive prefetching of stream elements

Data alignmentAGU

Arbiter & Bridge

System Bus

Input StreamBus Line Buffer

Stream Buffer

Str

eam

Uni

tAddr

Queue

Store bus transfersin transit.

Stream Unit

Data alignmentAGU

Arbiter & Bridge

System Bus

Input StreamBus Line Buffer

Stream Buffer

Str

eam

Uni

tAddr

Queue

92

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

4 92

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

4

Stream shape & access patternRequired stream bandwidth

Bus latency to memory

Reconfigurable Streaming Vector Processor

ApplicationProcessor Coprocessor

Memory

RSVP-I Development Board

ARM946ES™

TileBufferSequencer

RAM

PLL

RSVP™Architecture

PLL

Peripherals

SIF

SIF

ARM946+RSVPTM-I SoC

SoC (ARM946+RSVP) in 0.18µm which contains 9.5M transistors in a 5.04 x 9.03 mm2 die. Power consumption is 587mW (1.8V, 120MHz core, 60MHz bus).

A software programmable vector accelerator based on a “streaming dataflow” programming model

S. Chai, et. al., “Streaming Processors for Next Generation Mobile Imaging Applications”, in IEEE Communications Magazine, Circuits for Communication Series, vol 43, no 12, Dec 2005, pp. 81-89

S. Chiricescu, et. al., "The Reconfigurable Streaming Vector Processor (RSVP™),“ Micro, December 2003.

Proteus Streaming Accelerator Design Flow

Build Stream IF

Build Data Path

3

Multiplexer Tree

FU

Streaming Data

Control

FU

Control

FU

Control

Reg

Reg

Constants

ACC

Data alignment

Bus Line Buffer

Stream Buffer

Address Buffer

Addr 1

Addr 2

Addr 3

Addr 4

V

I

V

V

Arbiter

System Bus (e.g. PLB)

AddrMerge

AGU

Stream InterfaceTemplate

Data PathTemplate

Stream Buffer

to otherInput Stream

Interface

toOutput Stream

Interface

3rd partySynthesis &

P&R

5

Generate Verilog.v

4

(FPGA)

vld

FU Alloc

FU instantiation+Iteration Interval

Determination

1

FU FU

vadd vshl

vmulvsub

vst

Streaming DFG +Stream Descriptors +Resource Constraints +System Constraints

.dfg

ModuloScheduling

2vld, vadd

vmul

vsub, vshl

01..II-1

.hw

DFG and Scheduler example// DFG for the complex FIR

vname d_cfir32vbegin L_end-L_start,0

L_start:L0: vconst 0 // clear the accumulatorsL1: vputa L0, a0L2: vputa L0, a1

vinnerL3: vld.u16 (v1) // load the input dataL4: vld.u16 (v1) // load the input dataL5: vld.u16 (v2) // load the coefficien tsL6: vld.u16 (v2) // load the coefficien tsL7: vmul.s32 L4,L5 // first multiplyL8: vmul.s32 L3,L6 // second multiplyL9: vsub.s32 L7,L8 // real partL10: vadd.s32 L7,L8 // imaginary part

vadda L9,a0 // accumulate realvadda L10,a1 // accumulate imaginary

vpostL11: vgeta.s16 a0L12: vgeta.s16 a1L13: vst L11, (v0) // store the realL14: vst L12, (v0) // store the imaginaryL_end: vend

// In-lined C codecfir32(const short *x, const short *h, short *r,const short nh, const short nr) {

extern void d_cfir32; // entry point to the DFG_vload(&d_cfir32); // Load DFG

// Set up stream descriptors_vihalf(1, (unsigned short *) &x[2*nh - 1]);_vishape(1, -1, 2*nh, 2*nh + 2);_vihalf(2, (unsigned short *) &h[0]);_vishape(2, 1, 2*nh, -2*nh);_vohalf(0, r);

// Start RSVP™ co-processor_vloop2(&d_cfir32, nh, nr); // execute DFG

}

In-line C Code Linear DFG code

vconst

vputa(a0)

vld(v1)

vmul

vsub

vadda(a0)

vgeta(a0)

vst

vputa(a1)

vld(v2)

vmul

vadd

vgeta(a1)

vld(v1) vld(v2)

vadda(a1)

vst

InnerLoop

Pre-outer

Post-outer

Graphical DFG

Proteus Scheduler Features• Do not require separate prolog and epilog code• Can handle nested loop constructs without having to create different schedules

for different parts of the nested loop construct • Supports tightly coupled memories (e.g. doubled buffered LUT)• Handle resources (FUs, fabric, queues) for different application optimization

.dfg

Hardware File Template.DataPathBegin// Functional Unitsfu(adder0, adder, 1);fu(logic0, logic, 1);fu(sin1, InStream, 1);

... more ...// Functional Unit Slicessfu(adder0.0, {vsub, vabs, vnop}, {32,32,16}, 1, 1);sfu(logic0.0, {vmin, vmax, vif, vnop}, {16,16,32,16}, 1, 1);sfu(sin1.0, {vld, vnop}, {16}, 0, 1);

... more ...// Line Queues queue(Qadder0_0_0, adder0.0, [31..0], 1);queue(Qlogic0_0_0, logic0.0, [15..0], 1);queue(Qsin1_0_0, sin1.0, [15..0], 1);

... more ....DataPathEnd.ControlPathBegincstep(3)//// Operations for each Functional Unitctl_ops(adder0, {{vsub.s32.u32.u16}, {vnop}, {vabs.u32.s32.u16}});ctl_ops(logic0, {{vmax.u16.u16.u32.u16}, {vnop}, {vmin.u16.u16.u32.u16});ctl_ops(sin1, {{vld.u16}, {vnop}, {vnop});

... more ...// Operands for each Functional Unit Slice Inputctl_opnds(adder0.0.A, {{Qlogic2_0_0.0}, {vnop}, {Qadder0_0_0.0});ctl_opnds(adder0.0.B, {{Qlogic3_0_0.0}, {vnop}, {vnop});ctl_opnds(logic0.0.A, {{Qsin1_0_0.0}, {vnop}, {Qlogic0_0_0.0});ctl_opnds(logic0.0.B, {{Qsca_z15_0_0.0}, {vnop}, {Qsca_z0_0_0.0});ctl_opnds(logic0.0.C, {{vnop}, {vnop}, {vnop});

... more ...// Controls for each Queuectl_queue(Qadder0_0_0, {{1}, {0}, {1}});ctl_queue(Qlogic0_0_0, {{1}, {0}, {1}});

... more ....ControlPathEnd

+vld

logic

.hw

Proteus HW File features• Structured intermediate format

(IF) for streaming data path

• Facilitates debugging with consistent naming/labels

• Describes resources such as FUs, fabric, queues, LUTs.

• Scalable with new ISA updates

Security & Surveillance

Sek Chai, et. al., “Reconfigurable Streaming Architectures for Embedded Smart Camera Applications”, Embedded Computer Vision Workshop, New York, June 2006.

N. Bellas, Sek Chai, M. Dwyer, D. Linzmeier, “FPGA implementation of a license plate recognition SoC using automatically generated streaming accelerators”, Reconfigurable Architecture Workshop (RAW), April 2006.

Smart Cameras readlicense plates and compare against database.

AutomotiveSmart Cameras detect and track lanes. Warn drivers on hazards and unexpected lane departures.

Scene Calibration

Scene Analysis(to find lines on the road )

Lane Extraction(to find lane and road markings )

Capture Images

Convert togray scale

NoiseFilter

PerspectiveMapping

Edge Detection

Lane Extraction

Line Detection(Hough Transform)

Driver Warning

Road modeling

S. Chiricescu, S. Chai, K. Moat, B. Lucas, P. May, J. Norris, R. Essick, M. Schuette, “RSVP II: A Next generation Automotive Vector Processor,” in IEEE Intelligent Vehicles Symposium, June 2005

Lens correctionAddresses issues related to filtering and lens-dist ortion correction for visual communications

N. Bellas, Sek Chai, M. Dwyer, D. Linzmeier, “Real-Time Fisheye Lens Distortion Correction Using Automatically Generated Streaming Accelerators”, IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), April 2009

Fisheye space

InverseMapping

(x-2,y-1) (x,y-1) (x+1,y-1)(x-1,y-1)

Perspective space

(x-2,y) (x,y) (x+1,y)(x-1,y)

(x-2,y+1) (x,y+1) (x+1,y+1)(x-1,y+1)

Stereoscopic geometry of Fisheye Lenses

Rectilinear ProjectionFisheye Projection 25

• Fisheye lenses refract the incident light rays towards the central perspective point

Algorithmic steps (A)

From rectilinear to fisheye space coordinates (inverse mapping)

hx xd

Xc

Yc

Zc

YcXca

R

x ++

+

+

=

1

)()(tan

2

2

22

π

hy yd

Yc

Xc

Zc

YcXca

R

y ++

+

+

=

1

)()(tan

2

2

22

π

=

=

1333231

232221

131211

j

i

rrr

rrr

rrr

Z

Y

X

c

c

c

26

Algorithmic steps (B)

27

• Approximate pixel values in fractional positions in Fisheye space• Complex memory access pattern due to non-linear projection

trajectory

Algorithmic steps (B) Bicubic interpolation

• Bicubic interpolation uses a 4x4 window of pixels to approximate intermediate points

• Interpolation weights depend on the relative position of the intermediate point

Architectural Optimization Strategies

Use block tiling to correct a block of pixels at a time

Streaming Accelerator

Parallelism extracted by Proteus• Instruction Level Parallelism (ILP) naturally

expressed in streaming– About 400 executed instructions/cycle – Modulo scheduler with II=2

• Task level parallelism – Concurrent– Pipeline

Hardware detailsVirtex-4, LX-80 FPGA

SPEEDClock freq. 62.5 MHz (single clock)

Throughput 22 frames/sec

AREA

Logic Slices 11082 (30%)

DSP48 units 71 (88%)

BRAMs 109 (54.5%)

BRAM types (number per type )

4096x8 (15)13728x50 (1)256x16, 512x7, 256x17, 256x7, 256x17, 256x7

6864 x 8 (2)6864 x 16 (2)3432 x 8 (2)3432 x 16 (2)

Performance

0

5

10

15

20

25

30

Only PPE 1 SPE 2 SPE 4 SPE 8 SPE 1T 2T 4T Virtex-4 LX80

Cell Core 2 Quad FPGA

Pro

cess

ing

Spe

ed (Fr

ames

/Sec

) Inverse Mapping Amortization

CL+FL optimizations

CL optimizations

Bounded byavailableFPGA SRAM

(unpublished results)

Performance

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Onl

y PP

E

CL, 1

SPE

CL, 2

SPE

CL, 4

SPE

CL, 8

SPE

CL+F

L, 1

SPE

CL+F

L, 2

SPE

CL+F

L, 4

SPE

CL+F

L, 8

SPE

IMA

, 1 S

PEIM

A, 2

SPE

IMA

, 4 S

PEIM

A, 8

SPE

CL, 1

TCL

, 2T

CL, 4

T

CL+F

L, 1

TCL

+FL,

2T

CL+F

L, 4

T

Virt

ex-4

LX8

0

Cell Core 2 Quad FPGA

Mod

ule

Runt

ime

Brea

kdow

nInverse Mapping Bicubic Interpolation Low Pass Filter

Floating pointintensive

(unpublished results)

Research summary

Separation of concerns– Memory access patterns are

defined explicitly by programmer for RSVP and Proteus

Extend stream descriptorsPlan to open source Proteus tool

StreamRegister File

Cache

MemoryController

ExternalMemory

Local Register File

Datapaths Stream

Descriptors

Compiler generated descriptors that match transfer mechanisms at each memory hierarchy

Memory Hierarchy

extending a stream programming paradigm to hardware...

Documents