extending a stream programming paradigm to hardware...
TRANSCRIPT
Motorola General Business Information, saahpc2009, v1.0A patent is pending that claims aspects of items and methods described in this paper and presentation.
MOTOROLA and the Stylized M Logo are registered in the US Patent & Trademark Office. All other product or service names are the property of their respective owners. © Motorola, Inc. 2008
Extending a Stream Programming Paradigm to Hardware Accelerator Platforms
Sek Chai Motorola Schaumburg, [email protected]
Abelardo López-LagunasInstituto Tecnológico y de EstudiosSuperiores de Monterrey Campus Toluca, Mé[email protected]
Nikos BellasUniversity of Thessaly,[email protected]
Contributors
Nikos BellasSek ChaiSilviu ChiricescuMalcolm DwyerRay EssickDan LinzmeierAbelardo Lopez-Lagunas
Brian LucasPhil MayKent MoatJim NorrisMichael SchuetteAli Saidi
Agenda
• Separation of concerns between computation and memory access
• RSVP and Proteus streaming accelerators• Results and summary
Hardware Acceleration
Loop 1Loop 1Loop 1Loop 1Loop 2Loop 2Loop 2Loop 2
Loop 3Loop 3Loop 3Loop 3
Host Cycles
Loop 1
Loop 2
Loop 3Iden
tify
Loop
s By
Pro
filin
g C
Cod
e
Recode loops as for acceleration
Host Calls
In embedded applications:• Loops take most of the execution time• Access patterns are usually uniform and static
• Use hardware acceleration for compute intensive loops
• Keep single processor programming flow
Stream Processing
FeatureDetection
Modeling &Tracking
ClassificationVideo Output
Example processing chain as stream kernels
Large (possibly infinite) amount of data
Limited lifetime of datum
Compute graph is mostly constant;Static computation patterns
Characteristics Architecture implications
Computation is repetitive onlocalized data regions
Kernels are independent andself contained
Explicit parallelism. Overlap data movements to accelerator.
Low temporal locality for data. Traditional caches are not effective.
Memory access patterns are deterministic.
Kernel Parallelism
Kernels read and write streams(no global variables)
Optimal exploitation of Instruction &Data Level parallelism via loop unrollingmodulo scheduling, etc
Lack of dynamic memory scheduleallows the compiler to produce optimalcode for the given hardware resources
Chain functional units based on stream consumption and production rate
InputStream1
InputStream2
OutputStream
ALU1 ALU2 MUL1 ACC1
Communication Locality
Inter-kernel communication througha producer-consumer model.
System-level task scheduling is easier because the communications are explicit in the program
Memory wall problem can be significantly reduced
Chain hardware accelerators based on stream consumption and production rate
Kernel A
Kernel C
Needed if Kernel Creads data in a different rate or pattern
Kernel B
Buffer
Separation of Concerns
AddressAddressAddressAddress
Stride, Span, Skip
Data Access
Location
Shape
Computation
Separation of Concerns
Stream DescriptorsDFG
Stream kernels
Annotated C
Decoupled memory accesses and computations:• enable better optimization of hardware• simplify compiler tasks
FeatureDetection
Modeling &Tracking
ClassificationVideo Output
Example processing chain as stream kernels
Hardware Acceleration of Stream Kernels
StreamDescriptors
StreamKernels
C code
ScalarProcessor
MemoryController
Application-SpecificPeripherals
SU
Buffers
Streaming Accelerators
DRAM
SU
SU
Streaming Accelerators
Accelerate computeintensive functionsin streaming hardwareaccelerators Bus Network
Stream DescriptorsStream DescriptorsStream DescriptorsStream Descriptors help define memory subsystem structure.Stream KernelsStream KernelsStream KernelsStream Kernels define hardware accelerators.
Streaming Accelerator Template
ControlRegisters
Multiplexer Tree
FU
cont
rol
FU
cont
rol
RegReg
Acc
Linebuffers
Data alignmentAGU
Arbiter & Bridge
System Bus
Input Stream
To Output Stream
Constants
Streaming Data
Bus Line Buffer
Stream Buffer
To otherInput Stream
Str
eam
Uni
tD
ata
Pat
h Addr
Queue
Programmer Visible Architectural Elements
S. Chai, et. al., “Streaming Processors for Next Generation Mobile Imaging Applications”, in IEEE Communications Magazine, Circuits for Communication Series, vol 43, no 12, Dec 2005, pp. 81-89
S. Chiricescu, et. al., "The Reconfigurable Streaming Vector Processor (RSVP™),“ Micro, December 2003.
OutputSU
Interconnect
FunctionalUnits Constants
MemorySubsystem
InputSU
…
ControlAccumulators
Scheduler
HostCycles
Loop 1
Loop 2
Iden
tify
Loop
s B
y P
rofil
ing
C C
ode
Loop 1
Loop 2
ARM Cycles+ accelerator calls
AcceleratorCycles accelerator
calls
Speedup
Describing Computation
SRND
MUL MUL MUL
ADD
ADD
ADD
3208
16
15
8414 16519
V1 V2 V3
V0
8414 16519 3208
15
16
0
1
2
3
4
5
6
Components of sDFG(stream data flow graph)
• Nodes in graph represent computation (performed by functional units)
• Edges in graph represent data movement between functional units
• Multiple computation elements arranged across the data streams
STO
�� ��A
lgor
ithm
Dat
aflo
w
Datapath Construction
Multiplexer Tree
fromInput Stream
Interface
toOutput Stream
Interface
min minmin
maxmax max
Store z0 Store z1 Store z2 ......
255 vld (v1)
vmin Store z0
Store z1
Store z2
Store z3
Store z4
Store z5
vmin
vmin
vmin
vmin
vmin
vmin
vmax Store z6
Store z7vmax
vst
0
Stream
V1
Stream
V0
Resource constraints:192 ALU bits,128 MUL bits,64 SHIFTER bits128 NAMED REG bits4, 32-bit INSTREAM PORTS1, 32-bit OUSTREAM PORT
vmin
vmin Store z1
Store z2
vmax
Map sDFG to functional units
Streaming DFG – Modulo Scheduling
Steady state schedule
Prologue
Epilogue
Hardware mechanisms ensure that theprologue and the epilogue are executed correctly
Describing Memory Access
A method to move data efficiently using known shape of data
SPAN = 4
STRIDE=3 SKIP=5SA
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
2-D Subarrays (row)(SA, Stride, span, skip)(4, 1, 4, 97)
92
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
4
For data prefetch, staging, and reuse
Efficient data movementUtilize unused bandwidthLess sensitive to peak bandwidth
Mapped to “stream unit” or smart DMA in hardware accelerators
Stream Unit
Keep bus busy with requests.
Align and order streamelements for data path.
Prefetch data usingStream Descriptors.
Larger queues and buffers allow more aggressive prefetching of stream elements
Data alignmentAGU
Arbiter & Bridge
System Bus
Input StreamBus Line Buffer
Stream Buffer
Str
eam
Uni
tAddr
Queue
Store bus transfersin transit.
Stream Unit
Data alignmentAGU
Arbiter & Bridge
System Bus
Input StreamBus Line Buffer
Stream Buffer
Str
eam
Uni
tAddr
Queue
92
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
4 92
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
4
Stream shape & access patternRequired stream bandwidth
Bus latency to memory
Reconfigurable Streaming Vector Processor
ApplicationProcessor Coprocessor
Memory
RSVP-I Development Board
ARM946ES™
TileBufferSequencer
RAM
PLL
RSVP™Architecture
PLL
Peripherals
SIF
SIF
ARM946+RSVPTM-I SoC
SoC (ARM946+RSVP) in 0.18µm which contains 9.5M transistors in a 5.04 x 9.03 mm2 die. Power consumption is 587mW (1.8V, 120MHz core, 60MHz bus).
A software programmable vector accelerator based on a “streaming dataflow” programming model
S. Chai, et. al., “Streaming Processors for Next Generation Mobile Imaging Applications”, in IEEE Communications Magazine, Circuits for Communication Series, vol 43, no 12, Dec 2005, pp. 81-89
S. Chiricescu, et. al., "The Reconfigurable Streaming Vector Processor (RSVP™),“ Micro, December 2003.
Proteus Streaming Accelerator Design Flow
Build Stream IF
Build Data Path
3
Multiplexer Tree
FU
Streaming Data
Control
FU
Control
FU
Control
Reg
Reg
Constants
ACC
Data alignment
Bus Line Buffer
Stream Buffer
Address Buffer
Addr 1
Addr 2
Addr 3
Addr 4
V
I
V
V
Arbiter
System Bus (e.g. PLB)
AddrMerge
AGU
Stream InterfaceTemplate
Data PathTemplate
Stream Buffer
to otherInput Stream
Interface
toOutput Stream
Interface
3rd partySynthesis &
P&R
5
Generate Verilog.v
4
(FPGA)
vld
FU Alloc
FU instantiation+Iteration Interval
Determination
1
FU FU
vadd vshl
vmulvsub
vst
Streaming DFG +Stream Descriptors +Resource Constraints +System Constraints
.dfg
ModuloScheduling
2vld, vadd
vmul
vsub, vshl
01..II-1
.hw
DFG and Scheduler example// DFG for the complex FIR
vname d_cfir32vbegin L_end-L_start,0
L_start:L0: vconst 0 // clear the accumulatorsL1: vputa L0, a0L2: vputa L0, a1
vinnerL3: vld.u16 (v1) // load the input dataL4: vld.u16 (v1) // load the input dataL5: vld.u16 (v2) // load the coefficien tsL6: vld.u16 (v2) // load the coefficien tsL7: vmul.s32 L4,L5 // first multiplyL8: vmul.s32 L3,L6 // second multiplyL9: vsub.s32 L7,L8 // real partL10: vadd.s32 L7,L8 // imaginary part
vadda L9,a0 // accumulate realvadda L10,a1 // accumulate imaginary
vpostL11: vgeta.s16 a0L12: vgeta.s16 a1L13: vst L11, (v0) // store the realL14: vst L12, (v0) // store the imaginaryL_end: vend
// In-lined C codecfir32(const short *x, const short *h, short *r,const short nh, const short nr) {
extern void d_cfir32; // entry point to the DFG_vload(&d_cfir32); // Load DFG
// Set up stream descriptors_vihalf(1, (unsigned short *) &x[2*nh - 1]);_vishape(1, -1, 2*nh, 2*nh + 2);_vihalf(2, (unsigned short *) &h[0]);_vishape(2, 1, 2*nh, -2*nh);_vohalf(0, r);
// Start RSVP™ co-processor_vloop2(&d_cfir32, nh, nr); // execute DFG
}
In-line C Code Linear DFG code
vconst
vputa(a0)
vld(v1)
vmul
vsub
vadda(a0)
vgeta(a0)
vst
vputa(a1)
vld(v2)
vmul
vadd
vgeta(a1)
vld(v1) vld(v2)
vadda(a1)
vst
InnerLoop
Pre-outer
Post-outer
Graphical DFG
Proteus Scheduler Features• Do not require separate prolog and epilog code• Can handle nested loop constructs without having to create different schedules
for different parts of the nested loop construct • Supports tightly coupled memories (e.g. doubled buffered LUT)• Handle resources (FUs, fabric, queues) for different application optimization
.dfg
Hardware File Template.DataPathBegin// Functional Unitsfu(adder0, adder, 1);fu(logic0, logic, 1);fu(sin1, InStream, 1);
... more ...// Functional Unit Slicessfu(adder0.0, {vsub, vabs, vnop}, {32,32,16}, 1, 1);sfu(logic0.0, {vmin, vmax, vif, vnop}, {16,16,32,16}, 1, 1);sfu(sin1.0, {vld, vnop}, {16}, 0, 1);
... more ...// Line Queues queue(Qadder0_0_0, adder0.0, [31..0], 1);queue(Qlogic0_0_0, logic0.0, [15..0], 1);queue(Qsin1_0_0, sin1.0, [15..0], 1);
... more ....DataPathEnd.ControlPathBegincstep(3)//// Operations for each Functional Unitctl_ops(adder0, {{vsub.s32.u32.u16}, {vnop}, {vabs.u32.s32.u16}});ctl_ops(logic0, {{vmax.u16.u16.u32.u16}, {vnop}, {vmin.u16.u16.u32.u16});ctl_ops(sin1, {{vld.u16}, {vnop}, {vnop});
... more ...// Operands for each Functional Unit Slice Inputctl_opnds(adder0.0.A, {{Qlogic2_0_0.0}, {vnop}, {Qadder0_0_0.0});ctl_opnds(adder0.0.B, {{Qlogic3_0_0.0}, {vnop}, {vnop});ctl_opnds(logic0.0.A, {{Qsin1_0_0.0}, {vnop}, {Qlogic0_0_0.0});ctl_opnds(logic0.0.B, {{Qsca_z15_0_0.0}, {vnop}, {Qsca_z0_0_0.0});ctl_opnds(logic0.0.C, {{vnop}, {vnop}, {vnop});
... more ...// Controls for each Queuectl_queue(Qadder0_0_0, {{1}, {0}, {1}});ctl_queue(Qlogic0_0_0, {{1}, {0}, {1}});
... more ....ControlPathEnd
+vld
logic
.hw
Proteus HW File features• Structured intermediate format
(IF) for streaming data path
• Facilitates debugging with consistent naming/labels
• Describes resources such as FUs, fabric, queues, LUTs.
• Scalable with new ISA updates
Security & Surveillance
Sek Chai, et. al., “Reconfigurable Streaming Architectures for Embedded Smart Camera Applications”, Embedded Computer Vision Workshop, New York, June 2006.
N. Bellas, Sek Chai, M. Dwyer, D. Linzmeier, “FPGA implementation of a license plate recognition SoC using automatically generated streaming accelerators”, Reconfigurable Architecture Workshop (RAW), April 2006.
Smart Cameras readlicense plates and compare against database.
AutomotiveSmart Cameras detect and track lanes. Warn drivers on hazards and unexpected lane departures.
Scene Calibration
Scene Analysis(to find lines on the road )
Lane Extraction(to find lane and road markings )
Capture Images
Convert togray scale
NoiseFilter
PerspectiveMapping
Edge Detection
Lane Extraction
Line Detection(Hough Transform)
Driver Warning
Road modeling
S. Chiricescu, S. Chai, K. Moat, B. Lucas, P. May, J. Norris, R. Essick, M. Schuette, “RSVP II: A Next generation Automotive Vector Processor,” in IEEE Intelligent Vehicles Symposium, June 2005
Lens correctionAddresses issues related to filtering and lens-dist ortion correction for visual communications
N. Bellas, Sek Chai, M. Dwyer, D. Linzmeier, “Real-Time Fisheye Lens Distortion Correction Using Automatically Generated Streaming Accelerators”, IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), April 2009
Fisheye space
InverseMapping
(x-2,y-1) (x,y-1) (x+1,y-1)(x-1,y-1)
Perspective space
(x-2,y) (x,y) (x+1,y)(x-1,y)
(x-2,y+1) (x,y+1) (x+1,y+1)(x-1,y+1)
Stereoscopic geometry of Fisheye Lenses
Rectilinear ProjectionFisheye Projection 25
• Fisheye lenses refract the incident light rays towards the central perspective point
Algorithmic steps (A)
From rectilinear to fisheye space coordinates (inverse mapping)
hx xd
Xc
Yc
Zc
YcXca
R
x ++
+
+
=
1
)()(tan
2
2
22
π
hy yd
Yc
Xc
Zc
YcXca
R
y ++
+
+
=
1
)()(tan
2
2
22
π
=
=
1333231
232221
131211
j
i
rrr
rrr
rrr
Z
Y
X
c
c
c
26
Algorithmic steps (B)
27
• Approximate pixel values in fractional positions in Fisheye space• Complex memory access pattern due to non-linear projection
trajectory
Algorithmic steps (B) Bicubic interpolation
• Bicubic interpolation uses a 4x4 window of pixels to approximate intermediate points
• Interpolation weights depend on the relative position of the intermediate point
Architectural Optimization Strategies
Use block tiling to correct a block of pixels at a time
Streaming Accelerator
Parallelism extracted by Proteus• Instruction Level Parallelism (ILP) naturally
expressed in streaming– About 400 executed instructions/cycle – Modulo scheduler with II=2
• Task level parallelism – Concurrent– Pipeline
Hardware detailsVirtex-4, LX-80 FPGA
SPEEDClock freq. 62.5 MHz (single clock)
Throughput 22 frames/sec
AREA
Logic Slices 11082 (30%)
DSP48 units 71 (88%)
BRAMs 109 (54.5%)
BRAM types (number per type )
4096x8 (15)13728x50 (1)256x16, 512x7, 256x17, 256x7, 256x17, 256x7
6864 x 8 (2)6864 x 16 (2)3432 x 8 (2)3432 x 16 (2)
Performance
0
5
10
15
20
25
30
Only PPE 1 SPE 2 SPE 4 SPE 8 SPE 1T 2T 4T Virtex-4 LX80
Cell Core 2 Quad FPGA
Pro
cess
ing
Spe
ed (Fr
ames
/Sec
) Inverse Mapping Amortization
CL+FL optimizations
CL optimizations
Bounded byavailableFPGA SRAM
(unpublished results)
Performance
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Onl
y PP
E
CL, 1
SPE
CL, 2
SPE
CL, 4
SPE
CL, 8
SPE
CL+F
L, 1
SPE
CL+F
L, 2
SPE
CL+F
L, 4
SPE
CL+F
L, 8
SPE
IMA
, 1 S
PEIM
A, 2
SPE
IMA
, 4 S
PEIM
A, 8
SPE
CL, 1
TCL
, 2T
CL, 4
T
CL+F
L, 1
TCL
+FL,
2T
CL+F
L, 4
T
Virt
ex-4
LX8
0
Cell Core 2 Quad FPGA
Mod
ule
Runt
ime
Brea
kdow
nInverse Mapping Bicubic Interpolation Low Pass Filter
Floating pointintensive
(unpublished results)
Research summary
Separation of concerns– Memory access patterns are
defined explicitly by programmer for RSVP and Proteus
Extend stream descriptorsPlan to open source Proteus tool
StreamRegister File
Cache
MemoryController
ExternalMemory
Local Register File
Datapaths Stream
Descriptors
Compiler generated descriptors that match transfer mechanisms at each memory hierarchy
Memory Hierarchy