data parallel fpga workloads: software versus hardware
DESCRIPTION
Data Parallel FPGA Workloads: Software Versus Hardware. Peter Yiannacouras J. Gregory Steffan Jonathan Rose FPL 2009. FPGA Systems and Soft Processors. Simplify FPGA design: Customize soft processor architecture. Target: Data level parallelism → vector processors. Digital System. - PowerPoint PPT PresentationTRANSCRIPT
Data Parallel FPGA Workloads: Software Versus Hardware
Peter YiannacourasJ. Gregory SteffanJonathan Rose
FPL 2009
2
FPGA Systems and Soft Processors
SoftProcessor
CustomHW
HDL+
CAD
Software+
Compiler
Easier Faster Smaller Less Power
Simplify FPGA design: Customize soft processor architecture
? Configurable COMPETE
Weeks Months
Target: Data level parallelism → vector processors
Used in 25% of designs [source: Altera, 2009]
Digital System computation
3
Vector Processing Primer
// C codefor(i=0;i<16; i++) c[i]=a[i]+b[i]
// Vectorized codeset vl,16vload vr0,avload vr1,bvadd vr2,vr0,vr1vstore vr2,c
Each vector instructionholds many units of independent operations
vr2[0]= vr0[0]+vr1[0]vr2[1]= vr0[1]+vr1[1]vr2[2]= vr0[2]+vr1[2]
vr2[4]= vr0[4]+vr1[4]vr2[3]= vr0[3]+vr1[3]
vr2[5]= vr0[5]+vr1[5]vr2[6]= vr0[6]+vr1[6]vr2[7]= vr0[7]+vr1[7]vr2[8]= vr0[8]+vr1[8]vr2[9]= vr0[9]+vr1[9]
vr2[10]=vr0[10]+vr1[10]vr2[11]=vr0[11]+vr1[11]vr2[12]=vr0[12]+vr1[12]vr2[13]=vr0[13]+vr1[13]vr2[14]=vr0[14]+vr1[14]vr2[15]=vr0[15]+vr1[15]
vadd
1 Vector Lane
4
Vector Processing Primer
// C codefor(i=0;i<16; i++) c[i]=a[i]+b[i]
// Vectorized codeset vl,16vload vr0,avload vr1,bvadd vr2,vr0,vr1vstore vr2,c
Each vector instructionholds many units of independent operations
vadd16 Vector Lanes
vr2[0]= vr0[0]+vr1[0]vr2[1]= vr0[1]+vr1[1]vr2[2]= vr0[2]+vr1[2]
vr2[4]= vr0[4]+vr1[4]vr2[3]= vr0[3]+vr1[3]
vr2[5]= vr0[5]+vr1[5]vr2[6]= vr0[6]+vr1[6]vr2[7]= vr0[7]+vr1[7]vr2[8]= vr0[8]+vr1[8]vr2[9]= vr0[9]+vr1[9]
vr2[10]=vr0[10]+vr1[10]vr2[11]=vr0[11]+vr1[11]vr2[12]=vr0[12]+vr1[12]vr2[13]=vr0[13]+vr1[13]vr2[14]=vr0[14]+vr1[14]vr2[15]=vr0[15]+vr1[15]
16x speedup
Previous Work (on Soft Vector Processors):1. Scalability2. Flexibility3. Portability
5
Soft Vector Processors vs HW
CustomHW HDL
+CAD
Software+
Compiler
Easier Faster Smaller Less Power
What is the soft vector processor vs FPGA custom HW gap?(also vs scalar soft processor)
+Vectorizer
Lane1
Vector Lanes
Lane2
Lane3
Lane4
Lane5
Lane6
Lane7
Lane8 …16
How much?
Soft VectorProcessor
Weeks Months
ScalableFine-tunableCustomizable
6
Measuring the Gap
EEMBC Benchmarks
ScalarSoft
Processor
SoftVector
Processor
HWCircuits
Evaluation Evaluation Evaluation
SpeedArea
SpeedArea
SpeedArea
Compare Compare
Conclusions
7
VESPA Architecture Design(Vector Extended Soft Processor Architecture)
ScalarPipeline3-stage
VectorControlPipeline3-stage
VectorPipeline6-stage
Icache Dcache
Decode RFALU
MUX WB
VCRF
VSRF
VCWB
VSWB
Logic
DecodeRepli-cate
Hazardcheck
VRRF
VRWB
VRRF
VRWB
Decode
Supports integerand fixed-point operations [VIRAM]
32-bit Lanes
Shared Dcache
Legend Pipe stage Logic Storage
Lane 1 ALU,Mem UnitLane 2 ALU, Mem, Mul
8
VESPA Parameters
Description Symbol Values
Number of Lanes L 1,2,4,8, …
Memory Crossbar Lanes M 1,2, …, L
Multiplier Lanes X 1,2, …, L
Maximum Vector Length MVL 2,4,8, …
Width of Lanes (in bits) W 1-32
Instruction Enable (each) - on/off
Data Cache Capacity DD any
Data Cache Line Size DW any
Data Prefetch Size DPK < DD
Vector Data Prefetch Size DPV < DD/MVL
ComputeArchitecture
MemoryHierarchy
InstructionSet
Architecture
9
VESPA Evaluation Infrastructure
Vectorizedassembly
subroutinesGNU as
ELFBinary
InstructionSet
Simulation
scalar μP
+vpu
VCRF
VSRF
VCWB
VSWB
Logic
DecodeRepli-cate
Hazardcheck
VRRF
ALU
MemUnit
x & satur.
VRWB
MUX
Satu-rate
Rshift
VRRF
ALU
x & satur.
VRWB
MUX
Satu-rate
Rshift
EEMBC CBenchmarks
RTLSimulation
SOFTWARE HARDWAREVerilog
AlteraQuartus II
v 8.1
cyclesarea,clock frequency
GCC
ld
verification verification
TM4
Realisticanddetailedevaluation
10
Measuring the Gap
EEMBCBenchmarks
ScalarSoft
Processor
SoftVector
Processor
HWCircuits
Evaluation Evaluation Evaluation
SpeedArea
SpeedArea
SpeedArea
Compare Compare
Conclusions
11
Designing HW Circuits(with simplifying assumptions)
DDRCore
Datapath
MemoryRequest
Control
HW
Idealized
cycle count (modelled) Assume fed at full DDR bandwidth Calculate execution time from data size
Optimistic HW implementations vs real processors
AlteraQuartus II
v 8.1
area,clock frequency
12
Benchmarks Converted to HW
Clock
Benchmark ALMs DSPs M9Ks (MHz) Cycles
autcor 592 32 1 323 1057conven 46 0 0 476 226rgbcmyk 527 0 0 447 237784rgbyiq 706 108 0 274 144741ip_checksum 158 0 0 457 2567imgblend 302 32 0 443 14414
EEMBC
VIRAM
Stratix III 3S200C2
VESPA Clock: 120-140 MHz
HW Clock: 275-475 MHz
HW advantage: 3x faster clock frequency
13
Performance/Area Space (vs HW)
Scalar – 432x slower, 7x larger
Soft vector processors can significantly close performance gap
Slo
wd
ow
n v
s
HW
Area vs HW
fastest VESPA17x slower, 64x larger
HW (1,1)optimistic
HW Area Advantage
HW
Sp
eed
Ad
van
tag
e
14
Area-Delay Product
Commonly used to measure efficiency in silicon Considers both performance and area Inverse of performance-per-area
Calculated using:
(Area) × (Wall Clock Execution Time)
15
Area-Delay Space (vs HW)
0
500
1000
1500
2000
2500
3000
3500
0 20 40 60 80
HW Area Advantage
HW
Are
a-D
ela
y
Advanta
ge
Scalar
1 Lane
2 Lanes
4 Lanes
8 Lanes
16 Lanes
2900x
900x
VESPA up to 3 times better silicon usage than Scalar
A
rea
-De
lay
vs
HW
HW Area Advantage
HW
Are
a-D
ela
y
Ad
van
tag
e
16
Reducing the Performance Gap
Previously: VESPA was 50x slower than HW
Reducing loop overhead VESPA: Decoupled pipelines (+7% speed)
Improving data delivery VESPA: Parameterized cache (2x speed, 2x area) VESPA: Data Prefetching (+42% speed)
These enhancements were key parts of reducing gap,combined 3x performance improvement
17
Vector Memory Crossbar
Wider Cache Line Size
ScalarVectorCoproc
Lane0Lane
0Lane
0Lane
4
Dcache4KB,
16B line …
Lane0Lane
0Lane
0Lane
8Lane
0Lane
0Lane
0Lane12
Lane4Lane
4Lane15Lane16
VESPA16 lanes
vld.w (load 16 sequential 32-bit words)
18
Vector Memory Crossbar
Wider Cache Line Size
ScalarVectorCoproc
Lane0Lane
0Lane
0Lane
4
Dcache16KB,
64B line …
Lane0Lane
0Lane
0Lane
8Lane
0Lane
0Lane
0Lane12
Lane4Lane
4Lane15Lane16
VESPA16 lanes
vld.w (load 16 sequential 32-bit words)
4x
4x
2x speed, 2x area(reduced cache accesses +some prefetching)
19
Hardware Prefetching Example
DDR
Dcache
…
vld.w
No Prefetching Prefetching 3 blocks
DDR
Dcache
…
vld.w
MISS MISS
10 cyclepenalty
10 cyclepenalty
vld.w vld.w
HITMISS
42% speed improvement from reduced miss cycles
20
Reducing the Area Gap (by Customizing the Instruction Set)
FPGAs can be reconfigured between applications
Observations: Not all applications1. Operate on 32-bit data types
2. Use the entire vector instruction set
Eliminate unused hardware
21
VESPA Parameters
Description Symbol Values
Number of Lanes L 1,2,4,8, …
Maximum Vector Length MVL 2,4,8, …
Width of Lanes (in bits) W 1-32
Memory Crossbar Lanes M 1,2, …, L
Multiplier Lanes X 1,2, …, L
Instruction Enable (each) - on/off
Data Cache Capacity DD any
Data Cache Line Size DW any
Data Prefetch Size DPK < DD
Vector Data Prefetch Size DPV < DD/MVL
Subsetinstructionset
Reduce width
22
Customized VESPA vs HW
0
50
100
150
200
0 10 20 30 40 50 60 70
HW Area Advantage
HW
Speedup
Full
Subsetted
Subsetted+Width Reduced
Up to 45% area saved with width reduction & subsetting
45%
Slo
wd
ow
n v
s
HW
Area vs HW
HW Area Advantage
HW
Sp
eed
Ad
van
tag
e
23
Summary
VESPA more competitive with HW design Fastest VESPA only 17x slower than HW
Scalar soft processor was 432x slower than HW Attacking loop overhead and data delivery was key
Decoupled pipelines, cache tuning, data prefetching Further enhancements can reduce the gap more
VESPA improves efficiency of silicon usage 900x worse area-delay than HW
Scalar soft processor 2900x worse area-delay than HW Subsetting/width reduction can further reduce to 561x
Enable software implementation for non-critical data-parallel computation
24
Thank You!
Stay tuned for public release:1. GNU assembler ported for VIRAM (integer only)
2. VESPA hardware design (DE3 ready)
25
Breaking Down Performance
Components of performance
Loop:
<work>
goto Loop
Loop:
<work>
goto Loop
Loop:
<work>
goto Loop
…Iteration-level parallelism
Cycles per iteration × Clock period
a)
b)
c)
Measure the HW advantage in each of these components
26
Breakdown of Performance Loss(16 lane VESPA vs HW)
Benchmark
Clock
Frequency
Iteration
Level
Parallelism
Cycles
Per
Iteration
autcor 2.6x 1x 9.1x
conven 3.9x 1x 6.1x
rgbcmyk 3.7x 0.375x 13.8x
rgbyiq 2.2x 0.375x 19.0x
ip_checksum 3.7x 0.5x 4.8x
imgblend 3.6x 1x 4.4x
GEOMEAN 3.2x 0.64x 8.2x
Largest factor
Was previously worse, recently improved
17xTotal
27
1-Lane VESPA vs Scalar
1. Efficient pipeline execution 2. Large vector register file for storage3. Amortization of loop control instructions.4. More powerful ISA (VIRAM vs MIPS):
1. Support for fixed-point operations2. Predication3. Built-in min/max/absolute instructions
5. Execution in both scalar and vector co-processor6. Manual vectorization in assembly versus scalar GCC
28
Measuring the Gap
Scalar: MIPS soft processor
VESPA: VIRAM soft vector processor
HW: Custom circuit for each benchmark
COMPARE
(simplified & idealized)
(complete & real)
COMPARE
(complete & real)
EEMBC CBenchmarks
C
assembly
Verilog
29
Reporting Comparison Results
Performance (wall clock time)
Area (actual silicon area)
HW Speed Advantage =Execution Time of Processor
Execution Time of Hardware
HW Area Advantage =Area of Processor
Area of Hardware
1. Scalar (C)2.3.
VESPA (Vector assembly)HW (Verilog)
vs HW (Verilog)vs HW (Verilog)
30
Cache Design Space – Performance (Wall Clock Time)
1.68
1.93
1.55
1.77
1.37
1.50
1.13
1.00
1.25
1.50
1.75
2.00
4KB 8KB 16KB 32KB 64KB
Speedup V
s 4
KB
,16B
128B
64B
32B
16B
Best cache design almost doubles performance of original VESPA
122MHz
123MHz
126MHz129MHz
More pipelining/retiming could reduce clock frequency penalty
Cache line more important than cache depth (lots of streaming)
31
Vector Length Prefetching - Performance
0.5
1
1.5
2
2.5N
one
1*V
L
2*V
L
4*V
L
8*V
L
16*V
L
32*V
L
Amount of Prefetching
Speedup
autcor
conven
fbital
viterb
rgbcmyk
rgbyiq
ip_checksum
imgblend
filt3x3
GMEAN
Peak 29%
2.2x
Not receptive
1*VL prefetching provides good speedup without tuning, 8*VL best
no cachepollution
21%
32
Overall Memory System Performance
00.10.20.30.40.50.60.70.8
16-byte line 64-byte line 64-byte line +prefetch
Fra
ction o
f Tota
l Cycle
s
Memory Unit Stall Cycles
Miss Cycles
(4KB) (16KB)
67%
48%
31%
4%
15
Wider line + prefetching reduces memory unit stall cycles significantly
Wider line + prefetching eliminates all but 4% of miss cycles
16 lanes