data parallel fpga workloads: software versus hardware

Data Parallel FPGA Workloads: Software Versus Hardware

Peter YiannacourasJ. Gregory SteffanJonathan Rose

FPL 2009

2

FPGA Systems and Soft Processors

SoftProcessor

CustomHW

HDL+

CAD

Software+

Compiler

Easier Faster Smaller Less Power

Simplify FPGA design: Customize soft processor architecture

? Configurable COMPETE

Weeks Months

Target: Data level parallelism → vector processors

Used in 25% of designs [source: Altera, 2009]

Digital System computation

3

Vector Processing Primer

// C codefor(i=0;i<16; i++) c[i]=a[i]+b[i]

// Vectorized codeset vl,16vload vr0,avload vr1,bvadd vr2,vr0,vr1vstore vr2,c

Each vector instructionholds many units of independent operations

vr2[0]= vr0[0]+vr1[0]vr2[1]= vr0[1]+vr1[1]vr2[2]= vr0[2]+vr1[2]

vr2[4]= vr0[4]+vr1[4]vr2[3]= vr0[3]+vr1[3]

vr2[5]= vr0[5]+vr1[5]vr2[6]= vr0[6]+vr1[6]vr2[7]= vr0[7]+vr1[7]vr2[8]= vr0[8]+vr1[8]vr2[9]= vr0[9]+vr1[9]

vr2[10]=vr0[10]+vr1[10]vr2[11]=vr0[11]+vr1[11]vr2[12]=vr0[12]+vr1[12]vr2[13]=vr0[13]+vr1[13]vr2[14]=vr0[14]+vr1[14]vr2[15]=vr0[15]+vr1[15]

vadd

1 Vector Lane

4

Vector Processing Primer

// C codefor(i=0;i<16; i++) c[i]=a[i]+b[i]

// Vectorized codeset vl,16vload vr0,avload vr1,bvadd vr2,vr0,vr1vstore vr2,c

Each vector instructionholds many units of independent operations

vadd16 Vector Lanes

vr2[0]= vr0[0]+vr1[0]vr2[1]= vr0[1]+vr1[1]vr2[2]= vr0[2]+vr1[2]

vr2[4]= vr0[4]+vr1[4]vr2[3]= vr0[3]+vr1[3]

vr2[5]= vr0[5]+vr1[5]vr2[6]= vr0[6]+vr1[6]vr2[7]= vr0[7]+vr1[7]vr2[8]= vr0[8]+vr1[8]vr2[9]= vr0[9]+vr1[9]

vr2[10]=vr0[10]+vr1[10]vr2[11]=vr0[11]+vr1[11]vr2[12]=vr0[12]+vr1[12]vr2[13]=vr0[13]+vr1[13]vr2[14]=vr0[14]+vr1[14]vr2[15]=vr0[15]+vr1[15]

16x speedup

Previous Work (on Soft Vector Processors):1. Scalability2. Flexibility3. Portability

5

Soft Vector Processors vs HW

CustomHW HDL

+CAD

Software+

Compiler

Easier Faster Smaller Less Power

What is the soft vector processor vs FPGA custom HW gap?(also vs scalar soft processor)

+Vectorizer

Lane1

Vector Lanes

Lane2

Lane3

Lane4

Lane5

Lane6

Lane7

Lane8 …16

How much?

Soft VectorProcessor

Weeks Months

ScalableFine-tunableCustomizable

6

Measuring the Gap

EEMBC Benchmarks

ScalarSoft

Processor

SoftVector

Processor

HWCircuits

Evaluation Evaluation Evaluation

SpeedArea

SpeedArea

SpeedArea

Compare Compare

Conclusions

7

VESPA Architecture Design(Vector Extended Soft Processor Architecture)

ScalarPipeline3-stage

VectorControlPipeline3-stage

VectorPipeline6-stage

Icache Dcache

Decode RFALU

MUX WB

VCRF

VSRF

VCWB

VSWB

Logic

DecodeRepli-cate

Hazardcheck

VRRF

VRWB

VRRF

VRWB

Decode

Supports integerand fixed-point operations [VIRAM]

32-bit Lanes

Shared Dcache

Legend Pipe stage Logic Storage

Lane 1 ALU,Mem UnitLane 2 ALU, Mem, Mul

8

VESPA Parameters

Description Symbol Values

Number of Lanes L 1,2,4,8, …

Memory Crossbar Lanes M 1,2, …, L

Multiplier Lanes X 1,2, …, L

Maximum Vector Length MVL 2,4,8, …

Width of Lanes (in bits) W 1-32

Instruction Enable (each) - on/off

Data Cache Capacity DD any

Data Cache Line Size DW any

Data Prefetch Size DPK < DD

Vector Data Prefetch Size DPV < DD/MVL

ComputeArchitecture

MemoryHierarchy

InstructionSet

Architecture

9

VESPA Evaluation Infrastructure

Vectorizedassembly

subroutinesGNU as

ELFBinary

InstructionSet

Simulation

scalar μP

+vpu

VCRF

VSRF

VCWB

VSWB

Logic

DecodeRepli-cate

Hazardcheck

VRRF

ALU

MemUnit

x & satur.

VRWB

MUX

Satu-rate

Rshift

VRRF

ALU

x & satur.

VRWB

MUX

Satu-rate

Rshift

EEMBC CBenchmarks

RTLSimulation

SOFTWARE HARDWAREVerilog

AlteraQuartus II

v 8.1

cyclesarea,clock frequency

GCC

ld

verification verification

TM4

Realisticanddetailedevaluation

10

Measuring the Gap

EEMBCBenchmarks

ScalarSoft

Processor

SoftVector

Processor

HWCircuits

Evaluation Evaluation Evaluation

SpeedArea

SpeedArea

SpeedArea

Compare Compare

Conclusions

11

Designing HW Circuits(with simplifying assumptions)

DDRCore

Datapath

MemoryRequest

Control

HW

Idealized

cycle count (modelled) Assume fed at full DDR bandwidth Calculate execution time from data size

Optimistic HW implementations vs real processors

AlteraQuartus II

v 8.1

area,clock frequency

12

Benchmarks Converted to HW

Clock

Benchmark ALMs DSPs M9Ks (MHz) Cycles

autcor 592 32 1 323 1057conven 46 0 0 476 226rgbcmyk 527 0 0 447 237784rgbyiq 706 108 0 274 144741ip_checksum 158 0 0 457 2567imgblend 302 32 0 443 14414

EEMBC

VIRAM

Stratix III 3S200C2

VESPA Clock: 120-140 MHz

HW Clock: 275-475 MHz

HW advantage: 3x faster clock frequency

13

Performance/Area Space (vs HW)

Scalar – 432x slower, 7x larger

Soft vector processors can significantly close performance gap

Slo

wd

ow

n v

s

HW

Area vs HW

fastest VESPA17x slower, 64x larger

HW (1,1)optimistic

HW Area Advantage

HW

Sp

eed

Ad

van

tag

e

14

Area-Delay Product

Commonly used to measure efficiency in silicon Considers both performance and area Inverse of performance-per-area

Calculated using:

(Area) × (Wall Clock Execution Time)

15

Area-Delay Space (vs HW)

0

500

1000

1500

2000

2500

3000

3500

0 20 40 60 80

HW Area Advantage

HW

Are

a-D

ela

y

Advanta

ge

Scalar

1 Lane

2 Lanes

4 Lanes

8 Lanes

16 Lanes

2900x

900x

VESPA up to 3 times better silicon usage than Scalar

A

rea

-De

lay

vs

HW

HW Area Advantage

HW

Are

a-D

ela

y

Ad

van

tag

e

16

Reducing the Performance Gap

Previously: VESPA was 50x slower than HW

Reducing loop overhead VESPA: Decoupled pipelines (+7% speed)

Improving data delivery VESPA: Parameterized cache (2x speed, 2x area) VESPA: Data Prefetching (+42% speed)

These enhancements were key parts of reducing gap,combined 3x performance improvement

17

Vector Memory Crossbar

Wider Cache Line Size

ScalarVectorCoproc

Lane0Lane

0Lane

0Lane

4

Dcache4KB,

16B line …

Lane0Lane

0Lane

0Lane

8Lane

0Lane

0Lane

0Lane12

Lane4Lane

4Lane15Lane16

VESPA16 lanes

vld.w (load 16 sequential 32-bit words)

18

Vector Memory Crossbar

Wider Cache Line Size

ScalarVectorCoproc

Lane0Lane

0Lane

0Lane

4

Dcache16KB,

64B line …

Lane0Lane

0Lane

0Lane

8Lane

0Lane

0Lane

0Lane12

Lane4Lane

4Lane15Lane16

VESPA16 lanes

vld.w (load 16 sequential 32-bit words)

4x

4x

2x speed, 2x area(reduced cache accesses +some prefetching)

19

Hardware Prefetching Example

DDR

Dcache

…

vld.w

No Prefetching Prefetching 3 blocks

DDR

Dcache

…

vld.w

MISS MISS

10 cyclepenalty

10 cyclepenalty

vld.w vld.w

HITMISS

42% speed improvement from reduced miss cycles

20

Reducing the Area Gap (by Customizing the Instruction Set)

FPGAs can be reconfigured between applications

Observations: Not all applications1. Operate on 32-bit data types

2. Use the entire vector instruction set

Eliminate unused hardware

21

VESPA Parameters

Description Symbol Values

Number of Lanes L 1,2,4,8, …

Maximum Vector Length MVL 2,4,8, …

Width of Lanes (in bits) W 1-32

Memory Crossbar Lanes M 1,2, …, L

Multiplier Lanes X 1,2, …, L

Instruction Enable (each) - on/off

Data Cache Capacity DD any

Data Cache Line Size DW any

Data Prefetch Size DPK < DD

Vector Data Prefetch Size DPV < DD/MVL

Subsetinstructionset

Reduce width

22

Customized VESPA vs HW

0

50

100

150

200

0 10 20 30 40 50 60 70

HW Area Advantage

HW

Speedup

Full

Subsetted

Subsetted+Width Reduced

Up to 45% area saved with width reduction & subsetting

45%

Slo

wd

ow

n v

s

HW

Area vs HW

HW Area Advantage

HW

Sp

eed

Ad

van

tag

e

23

Summary

VESPA more competitive with HW design Fastest VESPA only 17x slower than HW

Scalar soft processor was 432x slower than HW Attacking loop overhead and data delivery was key

Decoupled pipelines, cache tuning, data prefetching Further enhancements can reduce the gap more

VESPA improves efficiency of silicon usage 900x worse area-delay than HW

Scalar soft processor 2900x worse area-delay than HW Subsetting/width reduction can further reduce to 561x

Enable software implementation for non-critical data-parallel computation

24

Thank You!

Stay tuned for public release:1. GNU assembler ported for VIRAM (integer only)

2. VESPA hardware design (DE3 ready)

25

Breaking Down Performance

Components of performance

Loop:

<work>

goto Loop

Loop:

<work>

goto Loop

Loop:

<work>

goto Loop

…Iteration-level parallelism

Cycles per iteration × Clock period

a)

b)

c)

Measure the HW advantage in each of these components

26

Breakdown of Performance Loss(16 lane VESPA vs HW)

Benchmark

Clock

Frequency

Iteration

Level

Parallelism

Cycles

Per

Iteration

autcor 2.6x 1x 9.1x

conven 3.9x 1x 6.1x

rgbcmyk 3.7x 0.375x 13.8x

rgbyiq 2.2x 0.375x 19.0x

ip_checksum 3.7x 0.5x 4.8x

imgblend 3.6x 1x 4.4x

GEOMEAN 3.2x 0.64x 8.2x

Largest factor

Was previously worse, recently improved

17xTotal

27

1-Lane VESPA vs Scalar

1. Efficient pipeline execution 2. Large vector register file for storage3. Amortization of loop control instructions.4. More powerful ISA (VIRAM vs MIPS):

1. Support for fixed-point operations2. Predication3. Built-in min/max/absolute instructions

5. Execution in both scalar and vector co-processor6. Manual vectorization in assembly versus scalar GCC

28

Measuring the Gap

Scalar: MIPS soft processor

VESPA: VIRAM soft vector processor

HW: Custom circuit for each benchmark

COMPARE

(simplified & idealized)

(complete & real)

COMPARE

(complete & real)

EEMBC CBenchmarks

C

assembly

Verilog

29

Reporting Comparison Results

Performance (wall clock time)

Area (actual silicon area)

HW Speed Advantage =Execution Time of Processor

Execution Time of Hardware

HW Area Advantage =Area of Processor

Area of Hardware

1. Scalar (C)2.3.

VESPA (Vector assembly)HW (Verilog)

vs HW (Verilog)vs HW (Verilog)

30

Cache Design Space – Performance (Wall Clock Time)

1.68

1.93

1.55

1.77

1.37

1.50

1.13

1.00

1.25

1.50

1.75

2.00

4KB 8KB 16KB 32KB 64KB

Speedup V

s 4

KB

,16B

128B

64B

32B

16B

Best cache design almost doubles performance of original VESPA

122MHz

123MHz

126MHz129MHz

More pipelining/retiming could reduce clock frequency penalty

Cache line more important than cache depth (lots of streaming)

31

Vector Length Prefetching - Performance

0.5

1

1.5

2

2.5N

one

1*V

L

2*V

L

4*V

L

8*V

L

16*V

L

32*V

L

Amount of Prefetching

Speedup

autcor

conven

fbital

viterb

rgbcmyk

rgbyiq

ip_checksum

imgblend

filt3x3

GMEAN

Peak 29%

2.2x

Not receptive

1*VL prefetching provides good speedup without tuning, 8*VL best

no cachepollution

21%

32

Overall Memory System Performance

00.10.20.30.40.50.60.70.8

16-byte line 64-byte line 64-byte line +prefetch

Fra

ction o

f Tota

l Cycle

s

Memory Unit Stall Cycles

Miss Cycles

(4KB) (16KB)

67%

48%

31%

4%

15

Wider line + prefetching reduces memory unit stall cycles significantly

Wider line + prefetching eliminates all but 4% of miss cycles

16 lanes

data parallel fpga workloads: software versus hardware

Documents

x largerhw

clock frequencybenchmarks

3s200c2vespa clock

mhzhw clock

data prefetching

vespa17x slower

parameterized cache

data parallel fpga workloads