the microarchitecture of fpga-based soft processors peter yiannacouras carg - june 14, 2005

54
The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Upload: thomasine-henry

Post on 05-Jan-2016

230 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

The Microarchitecture of FPGA-Based Soft Processors

Peter Yiannacouras

CARG - June 14, 2005

Page 2: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

FPGA vs ASIC Flows

CircuitDesign

ASIC Flow FPGA Flow

CircuitDesign

Reduced cost for low-volume

Reduced time-to-market

Programmability affords customization

Designers use FPGAs!

Page 3: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Processors and FPGAs

Custom Logic Processor

FPGA

Custom Logic Processor

Increased board area, cost, and latency

□ Option 1: Off-chip processor

Custom Logic Processor

FPGA

Specialized part, lack of flexibility

□ Option 2: On-chip “hard” processor

Custom Logic Processor

FPGA

Can implement any number of processors

Tune each one to meet design constraints

□ Option 3: On-chip “soft” processor

Custom Logic Processor

Page 4: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Tuning Processors

Application,Design constraints

• $3• 4 MHz• 800 mW• 2-stage pipeline

• $300• 3.8 GHz• 80 W• 31-stage pipeline

Application,Design constraints• 500 LEs

• 40 MHz• 2-stage pipeline

• 1700 LEs• 160 MHz• 6-stage pipeline

Tuning Soft Processors

Application,Design constraints• 500 LEs

• 40 MHz• 2-stage pipeline

• 1700 LEs• 160 MHz• 6-stage pipeline

• your area, speed, power tradeoff

Automatically Tuning Soft Processors

Page 5: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Understanding Soft Processors Tuning requires

understanding of soft processor design space

We implement many processors and study the design space

ArchitectureDescription

SynthesizedProcessor

• Area• Performance• Power

Page 6: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Don’t we already understand architecture? Not completely

We can evaluate area, power, performance

Not accurately (rules of thumb) FPGA CAD tools are very accurate

Not in the FPGA domain LUTs vs transistors relative speed of RAM & Multipliers

Page 7: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Goals

1. Develop measurement methodology2. Populate the design space3. Compare against industrial soft

processor(s)

Page 8: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Measurement Methodology Require a set of metrics

Area

Performance

Power

FPGA Flow

CircuitDesign (RTL)

• Resource Usage• Clock Frequency• Power estimate

Page 9: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

AreaLogic Elements (LEs – LUT & flip flop)

Multipliers

Big RAM

Little RAM

Medium RAM

Measure physical area in Equivalent LEs (Eg. 9-bit multiplier is equivalent to 23 LEs in area)

Page 10: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Performance Wall Clock Time = #Cycles * Clock Period

CAD Tool

dct, golRATEs

bubble_sort, crc, fft, fir, des, quant, iquant, turbo, vlcXirisc

Dhrystone 2.1Freescale

bitcnts, CRC32, sha, stringsearch, FFT, dijkstra, patriciaMiBench

BenchmarkSource

From RTLSimulation,Averaged over 20 benchmarks:

Page 11: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Power CAD tool can estimate power from

assumed toggle ratio (derived experimentally)

Total DynamicPower (mW)

÷ Clock Frequency (MHz)

=Dynamic Energyexcluding I/O per cycle (nJ/cycle)

Page 12: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Metrics summary Require the following information

1. Resource Usage (area – CAD Tool)2. Clock Frequency (wall clock time – CAD Tool)3. Power Estimate (energy/cycle – CAD Tool)4. Cycle Count (wall clock time – RTL Simulator)

Page 13: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

RTL-based Design Space Exploration

Complete and accurate understanding of design space

CircuitDesign (RTL)

3. Area4. Clock Frequency5. Power

1. Correctness2. Cycle Count

CADTool

RTLSimulator

Benchmarks

Page 14: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Goals

1. Develop measurement methodology2. Populate the design space3. Compare against industrial soft

processor(s)

Page 15: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Microarchitectural Design Space Exploration

Need fast route to RTL from architectural idea

CircuitDesign (RTL)

3. Area4. Clock Frequency5. Power

1. Correctness2. Cycle Count

CADTool

RTLSimulator

Benchmarks

Page 16: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

SPREE (Soft Processor Rapid Exploration Environment)

3. Area4. Clock Frequency5. Power

1. Correctness2. Cycle Count

CAD ToolRTL Simulator

Benchmarks

SPREERTL Generator

Page 17: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Goals

1. Develop measurement methodology2. Populate the design space

1. Rapidly2. With interesting designs3. Accurately (minimize overhead)

3. Compare against industrial soft processor(s)

SPREE

Page 18: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Related Work Parametrized Cores

Narrow design space, laborious changes to control

Architecture Description Languages (ADLs) Too robust, inaccurate (simulator based, or

behavioural RTL) PEAS-III/ASIPMeister [Itoh2000]

non-fpga specific, ISA design focus

Page 19: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

SPREE RTL Generator Overview

SPREERTL Generator

ComponentLibrary

ISA Description Datapath Description

EfficientlySynthesizable

RTL

InterestingAllows for interesting architectures

Rapidlysimple descriptions

Accuratelyefficient componentimplementations

Page 20: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Some current limitations No caches (use fast on-chip RAM) Simple in-order issue pipelines No dynamic branch prediction No OS or exceptions support

No ISA changes! Need compiler generation to support Use subset of MIPS-I

Page 21: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Mul

Ifetch Reg File

ALU WriteBack

DataMem

Mul

Ifetch Reg File

ALU WriteBack

DataMem

Architecture Input

Mul

Ifetch Reg File

ALU WriteBack

DataMem

Component Library

Page 22: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Mul

Ifetch Reg File

ALU WriteBack

DataMem

Mul

Ifetch Reg File

ALU WriteBack

DataMem

Architecture Input

Component Library

Mul

Ifetch Regfile

ALU WriteBack

DataMem

Datapath Description

Page 23: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Architecture Input

SPREERTL GeneratorMul

Ifetch Reg File

ALU WriteBack

Mul

Ifetch Reg File

ALU WriteBack

DataMem

Mul

IF

Regfile

ALU WriteBack

Data MemISA Description

Datapath Description

Component Library

Mul

IF

Regfile

ALU WriteBack

Data Mem

Decode Decode Decode

• Control generation savestime and is non-critical

Page 24: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Architecture Input:ISA Description

Generic Operations (GENOPs) MIPS instructions made of GENOPs

FETCH

RFREAD

ADD

RFWRITE

GENOPs MIPS ADD – add rd, rs, rt

FETCH

RFREAD

ADD

RFWRITE

RFREAD

Page 25: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Complete Experimental Framework Using SPREE

3. Area4. Clock Frequency5. Power

1. Correctness2. Cycle Count

CAD ToolRTL Simulator

Benchmarks

SPREERTL Generator

ComponentLibrary

ISA Description Datapath DescriptionFIXED

Page 26: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Goals

1. Develop measurement methodology2. Populate the design space3. Compare against industrial soft processor(s)

SPREE

Area

Performance

Power

Page 27: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Altera’s NiosII Second generation soft processor Has three variations:

NiosIIe – unpipelined, no hardware multiply NiosIIs – 5-stages, no branch prediction NiosIIf – 6-stages, dynamic branch prediction

Caveats Supports exceptions, OS, and caches Very similar but tweaked ISA

Page 28: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Design Space vs NiosII Variations

1000

2000

3000

4000

5000

6000

7000

8000

9000

500 700 900 1100 1300 1500 1700 1900

Area (Equivalent LEs)

Av

era

ge

Wa

ll C

loc

k T

ime

(u

s)

Generated Designs

Altera NiosIIe

Altera NiosIIs

Altera NiosIIf

Page 29: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Summary

1. We span the design space2. Remain competitive

Achieved 9% faster and 11% smaller than NiosIIs

=> don’t suffer from prohibitive overhead

Let’s explore some architecture!

Page 30: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Architectural Axes

1. Hardware vs Software Multiplication2. Shifter implementation3. Pipeline

Depth Organization Forwarding

Page 31: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Hardware vs Software Multiplication

Hardware multiplication Increases area & power consumption Speeds up execution

BUT … Not all applications care about speed Not all applications use multiplication

(significantly)

Page 32: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Cycle Count Speedup of Hardware Multiplication

1.01

1.03

1.04 1.

39

2.72 3.00

4.53

6.94

7.87

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

dijk

stra

dhry

qsor

t

fir

FF

T

dct

quan

t

fft

iqua

nt

Cyc

le C

ou

nt

Sp

eed

up

Must understand its cost/benefit to decide when to use

Page 33: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Cost of Hardware Multiply

0

2000

4000

6000

8000

10000

12000

0 200 400 600 800 1000 1200 1400 1600 1800

Area (Equivalent LEs)

Ave

rag

e W

all C

lock

Tim

e (u

s)

Multiply Full Hardware SupportMultiply Software RoutineAltera NiosIIeAltera NiosIIsAltera NiosIIf

~250 LEs (20%) 35% more Energy/cycle

Page 34: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Shifter Implementations Shifters (multiplexers) are big in FPGAs Consider 3 implementations:

Serial shifter LUT-based barrel shifter Multiplier-based barrel shifter

Page 35: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Impact of Shifter Implementation

Serial

Multiplier-based

LUT-based

1000

1500

2000

2500

3000

3500

4000

4500

5000

800 1000 1200 1400 1600

Area (Equivalent LEs)

Avera

ge W

all C

lock T

ime (

us)

2-stage

3-stage

4-stage

5-stage

7-stage

Consistent across different pipe depths

Page 36: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Shifter Implementation TradeoffsArea Wall Clock Time Energy per Cycle(LEs) (us) (nJ/cycle)

Serial 1035 3458 0.2114Multiplier-based barrel 1102 1945 0.2174LUT-based barrel 1297 1916 0.2409

Averaged over all pipeline depths Smallest: Serial Fastest: LUT-based barrel Energy efficient: Serial

Multiplier is very nice sweet spot

Page 37: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Pipelines - Depth Study different pipeline depths

Over 3 shifters

Arrows = possible forwarding lines (not used)

All use predict not-taken branches

Page 38: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Pipelining & clock frequency

0

20

40

60

80

100

120

Serial Mul-based LUT-based AVERAGE

Fre

qu

ency

(M

Hz) 2-stage

3-stage

4-stage

5-stage

7-stage

Page 39: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Impact of Pipelining

Serial

Multiplier-based

LUT-based

1000

1500

2000

2500

3000

3500

4000

4500

5000

800 1000 1200 1400 1600

Area (Equivalent LEs)

Avera

ge W

all C

lock T

ime (

us)

2-stage

3-stage

4-stage

5-stage

7-stage

Adds area, can increase speed (2 to 3 stage?)

Page 40: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Mul

FPGA Nuance: Synchronous RAMs 2-stage Pipeline

Ifetch Regfile

ALU WriteBack

DataMem

Stall on all loads, and any operand fetches

Page 41: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Mul

3-stage Pipeline

Ifetch Regfile

ALU WriteBack

DataMem

Less stalls, increased frequency => Big speedup (1.7x)

Page 42: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

3, 4 and 5 stage pipelines Increased area, small change in performance

=> Deeper pipelines have potential for better speedups

Serial

Multiplier-based

LUT-based

1000

1500

2000

2500

3000

3500

4000

4500

5000

800 1000 1200 1400 1600

Area (Equivalent LEs)

Avera

ge W

all C

lock T

ime (

us)

2-stage

3-stage

4-stage

5-stage

7-stage

Page 43: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

The 7-stage Pipeline Where Branch Delay Slots break down

The ideal case:

BEQOR JR ADDXX Neversquashthisstage

Page 44: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Problem: Separation of Branch and Branch Delay Slot

BEQADDJR

Stalls onRAW hazard

Page 45: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Problem: Separation of Branch and Branch Delay Slot

BEQADDJR NOPX Must track and protect delay slots

Page 46: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Multiple Delay Slots

Must detect separation of branch from delay slot

OR prevent multiple delay slots Stall branch if a delay slot exists in the pipe We did this one (+30LEs, -15% clock frequency)

BEQOR JR ADD

Can’t guard all delay slots

Better off eliminating delay slots – currently researching

Page 47: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Pipeline organization Where stages are placed is important Pipe stage placement can

Result in all around “win/loss” Present a tradeoff

LUT-basedMul-based

Serial

0

500

1000

1500

2000

2500

3000

3500

4000

800 900 1000 1100 1200 1300 1400

Area (LEs)

Wa

ll C

loc

k T

ime

(u

s)

4-Stage (H)

4-Stage (B)

Page 48: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Forwarding SPREE supports stage to stage forwarding

Mul

IfetchRegFile ALU Write

Back

DataMem

Forward line rs

Forward line rt

Page 49: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Effect of Forwarding

no forwarding

forward rt

forward rs

forward rs&rt

1000

1100

1200

1300

1400

1500

1600

1700

1800

1900

800 900 1000 1100 1200 1300 1400 1500 1600

Area (Equivalent LEs)

Ave

rag

e W

all

Clo

ck T

ime

(us)

3-stage

4-stage

5-stage

20% speed increase

Page 50: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

An Aside: ISA Subsetting Applications don’t generally use all

instructionsISA Usage In Each Benchmark

0.00%

50.00%

100.00%

bubble

_sort

crc

des

fft

fir

quant

iquant

turb

o

vlc

bitcnts

CR

C32

qsort

sha

str

ingsearc

h

FF

T

dijkstr

a

patr

icia

gol

dct

dhry

AV

ER

AG

E

Page 51: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Processor reduction Can strip away unused

components/control Generator supports instruction disabling

Automatically strips away unused components Create an Application Specific processor Do this for each benchmark

FPGAs are a good platform for this!

Page 52: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Area of a Subsetted Processor

Area Measurements for a Processor Subsetted Over Benchmark Set

0

200

400

600

800

1000

1200

1400

OR

IGIN

AL

bu

bb

le_

sort

crc

de

s fft fir

qu

an

t

iqu

an

t

turb

o

vlc

bitc

nts

CR

C3

2

qso

rt

sha

stri

ng

sea

rch

FF

T

dijk

stra

pa

tric

ia

go

l

dct

dh

ry

AV

ER

AG

E

Processor

Are

a (

LE

s)

Page 53: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Speed of a Subsetted Processor

Fmax Measurements for a Processor Subsetted Over Benchmark Set

50.00

52.00

54.00

56.00

58.00

60.00

62.00

64.00

66.00

68.00

70.00

cycl

es

bubb

le_s

ort

crc

des fft fir

quan

t

iqua

nt

turb

o

vlc

bitc

nts

CR

C32

qsor

t

sha

strin

gsea

rch

FF

T

dijk

stra

patr

icia go

l

dct

dhry

AV

ER

AG

E

Processor

Fm

ax (

MH

z)

`

Page 54: The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

Conclusion Understanding architectural trade-offs

=> Maximize efficiency Developed SPREE & measurement

methodology Performed preliminary architectural study

Quantified cost of hardware multiplication Explored shift unit implementations Explored pipelines: depth, organization,

forwarding