overlays: a soluon paradigm for fpga high-level design?

Post on 21-Oct-2021

6 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Overlays:asolu-onparadigmforFPGAhigh-leveldesign?

TarekS.Abdelrahman

TheEdwardS.RogersDepartmentofElectricalandComputerEngineering

UniversityofToronto

tsa@ece.utoronto.ca

ReconfigurableSystemsontheRise•  FPGAsareincreasinglyintegratedincompuCngsystems

–  Massiveparallelismcanleadtohighperformance–  Lowerpower–  Customizability

•  NewergeneraConofhigh-performancesystemsintegrateFPGAswithmulCcores,targeCngdatacenters–  ExamplesystemsfromIntel,IBMandXilinx–  UsedmainlybysoOwaredevelopers

16-07-11 2

ReconfigurableSystemsontheRise•  FPGAsareincreasinglyintegratedincompuCngsystems

–  Massiveparallelismcanleadtohighperformance–  Lowerpower–  Customizability

•  NewergeneraConofhigh-performancesystemsintegrateFPGAswithmulCcores,targeCngdatacenters–  ExamplesystemsfromIntel,IBMandXilinx–  UsedmainlybysoOwaredevelopers

16-07-11 3

FPGAProgrammabilityBurdens•  FPGAsareprogrammedusingahardwaredesignabstracCon,

whichisforeigntothebulkofsoOwaredevelopers–  HDL,Timing,fiYng,seedsweeps,etc.

•  FPGAdevelopmenttoolsleadtoextremelylongdevelopmentcyclescomparedtotheirsoOwarecounterparts–  Alargecircuitcantakedaystocompile(synthesis,place,route,Cme,

etc.)andmayneedseveralcompiles

•  ThereisapressingneedtoalleviatetheseburdensandmakeFPGAdesignaccessibletosoOwaredevelopers

16-07-11 4

TacklingtheBurden•  High-LevelSynthesis(HLS)

–  GeneratedhardwareincreasinglycompeCCvewithHDLdesign

•  High-levelprogrammingmodels–  DataflowmodelfromMaxeler

•  Nonetheless:–  Developerremainsexposedtovariousaspectsofhardwaredesign–  UseofFPGAdesigntoolsissCllrequired!⇒longdevelopmentcycles

16-07-11 5

Overlays•  Pre-compiledFPGAcircuitsthatareinthemselves

configurable/programmable,i.e.,run-Cmeconfigurable–  Examples:soOprocessors,GPU-on-FPGA,mesh-of-FUs,etc.

16-07-11 6

SoFProcessor

Source:Andrycetal:FlexGrip:ASoFGPGPUforFPGAs,FPT13

PE PE PE

PE PE PE

PE PE PE

FPGAvs.OverlayDesignFlows

16-07-11 7

Pre-compiledoverlay

FPGAFPGA

FPGADesignTools

ConfiguraConStreamFPGA

bitstream

Applica-on(HDL)

Applica-on-to-OverlayTools

Applica-on(C,CUDA,DFG,etc.)

seconds

hours/days

µseconds

harder simpler

Mesh-of-FUsOverlays[FPL2013]

16-07-11 8

ADD ADD

EXP SHF

ADD SUB SUB

MUL

DIV

FuncConUnit

RouCnglogic

4-NNconnectedarrayofcells

DataFlowGraph

O1

O2

I1 I2 I3 I4 I5 I6

ADD SUB SUB

MUL

DIV

C

E

D

A B

MappingDFGstoOverlay–Place

16-07-11 9

ADD ADD

EXP SHF

ADD SUB SUB

MUL

DIV

I1

I2

I3 I4 I5

I6

O1 O2

A B C

D

E

DataFlowGraph

O1

O2

I1 I2 I3 I4 I5 I6

ADD SUB SUB

MUL

DIV

C

E

D

A B

16-07-11 10

ADD SUB SUB

ADD MUL ADD

DIV EXP SHF

O1

O2

I1 I2 I3 I4 I5 I6

ADD SUB SUB

MUL

DIV

C

E

D

A B

I1

I2

I3 I4 I5

I6

O1 O2

A B C

D

E

pipelineregister/FIFO

DataFlowGraph

MappingDFGstoOverlay–Route

O1

O2

I1 I2 I3 I4 I5 I6

ADD SUB SUB

MUL

DIV

C

E

D

A B

O1

O2

I1 I2 I3 I4 I5 I6

ADD SUB SUB

MUL

DIV

C

E

D

A B

PipelinedExecu-on

16-07-11 11

ADD SUB SUB

ADD MUL ADD

DIV EXP SHF

O1

O2

I1 I2 I3 I4 I5 I6

ADD SUB SUB

MUL

DIV

C

E

D

A B

I1

I2

I3 I4 I5

I6

O1 O2

A B C

D

E

pipelineregister/FIFO

DataFlowGraph

Mesh-of-FUsTools•  ApplicaCon-to-overlaytoolchainthat:

–  ExtractsDFGofbodiesofparallelloopsinCcode

–  PlacesandroutestheDFGnodesontotheoverlay•  ConfigurestheswitchestoestablishDFGconnecCvity•  GeneratestheconfiguraConstreamoftheoverlay

16-07-11 12

HighPerformancewithnoHardwareDesign

16-07-11 13

DFG Size(nodes) GFLOPS CompileTime(sec)

n-Body 125 18.72 0.44BlackSholes 131 21.22 1.33MatMul 96 19.66 1.05MatMulAdd 114 22.46 3.80

•  Examplemesh-of-FUsoverlayonaStraCxIV[FPL2013]–  SingleprecisionfloaCngpointoperaCons–  288cellsimplementedasan18x16mesh–  fMAXof312MHzand32.4GFLOPSpeak(integerat415MHz)

•  Othersalsoreporthighperformanceresults

GFLOPS CompileTime(sec)

21.52 272422.10 250825.21 204528.79 919

HDLOverlay

SoFware-FriendlyTarget•  OverlaysraisethelevelofabstracConofusingFPGAstoone

thatismorefamiliartosoOwaredesigners–  CprogrammingforasoOprocessor–  CUDA/OpenCLforGPUoverlays–  Dataflowgraphsformesh-of-FUs

•  ThisopensupopportuniCesfor“standard”soOwaretoolstotargetFPGAs

16-07-11 14

JITCompila-ontoHardware

•  Profilecode

16-07-11 15

:ADD R9,R7,R10BEQZ end

L1: ADD R1,R3,R7MULT R11,R12,R13ADD R8,R1,R11SUB R9,R8,#8SLT R8,R9,R7BNZ R8,L1ADD R7,R6,R1:

FU FU FU

FU FU FU

FU FU FU

FU FU FU

FU FU FU

FU FU FU

FU FU FU

FU FU FU

FPGAOverlay

CPU

JITCompila-ontoHardware

•  IdenCfyhotsegmentsofcode

16-07-11 16

:ADD R9,R7,R10BEQZ end

L1: ADD R1,R3,R7MULT R11,R12,R13ADD R8,R1,R11SUB R9,R8,#8SLT R8,R9,R7BNZ R8,L1ADD R7,R6,R1:

FPGAOverlay

CPU

FU FU FU

FU FU FU

FU FU FU

FU FU FU

FU FU FU

FU FU FU

FU FU FU

FU FU FU

FU FU FU

FU FU FU

FU FU FU

FU FU FU

FU FU FU

FU FU FU

FU FU FU

FU FU FU

JITCompila-ontoHardware

•  ExtractDFGandconfiguretheoverlay

16-07-11 17

:ADD R9,R7,R10BEQZ end

L1: ADD R1,R3,R7MULT R11,R12,R13ADD R8,R1,R11SUB R9,R8,#8SLT R8,R9,R7BNZ R8,L1ADD R7,R6,R1:

ADD

MULT ADD

SLT SUB

FPGAOverlay

CPU

ADD

MULT ADD

SLT SUB

JITCompila-ontoHardware

•  Re-writethecode

16-07-11 18

:ADD R9,R7,R10BEQZ end

L1: ADD R7,R6,R1:

FPGAOverlay

CPU

FU FU FU

FU FU FU

FU FU FU

FU FU FU

FU FU FU

FU FU FU

FU FU FU

FU FU FU

ADD

MULT ADD

SLT SUB

ADD

MULT ADD

SLT SUB

JITCompila-ontoHardware

•  TransferexecuContotheoverlay

16-07-11 19

:ADD R9,R7,R10BEQZ end

L1: ADDR7,R6,R1:

FPGAOverlay

CPU

FU FU FU

FU FU FU

FU FU FU

FU FU FU

FU FU FU

FU FU FU

FU FU FU

FU FU FU

ADD

MULT ADD

SLT SUB

ADD

MULT ADD

SLT SUB

User-TransparentDynamicProgramAccelera-on

APrototypeJITCompiler•  Target:IntelQuickAssistPlamorm•  Thecompilerprototype:

–  BuiltaroundLLVM,targetsinnermostloopsofscienCficcode–  MiCgatesmuchoftherun-CmeoverheadtocompileCme

•  Overlaycurrentlybeingintegratedintothetargetplamorm16-07-11 20

CPU CPU

SystemMemory

QPICoherentInterconnect

StraCxFPGA

QPIIPAFU

XeonMulCcoreProcessor

FigureaOerIntelliterature

Accelera-onPoten-al

16-07-11 21

0

1

2

3

4

5

6

7

Speedu

p

Aplplica-on

FPGAsimulaConresultsbasedonmeasuredsystemparameters

Customizability•  OneofthekeyadvantagesofFPGAsisthattheycanbe

customizedforapplicaCons

•  Overlayscanalsobe“user”customizable–  WithminimalusageofFPGAdesigntools

•  Inthecontextofourmesh-of-FUs,wecanvarythechoiceoftheFUateachlocaConofthemesh,i.e.,thefuncConallayout,totheoverlaymoreefficientforanapplicaCon

16-07-11 22

ALibrary-BasedApproach

16-07-11 23

A M D D

A M S S

A M A M

A M A M

A M D D

A M S S

A M A M

A M A M

DesiredOverlay LibraryofPre-PlacedandPre-routedOverlays SCtchedOverlay

M M

A AD D

S SA M

A M

•  Bopom-Upflowallows(restricted)relocaConofpre-placedandpre-routedgroupsofcells[FPL2014] sCtch

•  Example12x15overlay:35minutesvs.15hours

D A S S

A

A

A

M

M

M

M

ProgramAnalysisforCustomiza-on

16-07-11 24

A

M A M

SS

M

D

A

D

MAProgramAnalysis

A M D D

A M S S

A M A M

A M A M

CandidateOverlaysProgramDFGWork-to-be-done

SystemIntegra-on

•  MustbeabletovirtualizetheFPGA–  Takesnapshots–  Migrate–  Shareandmanageasaresource

16-07-11 25

CPUs GPUs FPGAs

VM VM

CPUs GPUs FPGAs

VM VM

CPUs GPUs FPGAs

VM VM

Spark Hadoop GraphLab TensorFlow

ApplicaCon ApplicaCon ApplicaCon

OverlaysFacilitateVirtualiza-on•  FPGAvirtualizaCononlynowbeingexplored

–  Requiresspecializedhardware–  Averylarge“state”

•  Overlaysnaturallyhaveamuchsmallerstate,facilitaCngsnapshotsandcontextswitching

–  Wewouldliketoexplorethissupportinourmesh-of-FUsoverlay

16-07-11 26

ChallengestoOverlays

•  Resourceoverhead–  Thatis,theFPGAresourcesusedbytheoverlaycomparedtoa

dedicatedcircuit(HDL)thatimplementsthesameapplicaCon–  ~4XforourFPoverlayandcanbehigher–  DifficulttoquanCfydesigneffort

–  FPGAsareareincreasinginsize–  HardfloaCngpointunits–  Hardeningtheoverlayoncedesignisover?

16-07-11 27

ChallengestoOverlays–Cont’d•  OverlayarchitecturesneedmoreexploraCon:which

architectureforagivenapplicaCondomain–  Howtoensurescalability?–  TakingintoaccounttheunderlyingFPGAdeviceconstraints–  Howtoimplementwell(e.g.,data-drivenexecuCon,FIFOs,etc.)?–  FixedfuncConvs.mulC-funcConFUs?–  Howtoreducingresourceoverhead?–  TimemulCplexed?–  MulCpledevices?

16-07-11 28

ChallengestoOverlays–Cont’d•  EvolvingtheFPGAdesigntools

–  Modulararchitecturesdonotleadtomodularcircuits•  Thetoolsdonotunderstandthemodularity•  Atpresentwemust“fightwiththem”[FPL2014]

–  Thetoolsmustevolvetoallowdeveloperstoexpressandtorecognizethemodularityofthearchitecture•  Scalablecircuitsfromscalablearchitectures

16-07-11 29

ConcludingRemarks•  Acaseforoverlays

–  Performance,soOware-friendliness,customizabilityandsystemintegraCon

•  Theycanserveas“middleground”betweenhardwaredesignandsoOwareprogramming–  EitherforproducConorfordebuggingandprototyping

•  Challengestoarchitecture,programmingmodels,implementaConandresourceoverhead

16-07-11 30

Ques-ons?

16-07-11 31

top related