the risc-v berkeley out-of-order machine · icache uncore lsu rename table fpu rob free list issue...
Post on 14-Oct-2020
2 Views
Preview:
TRANSCRIPT
UC Berkeley
TheRISC-VBerkeleyOut-of-OrderMachine:AnupdateonBOOMandthewiderecosystem
(Chisel,FIRRTL,&Rocket-chip)ChristopherCelio
ORCONF2016October
celio@eecs.berkeley.edu
http://ucb-bar.github.io/riscv-boom
UC Berkeley
AnUpdateontheBerkeleyArchitectureResearchInfrastructure
(Oh,andBOOMiscooltoo.)ChristopherCelio
ORCONF2016October
celio@eecs.berkeley.edu
http://ucb-bar.github.io/riscv-boom
UC Berkeley WhatisBOOM?◾ superscalar,out-of-orderRISC-VprocessorwriGeninBerkeley’sChiselhardwareconstrucIonlanguage(HCL)
◾ Itissynthesizable◾ Itisparameterizable◾ itisopen-source
3
RegFile
ICache
Uncore
LSU
RenameTable
FPU
ROB
Free List
Issue Window
Branch PredictorALUs
Fetch Buffer
DCache
DCacheControl
IDIVIMUL Busy Table
Bypasses
2-wide BOOM (16kB/16kB) 1.2mm2 @ 45nm
Fetch
Decode &Rename
Issue Window
PhysicalRegisterFile (5x3)
(PRF)
ALU
ROB
Commitwakeup
uop tags
inst
2
2
2
2
2
FPUALU LSU
Datacache
UC Berkeley HowdoyoumakeanOoOprocessor?
◾ StartwithanewhardwareconstrucIonlanguage- Verilogisawfulandwilljustgetintheway.
◾ Startwithaworkingprocessor.-WayeasierthanwriIngeverythingfromscratch!- PTWs,FPUs,uncore,devices,off-chipIOsareunglamorous.
4
UC Berkeley Ittakesavillage.
◾ RISC-VISA- veryout-of-orderfriendly!
◾ ChiselhardwareconstrucIonlanguage- object-oriented,funcIonalprogramming
◾ FIRRTL(brandnew!)- exposedRTLintermediaterepresentaIon(IR)
◾ Rocket-chip- AfullworkingSoCplaVormbuiltaroundtheRocketin-ordercore
◾ Thanksto:- KrsteAsanović,RimasAvizienis,JonathanBachrach,ScoGBeamer,DavidBiancolin,ChristopherCelio,HenryCook,PalmerDabbelt,JohnHauser,AdamIzraelevitz,SagarKarandikar,BenKeller,DonggyuKim,JackKoenig,JimLawson,YunsupLee,RichardLin,EricLove,MarInMaas,ChickMarkley,AlbertMagyar,HowardMao,MiquelMoreto,QuanNguyen,AlbertOu,DavidA.PaGerson,BrianRichards,ColinSchmidt,WenyuTang,StephenTwigg,HuyVo,AndrewWaterman,AngieWang,andmore...
5
UC Berkeley TheRocket-ChipSoCGeneratorEcosystem
◾ BOOMisapieceoftheRocket-chipecosystem◾ Startedin2011◾ tapedout10(12?)ImesbyBerkeley◾ runsat1.65GHzinIBM45nm◾MMUsupportspage-basedvirtualmemory◾ IEEE754-2008compliantFPU
- supportsSP,DPFMAwithhwsupportforsubnormals
◾ cachecoherent,non-blockingL1datacache,L2cache,andmore
6https://github.com/ucb-bar/rocket-chip
Rocket5-stagepipeline
https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-17.html
UC Berkeley BOOMfitsintoRocket-chipSoC
7
BOOM
Rocket
Rocket
◾ Thedevsgraduated!- startedSiFivestart-uptosupportRISC-V,rocket-chipanddesigncustomchipsaroundthem
◾ workinprogressonanewRocket-chipFoundaIon◾ newgeneraIonofBerkeleystudentdevs!
- BerkeleyiscommiGedtoopen-sourceRocket-chip- (ourresearchdependsonit!)
◾ 37contributors- SiFive,Berkeley,LowRISC,BostonU.,andmore
UC Berkeley Rocket-chipUpdates
8
the vision...
https://dev.sifive.com/documentation/freedom-u500-platform-guide/
Rocket-chip
UC Berkeley Rocket-chipUpdates◾ commiGedtoopen-sourcesinceOct2011
- 3588commits!!!- 310commitslastfourweeks!
◾BeGerdocumentaIon- hGps://dev.sifive.com/documentaIon/u5-coreplex-series-manual/
◾ removedgitsubmodules- speedupdevelopment(reducemergeheadaches)
◾ RISC-VPrivilegedSpecv1.9◾uncachedloads/stores(memory-mappedIO)◾ RISC-VExternalDebug
- breakpoints- stand-aloneboot!- non-standardHTIF(host/targetinterface)hasbeenremoved!
◾ MoreconfiguraIons- RV32support+M/A/FasopIons- RISC-VCompressedISAsupport- blockingdatacache(MSHRs=0)
◾ UpdatedtoChisel3- NomoreC++backend,onlyVerilogisemiGed.- weuseVerilatorinthebuildsystemforfreeandfastsimulaIon.
◾ Andalotmore...- Ilelinkupdates,mulI-clockdomains,I/Os,devices,...- I'mhavingtroublekeepingup! 9
UC Berkeley Chisel
◾ HardwareConstrucIonLanguageembeddedinScala◾ notahigh-levelsynthesislanguage◾ hardwaremoduleisadatastructureinScala◾ FullpowerofScalaforwriInggenerators
- object-orientedprogramming- factoryobjects,traits,overloading
- funcIonalprogramming- high-orderfuns,anonymousfuncs,currying
10
Chisel Program
Scala/JVM
magic!
Verilator Verilog
FPGA Verilog
ASIC Verilog
C++ Compiler/Verilator FPGA Tools ASIC Tools
Verilator/Software Simulator
FPGA Emulation
GDS Layout
UC Berkeley Chisel3(frontend)andFIRRTL(backend)◾ Goal
- turnresearch-wareintoaqualitycompilerplaVorm- openuptheIRforhardwaredesignerstowritetheirownIRtransformpasses
◾ Chisel3- embeddedinScala- generatesFIRRTLRTLcode
◾ FIRRTL(FlexibleIRforRTL)- servesasanIRforhardware(i.e.,LLVMforhw!)- generatesVerilog
◾ Success!- canaddyourowntransformaIons- youcanthrowawayChiselandwriteyourownfront-end- hGps://github.com/ucb-bar/chisel3(alphaversion)- hGps://github.com/ucb-bar/firrtl(alphaversion)- Spec:hGps://github.com/ucb-bar/firrtl/tree/master/spec
11
UC Berkeley FIRRTLIRPasses
◾ Codecoverage- howmuchofmycircuitisbeingexercised?
◾ ScanchaininserIon◾ SRAM- FPGAsvsASICmemories- re-sizing(transformingskinny/talltorectangular)
◾ earlystaIsIcgathering(e.g.,gatecount)- synthesistoolstakealongIme...
◾ DecouplingtargetImefromhostIme- e.g.,addinga"stop-the-world"buGon
◾ asserIonsupport(TBA)- turningsimulatorasserIonsintoarealHWasserIon
12
UC Berkeley New&UpcomingChiselFeatures◾ parameterizedVerilogblackboxsupport◾ Analogtypes
- representsoutsidewiresthatare"notdigital"- e.g.,connecInginoutI/Ostoblackboxes
◾mulI-clockdomainsupport- fixed-raIohasfirst-classciIzensupport- youprovideyourownclockdomaincrossings*- e.g.,asyncFIFOs
◾ *asyncreset- TBD,butwithaneyetowardsmulI-clocksupport
◾ DSPsupport- FixedPointtype- canprovidevalueranges,notbitwidths
◾ AnnotaIons- allowFIRRTLback-endtogetaddiIonalinformaIonfromChisel
13
UC Berkeley Chisel2
◾ providescompaIbilitywarningstohelpmigraIontoChisel3
◾migraIondocumentaIonisavailable◾ Endoflife...?◾ ifyouuseChisel2anddependonitletusknow!
14
UC Berkeley Ques[onsonBAR?
15
UC Berkeley WhatisBOOM?
◾ superscalar,out-of-orderRISC-VprocessorwriGeninBerkeley’sChiselRTL
◾ Itissynthesizable◾ Itisparameterizable◾ itisopen-source◾ startedin2012◾ 10klinesofcode◾ implementsRV64G
16
RegFile
ICache
Uncore
LSU
RenameTable
FPU
ROB
Free List
Issue Window
Branch PredictorALUs
Fetch Buffer
DCache
DCacheControl
IDIVIMUL Busy Table
Bypasses
2-wide BOOM (16kB/16kB) 1.2mm2 @ 45nm
http://ucb-bar.github.io/riscv-boom
UC Berkeley BOOM
◾ PRF- explicitrenaming- holdsspeculaIveandcommiGeddata- holdsbothx-regs,f-regs
◾ UnifiedIssueWindow- holdsallinstrucIons
◾ splitROB/issuewindowdesign 17
Fetch Decode &Rename
Issue Window Unified
PhysicalRegister
File (PRF)
FPU
ALU
Rename Map Tables & Freelist
ROB
Commit
UC Berkeley BenefitsofusingChisel&Rocket-chip◾ ~10,000locinBOOMgithubrepo◾ +addiIonal~12,000locinstanIatedfromotherlibraries
- ~5,000locfromRocketcorerepository- 90(integerALU)- 150(unpipelinedmul/div)- 550(floaIngpointunits)- 1,000(non-blockingdatacache)- 300(icache)- 300(nextlinepredictor/BTB/RAS)- 200(decoderminimizaIonlogic)- 200(page-tablewalker)- 200(TLB)- 400(control/statusregisterfile)- 300(instrucIondefiniIons+constants)
- ~4,500locfromuncore- coherencehubs,L2caches,networks
- ~2000locfromhardfloat- floaIngpointhardunits
18
UC Berkeley ParameterizedSuperscalar
19
OR
valexe_units=ArrayBuffer[ExecutionUnit]()exe_units+=Module(newALUExeUnit(is_branch_unit=true,has_fpu=true,has_mul=true))exe_units+=Module(newALUMemExeUnit(fp_mem_support=true,has_div=true))Issue
SelectRegfileWriteback
dual-issue (5r,3w)
bypassing
ALU
div
LSUAgen D$
bypassing
ALU
FPU
bypassnetwork
RegfileRead
imul
exe_units+=Module(newALUExeUnit(is_branch_unit=true))exe_units+=Module(newALUExeUnit(has_fpu=true,has_mul=true))exe_units+=Module(newALUExeUnit(has_div=true))exe_units+=Module(newMemExeUnit())
Issue Select
RegfileWriteback
Quad-issue (9r,4w)
ALU
div
LSUAgen D$
ALU
imul
FPU
ALU
bypassing
bypassnetwork
RegfileRead
UC Berkeley ARMCortex-A9vs.RISC-VBOOM
20
Category ARM Cortex-A9 RISC-V BOOM-2w
ISA 32-bit ARM v7 64-bit RISC-V v2 (RV64G)
Architecture 2 wide, 3+1 issue Out-of-Order 8-stage
2 wide, 3 issue Out-of-Order 6-stage
Performance 3.59 CoreMarks/MHz 4.61 CoreMarks/MHz
Process TSMC 40GPLUS TSMC 40GPLUSArea with 32K caches 2.5 mm2 1.00 mm2
Area efficiency
1.4 CoreMarks/MHz/mm2
4.6 CoreMarks/MHz/mm2
Frequency 1.4 GHz 1.5 GHz
Caveats:A9includesNEON;BOOMis64-bit,hasIEEE-2008fusedmul-add
UC Berkeley BOOMUpdates◾ opensourcedinJan2016- hGp://ucb-bar.github.io/riscv-boom/
◾ ~60pageBOOMDesignSpecificaIon- hGps://ccelio.github.io/riscv-boom-doc/
◾ usedforcasestudyinISCA2016paper(strober.org)◾ firstexternalcontribuIon(visualizer)◾ portedtoChisel3/FIRRTL◾ supportsuncachedloads,stores(allowsformemory-mappedIO)
◾ updatedtoPrivilegeSpecv1.9◾ supportsRISC-VExternalDebugSpec◾ addedHighPerformanceMonitorcounters(HPM)◾ branchpredictorimprovements◾ beginningtape-out 21
UC Berkeley ZynqFPGARepository
◾ BOOMrunsonaXilinxZynqzc706◾ hGps://github.com/ucb-bar/fpga-zynq◾UpdatedtohandlelatestPrivilegedSpecv1.9,ExternalDebugspec,self-booIng
◾was"tethered"viatheDebugTransportModule,but...◾ ...justfinishedupa"TetherSerialInterface"update
22
UC Berkeley Visualiza[on
◾ Aboveshows:- SD-to-LBUhazardindhrystone- instrucIonsdependentonLBUwaitforLBUtoexecute- instrucIonsnot-dependentissueout-of-order,beforeLBUexecutes
23
UC Berkeley Iacceptcontribu[ons!
◾ VisualizerisBOOM'sfirstexternalcontribuIon◾ Happytoacceptmore!- codethatcrashes- fixestocodethatcrashes- performanceanalysistools- debuggingorvisualizaIontools- performanceimprovements- newfeatures
24
UC Berkeley BranchPredictors
25
Fetch Decode &Rename
Issue Window
Unified PhysicalRegister
File
Functional Unit
0x100:%add%r1,%r2,%r30x104:%add%r3,%r0,%r40x108:%lw%%r2%0(r3)0x10c:%beq%r1,%r2,%0x200
Instructions
0x110:%ld%%r2,%0(r1)0x114:%ld%%r3,%4(r1)0x118:%ld%%r4,%8(r1)0x11c:%ld%%r5,%12(r1)0x120:%...
0x200:%add%r1,%r2,%r30x204:%add%r3,%r2,%r10x208:%sw%%r1,%9(r1)0x20c:%sw%%r3,%0(r1)...
?
UC Berkeley GSharePredictor
◾ globalhistory- trackoutcomeoflastNbranches
◾ 2-bitsaturaIngcountertable
26
1 11 00 10 0
predict taken
predictnot taken
110110
global history
inst address0x1000
hash
11
0 1
predict taken?
hysteresis
UC Berkeley BOOM'sBranchPredic[on
◾ next-linepredictor(NLP)- BTB,BHT,RAS- combinaIonal
◾ backingpredictor(BPD)- globalhistorypredictor- SRAM(1r/wport)
27
Branch Prediction
I$
Fetch Buffer
Fetch1
μDecPC1
Fetch2
NLP
ExeBrTarget
NPC
Front-end
BPD
PC2
Front-end
TakePC
BHT Target >>
UC Berkeley GSharePredictor-FiengitinSRAM
◾ change"2-bitcounter"statemachine- onmispredicIon,jumpfromweaktostrong- 00->01->11
- thiswillallowustoreducetheread/writerequirements- describedinHennessy&PaGersoncomputerarchitecturetextbook
28
1 11 00 10 0
predict taken
predictnot taken
UC Berkeley GSharePredictor-FiengitinSRAM
◾ p-bit(predicIon)- readeverycycle- writeonmispredict(valueofh-bit)
◾h-bit(hysteresis)- writeeverybranch- readonmispredict
29
1 11 00 10 0
predict taken
predictnot taken
110110
global history
inst address0x1000
hash
11
0 1
predict taken?
hysteresis
UC Berkeley GShareinsingle-portedSRAM
- delayedghistoryupdate- super-scalarpredicIons- ghistoryisfetchpacketgranularity- bankedp-table- resetghistoryonmisspeculaIons- updateduringcommit 30
npc
hash
paddr
NPC (BP0)
uDec
predictions
BTB
validmaskinsts
IC-Access (BP1) IC-Resp (BP2)
FetchBuffer
I$
p-table
ghistory
wpredict
h-tablebr_unit_pc+ ghistory
S1 S2
write-buffer
h_out
addr
data
holds both rename-table snapshots, and bpd
snapshot information that can be released once a
branch is resolved.
npc
hash
paddr
NPC (BP0)
uDec
predictions
BTB
validmaskinsts
IC-Access (BP1) IC-Resp (BP2)
FetchBuffer
I$
ghistory
wpredict
S1S2
global history register* update at end of BP1 stage - it contains the branch prediction history of the fetch unit (including BTB decisions) - compresses entire fetch-packet decision - only includes branches, not JAL/JALRs* reset on fetch unit redirect (misprediction) - new prediction must use the new history
prediction tables
addr
data
allocateresources
B-ROB
Commit,update predictor
branch snapshots
B-ROB holds all inflight branches, and can bypass updates that haven't
been committed to incoming predictions.
imem
req
pred_resp(1 per packet)
predictions (1 per inst)ras_update
predictor pipeline
br_unit
update
ROB
kill
predictor
Decode/Rename/Dispatch
req_pc
resp
predictor (base)
ghist
brUnitResp
commit
UC Berkeley
AbstractBranchPredictors
31
UC Berkeley BOOM'sBranchPredictors◾ Null
- predictsnot-taken◾ Random
- servesasthebaselineworst-casepredictor- usefulfortesIngthepipeline
◾ SimpleGshare- demonstrateshowtointerfacewiththebranchpredictorframework- notsynthesizable
◾ GShare- targeIng1r/1wSRAM(dualported)...- ...or1rwSRAM(banked)
◾ 2bc-GSkew- basedontheEV8(Alpha21464)predictor- 1rw(or1r/1w)SRAM- took12hourstoimplement
◾ TAGE- asuperawesomepredictor- sIllinprototyping;WIPtomakeitsynthesizableandscalable
32
UC Berkeley Thankyou!
◾ Source- hGps://ucb-bar.github.io/riscv-boom
◾ DocumentaIon- hGps://ccelio.github.io/riscv-boom-doc
◾ Googlegroup- hGps://groups.google.com/forum/#!forum/riscv-boom
◾ TechReport- hGps://www.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-167.html
◾ TwiGer- hGps://twiGer.com/boom_cpu
33
UC Berkeley FundingAcknowledgements
34
◾ Researchpar*allyfundedbyDARPAAwardNumberHR0011-12-2-0016,theCenterforFutureArchitectureResearch,amemberofSTARnet,aSemiconductorResearchCorpora*onprogramsponsoredbyMARCOandDARPA,andASPIRELabindustrialsponsorsandaffiliatesIntel,Google,Huawei,Nokia,NVIDIA,Oracle,andSamsung.
◾ Approvedforpublicrelease;distribu*onisunlimited.Thecontentofthispresenta*ondoesnotnecessarilyreflecttheposi*onorthepolicyoftheUSgovernmentandnoofficialendorsementshouldbeinferred.
◾ Anyopinions,findings,conclusions,orrecommenda*onsinthispaperaresolelythoseoftheauthorsanddoesnotnecessarilyreflecttheposi*onorthepolicyofthesponsors.
top related