an open-source out-of-order risc-v core...it is open-source wriaen in chisel (16k loc) it is...
TRANSCRIPT
BOOMv2anopen-sourceout-of-orderRISC-Vcore
ChristopherCelio,Pi-FengChiu,BorivojeNikolic,DavidPaAerson,KrsteAsanović2017RISC-VWorkshop#7
UC Berkeley
UC Berkeley WhatisBOOM?
2http://ucb-bar.github.io/riscv-boom
◾"BerkeleyOut-of-OrderMachine"◾out-of-order◾superscalar◾ implementsRV64G,bootsLinux◾ Itissynthesizable◾ itisopen-source◾wriAeninChisel(16kloc)◾ Itisparameterizablegenerator◾builtontopofRocket-chipSoCEcosystem
integer
load/store
fp
fetchdec, ren,
& dis
issue/rrdqueues
BOOMv2-2f4i
wb
UC Berkeley BOOMfitsintoRocket-chipSoC
3
BOOM
Rocket
Rocket
UC Berkeley Lotsofneatfeatures
4
◾ advancedbranchpredicTon- BTB,RAS(1-cycle)- gshareandTAGE-basedcondiTonalpredictorimplementaTons(3-cycle)- fitsinsingle-portSRAM
◾ loadscanissueout-of-orderwrttootherloads,stores◾ hardwareperformancecounters(~40events)
FRFlsu
btbbpd I$D$
IRFIQ1IQ0
ROB
imulDFMA
SFMA
IQ2
idiv
mshrs
csr
dec
UC Berkeley ParameterizedSuperscalar
5
valexe_units=ArrayBuffer[ExecutionUnit]()exe_units+=Module(newALUExeUnit(is_branch_unit=true,has_fpu=true,has_mul=true))exe_units+=Module(newALUMemExeUnit(fp_mem_support=true,has_div=true))
Issue Select
RegfileWriteback
dual-issue (5r,3w)
bypassing
ALU
div
LSUAgen D$
bypassing
ALU
FPU
bypassnetwork
RegfileRead
imul
UC Berkeley ParameterizedSuperscalar
5
OR
valexe_units=ArrayBuffer[ExecutionUnit]()exe_units+=Module(newALUExeUnit(is_branch_unit=true,has_fpu=true,has_mul=true))exe_units+=Module(newALUMemExeUnit(fp_mem_support=true,has_div=true))
Issue Select
RegfileWriteback
dual-issue (5r,3w)
bypassing
ALU
div
LSUAgen D$
bypassing
ALU
FPU
bypassnetwork
RegfileRead
imul
exe_units+=Module(newALUExeUnit(is_branch_unit=true))exe_units+=Module(newALUExeUnit(has_fpu=true,has_mul=true))exe_units+=Module(newALUExeUnit(has_div=true))exe_units+=Module(newMemExeUnit())
Issue Select
RegfileWriteback
Quad-issue (9r,4w)
ALU
div
LSUAgen D$
ALU
imul
FPU
ALU
bypassing
bypassnetwork
RegfileRead
Async. FIFOs + level shifters
BOOM Domain (0.5V-1V)
Uncore Domain
VDDVDDCORE
BOOM Core
L1-to-L2 Crossbar
MMIO RouterRead/write all counters
and control registers
1MB L2 CacheBank0
μArchCounters
μArchCounters
JTAGDigital
IO Pads
Configurable Memory Traffic Switcher
To FPGA GPIO
BTB
Load/Store Unit
int int agen
Phys. Int RF (semi-custom
bit cell)
FPU
6KBBr Pred
Issue Units
16KB ScalarInst. Cache
(foundry 6T SRAM Macros)
Arbiter
16KB ScalarData Cache
(foundry 8T SRAM Macros)
Phys.FP RF
Tile Link IO Bidirectional SERDES
UC Berkeley Goodnews
◾ BOOMhas(finally)beentapedout!- 2017Aug15
◾ TSMC28nmHPM◾ Academictest-chip◾ BOOMisinsupportofstudyingotherfeatures
6
BOOMv2tapingoutanout-of-orderprocessorwitha2personteamin4months
ChristopherCelio,Pi-FengChiu,BorivojeNikolic,DavidPaAerson,KrsteAsanović2017CARRVWorkshop
UC Berkeley
ProcessorSiFive U54
Rocket (RV64GC)
Berkeley BOOMv2
UltraSPARC T2
ARM Cortex-A9
Intel Xeon Ivy
Language Chisel Chisel Verilog - SystemVerilog
Core LoC 8,000 16,000 290,000 - NDA
Total LoC 34,000 50,000 1,300,000 - NDA
Foundry TSMC TSMC TI TSMC Intel
Technology 28 nm (HPC)
28 nm (HPM)
65 nm 40 nm (G)
22 nm
Core+L1 Area 0.54 mm2 0.52 mm2 ~12 mm2 ~2.5 mm2 ~12 mm2
Coremark/MHz 2.75 3.92 1.64* 3.71 5.60
Frequency 1.5 GHz 1.2 GHz** 1.17 GHz 1.4 GHz 3.3 GHz
UC Berkeley Comparison:Industry
8
core+L1+L216kB/16kB
ProcessorSiFive U54
Rocket (RV64GC)
Berkeley BOOMv2
UltraSPARC T2
ARM Cortex-A9
Intel Xeon Ivy
Language Chisel Chisel Verilog - SystemVerilog
Core LoC 8,000 16,000 290,000 - NDA
Total LoC 34,000 50,000 1,300,000 - NDA
Foundry TSMC TSMC TI TSMC Intel
Technology 28 nm (HPC)
28 nm (HPM)
65 nm 40 nm (G)
22 nm
Core+L1 Area 0.54 mm2 0.52 mm2 ~12 mm2 ~2.5 mm2 ~12 mm2
Coremark/MHz 2.75 3.92 1.64* 3.71 5.60
Frequency 1.5 GHz 1.2 GHz** 1.17 GHz 1.4 GHz 3.3 GHz
UC Berkeley Comparison:Industry
9
core+L1+L2
*Fromeembc.org.32threads/8coresachieve13Cm/MHz.**EsTmated
16kB/16kB
UC Berkeley LimitsofEducaPonalLibraries
◾ BOOMv1- EducaTonal45nmlibraries- CACTImodelforSRAMs- synthesizedflip-flopsformostnon-cachememoryarrays- goodforcatchingsmallTmingbugs- notaccurateenoughtojusTfysweepingredesigns
◾ BOOMv2- TSMC28nmHPM(highperformancemobile)- LVT-basedstandardcells-memorycompiler(one-andtwo-port)
10
UC Berkeley 4monthsofagiletape-out
11
*ignore the Y-axis: -- too many parameters/variables changing between each run -- doesn't capture DRC violations
UC Berkeley BOOMv1
12
Fetch
Decode &Rename
Unified Issue
Window
PhysicalScalar Register
File (7x3)(PRF)
ROB
Commitwakeup
uop tags
inst
Fetch: 2wIssue: 3-issue
PRF: 7r3w (110)2
2
2
3
LSU
Datacache
ALUFPU iMul
ALUFDiv IDiv
BOOMv1-2f3i
int/idiv/fdiv
load/store
int/fma
fetch
issue/rrdqueues
wbdec, ren,
& dis
◾ shortpipeline- inspiredbyR10K,21264,Cortex-A9
◾ unifiedissuewindow
UC Berkeley BOOMv1:Frontend
13
I$
Fetch1
PC1
Fetch2
µBTB
Fetch0
PC2
hash
TLB
idx
next-pc
taken
I-Cachebackendredirecttarget pre-decode
& target compute
Fetch Buffer
BPD redirect
UC Berkeley BOOMv1:Frontend
13
I$
Fetch1
PC1
Fetch2
µBTB
Fetch0
PC2
hash
TLB
idx
next-pc
taken
I-Cachebackendredirecttarget pre-decode
& target compute
Fetch Buffer
BPD redirect
idx
redirect
I$Fetch Buffer
Fetch1
PC1
Fetch2
+8backendredirecttarget
Fetch0
PC2
BPD
redirect
BTB
Fetch3
hash
TLB
pre-decode & target compute
checker{hit,
target}next-pc
taken
I-Cache
decode info
RAS
I$
Fetch1
PC1
Fetch2
µBTB
Fetch0
PC2
hash
TLB
idx
next-pc
taken
I-Cachebackendredirecttarget pre-decode
& target compute
Fetch Buffer
BPD redirect
UC Berkeley Frontend
14
BOOMv1
BOOMv2
◾ BTBinSRAM- set-associaTve- parTallytagged- Checkertoverifyintegrity
◾ BPD(CondiTonalPredictor)- hashgetsenTrestage- redirectbasedonBTB- redirectpushedback(I$)
UC Berkeley Regfile:(thefirstP&R)
◾ Regfileblowsup◾ criTcalpathsinissue-select,registerread
15
regfilerob
iq
rename
lsu
mshr
bpdalu
fpu
alu
UC Berkeley BOOMv2
◾ splitunifiedissuewindow(1day)◾ splitphysicalregisterfile(1week)◾ issue/registerreadseparatestages(~1day)◾ 2stagerename(~1week)◾ 3stagefetch(3weeks)
16
Fetch
Decode &Rename
Unified Issue
Window
PhysicalScalar Register
File (7x3)(PRF)
ROB
Commitwakeup
uop tags
inst
Fetch: 2wIssue: 3-issue
PRF: 7r3w (110)2
2
2
3
LSU
Datacache
Before
ALUFPU iMul
ALUFDiv IDiv
Fetch
Decode &Rename1
FP Issue Window
PhysicalFP RegisterFile (3x2)(PFRF)
ROB
Commit
wakeup
uop tags
inst
Fetch: 2wIssue: 4-issue
PIRF: 6r3w (96)PFRF: 3r1w (48)
2
2
2
LSU
Datacache
FPU FDvALU
iMul iDiv
Physical Int Register
File (6x3)(PIRF)
uop tags uop tags
ALU
ALU Issue Window
Mem Issue Window
toInt(LL
wport)
toPFRF(LL wport)
2
I2F
toPFRF(LL wport)
F2I
Rename2 &Dispatch
Afterinteger
load/store
fp
fetchdec, ren,
& dis
issue/rrdqueues
BOOMv2-2f4i
wb
UC Berkeley RegfileChallenges
◾ foundaTonalIPprovidememorieswithsingleanddual-portmemories
◾ companiesbuildtheirownhand-crahedregisterfiles◾ notenoughlaborinacademiatosupportacustomizedregisterfile
17
... ...
WBL0 WBLB0
WWL0
WBLn WBLBn
WWLnWWLn
RWL[m-1,0]
RBL[m-1,0]x n write ports
x m read ports
UC Berkeley Regfile
18
• Routing 4480 wires causes congestion issues
Q
4480
70x64 flip-flops
........
.............
read_data{5-0}[63-0]
7-de
ep lo
gic
tree
.......
......
...
.......
70 e
ntrie
s
64 bits
read_data{5-0}[63] read_data{5-0}[0]
RF unit cell
• Reduce read_data routing wires to 384
WD0
WD1
WD2
WS[1:0]
D
EN
WE
Q
OE[5-0]
Tri-state buf [5-0]
RD
[5-0
]
RF unit cell
Congestion!
Tri-state buffer mimics the PG in a custom RF cell so that read data wire can be shared
synthesized array
WS
WD
validaddrdata
decode
WE
Write Port (x3)
addrRead Port (x6)
Regfile Bit Block
2
6RE
Flip flop(dff)D
Q
decode
RD
read data wires
UC Berkeley Regfile:OurquicksoluPon
◾ hand-crahourownbitblockoutofstandardcells- tri-statetodrivereadwires- hierarchicalbitlines
◾ pre-placearraysofbitblocks◾ letrouterhandlethe18wiresper1flip-flop◾ blackboxinChisel- simulatecontrolusingChiselmodel- verifyatgate-level 19
BOOMv1-2f3i
int/idiv/fdiv
load/store
int/fma
fetch
issue/rrdqueues
wb
integer
load/store
fp
fetchdec, ren,
& dis
issue/rrdqueues
BOOMv2-2f4i
wb
UC Berkeley BeforeandAXer
20
UC Berkeley Results
21
UC Berkeley Results
◾ Hardtocompute--notapplestooranges- alotoftheworkwasaboutdesignrulechecks,notperformance
21
UC Berkeley Results
◾ Hardtocompute--notapplestooranges- alotoftheworkwasaboutdesignrulechecks,notperformance
◾ Roughly25%decreaseinclockperiod◾ Roughly20%decreaseinCoremark/MHz- knownissuefromload-usedelay
21
UC Berkeley Results
◾ Hardtocompute--notapplestooranges- alotoftheworkwasaboutdesignrulechecks,notperformance
◾ Roughly25%decreaseinclockperiod◾ Roughly20%decreaseinCoremark/MHz- knownissuefromload-usedelay
◾ Barelyawin?◾ ChangesfixedDRCerrors,geometryerrors(shorts),so...
21
UC Berkeley Results
◾ Hardtocompute--notapplestooranges- alotoftheworkwasaboutdesignrulechecks,notperformance
◾ Roughly25%decreaseinclockperiod◾ Roughly20%decreaseinCoremark/MHz- knownissuefromload-usedelay
◾ Barelyawin?◾ ChangesfixedDRCerrors,geometryerrors(shorts),so...- infinitelyFaster!BeAer!Stronger!
21
UC Berkeley AgileHardwareDevelopment◾ RTLhackingcanbeveryagile
- ~6minutestocompile,build,andrun"riscv-tests"regressionsuite(10KHzforVerilatorsimulator)
- Chiselallowsforquick,far-reachingchanges- generatorapproachallowsforlate-bindingdesigndecisions- smallchanges,improvements(thatdon'taffectfloorplan)areagile
22
UC Berkeley AgileHardwareDevelopment◾ RTLhackingcanbeveryagile
- ~6minutestocompile,build,andrun"riscv-tests"regressionsuite(10KHzforVerilatorsimulator)
- Chiselallowsforquick,far-reachingchanges- generatorapproachallowsforlate-bindingdesigndecisions- smallchanges,improvements(thatdon'taffectfloorplan)areagile
◾ PhysicaldesignisaboAleneck- 2-3hoursforsynthesisresults- 8-24hoursforp&rresults- toomuchvalueprovidedbymanualintervenTon- reportsaredifficulttoreasonabout- sadlycan'tthrowthisatacomputerandget7gooddesignsaweeklater
22
UC Berkeley AgileHardwareDevelopment◾ RTLhackingcanbeveryagile
- ~6minutestocompile,build,andrun"riscv-tests"regressionsuite(10KHzforVerilatorsimulator)
- Chiselallowsforquick,far-reachingchanges- generatorapproachallowsforlate-bindingdesigndecisions- smallchanges,improvements(thatdon'taffectfloorplan)areagile
◾ PhysicaldesignisaboAleneck- 2-3hoursforsynthesisresults- 8-24hoursforp&rresults- toomuchvalueprovidedbymanualintervenTon- reportsaredifficulttoreasonabout- sadlycan'tthrowthisatacomputerandget7gooddesignsaweeklater
◾VerificaTonisanotherboAleneck- IcanwritebugsfasterthanIcanfindthem
22
UC Berkeley FutureDirecPons(forBOOM)
◾ AlotofimprovementstomaketoBOOM- CleardirecTonoffurtherIPC/QoRimprovements
23
UC Berkeley FutureDirecPons(forme)
◾ SeeDaveDitzel'stalkfromyesterdaytolearnmore◾ RISC-VFoundingGoldMember- onlotsofworkinggroups!
◾WearecommiAedtomaintainingtheBOOMopen-sourcerepository
◾We'rehiring...
24
"Esperanto Technologies designs high-performance energy-efficient computing solutions." - riscv.org
UC Berkeley A2-persontapeouttakesavillage!
◾ RISC-VISA- veryout-of-orderfriendly!
◾ ChiselhardwareconstrucTonlanguage- object-oriented,funcTonalprogramming
◾ FIRRTL- exposedRTLintermediaterepresentaTon(IR)
◾ Rocket-chip- AfullworkingSoCplarormbuiltaroundtheRocketin-ordercore
◾ Thanksto:- KrsteAsanović,RimasAvizienis,JonathanBachrach,ScoABeamer,DavidBiancolin,ChristopherCelio,HenryCook,PalmerDabbelt,JohnHauser,AdamIzraelevitz,SagarKarandikar,BenKeller,DonggyuKim,JackKoenig,JimLawson,YunsupLee,RichardLin,EricLove,MarTnMaas,ChickMarkley,AlbertMagyar,HowardMao,MiquelMoreto,QuanNguyen,AlbertOu,DavidA.PaAerson,BrianRichards,ColinSchmidt,WenyuTang,StephenTwigg,HuyVo,AndrewWaterman,AngieWang,andmore...
25
UC Berkeley QuesTons?
◾ Researchpar*allyfundedbyDARPAAwardNumberHR0011-12-2-0016,theCenterforFutureArchitectureResearch,amemberofSTARnet,aSemiconductorResearchCorpora*onprogramsponsoredbyMARCOandDARPA,andASPIRELabindustrialsponsorsandaffiliatesIntel,Google,Huawei,Nokia,NVIDIA,Oracle,andSamsung.
◾ Approvedforpublicrelease;distribu*onisunlimited.Thecontentofthispresenta*ondoesnotnecessarilyreflecttheposi*onorthepolicyoftheUSgovernmentandnoofficialendorsementshouldbeinferred.
◾ Anyopinions,findings,conclusions,orrecommenda*onsinthispaperaresolelythoseoftheauthorsanddoesnotnecessarilyreflecttheposi*onorthepolicyofthesponsors.
26
FundingAcknowledgements