cs 152 computer architecture and engineering lecture …cs152/fa16/lectures/l03-ciscrisc.pdf · cs...
TRANSCRIPT
9/1/2016 CS152,Fall2016
CS152ComputerArchitectureandEngineering
Lecture3- FromCISCtoRISC
JohnWawrzynekElectricalEngineeringandComputerSciences
UniversityofCaliforniaatBerkeley
http://www.eecs.berkeley.edu/~johnwhttp://inst.eecs.berkeley.edu/~cs152
9/1/2016 CS152,Fall2016
LastTimeinLecture2
§ ISAisthehardware/softwareinterface– Definessetofprogrammervisiblestate– Definesinstructionformat(bitencoding)andinstructionsemantics– Examples:IBM360,MIPS,RISC-V,x86,JVM
§ ManypossibleimplementationsofoneISA– 360implementations:model30(c.1964),z12(c.2012)– x86implementations:8086(c.1978),80186,286,386,486,Pentium,PentiumPro,Pentium-4(c.2000),Core2Duo,Nehalem,SandyBridge,IvyBridge,Atom,AMDAthlon,Transmeta Crusoe,SoftPC
– MIPSimplementations:R2000,R4000,R10000,R18K,…– JVM:HotSpot,PicoJava,ARMJazelle,…
§ Microcoding: straightforwardmethodicalwaytoimplementmachinesusinglowlogicgatecountandsimplifiesimplementationofcomplexinstructions
2
9/1/2016 CS152,Fall2016
§ Instructionsperprogramdependsoncompilertechnology,andISA
§ Cyclesperinstructions(CPI)dependsonISAandµarchitecture
§ Timeperclockcycledependsupontheµarchitectureandbasetechnology
3
Time =Instructions ClockCycles TimeProgramProgram*Instruction*ClockCycle
“IronLaw”ofProcessorPerformance
Microarchitecture CPI cycletimeMicrocoded >1 shortSingle-cycleunpipelined 1 longPipelined ~1 short
Thislecture
9/1/2016 CS152,Fall2016
HardwareElements§ Combinationalcircuits
– Mux,Decoder,ALU,...
• Synchronousstateelements– Flipflop,Register,Registerfile,SRAM,DRAM
Edge-triggered:Dataissampledattherisingedge
Clk
D
Q
Enff
Q
D
ClkEn
OpSelect- Add,Sub,...- And,Or,Xor,Not,...- GT,LT,EQ,Zero,...
Result
Comp?
A
B
ALU
Sel
OA0A1
An-1
Mux...
lg(n)
A
Decoder ...
O0O1
On-1
lg(n)
9/1/2016 CS152,Fall2016
RegisterFiles
§ Readsarecombinational
5
ReadData1ReadSel1ReadSel2
WriteSel
Registerfile
2R+1W
ReadData2
WriteData
WEClock
rd1rs1
rs2
ws
wd
rd2
we
ff
Q0
D0
ClkEn
ff
Q1
D1
ff
Q2
D2
ff
Qn-1
Dn-1
...
...
...
register
9/1/2016 CS152,Fall2016
RegisterFileImplementation
§ RISC-Vintegerinstructionshaveatmost2registersourceoperands
6
reg31
rd clk
reg1
wdata
we
rs1rdata1 rdata2
reg0
…
32
…
5 32 32
…
rs255
enables selects
9/1/2016 CS152,Fall2016
ASimpleMemoryModel
7
MAGICRAM
ReadData
WriteData
Address
WriteEnableClock
Readsandwritesarealwayscompletedinonecycle• aReadcanbedoneanytime(i.e.combinational)• aWriteisperformedattherisingclockedgeiff WriteEnable signalisasserted
⇒ thewriteaddressanddatamustbestableattheclockedge
Laterinthecoursewewillpresentamorerealisticmodelofmemory
9/1/2016 CS152,Fall2016
ImplementingRISC-V
Single-cycleperinstructiondatapath &controllogic
(SimilartoMIPSsingle-cycleprocessorinCS61C)
8
9/1/2016 CS152,Fall2016
InstructionExecutionReview
Executionofaninstructioninvolves
1. Instructionfetch2. Decodeandregisterfetch3. ALUoperation4. Memoryoperation(optional)5. Writeback(optional)
andcomputeaddressofnextinstruction
9
9/1/2016 CS152,Fall2016
Datapath:Reg-RegALUInstructions
10
RegWrite Timing?5 5 5 10 7
rd rs1 rs2 func opcode rd ← (rs1) func (rs2)31 27 26 22 21 17 16 7 6 0
0x4Add
clk
addrinst
Inst.Memory
PC
Inst<26:22>Inst<21:17>
Inst<31:27>
Inst<16:0>
OpCode
ALU
ALUControl
RegWriteEn
clk
rd1
GPRs
rs1rs2
wawd rd2
we
9/1/2016 CS152,Fall2016
Datapath:Reg-ImmALUInstructions
11
5 5 12 3 7rd rs1 immediate12 func opcode rd ← (rs1) op immediate
31 27 26 22 21 10 9 7 6 0
ImmSelect
ImmSel
inst<21:10>
OpCode
0x4Add
clk
addrinst
Inst.Memory
PCALU
RegWriteEn
clk
rd1
GPRs
rs1rs2
wawd rd2
weinst<26:22>
inst<31:27>
inst<9:0> ALUControl
9/1/2016 CS152,Fall2016
ConflictsinMergingDatapath
12
ImmSelect
ImmSelOpCode
0x4Add
clk
addrinst
Inst.Memory
PCALU
RegWrite
clk
rd1
GPRs
rs1rs2
wawd rd2
weinst<26:22>
Inst<31:27>
Inst<21:10>
Inst<16:0> ALUControlInst<9:0>
Introducemuxes
rd rs1 immediate12 func3 opcode rd ← (rs1) op immediate
5 5 5 10 7rd rs1 rs2 func10 opcode rd ← (rs1) func (rs2)
Inst<21:17>
9/1/2016 CS152,Fall2016
Datapath forALUInstructions
13
<16:0>
rd rs1 immediate12 func3 opcode rd ← (rs1) op immediate
5 5 5 10 7rd rs1 rs2 func10 opcode rd ← (rs1) func (rs2)
Op2SelReg / Imm
ImmSelect
ImmSelOpCode
0x4Add
clk
addrinst
Inst.Memory
PCALU
RegWriteEnclk
rd1
GPRs
rs1rs2
wawd rd2
we<26:22><21:17>
FuncSel
ALUControl
<31:27>
<6:0>
9/1/2016 CS152,Fall2016
Load/StoreInstructions
14
WBSelALU / Mem
rs1 is the base registerrd is the destination of a Load, rs2 is the data source for a Store
Op2Sel
“base”
disp
ImmSelOpCode FuncSel
ALUControl
ALU
0x4Add
clk
addrinst
Inst.Memory
PC
RegWriteEn
clk
rd1
GPRs
rs1rs2
wawd rd2
we
ImmSelect
clk
MemWrite
addr
wdata
rdataData Memory
we
rd rs1 immediate12 func3 opcode Load
5 5 5 7 3 7 Addressing Modeimm rs1 rs2 imm func3 opcode Store (rs) + displacement
9/1/2016 CS152,Fall2016
RISC-VConditionalBranches
§ Comparetwointegerregistersforequality(BEQ/BNE)orrelativevalue(signed)(BLT/BGE)orunsigned(BLTU/BGEU)
§ 12-bitimmediateencodesbranchtargetaddressasasignedoffsetfromPC,inunitsof16-bits(i.e.,shiftleftby1thenaddtoPC).
15
7
6 0opcode
3
9 7func3
7
16 10imm[6:0]
5
21 17rs2
5
26 22rs1
5
31 27imm[11:7]
BEQ/BNE
BLT/BGE
BLTU/BGEU
9/1/2016 CS152,Fall2016
ConditionalBranches(BEQ/BNE/BLT/BGE/BLTU/BGEU)
16
0x4
Add
PCSel
clk
WBSelMemWrite
addr
wdata
rdataData Memory
we
Op2SelImmSelOpCode
Bcomp?
FuncSel
clk
clk
addrinst
Inst.Memory
PC rd1
GPRs
rs1rs2
wawd rd2
we
ImmSelect
ALU
ALUControl
Add
br
pc+4
RegWrEn
Br Logic
9/1/2016 CS152,Fall2016
IncludingJumpandJalr
17
0x4
RegWriteEn
AddAdd
clk
WBSelMemWrite
addr
wdata
rdataData Memory
we
WASel Op2SelImmSelOpCode FuncSel
clk
clk
addrinst
Inst.Memory
PC rd1
GPRs
rs1rs2
wawd rd2
we
ImmSelect
ALU
ALUControl
1
PCSelbrrindjabspc+4
Bcomp?Br Logic
9/1/2016 CS152,Fall2016
HardwiredControlispureCombinationalLogic
18
combinationallogic
opcode
Equal?
ImmSelOp2SelFuncSelMemWriteWBSelWASelRegWriteEnPCSel
9/1/2016 CS152,Fall2016
ALUControl&ImmediateExtension
19
Inst<6:0> (Opcode)
Decode Map
Inst<16:7> (Func)
ALUop+
FuncSel( Func, Op, +)
ImmSel( IType12, BsType12,
BrType12)
9/1/2016 CS152,Fall2016
HardwiredControlTable
20
Opcode ImmSel Op2Sel FuncSel MemWr RFWen WBSel WASel PCSel
ALUALUiLWSWBEQtrue
BEQfalse
JJALJALR
Op2Sel=Reg /Imm WBSel =ALU/Mem /PCWASel =rd /X1 PCSel =pc+4/br /rind/jabs
* * * no yes rindPC rdjabs* * * no yes PC X1
jabs* * * no no * *pc+4BrType12 * * no no * *brBrType12 * * no no * *pc+4BsType12 Imm + yes no * *
pc+4* Reg Func no yes ALU rdIType12 Imm Op pc+4no yes ALU rd
pc+4IType12 Imm + no yes Mem rd
9/1/2016 CS152,Fall2016
RISC-VUnconditional Jumps
§ 25-bitimmediateencodesjumptargetaddressasasignedoffsetfromPC,inunitsof16-bits(i.e.,shiftleftby1thenaddtoPC).(+/- 16MB)
§ JALisasubroutinecallthatalsosavesreturnaddress(PC+4)inregisterx1
21
J
JAL
7
6 0opcode
25
31 7JumpOffset[24:0]
9/1/2016 CS152,Fall2016
RISC-VRegisterIndirectJumps
§ Jumpstotargetaddressgivenbyadding12-bitoffset(notshiftedby1bit)toregisterrs1.PC←RF[rs1]+sign-ext(Imm)
§ Thereturnaddress(PC+4)iswrittentord(canbex0 ifvaluenotneeded)
§ TheRDNPCinstructionsimplywritesreturnaddresstoregisterrdwithoutjumping(usedfordynamiclinking)
22
7
6 0opcode
3
9 7func3
12
21 10Imm[11:0]
5
26 22rs1
JALR
RDNPC
5
31 27rd
9/1/2016 CS152,Fall2016
FullRISCV1StageDatapath (Lab1)
23
Note: Ref File shown twice for clarity.Immediate select changed.
9/1/2016 CS152,Fall2016
Single-CycleHardwiredControl
Wewillassumeclockperiodissufficientlylongforallofthefollowingstepstobe“completed”:1. Instructionfetch2. Decodeandregisterfetch3. ALUoperation4. Datafetchifrequired5. Registerwrite-backsetuptime
⇒ tC >tIFetch +tRFetch +tALU+tDMem+tRWB
Attherisingedgeofthefollowingclock,thePC,registerfileandmemoryareupdated
24
9/1/2016 CS152,Fall2016
§ Instructionsperprogramdependsonsourcecode,compilertechnology,andISA
§ Cyclesperinstructions(CPI)dependsonISAandµarchitecture
§ Timepercycledependsupontheµarchitectureandbasetechnology
25
Time =Instructions Cycles TimeProgramProgram*Instruction*Cycle
“IronLaw”ofProcessorPerformance
9/1/2016 CS152,Fall2016
Inst3
CPIforMicrocodedMachine
26
7cycles
Inst1 Inst2
5cycles 10cycles
Totalclockcycles=7+5+10=22
Totalinstructions=3
CPI=22/3=7.33
CPIisalwaysanaverageoveralargenumberofinstructions.
Time
9/1/2016 CS152,Fall2016
TechnologyInfluence
§Whenmicrocodeappearedin50s,differenttechnologiesfor:– Logic:VacuumTubes– MainMemory:Magneticcores– Read-OnlyMemory:Diodematrix,punchedmetalcards,…
§ LogicveryexpensivecomparedtoROMorRAM§ ROMcheaperthanRAM§ ROMmuchfasterthanRAM
27
Butseventiesbroughtadvancesinintegratedcircuittechnologyandsemiconductormemory…
9/1/2016 CS152,Fall2016
FirstMicroprocessorIntel4004,1971
§ 4-bitaccumulatorarchitecture
§ 8µmpMOS§ 2,300transistors§ 3x4mm2§ 750kHzclock§ 8-16cycles/inst.
28
Madepossiblebynewintegratedcircuittechnology
9/1/2016 CS152,Fall2016
Microprocessors intheSeventies
§ Initialtargetwasembeddedcontrol– Firstmicro,4-bit4004fromIntel,designedforadesktopprintingcalculator
– Constrainedbywhatcouldfitonsinglechip– Accumulatorarchitectures,similartoearliestcomputers– Hardwiredstatemachinecontrol
§ 8-bitmicros(8085,6800,6502)usedinhobbyistpersonalcomputers– Micral,Altair,TRS-80,Apple-II– Usuallyhad16-bitaddressspace(upto64KBdirectlyaddressable)
– OftencamewithsimpleBASIClanguageinterpreterbuiltintoROMorloadedfromcassettetape.
29
9/1/2016 CS152,Fall2016
VisiCalc– thefirst“killer”appformicros• MicroprocessorshadlittleimpactonconventionalcomputermarketuntilVisiCalcspreadsheetforApple-II• Apple-IIusedMostek 6502microprocessorrunningat1MHz
30[PersonalComputingAd,1979]
FloppydiskswereoriginallyinventedbyIBMasawayofshippingIBM360microcodepatchestocustomers!
9/1/2016 CS152,Fall2016
DRAMintheSeventies
§ Dramaticprogressinsemiconductormemorytechnology
§ 1970,IntelintroducesfirstDRAM,1Kbit1103
§ 1979,Fujitsuintroduces64KbitDRAM
=>Bymid-Seventies,obviousthatPCswouldsoonhave>64KBytesphysicalmemory
31
9/1/2016 CS152,Fall2016
MicroprocessorEvolution
§ Rapidprogressin70s,fueledbyadvancesinMOSFETtechnologyandexpandingmarkets
§ Inteli432– Mostambitiousseventies’micro;startedin1975- released1981– 32-bitcapability-basedobject-orientedarchitecture– Instructionsvariablenumberofbitslong– Severeperformance,complexity,andusabilityproblems
§ Motorola68000(1979,8MHz,68,000transistors)– Heavilymicrocoded (andnanocoded)– 32-bitgeneral-purposeregisterarchitecture(24addresspins)– 8addressregisters,8dataregisters
§ Intel8086(1978,8MHz,29,000transistors)– “Stopgap”16-bitprocessor,architectedin10weeks– Extendedaccumulatorarchitecture,assembly-compatiblewith8080– 20-bitaddressingthroughsegmentedaddressingscheme
32
9/1/2016 CS152,Fall2016
IBMPC,1981
§ Hardware– TeamfromIBMbuildingPCprototypesin1979– Motorola68000choseninitially,but68000waslate– IBMbuilds“stopgap”prototypesusing8088boardsfromDisplayWriterwordprocessor
– 8088is8-bitbusversionof8086=>allowscheapersystem– Estimatedsalesof250,000– 100,000,000ssold
§ Software– MicrosoftnegotiatestoprovideOSforIBM.LaterbuysandmodifiesQDOSfromSeattleComputerProducts.
§ OpenSystem– Standardprocessor,Intel8088– Standardinterfaces– StandardOS,MS-DOS– IBMpermitscloningandthird-partysoftware
33
9/1/2016 CS152,Fall2016
Microprogramming:earlyEighties
§ Evolutionbredmorecomplexmicro-machines– Complexinstructionsetsledtoneedforsubroutineandcallstacksinµcode
– Needforfixingbugsincontrolprogramswasinconflictwithread-onlynatureofµROM
– èWritableControlStore(WCS)(B1700,QMachine,Inteli432,…)
§ WiththeadventofVLSItechnologyassumptionsaboutROM&RAMspeedbecameinvalidàmorecomplexity
§ Bettercompilersmadecomplexinstructionslessimportant.
§ Useofnumerousmicro-architecturalinnovations,e.g.,pipelining,cachesandbuffers,mademultiple-cycleexecutionofreg-reginstructionsunattractive
35
9/1/2016 CS152,Fall2016
AnalyzingMicrocodedMachines
§ JohnCocke andgroupatIBM– Workingonasimplepipelinedprocessor,801,andadvancedcompilersinsideIBM
– PortedexperimentalPL.8compilertoIBM370,andonlyusedsimpleregister-registerandload/storeinstructionssimilarto801
– Coderanfasterthanotherexistingcompilersthatusedall370instructions!(upto6MIPSwhereas2MIPSconsideredgoodbefore)
§ Emer,Clark,atDEC– MeasuredVAX-11/780usingexternalhardware– Founditwasactuallya0.5MIPSmachine,althoughusuallyassumedtobea1MIPSmachine
– Found20%ofVAXinstructionsresponsiblefor60%ofmicrocode,butonlyaccountfor0.2%ofexecutiontime!
§ VAX8800– ControlStore:16K*147bRAM,UnifiedCache:64K*8bRAM– 4.5xmoremicrostore RAMthancacheRAM!
36
9/1/2016 CS152,Fall2016
ICTechnologyChangesTradeoffs
§ Logic,RAM,ROMallimplementedusingMOStransistors§ SemiconductorRAM~samespeedasROM
37
9/1/2016 CS152,Fall2016
Nanocoding
38
µcodeROM
nanoaddress
µcodenext-state
µaddress
uPC (state)
nanoinstructionROMdata
Exploitsrecurringcontrolsignalpatternsinµcode,e.g.,
ALU0 A←Reg[rs1]...ALUi0 A←Reg[rs1]...
9/1/2016 CS152,Fall2016
FromCISCtoRISC
§ UsefastRAMtobuildfastinstructioncache ofuser-visibleinstructions,notfixedhardwaremicroroutines– Contentsoffastinstructionmemorychangetofitwhatapplicationneedsrightnow
§ UsesimpleISAtoenablehardwiredpipelinedimplementation– MostcompiledcodeonlyusedafewoftheavailableCISCinstructions– Simplerencodingallowedpipelinedimplementations
§ Furtherbenefitwithintegration– Inearly‘80s,couldfinallyfit32-bitdatapath +smallcachesonasinglechip
– Nochipcrossingsincommoncaseallowsfasteroperation
39
9/1/2016 CS152,Fall2016
BerkeleyRISCChips
40
RISC-I(1982)Contains44,420transistors,fabbed in5µm NMOS,withadieareaof77mm2,ranat1MHz.ThischipisprobablythefirstVLSIRISC.
RISC-II(1983)contains40,760transistors,wasfabbed in3µmNMOS,ranat3MHz,andthesizeis60mm2.
Stanford built some too…
9/1/2016 CS152,Fall2016
Summary
§ Microcoding becamelessattractiveasgapbetweenRAMandROMspeedsreduced,andlogicimplementedinsametechnologyasmemory
§ Complexinstructionsetsdifficulttopipeline,sodifficulttoincreaseperformanceasgatecountgrew
§ IronLawexplainsarchitecturedesignspace– Tradeinstruction/program,cycles/instruction,andtime/cycle
§ Load-StoreRISCISAsdesignedforefficientpipelinedimplementations– Verysimilartoverticalmicrocode– InspiredbyearlierCraymachines(CDC6600/7600)
§ RISC-VISAwillbeusedinlectures,problems,andlabs– BerkeleyRISCchips:RISC-I,RISC-II,SOAR(RISC-III),SPUR(RISC-IV)
41