cs 152 computer architecture and engineering lecture 13...

26
10/17/2016 CS152, Fall 2016 CS 152 Computer Architecture and Engineering Lecture 13 - VLIW Machines and Statically Scheduled ILP John Wawrzynek Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~johnw http://inst.eecs.berkeley.edu/~cs152

Upload: others

Post on 13-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 152 Computer Architecture and Engineering Lecture 13 ...gamescrafters.berkeley.edu/~cs152/fa16/lectures/L13-VLIW.pdf · 10/17/2016 CS152, Fall 2016 CS 152 Computer Architecture

10/17/2016 CS152,Fall2016

CS152ComputerArchitectureandEngineering

Lecture13- VLIWMachinesandStaticallyScheduledILP

JohnWawrzynekElectricalEngineeringandComputerSciences

UniversityofCaliforniaatBerkeley

http://www.eecs.berkeley.edu/~johnwhttp://inst.eecs.berkeley.edu/~cs152

Page 2: CS 152 Computer Architecture and Engineering Lecture 13 ...gamescrafters.berkeley.edu/~cs152/fa16/lectures/L13-VLIW.pdf · 10/17/2016 CS152, Fall 2016 CS 152 Computer Architecture

10/17/2016 CS152,Fall2016

CS152Administrivia

§Quiz3,ThursdayOct21– L10-L12,PS3,Lab3directedportion

§MakesuretounderstandPS3!§Lab3dueFridayOct28

2

Page 3: CS 152 Computer Architecture and Engineering Lecture 13 ...gamescrafters.berkeley.edu/~cs152/fa16/lectures/L13-VLIW.pdf · 10/17/2016 CS152, Fall 2016 CS 152 Computer Architecture

10/17/2016 CS152,Fall2016

LasttimeinLecture12§ UnifiedphysicalregisterfilemachinesremovedatavaluesfromROB– Allvaluesonlyreadandwrittenduringexecution– OnlyregistertagsheldinROB– Allocateresources(ROBslot,destination physicalregister,memoryreorderqueue location)duringdecode

– IssuewindowcanbeseparatedfromROBandmadesmallerthanROB(allocate indecode,freeafterinstructioncompletes)

– Freeresourcesoncommit

§ Speculative storebufferholdsstorevaluesbeforecommittoallowload-storeforwarding

§ Canexecute laterloadspastearlierstoreswhenaddressesknown,orpredictednodependence

3

Page 4: CS 152 Computer Architecture and Engineering Lecture 13 ...gamescrafters.berkeley.edu/~cs152/fa16/lectures/L13-VLIW.pdf · 10/17/2016 CS152, Fall 2016 CS 152 Computer Architecture

10/17/2016 CS152,Fall2016

SuperscalarControlLogicScaling

§ EachissuedinstructionmustsomehowcheckagainstW*Linstructions, i.e.,growthinhardware∝W*(W*L)

§ Forin-ordermachines,Lisrelated topipelinelatenciesandcheckisdoneduringissue(interlocksorscoreboard)

§ Forout-of-ordermachines,Lalsoincludestimespentininstructionbuffers(instructionwindoworROB),andcheckisdonebybroadcastingtagstowaitinginstructionsatwriteback(completion)

§ AsWincreases, largerinstructionwindowisneededtofindenoughparallelismtokeepmachinebusy=>greaterL

=>Out-of-ordercontrollogicgrowsfasterthanW2 (~W3)

4

LifetimeL

IssueGroup

PreviouslyIssued

Instructions

IssueWidthW

Page 5: CS 152 Computer Architecture and Engineering Lecture 13 ...gamescrafters.berkeley.edu/~cs152/fa16/lectures/L13-VLIW.pdf · 10/17/2016 CS152, Fall 2016 CS 152 Computer Architecture

10/17/2016 CS152,Fall2016

Out-of-OrderControlComplexity:MIPSR10000

5

ControlLogic

[SGI/MIPSTechnologiesInc.,1995]

Page 6: CS 152 Computer Architecture and Engineering Lecture 13 ...gamescrafters.berkeley.edu/~cs152/fa16/lectures/L13-VLIW.pdf · 10/17/2016 CS152, Fall 2016 CS 152 Computer Architecture

10/17/2016 CS152,Fall2016

SequentialISABottleneck

6

Checkinstructiondependencies

Superscalarprocessor

a = foo(b);

for (i=0, i<

Sequentialsourcecode

Superscalarcompiler

Findindependentoperations

Scheduleoperations

Sequentialmachinecode

Scheduleexecution

Page 7: CS 152 Computer Architecture and Engineering Lecture 13 ...gamescrafters.berkeley.edu/~cs152/fa16/lectures/L13-VLIW.pdf · 10/17/2016 CS152, Fall 2016 CS 152 Computer Architecture

10/17/2016 CS152,Fall2016

VLIW:VeryLongInstructionWord

§Multipleoperationspackedintooneinstruction§ Eachoperationslotisforafixedfunction§ Constantoperationlatenciesarespecified§ Architecturerequiresguaranteeof:

– Parallelismwithinaninstruction=>nocross-operation RAWcheck– Nodatausebeforedataready=>nodatainterlocks

7

TwoIntegerUnits,SingleCycleLatency

TwoLoad/StoreUnits,ThreeCycleLatency TwoFloating-PointUnits,

FourCycleLatency

IntOp2 MemOp1 MemOp2 FPOp1 FPOp2Int Op1

Page 8: CS 152 Computer Architecture and Engineering Lecture 13 ...gamescrafters.berkeley.edu/~cs152/fa16/lectures/L13-VLIW.pdf · 10/17/2016 CS152, Fall 2016 CS 152 Computer Architecture

10/17/2016 CS152,Fall2016

EarlyVLIWMachines

§ FPSAP120B(1976)– scientificattachedarrayprocessor– firstcommercialwideinstructionmachine– hand-codedvectormathlibrariesusingsoftwarepipeliningandloopunrolling

§MultiflowTrace(1987)– commercializationofideasfromFisher’sYalegroupincluding“tracescheduling”

– availableinconfigurationswith7,14,or28operations/instruction– 28operationspackedintoa1024-bitinstructionword

§ CydromeCydra-5(1987)– 7operationsencodedin256-bitinstructionword– rotatingregister file

8

Page 9: CS 152 Computer Architecture and Engineering Lecture 13 ...gamescrafters.berkeley.edu/~cs152/fa16/lectures/L13-VLIW.pdf · 10/17/2016 CS152, Fall 2016 CS 152 Computer Architecture

10/17/2016 CS152,Fall2016

VLIWCompilerResponsibilities

§Scheduleoperationstomaximizeparallelexecution

§Guaranteesintra-instructionparallelism

§Scheduletoavoiddatahazards(nointerlocks)– TypicallyseparatesoperationswithexplicitNOPs

9

Page 10: CS 152 Computer Architecture and Engineering Lecture 13 ...gamescrafters.berkeley.edu/~cs152/fa16/lectures/L13-VLIW.pdf · 10/17/2016 CS152, Fall 2016 CS 152 Computer Architecture

10/17/2016 CS152,Fall2016

LoopExecution

HowmanyFPops/cycle?

10

for (i=0; i<N; i++)

B[i] = A[i] + C;Int1 Int 2 M1 M2 FP+ FPx

loop: fldadd x1

fadd

fsdadd x2 bne

1 fadd / 8 cycles = 0.125

loop: fld f1, 0(x1)

add x1, 8

fadd f2, f0, f1

fsd f2, 0(x2)

add x2, 8

bne x1, x3, loop

Compile

Schedule

Page 11: CS 152 Computer Architecture and Engineering Lecture 13 ...gamescrafters.berkeley.edu/~cs152/fa16/lectures/L13-VLIW.pdf · 10/17/2016 CS152, Fall 2016 CS 152 Computer Architecture

10/17/2016 CS152,Fall2016

LoopUnrolling

11

for (i=0; i<N; i++)

B[i] = A[i] + C;

for (i=0; i<N; i+=4)

{

B[i] = A[i] + C;

B[i+1] = A[i+1] + C;

B[i+2] = A[i+2] + C;

B[i+3] = A[i+3] + C;

}

Unroll inner loop to perform 4 iterations at once

Need to handle values of N that are not multiples of unrolling factor with final cleanup loop

Page 12: CS 152 Computer Architecture and Engineering Lecture 13 ...gamescrafters.berkeley.edu/~cs152/fa16/lectures/L13-VLIW.pdf · 10/17/2016 CS152, Fall 2016 CS 152 Computer Architecture

10/17/2016 CS152,Fall2016

SchedulingLoopUnrolledCode

12

loop: fld f1, 0(x1)fld f2, 8(x1)fld f3, 16(x1)fld f4, 24(x1)add x1, 32fadd f5, f0, f1fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4fsd f5, 0(x2)fsd f6, 8(x2)fsd f7, 16(x2)fsd f8, 24(x2)add x2, 32bne x1, x3, loop

Schedule

Int1 Int 2 M1 M2 FP+ FPx

loop:

Unroll 4 ways

fld f1fld f2fld f3fld f4add x1 fadd f5

fadd f6fadd f7fadd f8

fsd f5fsd f6fsd f7fsd f8add x2 bne

How many FLOPS/cycle?4 fadds / 11 cycles = 0.36

Page 13: CS 152 Computer Architecture and Engineering Lecture 13 ...gamescrafters.berkeley.edu/~cs152/fa16/lectures/L13-VLIW.pdf · 10/17/2016 CS152, Fall 2016 CS 152 Computer Architecture

10/17/2016 CS152,Fall2016

SoftwarePipelining

HowmanyFLOPS/cycle?

13

loop: fld f1, 0(x1)fld f2, 8(x1)fld f3, 16(x1)fld f4, 24(x1)add x1, 32fadd f5, f0, f1fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4fsd f5, 0(x2)fsd f6, 8(x2)fsd f7, 16(x2)add x2, 32fsd f8, -8(x2)bne x1, x3, loop

Int1 Int 2 M1 M2 FP+ FPxUnroll 4 ways firstfld f1fld f2fld f3fld f4

fadd f5fadd f6fadd f7fadd f8

fsd f5fsd f6fsd f7fsd f8

add x1

add x2bne

fld f1fld f2fld f3fld f4

fadd f5fadd f6fadd f7fadd f8

fsd f5fsd f6fsd f7fsd f8

add x1

add x2bne

fld f1fld f2fld f3fld f4

fadd f5fadd f6fadd f7fadd f8

fsd f5

add x1

loop:iterate

prolog

epilog

4 fadds / 4 cycles = 1

Page 14: CS 152 Computer Architecture and Engineering Lecture 13 ...gamescrafters.berkeley.edu/~cs152/fa16/lectures/L13-VLIW.pdf · 10/17/2016 CS152, Fall 2016 CS 152 Computer Architecture

10/17/2016 CS152,Fall2016

SoftwarePipeliningvs.LoopUnrolling

14

time

performance

time

performance

Loop Unrolled

Software Pipelined

Startup overhead

Wind-down overhead

Loop Iteration

Loop Iteration

Software pipelining pays startup/wind-down costs only once per loop, not once per iteration

Page 15: CS 152 Computer Architecture and Engineering Lecture 13 ...gamescrafters.berkeley.edu/~cs152/fa16/lectures/L13-VLIW.pdf · 10/17/2016 CS152, Fall 2016 CS 152 Computer Architecture

10/17/2016 CS152,Fall2016

Whatiftherearenoloops?

§ Brancheslimitbasicblocksizeincontrol-flowintensiveirregularcode

§ DifficulttofindILPinindividualbasicblocks

15

Basicblock

Page 16: CS 152 Computer Architecture and Engineering Lecture 13 ...gamescrafters.berkeley.edu/~cs152/fa16/lectures/L13-VLIW.pdf · 10/17/2016 CS152, Fall 2016 CS 152 Computer Architecture

10/17/2016 CS152,Fall2016

TraceScheduling[Fisher,Ellis]

§ Pickstringofbasicblocks,atrace,thatrepresentsmostfrequentbranchpath

§ Useprofilingfeedback orcompilerheuristicstofindcommonbranchpaths

§ Schedulewhole“trace”atonce§ Addfixup codetocopewithbranchesjumpingoutoftrace

16

Page 17: CS 152 Computer Architecture and Engineering Lecture 13 ...gamescrafters.berkeley.edu/~cs152/fa16/lectures/L13-VLIW.pdf · 10/17/2016 CS152, Fall 2016 CS 152 Computer Architecture

10/17/2016 CS152,Fall2016

Problemswith“Classic”VLIW

§Object-codecompatibility– havetorecompileallcodeforeverymachine,evenfortwomachinesinsamegeneration

§Objectcodesize– instructionpaddingwastesinstructionmemory/cache– loopunrolling/softwarepipeliningreplicatescode

§ Schedulingvariablelatencymemoryoperations– cachesand/ormemorybankconflictsimposestaticallyunpredictablevariability

§ Knowingbranchprobabilities– optimalschedulevarieswithbranchpath– Profilingrequiresansignificantextrastepinbuildprocess

17

Page 18: CS 152 Computer Architecture and Engineering Lecture 13 ...gamescrafters.berkeley.edu/~cs152/fa16/lectures/L13-VLIW.pdf · 10/17/2016 CS152, Fall 2016 CS 152 Computer Architecture

10/17/2016 CS152,Fall2016

VLIWInstructionEncoding

§ Schemestoreduceeffectofunusedfields– Compressed formatinmemory,expandonI-cacherefill

• usedinMultiflow Trace• introducesinstructionaddressingchallenge

– Markparallelgroups• usedinTMS320C6xDSPs,IntelIA-64

– Provideasingle-opVLIWinstruction• Cydra-5UniOp instructions

18

Group 1 Group 2 Group 3

Page 19: CS 152 Computer Architecture and Engineering Lecture 13 ...gamescrafters.berkeley.edu/~cs152/fa16/lectures/L13-VLIW.pdf · 10/17/2016 CS152, Fall 2016 CS 152 Computer Architecture

10/17/2016 CS152,Fall2016

Intel Itanium,EPICIA-64

§ EPICisthestyleofarchitecture(cf.CISC,RISC)– ExplicitlyParallel InstructionComputing(reallyjustVLIW)

§ IA-64isIntel’schosenISA(cf.x86,MIPS)– IA-64=IntelArchitecture64-bit– Anobject-code-compatible VLIW

§ Mercedwasfirst Itaniumimplementation (cf.8086)– Firstcustomershipmentexpected 1997(actually2001)– McKinley,secondimplementation shipped in2002– Recentversion, Poulson,eightcores, 32nm,announced2011

19

Page 20: CS 152 Computer Architecture and Engineering Lecture 13 ...gamescrafters.berkeley.edu/~cs152/fa16/lectures/L13-VLIW.pdf · 10/17/2016 CS152, Fall 2016 CS 152 Computer Architecture

10/17/2016 CS152,Fall2016

EightCoreItanium“Poulson”[Intel2011]

§ 8cores§ 1-cycle16KBL1I&Dcaches§ 9-cycle512KBL2I-cache§ 8-cycle256KBL2D-cache§ 32MBsharedL3cache§ 544mm2 in 32nmCMOS§ Over 3billiontransistors

§ Coresare2-waymultithreaded§ 6instruction/cyclefetch

– Two128-bitbundles

§ Upto12insts/cycleexecute

20

Page 21: CS 152 Computer Architecture and Engineering Lecture 13 ...gamescrafters.berkeley.edu/~cs152/fa16/lectures/L13-VLIW.pdf · 10/17/2016 CS152, Fall 2016 CS 152 Computer Architecture

10/17/2016 CS152,Fall2016

IA-64InstructionFormat

§ Templatebitsdescribegroupingofthese instructionswithothersinadjacentbundles

§ Eachgroupcontainsinstructionsthatcanexecute inparallel

21

Instruction 2 Instruction 1 Instruction 0 Template

128-bit instruction bundle

group i group i+1 group i+2group i-1

bundle j bundle j+1bundle j+2bundle j-1

Page 22: CS 152 Computer Architecture and Engineering Lecture 13 ...gamescrafters.berkeley.edu/~cs152/fa16/lectures/L13-VLIW.pdf · 10/17/2016 CS152, Fall 2016 CS 152 Computer Architecture

10/17/2016 CS152,Fall2016

IA-64Registers

§ 128GeneralPurpose64-bitIntegerRegisters§ 128GeneralPurpose64/80-bitFloatingPointRegisters§ 641-bitPredicateRegisters

§ GPRs “rotate” toreducecodesizeforsoftwarepipelined loops– Rotation isasimple formofregisterrenamingallowingoneinstructiontoaddressdifferentphysicalregistersoneachiteration

22

Page 23: CS 152 Computer Architecture and Engineering Lecture 13 ...gamescrafters.berkeley.edu/~cs152/fa16/lectures/L13-VLIW.pdf · 10/17/2016 CS152, Fall 2016 CS 152 Computer Architecture

10/17/2016 CS152,Fall2016

IA-64PredicatedExecutionProblem:Mispredicted brancheslimitILP

20-30%ofperformancegoestobranchmispredictions [Intel98]Solution:Eliminatehardtopredictbrancheswithpredicatedexecution

– AlmostallIA-64instructionscanbeexecutedconditionallyunderpredicate– InstructionbecomesNOPifpredicateregisterfalse

23

Inst 1Inst 2br a==b, b2

Inst 3Inst 4br b3

Inst 5Inst 6

Inst 7Inst 8

b0:

b1:

b2:

b3:

if

else

then

Four basic blocks

Inst 1Inst 2p1,p2 <- cmp(a==b)(p1) Inst 3 || (p2) Inst 5(p1) Inst 4 || (p2) Inst 6Inst 7Inst 8

Predication

Single basic block

§ Simplifiesscheduling– Turn“controlflow”into“dataflow”

§ Lesscodeisneeded

Page 24: CS 152 Computer Architecture and Engineering Lecture 13 ...gamescrafters.berkeley.edu/~cs152/fa16/lectures/L13-VLIW.pdf · 10/17/2016 CS152, Fall 2016 CS 152 Computer Architecture

10/17/2016 CS152,Fall2016

IA-64DataSpeculation

Problem:Possiblememoryhazardslimitcodescheduling

26

Requires associative hardware in address check table

Inst 1Inst 2Store

Load r1Use r1Inst 3

Can’t move load above store because store might be to same

address

Load.a r1Inst 1Inst 2Store

Load.cUse r1Inst 3

Data speculative load adds address to

address check table

Store invalidates any matching loads in

address check table

Check if load invalid (or missing), jump to fixup

code if so

Solution: Hardware to check pointer hazards

Page 25: CS 152 Computer Architecture and Engineering Lecture 13 ...gamescrafters.berkeley.edu/~cs152/fa16/lectures/L13-VLIW.pdf · 10/17/2016 CS152, Fall 2016 CS 152 Computer Architecture

10/17/2016 CS152,Fall2016

LimitsofStaticScheduling

§Unpredictablebranches§ Variablememorylatency(unpredictablecachemisses)§ Codesizeexplosion§ CompilercomplexityDespiteseveralattempts,VLIWhasfailedingeneral-purposecomputingarena(sofar).– MorecomplexVLIWarchitecturesclosetoin-ordersuperscalarincomplexity,norealadvantageonlargecomplexapps.

SuccessfulinembeddedDSPmarket– SimplerVLIWs withmoreconstrainedenvironment, friendliercode.

27

Page 26: CS 152 Computer Architecture and Engineering Lecture 13 ...gamescrafters.berkeley.edu/~cs152/fa16/lectures/L13-VLIW.pdf · 10/17/2016 CS152, Fall 2016 CS 152 Computer Architecture

10/17/2016 CS152,Fall2016

Acknowledgements

§ Theseslidescontainmaterialdeveloped andcopyrightby:– Arvind (MIT)– KrsteAsanovic(MIT/UCB)– JoelEmer (Intel/MIT)– JamesHoe(CMU)– JohnKubiatowicz (UCB)– DavidPatterson(UCB)

§ MITmaterialderivedfromcourse6.823§ UCBmaterialderivedfromcourseCS252

28