1 ece 587 advanced computer architecture i chapter 4 code sequences herbert g. mayer, psu status...

1

ECE 587Advanced Computer Architecture

I

Chapter 4Code Sequences

Herbert G. Mayer, PSUHerbert G. Mayer, PSUStatus 7/6/2015Status 7/6/2015

2

Syllabus Moore’s LawMoore’s Law

Key Architecture MessagesKey Architecture Messages

Code Sequences for 3 Different Code Sequences for 3 Different ArchitecturesArchitectures

Dependencies, AKA DependencesDependencies, AKA Dependences

Score BoardScore Board

ReferencesReferences

3

Processor Performance GrowthMoore’s LawMoore’s Law --from Webopedia 8/27/2004: --from Webopedia 8/27/2004:

““The observation made in 1965 by Gordon Moore, co-The observation made in 1965 by Gordon Moore, co-founder of Intel, that the number of transistors per founder of Intel, that the number of transistors per square inch on integrated circuits had doubled every square inch on integrated circuits had doubled every year since it was invented. Moore predicted that this year since it was invented. Moore predicted that this trend would continue for the foreseeable future.trend would continue for the foreseeable future.

In subsequent years, the pace slowed down a bit, but In subsequent years, the pace slowed down a bit, but data density doubled approximately every 18 monthsdata density doubled approximately every 18 months, , and this is the current definition of and this is the current definition of Moore's LawMoore's Law, , which which Moore himself has blessedMoore himself has blessed. Most experts, . Most experts, including Moore himself, expect including Moore himself, expect Moore's LawMoore's Law to hold to hold for another two decades.for another two decades.

Others coin a more general law, a bit lamely stating Others coin a more general law, a bit lamely stating that that “the circuit density increases predictably over “the circuit density increases predictably over time.”time.”

4

Processor Performance GrowthSo far til 2015, Moore’s Law is holding true since So far til 2015, Moore’s Law is holding true since

~1968.~1968.

Some Intel Some Intel fellows fellows believe that an end to Moore’s Law believe that an end to Moore’s Law will be reached ~2018 due to physical limitations will be reached ~2018 due to physical limitations in the process of manufacturing transistors from in the process of manufacturing transistors from semi-conductor material.semi-conductor material.

Such phenomenal growth is unknown in any other Such phenomenal growth is unknown in any other industry. For example, if doubling of performance industry. For example, if doubling of performance could be achieved every 18 months, then by 2001 could be achieved every 18 months, then by 2001 other industries would have achieved the other industries would have achieved the following:following:

Cars would travel at 2,400,000 Mph, and get 600,000 Cars would travel at 2,400,000 Mph, and get 600,000 MpGMpG

Air travel LA to NYC would be at 36,000 Mach, take Air travel LA to NYC would be at 36,000 Mach, take 0.5 seconds0.5 seconds

5

KeyArchitecture Messages

6

Message 1: Memory is Slow The inner core of the processor, the CPU or the µP, The inner core of the processor, the CPU or the µP,

is getting faster at a steady rateis getting faster at a steady rate

Access to memoryAccess to memory is also getting faster over time, is also getting faster over time, but but at a slower rateat a slower rate. This rate differential has . This rate differential has existed for quite some time, with the strange existed for quite some time, with the strange effect that fast processors have to rely on effect that fast processors have to rely on progressively slower memories –relatively speakingprogressively slower memories –relatively speaking

Possible on MP servers that processor has to wait > Possible on MP servers that processor has to wait > 100 cycles before a memory access completes:100 cycles before a memory access completes: one one single memory accesssingle memory access. On a Multi-Processor the bus . On a Multi-Processor the bus protocol is more complex due to snooping, backing-protocol is more complex due to snooping, backing-off, arbitration, thus the number of cycles to off, arbitration, thus the number of cycles to complete a memory access can grow so highcomplete a memory access can grow so high

IO simply compounds the problem of slow memory IO simply compounds the problem of slow memory accessaccess

7

Message 1: Memory is Slow Discarding conventional memory altogether, relying only Discarding conventional memory altogether, relying only

on cache-like memories, is NOT an option for 64-bit on cache-like memories, is NOT an option for 64-bit architectures, due to the price/size/cost/power if you architectures, due to the price/size/cost/power if you pursue full memory population with 2pursue full memory population with 26464 bytes bytes

Another way of seeing this: Using solely reasonably-Another way of seeing this: Using solely reasonably-priced cache memories (say more than 10 times the cost priced cache memories (say more than 10 times the cost of regular memory) is of regular memory) is not feasiblenot feasible: the resulting : the resulting physical address space would be too small, or the price physical address space would be too small, or the price too hightoo high

Significant intellectual efforts in computer Significant intellectual efforts in computer architecture focuses on architecture focuses on reducing the performance impact reducing the performance impact of fast processors accessing slow, virtualized memoriesof fast processors accessing slow, virtualized memories

All else except IO, seems easy compared to this All else except IO, seems easy compared to this fundamental problem! IO is even slower by further fundamental problem! IO is even slower by further orders of magnitudeorders of magnitude

8

Message 1: Memory is Slow

µProc60%/yr.

DRAM7%/yr.

1

10

100

1000

1980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU1982

Processor-MemoryPerformance Gap:(grows 50% / year)

Time

“Moore’s Law”

Source: David Patterson, UC Berkeley

2001

2002

9

Message 2: Events Tend to Cluster

A strange thing happens during program execution: A strange thing happens during program execution: Seemingly Seemingly unrelated events tend to clusterunrelated events tend to cluster

memory accessesmemory accesses tend to concentrate a majority of tend to concentrate a majority of their referenced addresses onto a their referenced addresses onto a small domainsmall domain of of the total address space. Even if all of memory is the total address space. Even if all of memory is accessed, during some periods of time such accessed, during some periods of time such clustering happensclustering happens

Intuitively, one memory access seems independent of Intuitively, one memory access seems independent of another, but they both happen to fall onto cache another, but they both happen to fall onto cache line or the same page (or line or the same page (or working set working set of pages)of pages)

We call this phenomenon We call this phenomenon LocalityLocality!! Architects Architects exploit locality to speed up memory access via exploit locality to speed up memory access via CachesCaches and increase the address range beyond and increase the address range beyond physical memory via physical memory via Virtual Memory ManagementVirtual Memory Management. . Distinguish Distinguish spacialspacial from from temporaltemporal locality locality

10


Similarly, Similarly, hash functions hash functions tend to tend to concentrate an unproportionally large concentrate an unproportionally large number of keys onto a small number of table number of keys onto a small number of table entriesentries

Incoming search key (say, a C++ program Incoming search key (say, a C++ program identifier) is mapped into an index, but identifier) is mapped into an index, but the next, completely unrelated key, happens the next, completely unrelated key, happens to map onto the same index. In an extreme to map onto the same index. In an extreme case, this may render a hash lookup slower case, this may render a hash lookup slower than a sequential, linear searchthan a sequential, linear search

Programmer must Programmer must watch outwatch out for the for the phenomenon of clustering, as it is phenomenon of clustering, as it is undesired in hashingundesired in hashing!!

11


Clustering happens in all diverse modules of the Clustering happens in all diverse modules of the processor architecture. For example, when a processor architecture. For example, when a data data cache cache is used to speed-up memory accesses by is used to speed-up memory accesses by having a copy of frequently used data in a faster having a copy of frequently used data in a faster memory unit, it happens that a memory unit, it happens that a small cache small cache sufficessuffices to speed up execution to speed up execution

Due to Due to Data Locality Data Locality (spatial and temporal). Data (spatial and temporal). Data that have been accessed recently will again be that have been accessed recently will again be accessed in the near future, or at least data accessed in the near future, or at least data that live that live close by close by will be accessed in the near will be accessed in the near futurefuture

Thus they happen to reside in the same cache Thus they happen to reside in the same cache line. Architects do exploit this to speed up line. Architects do exploit this to speed up execution, while keeping the incremental cost for execution, while keeping the incremental cost for HW contained. Here clustering is a HW contained. Here clustering is a valuable valuable phenomenon phenomenon

12

Message 3: Heat is Bad Clocking a processor fast (e.g. > 3-5 GHz) can Clocking a processor fast (e.g. > 3-5 GHz) can

increase performance and thus generally “is good”increase performance and thus generally “is good”

Other performance parameters, such as memory Other performance parameters, such as memory access speed, peripheral access, etc. access speed, peripheral access, etc. do not do not scale with the clock speedscale with the clock speed. Still, increasing the . Still, increasing the clock to a higher rate is desirableclock to a higher rate is desirable

Comes at the Comes at the cost of higher currentcost of higher current, thus more , thus more heat generated in the identical physical geometry heat generated in the identical physical geometry (the so called (the so called real-estatereal-estate) of the silicon ) of the silicon processor or also the chipsetprocessor or also the chipset

But the silicon part acts like a heat-conductor, But the silicon part acts like a heat-conductor, conducting better, as it gets warmer (negative conducting better, as it gets warmer (negative temperature coefficient resistor, or temperature coefficient resistor, or NTCNTC). Since ). Since the power-supply is a the power-supply is a constant-current sourceconstant-current source, a , a lower resistance causes lower voltage, shown as lower resistance causes lower voltage, shown as VDroopVDroop in the figure below in the figure below

13

Message 3: Heat is Bad

14

Message 3: Heat is Bad This in turn means, voltage must be increased This in turn means, voltage must be increased

artificially, to sustain the clock rate, creating artificially, to sustain the clock rate, creating more heat, ultimately leading to self-destruction more heat, ultimately leading to self-destruction of the partof the part

Great efforts are being made to increase the Great efforts are being made to increase the clock speed, requiring more voltage, while at the clock speed, requiring more voltage, while at the same time reducing heat generation. Current same time reducing heat generation. Current technologies include technologies include sleep-states sleep-states of the Silicon of the Silicon part (processor as well as chip-set), and part (processor as well as chip-set), and Turbo Turbo BoostBoost mode, to contain heat generation while mode, to contain heat generation while boosting clock speed just at the right timeboosting clock speed just at the right time

Good that to date Silicon manufacturing Good that to date Silicon manufacturing technologies allow the technologies allow the shrinkingshrinking of transistors of transistors and thus of whole dies. Else CPUs would become and thus of whole dies. Else CPUs would become larger, more expensive, and above all: larger, more expensive, and above all: hotterhotter

15

Message 4: Resource Replication Architects cannot increase Architects cannot increase clock speed clock speed

beyond physical limitationsbeyond physical limitations

One cannot decrease the One cannot decrease the die size die size beyond beyond evolving technologyevolving technology

Yet speed improvements are desired, and Yet speed improvements are desired, and must be achievedmust be achieved

This conflict can partly be overcome This conflict can partly be overcome with replicated resources! But careful!with replicated resources! But careful!

Why carefulWhy careful? Resources could be used for ? Resources could be used for other, better purposes! Optimization other, better purposes! Optimization problemproblem

16

Message 4: Resource Replication Key obstacle to parallel execution Key obstacle to parallel execution

is is data dependencedata dependence in the SW under in the SW under execution. A datum cannot be used, execution. A datum cannot be used, before it has been computedbefore it has been computed

Compiler optimization technology Compiler optimization technology calls this calls this use-def dependence use-def dependence (short (short for use-before-definition), AKA for use-before-definition), AKA true true dependencedependence, AKA , AKA data dependencedata dependence

Goal is to search for program Goal is to search for program portions that are independent of one portions that are independent of one another. This can be at multiple another. This can be at multiple levels of focuslevels of focus

17

Message 4: Resource Replication

At the At the very low levelvery low level of registers, at the of registers, at the machine level –done by HW; see also machine level –done by HW; see also score score boardboard

At the At the low level low level of individual machine of individual machine instructions –done by HW; see also superscalar instructions –done by HW; see also superscalar architecturearchitecture

At the At the medium level medium level of subexpressions in a of subexpressions in a program –done by compiler; see CSEprogram –done by compiler; see CSE

At the At the higher level higher level of several statements of several statements written in sequence in high-level language written in sequence in high-level language program –done by optimizing compiler or by program –done by optimizing compiler or by programmerprogrammer

Or at the Or at the very high level very high level of different of different applications, running on the same computer, applications, running on the same computer, but with independent data, separate but with independent data, separate computations, and independent results –done by computations, and independent results –done by the user running concurrent programsthe user running concurrent programs

18

Message 4: Resource Replication Whenever program portions are independent Whenever program portions are independent

of one another, they can be computed at of one another, they can be computed at the same time: in parallel; the same time: in parallel; but will theybut will they??

Architects provide resources for this Architects provide resources for this parallelismparallelism

Compilers need to Compilers need to uncover opportunities uncover opportunities for parallelism for parallelism in programsin programs

If two actions are independent of one If two actions are independent of one another, they can be computed another, they can be computed simultaneouslysimultaneously

Provided that HW resources exist, that the Provided that HW resources exist, that the absence of dependence has been proven, absence of dependence has been proven, that independent execution paths are that independent execution paths are scheduled on these replicated HW resourcesscheduled on these replicated HW resources

19

Code Samples forthree

Different Architectures

20

The 3 Different Architectures

1.1. Single Accumulator ArchitectureSingle Accumulator Architecture Has one implicit register for all/any operations:

accumulator Arithmetic operations frequently require intermediate

temps! Code relies heavily on load-store to-from temps

2.2. Three-Address GPR ArchitectureThree-Address GPR Architecture Allows complex operations with multiple operands all in

one instruction Hence complex opcode bits, many bits per instruction

3.3. Stack Machine ArchitectureStack Machine Architecture Operands are implied on the stack, except load/store Hence all operations are simple, few bits, but all are

memory accesses

21

Code 1 for 3 Different ArchitecturesExample 1: Object Code Sequence Example 1: Object Code Sequence Without Without

OptimizationOptimization

Strict left-to-right translation, Strict left-to-right translation, no smartsno smarts in in mappingmapping

Consider non-commutative subtraction and Consider non-commutative subtraction and division operatorsdivision operators

We’ll use We’ll use no common subexpression eliminationno common subexpression elimination (CSE), and no register reuse(CSE), and no register reuse

Conventional operator precedenceConventional operator precedence

For Single Accumulator SAA, Three-Address GPR, For Single Accumulator SAA, Three-Address GPR, Stack ArchitecturesStack Architectures

Sample source: Sample source: d d ( a + 3 ) * b - ( a + 3 ) / c ( a + 3 ) * b - ( a + 3 ) / c

22

Code 1 for 3 Different ArchitecturesNo Single-

Accumulator Three-Address GPR dest op1 op op2

Stack Machine

1 ld a add r1, a, #3 push a 2 add #3 mult r2, r1, b pushlit #3 3 mult b add r3, a, #3 add 4 st temp1 div r4, r3, c push b 5 ld a sub d, r2, r4 mult 6 add #3 push a 7 div c pushlit #3 8 st temp2 add 9 ld temp1 push c

10 sub temp2 div 11 st d sub 12 pop d

23

Code 1 for 3 Different ArchitecturesThree-address code Three-address code looks shortestlooks shortest, w.r.t. , w.r.t. number of number of

instructionsinstructions

Maybe Maybe optical illusionoptical illusion, must also consider , must also consider number of bitsnumber of bits per instructionper instruction

Must consider number of I-fetches, operand fetches, total Must consider number of I-fetches, operand fetches, total number of storesnumber of stores

Numerous memory accesses on SAA (Single Accumulator Numerous memory accesses on SAA (Single Accumulator Architecture) Architecture) due to temporary valuesdue to temporary values held in memory held in memory

We find the largest number memory accesses on SA (Stack We find the largest number memory accesses on SA (Stack Architecture): there are no registers, just memory Architecture): there are no registers, just memory accessaccess

Three-Address architecture immune to Three-Address architecture immune to ordering constraintordering constraint, , since operands may be placed in registers in either since operands may be placed in registers in either orderorder

No need for reverse-operation opcodes for Three-Address No need for reverse-operation opcodes for Three-Address architecturearchitecture

24

Code 2 for Different ArchitecturesThis time we This time we eliminate common eliminate common

subexpression (CSE)subexpression (CSE)

Compiler handles left-to-right order for Compiler handles left-to-right order for non-commutative operators on SAAnon-commutative operators on SAA

Better:Better: d d ( a + 3 ) * b - ( a + 3 ) / c ( a + 3 ) * b - ( a + 3 ) / c

25

Code 2 for Different Architectures

No Single-Accumulator

Three-Address GPR dest op1 op op2

Stack Machine

1 ld a add r1, a, #3 push a 2 add #3 mult r2, r1, b pushlit #3 3 st temp1 div r1, r1, c add 4 div c sub d, r2, r1 dup 5 st temp2 push b 6 ld temp1 mult 7 mult b xch 8 sub temp2 push c

9 st d div 10 sub 11 pop d

26

Code 2 for Different ArchitecturesSingle Accumulator Architecture (SAA) Single Accumulator Architecture (SAA)

optimized still needs temporary storage; optimized still needs temporary storage; uses uses temp1 temp1 for common subexpression; has for common subexpression; has no other register for temps!!no other register for temps!!

SAA could use SAA could use negatenegate instruction or instruction or reverse reverse subtractsubtract

Register-use optimized for Three-Address Register-use optimized for Three-Address architecturearchitecture

Common subexpresssion optimized on Stack Common subexpresssion optimized on Stack Machine by duplicating Machine by duplicating dupdup, exchanging , exchanging xchxch

20% reduced for Three-Address, 18% for SAA, 20% reduced for Three-Address, 18% for SAA, only 8% for Stack Machineonly 8% for Stack Machine

27

Code 3 for Different Architectures Analyze 2 similar expressions but with Analyze 2 similar expressions but with

increasing operator precedence left-to-right, increasing operator precedence left-to-right, in 2in 2ndnd case precedences are overridden by ( ) case precedences are overridden by ( )

One operator sequence associates right-to-One operator sequence associates right-to-left, due to arithmetic precedenceleft, due to arithmetic precedence

Compiler uses commutativityCompiler uses commutativity

The other left-to-right, due to explicit The other left-to-right, due to explicit parentheses ( )parentheses ( )

Use simple-minded code generation model: Use simple-minded code generation model: no no cache, no optimizationcache, no optimization

Will there be advantages/disadvantages caused Will there be advantages/disadvantages caused by the architecture?by the architecture?

Expression 1 is:Expression 1 is: e e a + b * c ^ d a + b * c ^ d

28

Expression 1 is : e a + b * c ^ d


No Single-Accumulator

Three-Address GPR dest op1 op op2

Stack Machine Implied Operands

1 ld c expo r1, c, d push a 2 expo d mult r1, b, r1 push b

3 mult b add e, a, r1 push c 4 add a push d 5 st e expo 6 mult 7 add 8 pop e

Expression 2 is : f ( ( g + h ) * i ) ^ j Here the operators associate left-to-right due to parentheses

• Expression 1 is: Expression 1 is: e e a + b * c ^ d a + b * c ^ d

29


No Single-

Accumulator Three-Address GPR dest op1 op op2

Stack Machine Implied operands

1 ld g add r1, g, h push g 2 add h mult r1, i, r1 push h

3 mult i expo f, r1, j add 4 expo j push i 5 st f mult 6 push j 7 expo 8 pop f

Observations, Interaction of Precedence and Architecture Software eliminates constraints imposed by precedence: looking ahead Execution times identical for the 2 different expressions on the same

architecture --unless blurred by secondary effect; see cache example below Conclusion: all architectures handle arithmetic and logic operations well

• Expression 2 is: Expression 2 is: f f ( ( g + h ) * i ( ( g + h ) * i ) ^ j) ^ j

30

Code For Stack Architecture Stack Machine with no register inherently slow, due Stack Machine with no register inherently slow, due

to: to: Memory Accesses!!!Memory Accesses!!!

Implement few top of stack elements via HW shadow Implement few top of stack elements via HW shadow registers registers CacheCache

Let us then measure equivalent code sequences with Let us then measure equivalent code sequences with and without consideration for cacheand without consideration for cache

Top-of-stack register Top-of-stack register tos tos identifies the last valid identifies the last valid word on physical stackword on physical stack

Two shadow registers may hold 0, 1, or 2 true top Two shadow registers may hold 0, 1, or 2 true top wordswords

Top of stack cache counter Top of stack cache counter tcctcc specifies number of specifies number of shadow registers actually usedshadow registers actually used

Thus Thus tostos plus plus tcctcc jointly specify true jointly specify true top of stacktop of stack

31

Code For Stack Architecture

free free

0,1,20,1,2

tcc tcc

2 tos registers 2 tos registers

stack stack

tos tos

32

Code For Stack ArchitectureTimings for Timings for push, pushlit, add, poppush, pushlit, add, pop operations operations

depend on depend on tcctcc

Operations in shadow registers fastest, typically 1 Operations in shadow registers fastest, typically 1 cycle, include register access and the operation cycle, include register access and the operation itselfitself

In our simplistic model, memory access adds 2 cyclesIn our simplistic model, memory access adds 2 cycles

For stack changes define some policy, e.g. For stack changes define some policy, e.g. keep tcc keep tcc 50%50% full full

Table below refines timings for stack with shadow Table below refines timings for stack with shadow registersregisters

Note: push x into cache with free space requires 2 Note: push x into cache with free space requires 2 cycles, which are for the memory fetch: cache cycles, which are for the memory fetch: cache adjustment is done at the same time as memory adjustment is done at the same time as memory fetchfetch

33


operation Cycles tcc before tcc after tos change comment add 1 tcc = 2 tcc = 1 no change add 1+2 tcc = 1 tcc = 1 tos-- underflow? add 1+2+2 tcc = 0 tcc = 1 tos -= 2 underflow? push x 2 tcc = 0,1 tcc++ no change tcc update

in parallel push x 2+2 tcc = 2 tcc = 2 tos++ overflow? pushlit #3 1 tcc = 0,1 tcc++ no change pushlit #3 1+2 tcc = 2 tcc = 2 tos++ overflow? pop y 2 tcc = 1,2 tcc-- no change pop y 2+2 tcc = 0 tcc = 0 tos-- underflow?

34


Code emission for: Code emission for: a + b * c ^ ( d + e * f a + b * c ^ ( d + e * f ^ g )^ g )

Let + and * be commutative, by language Let + and * be commutative, by language rulerule

Architecture here has 2 shadow registers, Architecture here has 2 shadow registers, compiler compiler exploitsexploits this this

Assume initially empty 2-word cacheAssume initially empty 2-word cache

35


# 1 Left - to - Right cycles 1 2 Exploit Cache cycles

2

1 push a 2 push f 2

2 push b 2 push g 2

3 push c 4 e xpo 1

4 push d 4 push e 2

5 push e 4 m ult 1

6 push f 4 push d 2

7 push g 4 add 1

8 expo 1 push c 2

9 mult 3 r_expo = swap + expo 1

10 add 3 push b 2

11 expo 3 m ult 1

12 m ult 3 push a 2

13 a dd 3 a dd 1

36

Code For Stack ArchitectureBlind Blind code emission costs 40 cycles; i.e. not taking code emission costs 40 cycles; i.e. not taking

advantage of tcc knowledge: costs performanceadvantage of tcc knowledge: costs performance

Code emission with shadow register consideration costs 20 Code emission with shadow register consideration costs 20 cyclescycles

True penalty for memory access is worse in practice, based True penalty for memory access is worse in practice, based on quotient of memory access / register operationon quotient of memory access / register operation

Tremendous speed-up always possible when fixing system with Tremendous speed-up always possible when fixing system with severe flaws severe flaws

Return of investment for 2 registers is twice the original Return of investment for 2 registers is twice the original performanceperformance

Such strong speedup is an indicator that the starting Such strong speedup is an indicator that the starting architecture was severely flawed!architecture was severely flawed!

Stack Machine can be fast, if purity of top-of-stack access Stack Machine can be fast, if purity of top-of-stack access is sacrificed for performanceis sacrificed for performance

Note that indexing, looping, indirection, call/return are Note that indexing, looping, indirection, call/return are not addressed herenot addressed here

37

Data Dependences (sic.)

Register Dependencies

38

Register Dependencies Inter-instruction dependenInter-instruction dependenciescies, in CS , in CS

parlance also known as parlance also known as dependendependencesces, arise , arise between registers –or memory locations– between registers –or memory locations– being being defineddefined and and usedused

One instruction computes a result into a One instruction computes a result into a register (or memory); another instruction register (or memory); another instruction needs that result from that same register needs that result from that same register (or that memory location)(or that memory location)

Or, one instruction uses a datum; and Or, one instruction uses a datum; and after its use the same item is then after its use the same item is then recomputedrecomputed

Dependences Dependences require sequential executionrequire sequential execution, , lest the result is unpredictablelest the result is unpredictable

39

Register DependenciesTrue-DependenceTrue-Dependence, AKA Data Dependence: <- synonymous!, AKA Data Dependence: <- synonymous!r3 ←r3 ← r1 op r2 r1 op r2

r5 ← r5 ← r3r3 op r4 op r4 Read after Write, RAWRead after Write, RAW

Anti-Dependence,Anti-Dependence, not a true dependence not a true dependenceparallelize under right conditionparallelize under right conditionr3 ← r3 ← r1r1 op r2 op r2

r1r1 ← r5 op r4 ← r5 op r4 Write after read, WARWrite after read, WAR

Output Dependence, Output Dependence, similar to Anti-Dependence: can do similar to Anti-Dependence: can do somethingsomething

r3r3 ← r1 op r2 ← r1 op r2

r5 ← r5 ← r3r3 op r4 op r4

r3 r3 ← r6 op r7← r6 op r7 Write after Write, WAW, use in Write after Write, WAW, use in betweenbetween

40

Register Dependencies

Control Dependence:Control Dependence:

// r// rii, i = 1..4 come in “live”, i = 1..4 come in “live”

if ( condition1 ) {if ( condition1 ) {

r3 = r1 op r2;r3 = r1 op r2;

}else{}else{ see the jump here? see the jump here?

r5 = r3 op r4;r5 = r3 op r4;

} // end if} // end if

write( r3 );write( r3 );

41

Register Renaming Only Only data dependence data dependence is a is a real real

dependence, dependence, hence called hence called true dependencetrue dependence

Other dependences are artifacts of Other dependences are artifacts of insufficient resourcesinsufficient resources, generally , generally insufficient registersinsufficient registers

This means: if additional registers were This means: if additional registers were available, then replacing some of these available, then replacing some of these conflicting registers with new ones, conflicting registers with new ones, could make the conflict disappear!could make the conflict disappear!

Anti-Anti- and and Output-Output-Dependences are indeed Dependences are indeed such such falsefalse dependences dependences

42

Register Renaming

OriginalOriginal Code: Code:

L1:L1: r1 ← r2 op r3r1 ← r2 op r3

L2:L2: r4 ← r1 op r5r4 ← r1 op r5

L3:L3: r1 ← r3 op r6r1 ← r3 op r6

L4:L4: r3 ← r1 op r7r3 ← r1 op r7

Dependences Dependences beforebefore::

Lx: Ly: x, y = 1..4, which dependence?Lx: Ly: x, y = 1..4, which dependence?

43

Register Renaming OriginalOriginal Code: Code: NewNew Code, Code, afterafter adding adding

regs:regs:

L1:L1: r1 ← r2 op r3r1 ← r2 op r3 rr1010 ← r2 op r ← r2 op r3030 –- r30 instead –- r30 instead

L2:L2: r4 ← r1 op r5r4 ← r1 op r5 r4 ← rr4 ← r1010 op r5 –- r10 instead op r5 –- r10 instead

L3:L3: r1 ← r3 op r6r1 ← r3 op r6 r1 ← rr1 ← r3030 op r6 op r6

L4:L4: r3 ← r1 op r7r3 ← r1 op r7 r3 ← r1 op r7r3 ← r1 op r7

Dependences Dependences beforebefore:: Dependences Dependences afterafter::

L1, L2 true-Dep with r1L1, L2 true-Dep with r1 L1, L2 true-Dep with r10L1, L2 true-Dep with r10

L1, L3 output-Dep with r1L1, L3 output-Dep with r1 L3, L4 true-Dep with r1L3, L4 true-Dep with r1

L1, L4 anti-Dep with r3L1, L4 anti-Dep with r3 // r// rii, i = 1..7 are “live”, i = 1..7 are “live”

L3, L4 true-Dep with r1L3, L4 true-Dep with r1

L2, L3 anti-Dep with r1L2, L3 anti-Dep with r1

L3, L4 anti-Dep with r3L3, L4 anti-Dep with r3

44

Register RenamingWith these additional --or With these additional --or renamedrenamed-- regs, the -- regs, the

new code could possibly run in half the time!new code could possibly run in half the time!

First: Compute into r10 instead of r1, but you First: Compute into r10 instead of r1, but you need to have the additional register r10; need to have the additional register r10; no no time penalty!time penalty!

Also: Compute in preceding code into r30 instead Also: Compute in preceding code into r30 instead of r3, if r30 available; of r3, if r30 available; also no time penalty!also no time penalty!

Then the following regs are Then the following regs are livelive afterwards: r1, afterwards: r1, r3, r4, plus the non-modified ones, i.e. r2! r3, r4, plus the non-modified ones, i.e. r2! r2 came in live, must go out live!r2 came in live, must go out live!

While r10 and r30 are While r10 and r30 are don’t cares don’t cares afterwardsafterwards

45

Score BoardScore-boardScore-board is an array of HW programmable bits is an array of HW programmable bits sb[]sb[]

Manages other HW resources, specifically registersManages other HW resources, specifically registers

In the single-bit HW array In the single-bit HW array sb[]sb[], every bit , every bit ii in in sb[i]sb[i] is is associated with a specific register, the one associated with a specific register, the one identified by identified by ii , i.e. , i.e. rrii

Association is by index, i.e. by name: Association is by index, i.e. by name: sb[i]sb[i] belongs to belongs to reg reg rrii

Only if Only if sb[i] = 0sb[i] = 0, does that register , does that register ii have have valid datavalid data

If If sb[i] = 0 sb[i] = 0 then register then register rrii is is NOT in process of being NOT in process of being writtenwritten

If bit If bit ii is set, i.e. if is set, i.e. if sb[i] = 1sb[i] = 1, then that register , then that register rrii

is reservedis reserved

Initially all Initially all sb[*]sb[*] are free to use, i.e. set to 0 are free to use, i.e. set to 0

46

Score BoardExecution constraints:Execution constraints:

rrdd ← r ← rss op r op rtt

if if sb[s]sb[s] or if or if sb[t]sb[t] is set → RAW dependence, is set → RAW dependence, hence stall the computation; wait until hence stall the computation; wait until both both rrss and and rrtt are available are available

if if sb[d]sb[d] is set→ WAW dependence, hence stall is set→ WAW dependence, hence stall the write; wait until the write; wait until rrdd has been used; SW has been used; SW can sometimes determine to use another can sometimes determine to use another register instead of register instead of rrdd

Else, if none of the 3 registers are in use, Else, if none of the 3 registers are in use, dispatchdispatch the instruction immediately the instruction immediately

47

Score BoardTo allow To allow out of order (ooo) executionout of order (ooo) execution, , upon computing the value of rupon computing the value of rdd

Update Update rrdd, and , and clearclear sb[sb[dd]]

For uses (references), HW may use any For uses (references), HW may use any register register ii, whose , whose sb[i]sb[i] is 0 is 0

For definitions (assignments), HW may set For definitions (assignments), HW may set any register any register jj, whose , whose sb[j]sb[j] is 0 is 0

Independent of original order, in which Independent of original order, in which source program was writtensource program was written, i.e. possibly ooo

48

References1.1. The Humble Programmer: The Humble Programmer:

http://www.cs.utexas.edu/~EWD/transcriptions/EWD03xx/EWD340.htmhttp://www.cs.utexas.edu/~EWD/transcriptions/EWD03xx/EWD340.htmll

2.2. Algorithm Definitions: Algorithm Definitions: http://en.wikipedia.org/wiki/Algorithm_characterizationshttp://en.wikipedia.org/wiki/Algorithm_characterizations

3.3. http://en.wikipedia.org/wiki/Moore's_lawhttp://en.wikipedia.org/wiki/Moore's_law

4.4. C. A. R. HoareC. A. R. Hoare’’s comment on readability: s comment on readability: http://www.eecs.berkeley.edu/~necula/cs263/handouts/hoarehints.http://www.eecs.berkeley.edu/~necula/cs263/handouts/hoarehints.pdfpdf

5.5. Gibbons, P. B, and Steven Muchnick [1986]. “Efficient Gibbons, P. B, and Steven Muchnick [1986]. “Efficient Instruction Scheduling for a Pipelined Architecture”, ACM Instruction Scheduling for a Pipelined Architecture”, ACM Sigplan Notices, Proceeding of ’86 Symposium on Compiler Sigplan Notices, Proceeding of ’86 Symposium on Compiler Construction, Volume 21, Number 7, July 1986, pp 11-16Construction, Volume 21, Number 7, July 1986, pp 11-16

6.6. Church-Turing Thesis: http://plato.stanford.edu/entries/church-Church-Turing Thesis: http://plato.stanford.edu/entries/church-turing/turing/

7.7. Linux design: Linux design: http://www.livinginternet.com/i/iw_unix_gnulinux.htmhttp://www.livinginternet.com/i/iw_unix_gnulinux.htm

8.8. Words of wisdom: http://www.cs.yale.edu/quotes.htmlWords of wisdom: http://www.cs.yale.edu/quotes.html

9.9. John von NeumannJohn von Neumann’’s computer design: A.H. Taub (ed.), s computer design: A.H. Taub (ed.), ““Collected Collected Works of John von NeumannWorks of John von Neumann””, vol 5, pp. 34-79, The MacMillan Co., , vol 5, pp. 34-79, The MacMillan Co., New York 1963New York 1963

1 ece 587 advanced computer architecture i chapter 4 code sequences herbert g. mayer, psu status...

Documents