a3 computer architecture - university of oxforddwm/courses/3co_2000/3co-l6.pdf · 2012-03-24 ·...
TRANSCRIPT
Computer Architecture MT 2011
A3 Computer Architecture
Engineering Science
3rd year A3 Lectures
Prof David Murray
[email protected]/∼dwm/Courses/3CO
Michaelmas 2000
1 / 1
Computer Architecture MT 2011
6. Stacks, Subroutines, and MemoryHierarchies
3A3 Michaelmas 2000
2 / 1
Computer Architecture MT 2011
In this lecture ...
We first continue looking at the support supplied to the high levelprogrammer by the macro-level
In particular we look atthe stack area of memorypassing parameters to subroutines
We end by looking at ways that memory hierarchies can be builtusing smaller and faster memories along with larger and slowermemories.
3 / 1
Computer Architecture MT 2011
Stacks and Subroutines
Looking back at the BogStandard Architecture ...
there is a register calledSP,the stack pointer.
This register holds theaddress of the next freelocation in an area ofmemory reserved by aprogram as temporarystorage area.(Someholds the address ofthe last occupied location.)
Memory
MAR
ALU
CUAC
Control Lines
IR(opcode)
SPPC
IR(address)
Status
MBR
IR
Inc(PC)
4 / 1
Computer Architecture MT 2011
The stack
The stack pointer usesmemory as a last-in,first-out (LIFO) buffer.Usually it is placed at thetop of memory, as faraway from the program aspossible.In the figure, the stackcurrently contains 5 itemsand grows downwards intofree memory.The stack pointer points tothe next free location.
Program
Fixed size data
Data allocated duringexecution
Free memory
The Stack
511
510
508
507
509
SP=506
Stack grows
down
Stack pointerpoints to nextfree location
5 / 1
Computer Architecture MT 2011
Push
PUSH (PHA) and PULL (PLA)manipulate the stack:PHA
MAR ← SPMBR ← AC
〈MAR 〉 ← MBR; SP←SP-1The accumulator gets pushedonto the stack at the addresspointed to by the stack pointer.The stack pointer is thendecremented.
SP
AC
506
327
234
345
456
567
234
345
456
567
1234
1234
327
BEFORE
AFTER PUSH
SP
AC 327
505
511
510
508
507
509
511
510
508
507
509
506
6 / 1
Computer Architecture MT 2011
Pull
PLASP ← SP +1
MAR ← SPMBR ← 〈MAR 〉
AC ← MBRThe stack pointer isincremented and thecontent pointed totransferred to theaccumulator.
SP
AC
506
327
234
345
456
567
234
345
456
567
1234
1234
BEFORE
SP
AC
AFTER PULL
507
1234
511
510
508
509
511
510
508
507
509
7 / 1
Computer Architecture MT 2011
Using the stack during a subroutine
One of the most usefulconstructs in a high levellanguage is thesubroutine. This allowsthe programmer tomodularize code into smallchunks which do a specifictask and which may bereused.
When we come to compileinto assembler whathappens to thesubroutines?
main(){...xcomp = 4;ycomp = 2;mod = modulus(xcomp,ycomp);...}
modulus(a,b){msq = a*a + b*b ;m = sqrt (msq );return(m);}
sqrt (x){... blah ...}
8 / 1
Computer Architecture MT 2011
Macro support for subroutines ...You now know that instructions in the different routines will endup indifferent parts of the program.
To call a subroutine, we obviously need tojump to the subroutine’s instructionstransfer the necessary data to the subroutine.arrange for the result to be transferred back to the callingroutine.
Is this enough?When you consider transferring back to the calling routine, theanswer has to be no!How do we know where to jump back to?What if the subroutine has messed up the registers?
There is an obvious need tostore the status quo before jumpingrestore it after
9 / 1
Computer Architecture MT 2011
Calling by value or by reference
There are two ways of using formal parameters to a subroutine.
1 By ValuePassing by value means that a copy of the value of the parameteris transferred. This means that even if the subroutine changes thatvalue, when a subroutine returns, the value in the calling routine isunchanged.
2 By referenceIn this case the address of the parameter is passed, so that thesubroutine works with the original data.If it changes the value, that change is seen by the calling routine onreturn from the subroutine.
This discussion continues assuming calling by value.
10 / 1
Computer Architecture MT 2011
Example with no parameters... \\Calling RoutineJSR MYSUB \\ JUMP TO SUBROUTINEADD 103 ... \\Rest of Calling Routine
MYSUB LDA 592 \\ SUBROUTINE STARTS......RTS \\ RETURN FROM SUBROUTINE
Subroutine starts at labelled address, so JSR is very like JMP.Difference: in its execute phase, JSR pushes the current value ofthe PC onto the stack, and then loads the operand into the PC.The RTS command ends the subroutine by pulling the storedprogram counter from the stack. Because the stack is a LIFObuffer, subroutines can be “nested” to any level until memoryruns out.
11 / 1
Computer Architecture MT 2011
JSR and RTS
The RTL for the JSR and RTS opcodes are
JSR〈SP 〉 ← PC
PC ← IR (address); SP←SP– 1
RTSSP ← SP + 1PC ← 〈SP 〉
Check Notes!!
12 / 1
Computer Architecture MT 2011
Example with parameters
Now we must worry how to transfer the parameters to thesubroutine,and results from the subroutine
We’ll consider two waysto pass parameters on a reserved area of memory(your pleasure in 3A3E)to pass parameters on the stack
In practice the choice is made by your particular compiler
13 / 1
Computer Architecture MT 2011
Using the stack to pass parameters
During the calling routine ...The calling routine pushes the parameters onto the stack in orderuses JSR to push the return PC value onto the stack.
During the subroutine itself ...
Figure shows a very correct, but slow, way of using theparameters from the stack.
Return PC
SP
Param3
Param2
Param1
Param2
Return PC
Temporary
register
PullSP
Pull
Pull
Pull
To working
storage
SP
Return PC
SP
Push
Param3
Param1
14 / 1
Computer Architecture MT 2011
Much more efficient ...During subroutine
access the parameters by indexed addressing, relative to thestack pointer SP.That is, the subroutine would access parameter 1, for example,by writingLDA SP,2
When the subroutine RTS’s it will pull the return PC off the stack,but the parameters will be left on.
However, the calling routine knows how many parameters therewere.Rather than pulling them off, it can simply increase the stackpointer by the number of parameters (ie, by three in ourexample).There is no need to erase or reset the memory contents,because subsequent pushes onto the stack will over-write thestale contents.
15 / 1
Computer Architecture MT 2011
Example of indexed addressing relative to SP
Return PC
SP
Param3
Param2
Param1
Use M<SP+1+1> as param 1
Use M<SP+1+2> as param 2
etc
Can be achieved using index
LDA SP+1,1 for param 1Param3
Param2
Param1
SP
Set up by calling routine
During
subroutine
Just after return to calling
routine:
Then SP<− SP+3Param3
Param2
Param1
SPParams now dropped out of stack
16 / 1
Computer Architecture MT 2011
For the rest of the lecture ...
We look at arranging memory to give the illusion that one has alarger amount of fast SRAM than is actually the case.
We look second at arranging disk and memory to give the illusionthat we have more main memory than is actually the case.
17 / 1
Computer Architecture MT 2011
Cache memory
In Lecture 4 we noted that the large main memories were oftenmade using relatively inexpensive but relatively slow DynamicRAM (DRAM).
However, in addition, there is often a a cache of fast Static RAM(SRAM).
The cache is small, say < 1% of main memory size.For example, in a machine with a 64 – 128 MB main memory thecache it might only be 512 KB.
Nonetheless, it has a very significant effect on the performanceof memory accesses.
18 / 1
Computer Architecture MT 2011
Cache memory
The cache memory works by copying parts of the main memoryto itself.
The cache controllerintercepts an address requested by the cpu,determines whether it is in the cache or not,declares a HIT allowing the data to be recovered from cache, ordeclares a MISS requiring it to be fetched from main memory.
Cache
Memory
Cache
Controller
Main
Memory
CPU HIT MISS
19 / 1
Computer Architecture MT 2011
Cache memory
We need to look at1. The method used by the controller to determine if anaddress is in the cache.
Obviously a small cache memory will only have an appreciableeffect on memory performance if the locations that are mostfrequently accessed are in the cache.So we also need to consider2. The method used to decide what should be in the cache.
We will look at a directly mapped cache.
20 / 1
Computer Architecture MT 2011
The directly mapped cache
The cache is divided up into anumber of memory blocks —each contains a number ofwords.The main memory is many timesthe size of the cache, and so wepartition main memory intocached-sized chunks called sets.A memory address is thereforedivided up into three parts
the lsb’s which indicate whichword in a block;the middle bits, which definewhich block in a set; andthe msb’s which define whichset.
Block n
Block 0
Block 1
Block 2
Block 3
Block n
Block 0
Block 1
Block 2
Block 3
Block n
Block 0
Block 1
Block 2
Block 3
Block n
Block 0
Block 1
Block 2
Block 3Set 2
Set 1
Set 0
etc
Cache Main
BlockSet Word
Set Block Word
A0A23 Memory Address
21 / 1
Computer Architecture MT 2011
Directly mapped cache ...Note that the controller need notworry about the lsb’s (ie thewords) —a block is the least significant unitof memory in the cache.Each block (0,1,2,. . .) in thecache is required to come from ablock with the same number inthe main memory.Ie Cache Block 0↔ block 0but from from one set or anotherWe see that
1 the cache controller need onlyknow which memory set theblock came from — once itknows the set, it knows the setand the block.
2 the cache is not completelyflexible.
Block n
Block 0
Block 1
Block 2
Block 3
Block n
Block 0
Block 1
Block 2
Block 3
Block n
Block 0
Block 1
Block 2
Block 3
Block n
Block 0
Block 1
Block 2
Block 3Set 2
Set 1
Set 0
etc
Cache Main
BlockSet Word
22 / 1
Computer Architecture MT 2011
Hit or Miss? Need a Look Up TableTo resolve to which set a block inthe cache corresponds, thecontroller maintains a look uptable called a cache tag ram:The address into the tag ram isjust the block addressThe contents are the set numberS = S∗
To determine hit or miss ...the controller takes the set andblock parts of the address fromthe cpu, S and B.It addresses the tag ram with Band recovers the contents S∗.If S = S∗ we have a HIT,otherwise a MISS.
How do you build thecomparator?
S S*
BlockSet Word
Tag
RAM
Comparator
Hit Miss
Address Bus
23 / 1
Computer Architecture MT 2011
The comparator
S0
S0*
S1
S1*
MISS
HIT
etc
24 / 1
Computer Architecture MT 2011
Hit or Miss?
If a hit:the required word is recovered from cache
If a miss:the required block is recovered from main memorythe required word forwarded to the cputhe block placed in the cache at block Bthe tag ram contents at address B changed to S.
25 / 1
Computer Architecture MT 2011
Updating the cache
At first sight both the addressing scheme, and the updatingscheme seem strange!Is not the fact that we cannot have block B from set Sa and blockB from Sb together in the cache at the same time a majorrestriction?
Answers:YES!— one could image accessing data in such a way thatthe cache failed all the time.NO, becausecomputation usually occurs on instructions/data clustered inmemory.
Why?Compilers cluster the code.Compilers cluster allocated data (particularly arrays)Run-time memory allocation clusters data (particularly arrays)
26 / 1
Computer Architecture MT 2011
Hit-Rates and Speed-Ups
Suppose access times to main and cache memories are tm andtc ,
Suppose Hit-ratio — ie probability of Hit — is H
Then average access time is
tave = Htc + (1− H)tm = H(tc − tm) + tm
Sotave
tm= H(k − 1) + 1
where k = tc/tm.
The Speed-Up with the cache then is
s =tmtave
=1
H(k − 1) + 1
27 / 1
Computer Architecture MT 2011
Hit-Rates and Speed-Ups
Speed up is: s = 1H(k−1)+1
Top: Dependence ofspeed-up on hit rate for acache with k=0.1Bottom: Dependence ofspeed-up on hit rate for acache with (unreasonably)k=0.01.
0
10
20
30
40
50
60
70
80
90
100
0 0.2 0.4 0.6 0.8 1
1/(-0.99*x+1)
28 / 1
Computer Architecture MT 2011
Extending the idea downwardsNowadays, processors often have two levels of cache.Pentium III has 512KB cache off the cpu and a 32KB cache onthe cpu.Non-Blocking Level 1 CacheThe Pentium III processor includes two separate 16 KB level 1(L1) caches, one for instruction and one for data. The L1 cacheprovides fast access to the recently used data, increasing theoverall performance of the system.Non-Blocking Level 2 CacheCertain versions of the Pentium III processor include a Discrete,off-die level 2 (L2) cache. This L2 cache consists of a 512 KBunified, non-blocking cache that improves performance overcache-on-motherboard solutions by reducing the averagememory access time and by providing fast access to recentlyused instructions and data. Performance is also enhanced overcache-on-motherboard implementations through a dedicated64-bit cache bus.
29 / 1
Computer Architecture MT 2011
Extending the idea upwards
But we can also think of extending the idea upwards.Main memory becomes the cache for larger slower memory.Where?
30 / 1
Computer Architecture MT 2011
Virtual memoryMain memory holds the current working set of a much largermemory on hard disk.
Just as the tag ram monkeyed around with addresses betweenmain and cache, now we have to monkey with those betweenmain memory and disk.Distingush two address spaces
a logical or virtual address space anda physical address space
The cpu deals in logical addresses using the full width of theaddress bus. Eg, a 32bit address bus allows logical addressingof 4GB.
The physical address refers to an actual address in mainmemory, which may be only 64MB in size.But the user’s program/data appears to have access to thelogical address space.
31 / 1
Computer Architecture MT 2011
Virtual memory ...
Memory is divided up in pages (rather than blocks in the cache)
Each page of physical memory can map onto any page of logicalmemoryMore flexible approach than a directly mapped cache (equivalentto non-blocking).
A page table is maintained which describes the mappingbetween logical and physical addresses.
When the cpu request a logical address, the relevant page isdetermined, and the page table read to see whether that page isin physical memory.
32 / 1
Computer Architecture MT 2011
Paging in Virtual memory ...
HIT! returns the physicaladdress.MISS! the page tableindicates where it is on thedisk.The page is read from diskand installed in physicalmemory.
This displaces a page ofmemory which has to bewritten back to disk,
Pager decides
which page to
swap out
Logical address
Memory
Page Table
In memory
physical addr
Not in memory
location on disk
MISS
Disk
Swapped out
Swapped in
from cpu
Data
to cpu
HIT
A paging algorithm determines which is the best one to remove— the least used, or that used least recently, are obviouschoices.The need to access disk is called a page fault.
33 / 1
Computer Architecture MT 2011
Hit ratio
Just as a cache was a method of speeding up access to mainmemory, one might regard main memory as a method ofspeeding up access to disk.
Need to worry about the hit rate. Whereas cache is say an orderof magnitude slower than Main, Disk is some 6 orders ofmagnitude slower than Main.
k = tm/td can be all but ignored, giving
s =1
1− H
so the speed-up is determined by the hit rate alone.
The size of page and number of pages in a memory areimportant factors in maintaining a high hit rate.
Again success requires data to be localized in memory
34 / 1
Computer Architecture MT 2011
Disk Thrashing
If the program repeatedlygenerate page faults itthrashes the disks ...A problem with (large)data arrays, such asimages ...
Data stored in row order.A pixel in column C, row Rmay be on a different pagefrom column C, row R+1 ...
35 / 1
Computer Architecture MT 2011
Multi-tasking
The use of virtual memory is important in multiuser ormultitasking systems, allowing several users to feel they haveaccess to large memories.
The topic is discussed in depth in Tanenbaum (Chapter 6).
An important feature of the hierarchy built from cache, mainmemory and disk is its transparency to the user.
The hierarchy can be extended further to slower online storagemedia such as CD Roms.The idea here would be that when accessed, large quantities ofmaterial would be brought onto faster disks, but would “decay”away if unused over a period of hours.
36 / 1