a3 computer architecture - university of oxforddwm/courses/3co_2000/3co-l6.pdf · 2012-03-24 ·...

Computer Architecture MT 2011

A3 Computer Architecture

Engineering Science

3rd year A3 Lectures

Prof David Murray

[email protected]/∼dwm/Courses/3CO

Michaelmas 2000

1 / 1


6. Stacks, Subroutines, and MemoryHierarchies

3A3 Michaelmas 2000

2 / 1


In this lecture ...

We first continue looking at the support supplied to the high levelprogrammer by the macro-level

In particular we look atthe stack area of memorypassing parameters to subroutines

We end by looking at ways that memory hierarchies can be builtusing smaller and faster memories along with larger and slowermemories.

3 / 1


Stacks and Subroutines

Looking back at the BogStandard Architecture ...

there is a register calledSP,the stack pointer.

This register holds theaddress of the next freelocation in an area ofmemory reserved by aprogram as temporarystorage area.(Someholds the address ofthe last occupied location.)

Memory

MAR

ALU

CUAC

Control Lines

IR(opcode)

SPPC

IR(address)

Status

MBR

IR

Inc(PC)

4 / 1


The stack

The stack pointer usesmemory as a last-in,first-out (LIFO) buffer.Usually it is placed at thetop of memory, as faraway from the program aspossible.In the figure, the stackcurrently contains 5 itemsand grows downwards intofree memory.The stack pointer points tothe next free location.

Program

Fixed size data

Data allocated duringexecution

Free memory

The Stack

511

510

508

507

509

SP=506

Stack grows

down

Stack pointerpoints to nextfree location

5 / 1


Push

PUSH (PHA) and PULL (PLA)manipulate the stack:PHA

MAR ← SPMBR ← AC

〈MAR 〉 ← MBR; SP←SP-1The accumulator gets pushedonto the stack at the addresspointed to by the stack pointer.The stack pointer is thendecremented.

SP

AC

506

327

234

345

456

567

234

345

456

567

1234

1234

327

BEFORE

AFTER PUSH

SP

AC 327

505

511

510

508

507

509

511

510

508

507

509

506

6 / 1


Pull

PLASP ← SP +1

MAR ← SPMBR ← 〈MAR 〉

AC ← MBRThe stack pointer isincremented and thecontent pointed totransferred to theaccumulator.

SP

AC

506

327

234

345

456

567

234

345

456

567

1234

1234

BEFORE

SP

AC

AFTER PULL

507

1234

511

510

508

509

511

510

508

507

509

7 / 1


Using the stack during a subroutine

One of the most usefulconstructs in a high levellanguage is thesubroutine. This allowsthe programmer tomodularize code into smallchunks which do a specifictask and which may bereused.

When we come to compileinto assembler whathappens to thesubroutines?

main(){...xcomp = 4;ycomp = 2;mod = modulus(xcomp,ycomp);...}

modulus(a,b){msq = a*a + b*b ;m = sqrt (msq );return(m);}

sqrt (x){... blah ...}

8 / 1


Macro support for subroutines ...You now know that instructions in the different routines will endup indifferent parts of the program.

To call a subroutine, we obviously need tojump to the subroutine’s instructionstransfer the necessary data to the subroutine.arrange for the result to be transferred back to the callingroutine.

Is this enough?When you consider transferring back to the calling routine, theanswer has to be no!How do we know where to jump back to?What if the subroutine has messed up the registers?

There is an obvious need tostore the status quo before jumpingrestore it after

9 / 1


Calling by value or by reference

There are two ways of using formal parameters to a subroutine.

1 By ValuePassing by value means that a copy of the value of the parameteris transferred. This means that even if the subroutine changes thatvalue, when a subroutine returns, the value in the calling routine isunchanged.

2 By referenceIn this case the address of the parameter is passed, so that thesubroutine works with the original data.If it changes the value, that change is seen by the calling routine onreturn from the subroutine.

This discussion continues assuming calling by value.

10 / 1


Example with no parameters... \\Calling RoutineJSR MYSUB \\ JUMP TO SUBROUTINEADD 103 ... \\Rest of Calling Routine

MYSUB LDA 592 \\ SUBROUTINE STARTS......RTS \\ RETURN FROM SUBROUTINE

Subroutine starts at labelled address, so JSR is very like JMP.Difference: in its execute phase, JSR pushes the current value ofthe PC onto the stack, and then loads the operand into the PC.The RTS command ends the subroutine by pulling the storedprogram counter from the stack. Because the stack is a LIFObuffer, subroutines can be “nested” to any level until memoryruns out.

11 / 1


JSR and RTS

The RTL for the JSR and RTS opcodes are

JSR〈SP 〉 ← PC

PC ← IR (address); SP←SP– 1

RTSSP ← SP + 1PC ← 〈SP 〉

Check Notes!!

12 / 1


Example with parameters

Now we must worry how to transfer the parameters to thesubroutine,and results from the subroutine

We’ll consider two waysto pass parameters on a reserved area of memory(your pleasure in 3A3E)to pass parameters on the stack

In practice the choice is made by your particular compiler

13 / 1


Using the stack to pass parameters

During the calling routine ...The calling routine pushes the parameters onto the stack in orderuses JSR to push the return PC value onto the stack.

During the subroutine itself ...

Figure shows a very correct, but slow, way of using theparameters from the stack.

Return PC

SP

Param3

Param2

Param1

Param2

Return PC

Temporary

register

PullSP

Pull

Pull

Pull

To working

storage

SP

Return PC

SP

Push

Param3

Param1

14 / 1


Much more efficient ...During subroutine

access the parameters by indexed addressing, relative to thestack pointer SP.That is, the subroutine would access parameter 1, for example,by writingLDA SP,2

When the subroutine RTS’s it will pull the return PC off the stack,but the parameters will be left on.

However, the calling routine knows how many parameters therewere.Rather than pulling them off, it can simply increase the stackpointer by the number of parameters (ie, by three in ourexample).There is no need to erase or reset the memory contents,because subsequent pushes onto the stack will over-write thestale contents.

15 / 1


Example of indexed addressing relative to SP

Return PC

SP

Param3

Param2

Param1

Use M<SP+1+1> as param 1

Use M<SP+1+2> as param 2

etc

Can be achieved using index

LDA SP+1,1 for param 1Param3

Param2

Param1

SP

Set up by calling routine

During

subroutine

Just after return to calling

routine:

Then SP<− SP+3Param3

Param2

Param1

SPParams now dropped out of stack

16 / 1


For the rest of the lecture ...

We look at arranging memory to give the illusion that one has alarger amount of fast SRAM than is actually the case.

We look second at arranging disk and memory to give the illusionthat we have more main memory than is actually the case.

17 / 1


Cache memory

In Lecture 4 we noted that the large main memories were oftenmade using relatively inexpensive but relatively slow DynamicRAM (DRAM).

However, in addition, there is often a a cache of fast Static RAM(SRAM).

The cache is small, say < 1% of main memory size.For example, in a machine with a 64 – 128 MB main memory thecache it might only be 512 KB.

Nonetheless, it has a very significant effect on the performanceof memory accesses.

18 / 1


Cache memory

The cache memory works by copying parts of the main memoryto itself.

The cache controllerintercepts an address requested by the cpu,determines whether it is in the cache or not,declares a HIT allowing the data to be recovered from cache, ordeclares a MISS requiring it to be fetched from main memory.

Cache

Memory

Cache

Controller

Main

Memory

CPU HIT MISS

19 / 1


Cache memory

We need to look at1. The method used by the controller to determine if anaddress is in the cache.

Obviously a small cache memory will only have an appreciableeffect on memory performance if the locations that are mostfrequently accessed are in the cache.So we also need to consider2. The method used to decide what should be in the cache.

We will look at a directly mapped cache.

20 / 1


The directly mapped cache

The cache is divided up into anumber of memory blocks —each contains a number ofwords.The main memory is many timesthe size of the cache, and so wepartition main memory intocached-sized chunks called sets.A memory address is thereforedivided up into three parts

the lsb’s which indicate whichword in a block;the middle bits, which definewhich block in a set; andthe msb’s which define whichset.

Block n

Block 0

Block 1

Block 2

Block 3

Block n

Block 0

Block 1

Block 2

Block 3

Block n

Block 0

Block 1

Block 2

Block 3

Block n

Block 0

Block 1

Block 2

Block 3Set 2

Set 1

Set 0

etc

Cache Main

BlockSet Word

Set Block Word

A0A23 Memory Address

21 / 1


Directly mapped cache ...Note that the controller need notworry about the lsb’s (ie thewords) —a block is the least significant unitof memory in the cache.Each block (0,1,2,. . .) in thecache is required to come from ablock with the same number inthe main memory.Ie Cache Block 0↔ block 0but from from one set or anotherWe see that

1 the cache controller need onlyknow which memory set theblock came from — once itknows the set, it knows the setand the block.

2 the cache is not completelyflexible.

Block n

Block 0

Block 1

Block 2

Block 3

Block n

Block 0

Block 1

Block 2

Block 3

Block n

Block 0

Block 1

Block 2

Block 3

Block n

Block 0

Block 1

Block 2

Block 3Set 2

Set 1

Set 0

etc

Cache Main

BlockSet Word

22 / 1


Hit or Miss? Need a Look Up TableTo resolve to which set a block inthe cache corresponds, thecontroller maintains a look uptable called a cache tag ram:The address into the tag ram isjust the block addressThe contents are the set numberS = S∗

To determine hit or miss ...the controller takes the set andblock parts of the address fromthe cpu, S and B.It addresses the tag ram with Band recovers the contents S∗.If S = S∗ we have a HIT,otherwise a MISS.

How do you build thecomparator?

S S*

BlockSet Word

Tag

RAM

Comparator

Hit Miss

Address Bus

23 / 1


The comparator

S0

S0*

S1

S1*

MISS

HIT

etc

24 / 1


Hit or Miss?

If a hit:the required word is recovered from cache

If a miss:the required block is recovered from main memorythe required word forwarded to the cputhe block placed in the cache at block Bthe tag ram contents at address B changed to S.

25 / 1


Updating the cache

At first sight both the addressing scheme, and the updatingscheme seem strange!Is not the fact that we cannot have block B from set Sa and blockB from Sb together in the cache at the same time a majorrestriction?

Answers:YES!— one could image accessing data in such a way thatthe cache failed all the time.NO, becausecomputation usually occurs on instructions/data clustered inmemory.

Why?Compilers cluster the code.Compilers cluster allocated data (particularly arrays)Run-time memory allocation clusters data (particularly arrays)

26 / 1


Hit-Rates and Speed-Ups

Suppose access times to main and cache memories are tm andtc ,

Suppose Hit-ratio — ie probability of Hit — is H

Then average access time is

tave = Htc + (1− H)tm = H(tc − tm) + tm

Sotave

tm= H(k − 1) + 1

where k = tc/tm.

The Speed-Up with the cache then is

s =tmtave

=1

H(k − 1) + 1

27 / 1


Hit-Rates and Speed-Ups

Speed up is: s = 1H(k−1)+1

Top: Dependence ofspeed-up on hit rate for acache with k=0.1Bottom: Dependence ofspeed-up on hit rate for acache with (unreasonably)k=0.01.

0

10

20

30

40

50

60

70

80

90

100

0 0.2 0.4 0.6 0.8 1

1/(-0.99*x+1)

28 / 1


Extending the idea downwardsNowadays, processors often have two levels of cache.Pentium III has 512KB cache off the cpu and a 32KB cache onthe cpu.Non-Blocking Level 1 CacheThe Pentium III processor includes two separate 16 KB level 1(L1) caches, one for instruction and one for data. The L1 cacheprovides fast access to the recently used data, increasing theoverall performance of the system.Non-Blocking Level 2 CacheCertain versions of the Pentium III processor include a Discrete,off-die level 2 (L2) cache. This L2 cache consists of a 512 KBunified, non-blocking cache that improves performance overcache-on-motherboard solutions by reducing the averagememory access time and by providing fast access to recentlyused instructions and data. Performance is also enhanced overcache-on-motherboard implementations through a dedicated64-bit cache bus.

29 / 1


Extending the idea upwards

But we can also think of extending the idea upwards.Main memory becomes the cache for larger slower memory.Where?

30 / 1


Virtual memoryMain memory holds the current working set of a much largermemory on hard disk.

Just as the tag ram monkeyed around with addresses betweenmain and cache, now we have to monkey with those betweenmain memory and disk.Distingush two address spaces

a logical or virtual address space anda physical address space

The cpu deals in logical addresses using the full width of theaddress bus. Eg, a 32bit address bus allows logical addressingof 4GB.

The physical address refers to an actual address in mainmemory, which may be only 64MB in size.But the user’s program/data appears to have access to thelogical address space.

31 / 1


Virtual memory ...

Memory is divided up in pages (rather than blocks in the cache)

Each page of physical memory can map onto any page of logicalmemoryMore flexible approach than a directly mapped cache (equivalentto non-blocking).

A page table is maintained which describes the mappingbetween logical and physical addresses.

When the cpu request a logical address, the relevant page isdetermined, and the page table read to see whether that page isin physical memory.

32 / 1


Paging in Virtual memory ...

HIT! returns the physicaladdress.MISS! the page tableindicates where it is on thedisk.The page is read from diskand installed in physicalmemory.

This displaces a page ofmemory which has to bewritten back to disk,

Pager decides

which page to

swap out

Logical address

Memory

Page Table

In memory

physical addr

Not in memory

location on disk

MISS

Disk

Swapped out

Swapped in

from cpu

Data

to cpu

HIT

A paging algorithm determines which is the best one to remove— the least used, or that used least recently, are obviouschoices.The need to access disk is called a page fault.

33 / 1


Hit ratio

Just as a cache was a method of speeding up access to mainmemory, one might regard main memory as a method ofspeeding up access to disk.

Need to worry about the hit rate. Whereas cache is say an orderof magnitude slower than Main, Disk is some 6 orders ofmagnitude slower than Main.

k = tm/td can be all but ignored, giving

s =1

1− H

so the speed-up is determined by the hit rate alone.

The size of page and number of pages in a memory areimportant factors in maintaining a high hit rate.

Again success requires data to be localized in memory

34 / 1


Disk Thrashing

If the program repeatedlygenerate page faults itthrashes the disks ...A problem with (large)data arrays, such asimages ...

Data stored in row order.A pixel in column C, row Rmay be on a different pagefrom column C, row R+1 ...

35 / 1


Multi-tasking

The use of virtual memory is important in multiuser ormultitasking systems, allowing several users to feel they haveaccess to large memories.

The topic is discussed in depth in Tanenbaum (Chapter 6).

An important feature of the hierarchy built from cache, mainmemory and disk is its transparency to the user.

The hierarchy can be extended further to slower online storagemedia such as CD Roms.The idea here would be that when accessed, large quantities ofmaterial would be brought onto faster disks, but would “decay”away if unused over a period of hours.

36 / 1

a3 computer architecture - university of oxforddwm/courses/3co_2000/3co-l6.pdf · 2012-03-24 ·...

Documents