gary marsdenslide 1university of cape town memory ‘the illusion of unlimited fast memory’...

82
Gary Marsden Slide 1 University of Cape Town Memory ‘The Illusion of Unlimited Fast Memory’ What programmers want, so we need to fake it Can exploit Principle of temporal locality Principle of spatial locality Kind of like going to the library

Upload: lilian-parrish

Post on 16-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 1University of Cape Town

Memory

‘The Illusion of Unlimited Fast Memory’– What programmers want, so we need to fake it

Can exploit– Principle of temporal locality– Principle of spatial locality

Kind of like going to the library

Page 2: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 2University of Cape Town

Memory types

SRAM - static RAM– Values don’t leak– Made from 4-6 transistors– Fairly expensive, but fast and low powered

DRAM - Dynamic RAM– Data leaks out – needs refreshing, usually be frequent reads– Only needs one transistor– Flavours

• EDO - pipelined DRAM• SDRAM - synchronised DRAM (to warm up the memory)

Page 3: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 3University of Cape Town

Relative Cost

As per 2004

Technology Access Time $ / Gbyte

SRAM 0.5 -5 ns 4000-10000

DRAM 50-70 ns 100-200

Disk 5-20 million ns 0.5-2

Page 4: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 4University of Cape Town

Memory hierarchy

Exploit localityMultiple levels of memory of different sizes

and speeds– Fast memory is expensive, so less used than

slower, cheap memory

Differences in cost and access times make it advantageous to have a hierarchy of memory, with faster closer to the CPU

Page 5: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 5University of Cape Town

Hierarchy

Page 6: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 6University of Cape Town

Goal

Present the user with as much memory as is available in the cheapest technology, while providing access at the speed offered by the fastest memory

Operation– Hierarchy is multi-level, but data only moved

between adjacent layers– Closer to CPU - fast and small– Further from CPU - slow and large

Page 7: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 7University of Cape Town

Terminology

Block: minimum unit of information present (or not) in a multilevel hierarchy (think book)

Hit: data found in upper level (on desk) Miss: data not found in upper level

– Lower level accessed to find block (go to shelves)

Hit rate: fraction of memory accesses found on upper levels– Used to measure performance

Miss rate: fraction of memory accesses not found on upper levels (1 - hit rate)

Page 8: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 8University of Cape Town

More terminology

Hit time: Time to access upper level of hierarchy (incl. Time to determine if it is there) – looking at desk

Miss penalty: Time take to replace block at upper level with block from lower level AND time to deliver block to processor – time to go to shelves and back

Impacts: OS design, how code is compiled, how applications are written

Page 9: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 9University of Cape Town

Summary Diagrams

Processor

Data are transferred

CPU

Level n

Level 2

Level 1

Levels in thememory hierarchy

Increasing distance from the CPU in

access time

Size of the memory at each level

Page 10: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 10University of Cape Town

Caching

‘a safe place for hiding or storing things’Used to mean the level of memory

hierarchy between CPU and main memoryNow used to mean any system to exploit

locality– E.g. Browser cache

Kick of by considering a simple cache: processor requests are one block; blocks are one word

Page 11: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 11University of Cape Town

Caching in contemporary processors

Page 12: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 12University of Cape Town

Reference to missing block

Issues1. How do we

know if Xn is in cache?

2. If it is, where do we find it.

a. Before the reference to Xn

X3

Xn – 1

Xn – 2

X1

X4

b. After the reference to Xn

X3

Xn – 1

Xn – 2

X1

X4

Xn

X2X2

Page 13: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 13University of Cape Town

Where’s the word?

Questions are related If a given memory word can only go to one

location in cache, there is only one place to look!

Direct mapped caching: fn(mem address)– Usually (mem address) modulo (number of

cache slots– Can use binary ‘trick’ where number of cache

slots is a round power of 2

Page 14: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 14University of Cape Town

Cache picture

Can have valid bit to show slot holds value

00001 00101 01001 01101 10001 10101 11001 11101

000

Cache

Memory

001

010

011

100

101

110

111

Page 15: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 15University of Cape Town

Cache details

Need to add ‘tag’ to each cache slot to show if the value there is indeed the one required– Disambiguates the 1:M mapping

Tag is comprised of the remaining bits not used in the modulo calculation

For now we concentrate on ‘read’ then look at cache design for two real machines

Page 16: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 16University of Cape Town

Cache - init, miss10110, miss11010, miss10000

Page 17: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 17University of Cape Town

Cache - miss 00011, miss 10010

Page 18: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 18University of Cape Town

Cache datapathAddress (showing bit positions)

20 10

Byteoffset

Valid Tag DataIndex

0

1

2

1021

1022

1023

Tag

Index

Hit Data

20 32

31 30 13 12 11 2 1 0

Page 19: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 19University of Cape Town

Cache sizes

Function of word size, cache slots and address size - influences tag size

Assume 32 bit MIPS address and word, with 2^n slots in cache (n = address width)

2^n *(block size + tag size + valid bit)2^n *(32+(32-n-2)+1)2^n * (63-n)

Page 20: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 20University of Cape Town

Handling cache misses

Control unit must detect miss and process miss by fetching data from lower level (memory)

Cache hit - no problem– Data memory = data cache– Instruction memory = instruction cache

Control for misses not so easy– Miss means instruction is not valid so wrong

instruction is executed– Miss means data is invalid, so calculation is

meaningless

Page 21: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 21University of Cape Town

Steps to cope with a miss

Overview– Stall processor; activate mem. controller; get

value from next level; load value; continueDetail

– Subtract 4 from PC– Ask main memory to do a read and wait on

completion– Write entry to cache (mem data -> cache data;

upper bits -> tag field; set valid bit)– Restart instruction from first step; refetch

correct instruction now in instruction cache

Page 22: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 22University of Cape Town

Example Cache - DECStation 3100

Start with a fairly simple design– MIPS R2000, pipeline similar to chapter 6– Has inst. and data caches (for pipeline)– Fetches data and inst. word on every cycle

Cache is 64 Kb– 16k entries of 4-byte words

Steps for a cache read– Send address to cache– If hit, request data– If miss, send full address to main memory –

when data returned, place in cache

Page 23: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 23University of Cape Town

Diagram of DECStation 3100Address (showing bit positions)

16 14 Byteoffset

Valid Tag Data

Hit Data

16 32

16Kentries

16 bits 32 bits

31 30 17 16 15 5 4 3 2 1 0

Page 24: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 24University of Cape Town

Writing to Cache

We can’t write to cache alone as main memory will become inconsistent with cache

Can solve by write-through– Write to memory and cache simultaneously

Implication is that there is not point checking tag or write location - overwritten anyway– Index the cache using bits 15-2– Write tag bits (31-16) and data value word– Write data word to main memory

Page 25: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 25University of Cape Town

Too slow

The problem with as write-through technique is that we are bound by the slower speed of main memory

A main-memory buffer can help; usually several words long (4 in this case)– Stall on buffer full

Can use write-back– Value is only written to main memory when it

drops out of cache– Faster, but more complex to design and control

Page 26: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 26University of Cape Town

Exploiting Spatial Locality

Simply: When we have a miss, load a group of adjacent blocks into cache

Implies a cache block size > 1Address (showing bit positions)

16 12 Byteoffset

V Tag Data

Hit Data

16 32

4Kentries

16 bits 128 bits

Mux

32 32 32

2

32

Block offsetIndex

Tag

31 16 15 4 32 1 0

Page 27: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 27University of Cape Town

Using wider cache

Read misses same as before If we get a write hit

– Continue as usual

For a write miss– Read entire block from memory– Write word in block in cache– Write block back

Page 28: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 28University of Cape Town

Block size

With bigger blocks– Miss rate falls– Cost of a miss increases; time is latency to first

word + block transfer time• Obviously greater for bigger block

1 KB

8 KB

16 KB

64 KB

256 KB

256

40%

35%

30%

25%

20%

15%

10%

5%

0%

Miss rate

64164

Block size (bytes)

Page 29: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 29University of Cape Town

Memory design for caches

Miss is resolved from main (DRAM) memoryDRAM designed for density, not speedReduce miss penalty by increasing width of

memory retrievedCPU

Cache

Bus

Memory

a. One-word-wide memory organization

CPU

Bus

b. Wide memory organization

Memory

Multiplexor

Cache

CPU

Cache

Bus

Memorybank 1

Memorybank 2

Memorybank 3

Memorybank 0

c. Interleaved memory organization

Page 30: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 30University of Cape Town

Calculating penalty

Hypothetical access time for a DRAM– 1 clock cycle for sending address– 15 clock cycles for initiating the access– 1 clock cycle for sending the data

Memory organisation– Block with four words– Memory access 1 word

Miss penalty– 1 + 4 • 15 + 4 • 1 = 65 cycles– Bytes / cycle = 4 • 4 / 65 = 0.25

Page 31: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 31University of Cape Town

Widening access

Option one is what we were assumingOption 2 reduces latency and transfer

times– 1 + 1 • 15 + 1 • 1 = 17 cycles

Option 3 (interleaving) reduces latency, but not transfer time– 1 + 1 • 15 + 4 • 1 = 20 cycles

Option 3 cheaper than 2 and only marginally less quick

Page 32: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 32University of Cape Town

Summary

Simplest cache: direct mapped– 1 word: 1 location and 1 tag per word

Write-through/back used for consistency To exploit spatial locality, have cache block > 1

word– Tradeoff in block size

Updating of cache from memory increased by– Make memory & bus wider– Interleaving– Both schemes minimize times we initiate memory access

Page 33: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 33University of Cape Town

More summary

Since cycles spent on program = processor cycles + memory stall cycles, memory design has huge impact on performance

Faster processors mean relatively greater impact of memory stalls

Page 34: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 34University of Cape Town

Cache Performance

Coming back to Amdhal’s law, we can quickly show that faster processors can be undone by slow cache

Typically a poor cache can reduce a 2x performance increase to a 1.2x– Especially for highly clocked, low CPI processors

We shall look at some improvements

Page 35: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 35University of Cape Town

Where can block be placed?

Direct mapped - location is knownFully associative - must search whole cacheSet associative:

– a fixed number of locations where a block can be placed

– A set associative cache with n location is called n-way associative

– Block is mapped to unique set in cache (like hasing)

– Increasing associative usually decreases misses

Page 36: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 36University of Cape Town

Associative Cache

For direct mapped– Location = number modulo (# of cache blocks)

For set associative– Location = number modulo (# of cache sets)

What is a set?– Number of blocks where a word can be placed

Page 37: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 37University of Cape Town

Associative caches

1

2Tag

Data

Block # 0 1 2 3 4 5 6 7

Search

Direct mapped

1

2Tag

Data

Set # 0 1 2 3

Search

Set associative

1

2Tag

Data

Search

Fully associative

Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data

Eight-way set associative (fully associative)

Tag Data Tag Data Tag Data Tag Data

Four-way set associative

Set

0

1

Tag Data

One-way set associative(direct mapped)

Block

0

7

1

2

3

4

5

6

Tag Data

Two-way set associative

Set

0

1

2

3

Tag Data

Page 38: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 38University of Cape Town

Tradeoff

0%

3%

6%

9%

12%

15%

Eight-wayFour-wayTwo-wayOne-way

1 KB

2 KB

4 KB

8 KB

Miss rate

Associativity 16 KB

32 KB

64 KB

128 KB

Page 39: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 39University of Cape Town

How is block found

In direct mapped cache: index the cache In set associative cache:

– Index the set– Check tags in the set to see if a match is found

Choice depends on cost of a miss– High associativity balanced against search time– Fully assoc. too much, unless cache is small

Page 40: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 40University of Cape Town

Replacing Blocks

Fully associative– Any block can be replaced

Set associative– Must choose from set

Direct mapped– No choice

Algorithms– Random: cheap, simple– LRU: expensive as cache grows

Page 41: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 41University of Cape Town

Handling writes

Write back– Words can be written at the rate of fastest

memory– Multiple writes in a block only need one write to

slow memory

Write through– Read misses are cheaper (no write on displaced

block)– Easier to implement

Page 42: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 42University of Cape Town

Virtual Memory

Cache used to provide fast access for processor to recently used code

Similarly, main memory can act as a cache for disk: this is called ‘virtual memory’

Main reason is to allow memory sharing among multiple programs– Memory requirement for all programs greater

than RAM available– But only a fraction of memory is used at any

given time– Main memory need only hold in-use values of

one program

Page 43: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 43University of Cape Town

The burden of memory

Want programs to be able to exceed size of main memory

Programmer used to do this explicitly in code

Divide program into mutually exclusive overlays

Overlays explicitly loaded/unloadedVirtual memory removes this responsibility

from application programmers

Page 44: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 44University of Cape Town

Terminology

Concepts similar to cache, but grew from different direction

Page: virtual memory blockPage fault: virtual memory missVirtual address: Address produced by the

CPU and translated into an absolute addressMemory mapping / address translation:

the virtual to physical mapping process– Book title -> dewey-decimal system

Page 45: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 45University of Cape Town

Virtual / physical addresses

Physical addresses

Disk addresses

Virtual addresses

Address translation

Page 46: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 46University of Cape Town

More terminology

Relocation: Provided by virtual memory as virtual addresses used by a program and mapped to physical addresses before memory access– Programs relocated as fixed size pages (blocks);

need not be contiguous

Virtual memory addresses consist of a virtual page number and offset which is translated to a physical page

Page 47: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 47University of Cape Town

Translation process

3 2 1 011 10 9 815 14 13 1231 30 29 28 27

Page offsetVirtual page number

Virtual address

3 2 1 011 10 9 815 14 13 1229 28 27

Page offsetPhysical page number

Physical address

Translation

Page 48: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 48University of Cape Town

Design decisions

The cost of a page miss is massive: hundreds of thousands of clock cycles to rectify– Pages should be large enough to amortize high

access times (4kb - 16kb typical)– Need to reduce page fault rate, primarily

through flexible page placement– Virtual memory misses can be handled in

software, due to speed of disk– Write through is just too darned slow - need

better system

Page 49: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 49University of Cape Town

Page placement

Try to reduce page misses by optimising placement

Allow any virtual page to map to any physical page, then the OS can choose to replace any page it wants– Sophisticated replacement algorithms involved

Page 50: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 50University of Cape Town

Finding pages

Fully associative mapping allows any page (or block) to be associated with any location in physical memory (or cache)

If things can go anywhere, could be hard to find– Have a page table which resides in memory

and provides an address service for pages– Indexed with the virtual address and provides

location in memory hierarchy– Each program has own page table (page table

register)

Page 51: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 51University of Cape Town

Page table

Page offsetVirtual page number

Virtual address

Page offsetPhysical page number

Physical address

Physical page numberValid

If 0 then page is notpresent in memory

Page table register

Page table

20 12

18

31 30 29 28 27 15 14 13 12 11 10 9 8 3 2 1 0

29 28 27 15 14 13 12 11 10 9 8 3 2 1 0

Page 52: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 52University of Cape Town

Page fault

If valid bit is 0, then a fault occursOS is given control through an exceptionOnce OS has control, must decide where to

place page in physical memoryPrinciple of temporal locality - throw out

least recently used page (LRU scheme)

Page 53: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 53University of Cape Town

Page fault mechanism

Physical memory

Disk storage

Valid

1

1

1

1

0

1

1

0

1

1

0

1

Page table

Virtual pagenumber

Physical page ordisk address

Page 54: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 54University of Cape Town

Coping with writing

With cache, could use buffer on main memory to hide difference in speed for write-through

Not possible with main memory / disk– Difference too great

Changes made in memory and page is written back to disk once it drops out of memory

Called copy-back– Only happens if values on page are changed

• ‘Dirty’ bit

– Just as easy to copy a whole page as the altered values

Page 55: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 55University of Cape Town

Improving Performance

Every instruction that accesses memory incurs two memory accesses– One to look up physical address in page table– Another to actually access the value

Time to use the answer to all memory problems - Principle of Locality– This time, we exploit the fact that we are often

looking up the same address translations in the page table

– We need a page table cache!

Page 56: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 56University of Cape Town

Translation-Lookaside Buffer (TLB)

Cache for the page tableAs it is a cache, we need a tag field

– Page table is 1:1 so does not have a tag

Each entry is a physical page numberNeed valid and dirty bits

– May never get to the page table, so need to know if values in deallocated page need writing

If hit, use address in TLB, otherwise– Look in Page table. If miss there– Page fault

Page 57: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 57University of Cape Town

TLB diagram

Valid

1

1

1

1

0

1

1

0

1

1

0

1

Page table

Physical pageaddressValid

TLB

1

1

1

1

0

1

TagVirtual page

number

Physical pageor disk address

Physical memory

Disk storage

Page 58: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 58University of Cape Town

Typical Values for TLB

Size: 16-512 entriesHit time: 0.5 - 1 clock cyclesMiss penalty: 10-100 clock cyclesMiss rate: 0.01% - 1%

Page 59: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 59University of Cape Town

DECStation 3100

Valid Tag Data

Page offset

Page offset

Virtual page number

Virtual address

Physical page numberValid

1220

20

16 14

Cache index

32

Cache

DataCache hit

2

Byteoffset

Dirty Tag

TLB hit

Physical page number

Physical address tag

TLB

Physical address

31 30 29 15 14 13 12 11 10 9 8 3 2 1 0

Page 60: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 60University of Cape Town

Processor Support

To enable the OS to implement protection in the VM system, the hardware must support the following:1. Support at lease two modes: User and

O/S(kernel, supervisor, executive)2. Provide a portion of CPU state that a user can

read but not write3. Support movement from user to supervisor

mode

– All this is needed to allow OS to change page tables, but disallow users from doing so

Page 61: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 61University of Cape Town

Handling faults

When a fault occurs, the interrupt mechanism halts the user process and invokes the OS to find the correct page, then returns control to the user process

On fault:1. Find physical address from page table2. Choose page to replace (check dirty bit)3. Start a read to bring page from disk to memory

Page 62: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 62University of Cape Town

Memory access diagram

Yes

Deliver datato the CPU

Write?

Try to read datafrom cache

Write data into cache,update the tag, and put

the data and the addressinto the write buffer

Cache hit?Cache miss stall

TLB hit?

TLB access

Virtual address

TLB missexception

No

YesNo

YesNo

Write accessbit on?

YesNo

Write protectionexception

Physical address

Page 63: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 63University of Cape Town

Summary

VM is level in memory hierarchy bridging main memory - disk caching

Cost of miss is high, so:– Pages are large (spatial locality)– Virtual - physical address mapping is fully

associativeDisk writes are expensive: write-back and

dirty bitAllows multiple processSpeeded up via address translation cache

– TLB

Page 64: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 64University of Cape Town

VM Performance

If a program needs more virtual memory than there is physical memory; big trouble– System Thrashes– Best buy more memory

More common are TLB misses– A 64 entry TLB gives 64x4k = 0.25 Mb!– Cheat by having variable page sizes

Page 65: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 65University of Cape Town

Memory hierarchy framework

By now you will have spotted that there are similarities in different memory hierarchies

We want to look at common issuesFeature L1 Cache Virt. Mem. TLB

Blocks 250-2000 16000 - 250 000

16-512

Kbytes 16-64 250000-1000000000

0.25 - 16

Block size (B)

32-64 4000 - 64 000

4 - 32

Miss penalty 10 - 25 10 - 100 million

10-1000

Miss rate 2% - 5% 0.000001% - 0.0001%

0.01% - 2%

Page 66: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 66University of Cape Town

How to compare

Four questions to apply between any two levels of hierarchy1. Where can a block be placed?2. How is block found?3. Which block replaced on cache miss?4. What happens on write?

Page 67: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 67University of Cape Town

Placement

DirectSet associativeFully associative

(see cache section)

Page 68: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 68University of Cape Town

VM associativity

In VM, there are three key factors1. Miss rates are crucial as cost is high2. Mapping is implemented in software, so no

cycle time impact3. Large page size means table size is relatively

small

– Therefore, VM is always fully associative– Cache and TLB: often set associative

(recently move toward direct mapped)

Page 69: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 69University of Cape Town

Find a block

Direct - index (1 comparison)Set - search set (set size)Fully

– Search the whole cache (cache size)– Use lookup table (0)

Page 70: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 70University of Cape Town

Replace a block

In practice LRU is not truly used; there is always some approximation

Often a random scheme is employed due to the low overhead in calculation

Page 71: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 71University of Cape Town

Writing

Write-through– Usually for cache

Write-back– Only workable scheme for virtual memory

Page 72: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 72University of Cape Town

Intuitive model

All misses can be classified as:– Compulsory: First access to block (cold start)

• Increase block size

– Capacity: Cache cannot contain all the blocks it needs

• Increase cache size, but not at the cost of access time

– Conflict misses: or collision misses in direct mapped or set associative caches

• Increase associativity, but not at cost of access time

Page 73: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 73University of Cape Town

AMD K7

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 74: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 74University of Cape Town

Intel P4 - Prescott original (1Mb L2 cache)

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 75: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 75University of Cape Town

Intel P4 - Prescott newer (2Mb L2 cache)

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 76: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 76University of Cape Town

Intel Centrinio (2Mb L2 Cache)

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 77: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 77University of Cape Town

Cell processor

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 78: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 78University of Cape Town

Relative performance

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 79: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 79University of Cape Town

AMD and Intel

Both have L1 and L2 cache– L1 serves to make register data available– L2 serves to get correct machine instructions

lined up

P4 has TLB for instructions and TLB for dataAMD has TLBs for instructions and data, for

L1 and L2 cacheBoth cope with page sizes ranging from 4k

to 4Mb

Page 80: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 80University of Cape Town

Multiport cache

From the section on pipelines, you will realise that multiple instructions are executed simultaneously

Therefore the cache is being accessed simultaneously by multiple instructions– Hence multi-port functionality

Page 81: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 81University of Cape Town

Future

L3 cache now moving on-chipXeons have 1Mb L3 cacheClear that efforts to increase clock speed

not so important any moreBetter performance to be gained by

– Improving memory hierarchy– Understanding how pipelines and memory

interact

Page 82: Gary MarsdenSlide 1University of Cape Town Memory  ‘The Illusion of Unlimited Fast Memory’ –What programmers want, so we need to fake it  Can exploit

Gary Marsden Slide 82University of Cape Town

Thanks for the memory!