gary marsdenslide 1university of cape town memory ‘the illusion of unlimited fast memory’...

Gary Marsden Slide 1University of Cape Town

Memory

‘The Illusion of Unlimited Fast Memory’– What programmers want, so we need to fake it

Can exploit– Principle of temporal locality– Principle of spatial locality

Kind of like going to the library


Memory types

SRAM - static RAM– Values don’t leak– Made from 4-6 transistors– Fairly expensive, but fast and low powered

DRAM - Dynamic RAM– Data leaks out – needs refreshing, usually be frequent reads– Only needs one transistor– Flavours

• EDO - pipelined DRAM• SDRAM - synchronised DRAM (to warm up the memory)


Relative Cost

As per 2004

Technology Access Time $ / Gbyte

SRAM 0.5 -5 ns 4000-10000

DRAM 50-70 ns 100-200

Disk 5-20 million ns 0.5-2


Memory hierarchy

Exploit localityMultiple levels of memory of different sizes

and speeds– Fast memory is expensive, so less used than

slower, cheap memory

Differences in cost and access times make it advantageous to have a hierarchy of memory, with faster closer to the CPU


Hierarchy


Goal

Present the user with as much memory as is available in the cheapest technology, while providing access at the speed offered by the fastest memory

Operation– Hierarchy is multi-level, but data only moved

between adjacent layers– Closer to CPU - fast and small– Further from CPU - slow and large


Terminology

Block: minimum unit of information present (or not) in a multilevel hierarchy (think book)

Hit: data found in upper level (on desk) Miss: data not found in upper level

– Lower level accessed to find block (go to shelves)

Hit rate: fraction of memory accesses found on upper levels– Used to measure performance

Miss rate: fraction of memory accesses not found on upper levels (1 - hit rate)


More terminology

Hit time: Time to access upper level of hierarchy (incl. Time to determine if it is there) – looking at desk

Miss penalty: Time take to replace block at upper level with block from lower level AND time to deliver block to processor – time to go to shelves and back

Impacts: OS design, how code is compiled, how applications are written


Summary Diagrams

Processor

Data are transferred

CPU

Level n

Level 2

Level 1

Levels in thememory hierarchy

Increasing distance from the CPU in

access time

Size of the memory at each level


Caching

‘a safe place for hiding or storing things’Used to mean the level of memory

hierarchy between CPU and main memoryNow used to mean any system to exploit

locality– E.g. Browser cache

Kick of by considering a simple cache: processor requests are one block; blocks are one word


Caching in contemporary processors


Reference to missing block

Issues1. How do we

know if Xn is in cache?

2. If it is, where do we find it.

a. Before the reference to Xn

X3

Xn – 1

Xn – 2

X1

X4

b. After the reference to Xn

X3

Xn – 1

Xn – 2

X1

X4

Xn

X2X2


Where’s the word?

Questions are related If a given memory word can only go to one

location in cache, there is only one place to look!

Direct mapped caching: fn(mem address)– Usually (mem address) modulo (number of

cache slots– Can use binary ‘trick’ where number of cache

slots is a round power of 2


Cache picture

Can have valid bit to show slot holds value

00001 00101 01001 01101 10001 10101 11001 11101

000

Cache

Memory

001

010

011

100

101

110

111


Cache details

Need to add ‘tag’ to each cache slot to show if the value there is indeed the one required– Disambiguates the 1:M mapping

Tag is comprised of the remaining bits not used in the modulo calculation

For now we concentrate on ‘read’ then look at cache design for two real machines


Cache - init, miss10110, miss11010, miss10000


Cache - miss 00011, miss 10010


Cache datapathAddress (showing bit positions)

20 10

Byteoffset

Valid Tag DataIndex

0

1

2

1021

1022

1023

Tag

Index

Hit Data

20 32

31 30 13 12 11 2 1 0


Cache sizes

Function of word size, cache slots and address size - influences tag size

Assume 32 bit MIPS address and word, with 2^n slots in cache (n = address width)

2^n *(block size + tag size + valid bit)2^n *(32+(32-n-2)+1)2^n * (63-n)


Handling cache misses

Control unit must detect miss and process miss by fetching data from lower level (memory)

Cache hit - no problem– Data memory = data cache– Instruction memory = instruction cache

Control for misses not so easy– Miss means instruction is not valid so wrong

instruction is executed– Miss means data is invalid, so calculation is

meaningless


Steps to cope with a miss

Overview– Stall processor; activate mem. controller; get

value from next level; load value; continueDetail

– Subtract 4 from PC– Ask main memory to do a read and wait on

completion– Write entry to cache (mem data -> cache data;

upper bits -> tag field; set valid bit)– Restart instruction from first step; refetch

correct instruction now in instruction cache


Example Cache - DECStation 3100

Start with a fairly simple design– MIPS R2000, pipeline similar to chapter 6– Has inst. and data caches (for pipeline)– Fetches data and inst. word on every cycle

Cache is 64 Kb– 16k entries of 4-byte words

Steps for a cache read– Send address to cache– If hit, request data– If miss, send full address to main memory –

when data returned, place in cache


Diagram of DECStation 3100Address (showing bit positions)

16 14 Byteoffset

Valid Tag Data

Hit Data

16 32

16Kentries

16 bits 32 bits

31 30 17 16 15 5 4 3 2 1 0


Writing to Cache

We can’t write to cache alone as main memory will become inconsistent with cache

Can solve by write-through– Write to memory and cache simultaneously

Implication is that there is not point checking tag or write location - overwritten anyway– Index the cache using bits 15-2– Write tag bits (31-16) and data value word– Write data word to main memory


Too slow

The problem with as write-through technique is that we are bound by the slower speed of main memory

A main-memory buffer can help; usually several words long (4 in this case)– Stall on buffer full

Can use write-back– Value is only written to main memory when it

drops out of cache– Faster, but more complex to design and control


Exploiting Spatial Locality

Simply: When we have a miss, load a group of adjacent blocks into cache

Implies a cache block size > 1Address (showing bit positions)

16 12 Byteoffset

V Tag Data

Hit Data

16 32

4Kentries

16 bits 128 bits

Mux

32 32 32

2

32

Block offsetIndex

Tag

31 16 15 4 32 1 0


Using wider cache

Read misses same as before If we get a write hit

– Continue as usual

For a write miss– Read entire block from memory– Write word in block in cache– Write block back


Block size

With bigger blocks– Miss rate falls– Cost of a miss increases; time is latency to first

word + block transfer time• Obviously greater for bigger block

1 KB

8 KB

16 KB

64 KB

256 KB

256

40%

35%

30%

25%

20%

15%

10%

5%

0%

Miss rate

64164

Block size (bytes)


Memory design for caches

Miss is resolved from main (DRAM) memoryDRAM designed for density, not speedReduce miss penalty by increasing width of

memory retrievedCPU

Cache

Bus

Memory

a. One-word-wide memory organization

CPU

Bus

b. Wide memory organization

Memory

Multiplexor

Cache

CPU

Cache

Bus

Memorybank 1

Memorybank 2

Memorybank 3

Memorybank 0

c. Interleaved memory organization


Calculating penalty

Hypothetical access time for a DRAM– 1 clock cycle for sending address– 15 clock cycles for initiating the access– 1 clock cycle for sending the data

Memory organisation– Block with four words– Memory access 1 word

Miss penalty– 1 + 4 • 15 + 4 • 1 = 65 cycles– Bytes / cycle = 4 • 4 / 65 = 0.25


Widening access

Option one is what we were assumingOption 2 reduces latency and transfer

times– 1 + 1 • 15 + 1 • 1 = 17 cycles

Option 3 (interleaving) reduces latency, but not transfer time– 1 + 1 • 15 + 4 • 1 = 20 cycles

Option 3 cheaper than 2 and only marginally less quick


Summary

Simplest cache: direct mapped– 1 word: 1 location and 1 tag per word

Write-through/back used for consistency To exploit spatial locality, have cache block > 1

word– Tradeoff in block size

Updating of cache from memory increased by– Make memory & bus wider– Interleaving– Both schemes minimize times we initiate memory access


More summary

Since cycles spent on program = processor cycles + memory stall cycles, memory design has huge impact on performance

Faster processors mean relatively greater impact of memory stalls


Cache Performance

Coming back to Amdhal’s law, we can quickly show that faster processors can be undone by slow cache

Typically a poor cache can reduce a 2x performance increase to a 1.2x– Especially for highly clocked, low CPI processors

We shall look at some improvements


Where can block be placed?

Direct mapped - location is knownFully associative - must search whole cacheSet associative:

– a fixed number of locations where a block can be placed

– A set associative cache with n location is called n-way associative

– Block is mapped to unique set in cache (like hasing)

– Increasing associative usually decreases misses


Associative Cache

For direct mapped– Location = number modulo (# of cache blocks)

For set associative– Location = number modulo (# of cache sets)

What is a set?– Number of blocks where a word can be placed


Associative caches

1

2Tag

Data

Block # 0 1 2 3 4 5 6 7

Search

Direct mapped

1

2Tag

Data

Set # 0 1 2 3

Search

Set associative

1

2Tag

Data

Search

Fully associative

Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data

Eight-way set associative (fully associative)

Tag Data Tag Data Tag Data Tag Data

Four-way set associative

Set

0

1

Tag Data

One-way set associative(direct mapped)

Block

0

7

1

2

3

4

5

6

Tag Data

Two-way set associative

Set

0

1

2

3

Tag Data


Tradeoff

0%

3%

6%

9%

12%

15%

Eight-wayFour-wayTwo-wayOne-way

1 KB

2 KB

4 KB

8 KB

Miss rate

Associativity 16 KB

32 KB

64 KB

128 KB


How is block found

In direct mapped cache: index the cache In set associative cache:

– Index the set– Check tags in the set to see if a match is found

Choice depends on cost of a miss– High associativity balanced against search time– Fully assoc. too much, unless cache is small


Replacing Blocks

Fully associative– Any block can be replaced

Set associative– Must choose from set

Direct mapped– No choice

Algorithms– Random: cheap, simple– LRU: expensive as cache grows


Handling writes

Write back– Words can be written at the rate of fastest

memory– Multiple writes in a block only need one write to

slow memory

Write through– Read misses are cheaper (no write on displaced

block)– Easier to implement


Virtual Memory

Cache used to provide fast access for processor to recently used code

Similarly, main memory can act as a cache for disk: this is called ‘virtual memory’

Main reason is to allow memory sharing among multiple programs– Memory requirement for all programs greater

than RAM available– But only a fraction of memory is used at any

given time– Main memory need only hold in-use values of

one program


The burden of memory

Want programs to be able to exceed size of main memory

Programmer used to do this explicitly in code

Divide program into mutually exclusive overlays

Overlays explicitly loaded/unloadedVirtual memory removes this responsibility

from application programmers


Terminology

Concepts similar to cache, but grew from different direction

Page: virtual memory blockPage fault: virtual memory missVirtual address: Address produced by the

CPU and translated into an absolute addressMemory mapping / address translation:

the virtual to physical mapping process– Book title -> dewey-decimal system


Virtual / physical addresses

Physical addresses

Disk addresses

Virtual addresses

Address translation


More terminology

Relocation: Provided by virtual memory as virtual addresses used by a program and mapped to physical addresses before memory access– Programs relocated as fixed size pages (blocks);

need not be contiguous

Virtual memory addresses consist of a virtual page number and offset which is translated to a physical page


Translation process

3 2 1 011 10 9 815 14 13 1231 30 29 28 27

Page offsetVirtual page number

Virtual address

3 2 1 011 10 9 815 14 13 1229 28 27

Page offsetPhysical page number

Physical address

Translation


Design decisions

The cost of a page miss is massive: hundreds of thousands of clock cycles to rectify– Pages should be large enough to amortize high

access times (4kb - 16kb typical)– Need to reduce page fault rate, primarily

through flexible page placement– Virtual memory misses can be handled in

software, due to speed of disk– Write through is just too darned slow - need

better system


Page placement

Try to reduce page misses by optimising placement

Allow any virtual page to map to any physical page, then the OS can choose to replace any page it wants– Sophisticated replacement algorithms involved


Finding pages

Fully associative mapping allows any page (or block) to be associated with any location in physical memory (or cache)

If things can go anywhere, could be hard to find– Have a page table which resides in memory

and provides an address service for pages– Indexed with the virtual address and provides

location in memory hierarchy– Each program has own page table (page table

register)


Page table

Page offsetVirtual page number

Virtual address

Page offsetPhysical page number

Physical address

Physical page numberValid

If 0 then page is notpresent in memory

Page table register

Page table

20 12

18

31 30 29 28 27 15 14 13 12 11 10 9 8 3 2 1 0

29 28 27 15 14 13 12 11 10 9 8 3 2 1 0


Page fault

If valid bit is 0, then a fault occursOS is given control through an exceptionOnce OS has control, must decide where to

place page in physical memoryPrinciple of temporal locality - throw out

least recently used page (LRU scheme)


Page fault mechanism

Physical memory

Disk storage

Valid

1

1

1

1

0

1

1

0

1

1

0

1

Page table

Virtual pagenumber

Physical page ordisk address


Coping with writing

With cache, could use buffer on main memory to hide difference in speed for write-through

Not possible with main memory / disk– Difference too great

Changes made in memory and page is written back to disk once it drops out of memory

Called copy-back– Only happens if values on page are changed

• ‘Dirty’ bit

– Just as easy to copy a whole page as the altered values


Improving Performance

Every instruction that accesses memory incurs two memory accesses– One to look up physical address in page table– Another to actually access the value

Time to use the answer to all memory problems - Principle of Locality– This time, we exploit the fact that we are often

looking up the same address translations in the page table

– We need a page table cache!


Translation-Lookaside Buffer (TLB)

Cache for the page tableAs it is a cache, we need a tag field

– Page table is 1:1 so does not have a tag

Each entry is a physical page numberNeed valid and dirty bits

– May never get to the page table, so need to know if values in deallocated page need writing

If hit, use address in TLB, otherwise– Look in Page table. If miss there– Page fault


TLB diagram

Valid

1

1

1

1

0

1

1

0

1

1

0

1

Page table

Physical pageaddressValid

TLB

1

1

1

1

0

1

TagVirtual page

number

Physical pageor disk address

Physical memory

Disk storage


Typical Values for TLB

Size: 16-512 entriesHit time: 0.5 - 1 clock cyclesMiss penalty: 10-100 clock cyclesMiss rate: 0.01% - 1%


DECStation 3100

Valid Tag Data

Page offset

Page offset

Virtual page number

Virtual address

Physical page numberValid

1220

20

16 14

Cache index

32

Cache

DataCache hit

2

Byteoffset

Dirty Tag

TLB hit

Physical page number

Physical address tag

TLB

Physical address

31 30 29 15 14 13 12 11 10 9 8 3 2 1 0


Processor Support

To enable the OS to implement protection in the VM system, the hardware must support the following:1. Support at lease two modes: User and

O/S(kernel, supervisor, executive)2. Provide a portion of CPU state that a user can

read but not write3. Support movement from user to supervisor

mode

– All this is needed to allow OS to change page tables, but disallow users from doing so


Handling faults

When a fault occurs, the interrupt mechanism halts the user process and invokes the OS to find the correct page, then returns control to the user process

On fault:1. Find physical address from page table2. Choose page to replace (check dirty bit)3. Start a read to bring page from disk to memory


Memory access diagram

Yes

Deliver datato the CPU

Write?

Try to read datafrom cache

Write data into cache,update the tag, and put

the data and the addressinto the write buffer

Cache hit?Cache miss stall

TLB hit?

TLB access

Virtual address

TLB missexception

No

YesNo

YesNo

Write accessbit on?

YesNo

Write protectionexception

Physical address


Summary

VM is level in memory hierarchy bridging main memory - disk caching

Cost of miss is high, so:– Pages are large (spatial locality)– Virtual - physical address mapping is fully

associativeDisk writes are expensive: write-back and

dirty bitAllows multiple processSpeeded up via address translation cache

– TLB


VM Performance

If a program needs more virtual memory than there is physical memory; big trouble– System Thrashes– Best buy more memory

More common are TLB misses– A 64 entry TLB gives 64x4k = 0.25 Mb!– Cheat by having variable page sizes


Memory hierarchy framework

By now you will have spotted that there are similarities in different memory hierarchies

We want to look at common issuesFeature L1 Cache Virt. Mem. TLB

Blocks 250-2000 16000 - 250 000

16-512

Kbytes 16-64 250000-1000000000

0.25 - 16

Block size (B)

32-64 4000 - 64 000

4 - 32

Miss penalty 10 - 25 10 - 100 million

10-1000

Miss rate 2% - 5% 0.000001% - 0.0001%

0.01% - 2%


How to compare

Four questions to apply between any two levels of hierarchy1. Where can a block be placed?2. How is block found?3. Which block replaced on cache miss?4. What happens on write?


Placement

DirectSet associativeFully associative

(see cache section)


VM associativity

In VM, there are three key factors1. Miss rates are crucial as cost is high2. Mapping is implemented in software, so no

cycle time impact3. Large page size means table size is relatively

small

– Therefore, VM is always fully associative– Cache and TLB: often set associative

(recently move toward direct mapped)


Find a block

Direct - index (1 comparison)Set - search set (set size)Fully

– Search the whole cache (cache size)– Use lookup table (0)


Replace a block

In practice LRU is not truly used; there is always some approximation

Often a random scheme is employed due to the low overhead in calculation


Writing

Write-through– Usually for cache

Write-back– Only workable scheme for virtual memory


Intuitive model

All misses can be classified as:– Compulsory: First access to block (cold start)

• Increase block size

– Capacity: Cache cannot contain all the blocks it needs

• Increase cache size, but not at the cost of access time

– Conflict misses: or collision misses in direct mapped or set associative caches

• Increase associativity, but not at cost of access time


AMD K7

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.


Intel P4 - Prescott original (1Mb L2 cache)




Intel P4 - Prescott newer (2Mb L2 cache)




Intel Centrinio (2Mb L2 Cache)




Cell processor




Relative performance




AMD and Intel

Both have L1 and L2 cache– L1 serves to make register data available– L2 serves to get correct machine instructions

lined up

P4 has TLB for instructions and TLB for dataAMD has TLBs for instructions and data, for

L1 and L2 cacheBoth cope with page sizes ranging from 4k

to 4Mb


Multiport cache

From the section on pipelines, you will realise that multiple instructions are executed simultaneously

Therefore the cache is being accessed simultaneously by multiple instructions– Hence multi-port functionality


Future

L3 cache now moving on-chipXeons have 1Mb L3 cacheClear that efforts to increase clock speed

not so important any moreBetter performance to be gained by

– Improving memory hierarchy– Understanding how pipelines and memory

interact


Thanks for the memory!

gary marsdenslide 1university of cape town memory ‘the illusion of unlimited fast memory’...

Documents