gary marsdenslide 1university of cape town memory ‘the illusion of unlimited fast memory’...
TRANSCRIPT
Gary Marsden Slide 1University of Cape Town
Memory
‘The Illusion of Unlimited Fast Memory’– What programmers want, so we need to fake it
Can exploit– Principle of temporal locality– Principle of spatial locality
Kind of like going to the library
Gary Marsden Slide 2University of Cape Town
Memory types
SRAM - static RAM– Values don’t leak– Made from 4-6 transistors– Fairly expensive, but fast and low powered
DRAM - Dynamic RAM– Data leaks out – needs refreshing, usually be frequent reads– Only needs one transistor– Flavours
• EDO - pipelined DRAM• SDRAM - synchronised DRAM (to warm up the memory)
Gary Marsden Slide 3University of Cape Town
Relative Cost
As per 2004
Technology Access Time $ / Gbyte
SRAM 0.5 -5 ns 4000-10000
DRAM 50-70 ns 100-200
Disk 5-20 million ns 0.5-2
Gary Marsden Slide 4University of Cape Town
Memory hierarchy
Exploit localityMultiple levels of memory of different sizes
and speeds– Fast memory is expensive, so less used than
slower, cheap memory
Differences in cost and access times make it advantageous to have a hierarchy of memory, with faster closer to the CPU
Gary Marsden Slide 5University of Cape Town
Hierarchy
Gary Marsden Slide 6University of Cape Town
Goal
Present the user with as much memory as is available in the cheapest technology, while providing access at the speed offered by the fastest memory
Operation– Hierarchy is multi-level, but data only moved
between adjacent layers– Closer to CPU - fast and small– Further from CPU - slow and large
Gary Marsden Slide 7University of Cape Town
Terminology
Block: minimum unit of information present (or not) in a multilevel hierarchy (think book)
Hit: data found in upper level (on desk) Miss: data not found in upper level
– Lower level accessed to find block (go to shelves)
Hit rate: fraction of memory accesses found on upper levels– Used to measure performance
Miss rate: fraction of memory accesses not found on upper levels (1 - hit rate)
Gary Marsden Slide 8University of Cape Town
More terminology
Hit time: Time to access upper level of hierarchy (incl. Time to determine if it is there) – looking at desk
Miss penalty: Time take to replace block at upper level with block from lower level AND time to deliver block to processor – time to go to shelves and back
Impacts: OS design, how code is compiled, how applications are written
Gary Marsden Slide 9University of Cape Town
Summary Diagrams
Processor
Data are transferred
CPU
Level n
Level 2
Level 1
Levels in thememory hierarchy
Increasing distance from the CPU in
access time
Size of the memory at each level
Gary Marsden Slide 10University of Cape Town
Caching
‘a safe place for hiding or storing things’Used to mean the level of memory
hierarchy between CPU and main memoryNow used to mean any system to exploit
locality– E.g. Browser cache
Kick of by considering a simple cache: processor requests are one block; blocks are one word
Gary Marsden Slide 11University of Cape Town
Caching in contemporary processors
Gary Marsden Slide 12University of Cape Town
Reference to missing block
Issues1. How do we
know if Xn is in cache?
2. If it is, where do we find it.
a. Before the reference to Xn
X3
Xn – 1
Xn – 2
X1
X4
b. After the reference to Xn
X3
Xn – 1
Xn – 2
X1
X4
Xn
X2X2
Gary Marsden Slide 13University of Cape Town
Where’s the word?
Questions are related If a given memory word can only go to one
location in cache, there is only one place to look!
Direct mapped caching: fn(mem address)– Usually (mem address) modulo (number of
cache slots– Can use binary ‘trick’ where number of cache
slots is a round power of 2
Gary Marsden Slide 14University of Cape Town
Cache picture
Can have valid bit to show slot holds value
00001 00101 01001 01101 10001 10101 11001 11101
000
Cache
Memory
001
010
011
100
101
110
111
Gary Marsden Slide 15University of Cape Town
Cache details
Need to add ‘tag’ to each cache slot to show if the value there is indeed the one required– Disambiguates the 1:M mapping
Tag is comprised of the remaining bits not used in the modulo calculation
For now we concentrate on ‘read’ then look at cache design for two real machines
Gary Marsden Slide 16University of Cape Town
Cache - init, miss10110, miss11010, miss10000
Gary Marsden Slide 17University of Cape Town
Cache - miss 00011, miss 10010
Gary Marsden Slide 18University of Cape Town
Cache datapathAddress (showing bit positions)
20 10
Byteoffset
Valid Tag DataIndex
0
1
2
1021
1022
1023
Tag
Index
Hit Data
20 32
31 30 13 12 11 2 1 0
Gary Marsden Slide 19University of Cape Town
Cache sizes
Function of word size, cache slots and address size - influences tag size
Assume 32 bit MIPS address and word, with 2^n slots in cache (n = address width)
2^n *(block size + tag size + valid bit)2^n *(32+(32-n-2)+1)2^n * (63-n)
Gary Marsden Slide 20University of Cape Town
Handling cache misses
Control unit must detect miss and process miss by fetching data from lower level (memory)
Cache hit - no problem– Data memory = data cache– Instruction memory = instruction cache
Control for misses not so easy– Miss means instruction is not valid so wrong
instruction is executed– Miss means data is invalid, so calculation is
meaningless
Gary Marsden Slide 21University of Cape Town
Steps to cope with a miss
Overview– Stall processor; activate mem. controller; get
value from next level; load value; continueDetail
– Subtract 4 from PC– Ask main memory to do a read and wait on
completion– Write entry to cache (mem data -> cache data;
upper bits -> tag field; set valid bit)– Restart instruction from first step; refetch
correct instruction now in instruction cache
Gary Marsden Slide 22University of Cape Town
Example Cache - DECStation 3100
Start with a fairly simple design– MIPS R2000, pipeline similar to chapter 6– Has inst. and data caches (for pipeline)– Fetches data and inst. word on every cycle
Cache is 64 Kb– 16k entries of 4-byte words
Steps for a cache read– Send address to cache– If hit, request data– If miss, send full address to main memory –
when data returned, place in cache
Gary Marsden Slide 23University of Cape Town
Diagram of DECStation 3100Address (showing bit positions)
16 14 Byteoffset
Valid Tag Data
Hit Data
16 32
16Kentries
16 bits 32 bits
31 30 17 16 15 5 4 3 2 1 0
Gary Marsden Slide 24University of Cape Town
Writing to Cache
We can’t write to cache alone as main memory will become inconsistent with cache
Can solve by write-through– Write to memory and cache simultaneously
Implication is that there is not point checking tag or write location - overwritten anyway– Index the cache using bits 15-2– Write tag bits (31-16) and data value word– Write data word to main memory
Gary Marsden Slide 25University of Cape Town
Too slow
The problem with as write-through technique is that we are bound by the slower speed of main memory
A main-memory buffer can help; usually several words long (4 in this case)– Stall on buffer full
Can use write-back– Value is only written to main memory when it
drops out of cache– Faster, but more complex to design and control
Gary Marsden Slide 26University of Cape Town
Exploiting Spatial Locality
Simply: When we have a miss, load a group of adjacent blocks into cache
Implies a cache block size > 1Address (showing bit positions)
16 12 Byteoffset
V Tag Data
Hit Data
16 32
4Kentries
16 bits 128 bits
Mux
32 32 32
2
32
Block offsetIndex
Tag
31 16 15 4 32 1 0
Gary Marsden Slide 27University of Cape Town
Using wider cache
Read misses same as before If we get a write hit
– Continue as usual
For a write miss– Read entire block from memory– Write word in block in cache– Write block back
Gary Marsden Slide 28University of Cape Town
Block size
With bigger blocks– Miss rate falls– Cost of a miss increases; time is latency to first
word + block transfer time• Obviously greater for bigger block
1 KB
8 KB
16 KB
64 KB
256 KB
256
40%
35%
30%
25%
20%
15%
10%
5%
0%
Miss rate
64164
Block size (bytes)
Gary Marsden Slide 29University of Cape Town
Memory design for caches
Miss is resolved from main (DRAM) memoryDRAM designed for density, not speedReduce miss penalty by increasing width of
memory retrievedCPU
Cache
Bus
Memory
a. One-word-wide memory organization
CPU
Bus
b. Wide memory organization
Memory
Multiplexor
Cache
CPU
Cache
Bus
Memorybank 1
Memorybank 2
Memorybank 3
Memorybank 0
c. Interleaved memory organization
Gary Marsden Slide 30University of Cape Town
Calculating penalty
Hypothetical access time for a DRAM– 1 clock cycle for sending address– 15 clock cycles for initiating the access– 1 clock cycle for sending the data
Memory organisation– Block with four words– Memory access 1 word
Miss penalty– 1 + 4 • 15 + 4 • 1 = 65 cycles– Bytes / cycle = 4 • 4 / 65 = 0.25
Gary Marsden Slide 31University of Cape Town
Widening access
Option one is what we were assumingOption 2 reduces latency and transfer
times– 1 + 1 • 15 + 1 • 1 = 17 cycles
Option 3 (interleaving) reduces latency, but not transfer time– 1 + 1 • 15 + 4 • 1 = 20 cycles
Option 3 cheaper than 2 and only marginally less quick
Gary Marsden Slide 32University of Cape Town
Summary
Simplest cache: direct mapped– 1 word: 1 location and 1 tag per word
Write-through/back used for consistency To exploit spatial locality, have cache block > 1
word– Tradeoff in block size
Updating of cache from memory increased by– Make memory & bus wider– Interleaving– Both schemes minimize times we initiate memory access
Gary Marsden Slide 33University of Cape Town
More summary
Since cycles spent on program = processor cycles + memory stall cycles, memory design has huge impact on performance
Faster processors mean relatively greater impact of memory stalls
Gary Marsden Slide 34University of Cape Town
Cache Performance
Coming back to Amdhal’s law, we can quickly show that faster processors can be undone by slow cache
Typically a poor cache can reduce a 2x performance increase to a 1.2x– Especially for highly clocked, low CPI processors
We shall look at some improvements
Gary Marsden Slide 35University of Cape Town
Where can block be placed?
Direct mapped - location is knownFully associative - must search whole cacheSet associative:
– a fixed number of locations where a block can be placed
– A set associative cache with n location is called n-way associative
– Block is mapped to unique set in cache (like hasing)
– Increasing associative usually decreases misses
Gary Marsden Slide 36University of Cape Town
Associative Cache
For direct mapped– Location = number modulo (# of cache blocks)
For set associative– Location = number modulo (# of cache sets)
What is a set?– Number of blocks where a word can be placed
Gary Marsden Slide 37University of Cape Town
Associative caches
1
2Tag
Data
Block # 0 1 2 3 4 5 6 7
Search
Direct mapped
1
2Tag
Data
Set # 0 1 2 3
Search
Set associative
1
2Tag
Data
Search
Fully associative
Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data
Eight-way set associative (fully associative)
Tag Data Tag Data Tag Data Tag Data
Four-way set associative
Set
0
1
Tag Data
One-way set associative(direct mapped)
Block
0
7
1
2
3
4
5
6
Tag Data
Two-way set associative
Set
0
1
2
3
Tag Data
Gary Marsden Slide 38University of Cape Town
Tradeoff
0%
3%
6%
9%
12%
15%
Eight-wayFour-wayTwo-wayOne-way
1 KB
2 KB
4 KB
8 KB
Miss rate
Associativity 16 KB
32 KB
64 KB
128 KB
Gary Marsden Slide 39University of Cape Town
How is block found
In direct mapped cache: index the cache In set associative cache:
– Index the set– Check tags in the set to see if a match is found
Choice depends on cost of a miss– High associativity balanced against search time– Fully assoc. too much, unless cache is small
Gary Marsden Slide 40University of Cape Town
Replacing Blocks
Fully associative– Any block can be replaced
Set associative– Must choose from set
Direct mapped– No choice
Algorithms– Random: cheap, simple– LRU: expensive as cache grows
Gary Marsden Slide 41University of Cape Town
Handling writes
Write back– Words can be written at the rate of fastest
memory– Multiple writes in a block only need one write to
slow memory
Write through– Read misses are cheaper (no write on displaced
block)– Easier to implement
Gary Marsden Slide 42University of Cape Town
Virtual Memory
Cache used to provide fast access for processor to recently used code
Similarly, main memory can act as a cache for disk: this is called ‘virtual memory’
Main reason is to allow memory sharing among multiple programs– Memory requirement for all programs greater
than RAM available– But only a fraction of memory is used at any
given time– Main memory need only hold in-use values of
one program
Gary Marsden Slide 43University of Cape Town
The burden of memory
Want programs to be able to exceed size of main memory
Programmer used to do this explicitly in code
Divide program into mutually exclusive overlays
Overlays explicitly loaded/unloadedVirtual memory removes this responsibility
from application programmers
Gary Marsden Slide 44University of Cape Town
Terminology
Concepts similar to cache, but grew from different direction
Page: virtual memory blockPage fault: virtual memory missVirtual address: Address produced by the
CPU and translated into an absolute addressMemory mapping / address translation:
the virtual to physical mapping process– Book title -> dewey-decimal system
Gary Marsden Slide 45University of Cape Town
Virtual / physical addresses
Physical addresses
Disk addresses
Virtual addresses
Address translation
Gary Marsden Slide 46University of Cape Town
More terminology
Relocation: Provided by virtual memory as virtual addresses used by a program and mapped to physical addresses before memory access– Programs relocated as fixed size pages (blocks);
need not be contiguous
Virtual memory addresses consist of a virtual page number and offset which is translated to a physical page
Gary Marsden Slide 47University of Cape Town
Translation process
3 2 1 011 10 9 815 14 13 1231 30 29 28 27
Page offsetVirtual page number
Virtual address
3 2 1 011 10 9 815 14 13 1229 28 27
Page offsetPhysical page number
Physical address
Translation
Gary Marsden Slide 48University of Cape Town
Design decisions
The cost of a page miss is massive: hundreds of thousands of clock cycles to rectify– Pages should be large enough to amortize high
access times (4kb - 16kb typical)– Need to reduce page fault rate, primarily
through flexible page placement– Virtual memory misses can be handled in
software, due to speed of disk– Write through is just too darned slow - need
better system
Gary Marsden Slide 49University of Cape Town
Page placement
Try to reduce page misses by optimising placement
Allow any virtual page to map to any physical page, then the OS can choose to replace any page it wants– Sophisticated replacement algorithms involved
Gary Marsden Slide 50University of Cape Town
Finding pages
Fully associative mapping allows any page (or block) to be associated with any location in physical memory (or cache)
If things can go anywhere, could be hard to find– Have a page table which resides in memory
and provides an address service for pages– Indexed with the virtual address and provides
location in memory hierarchy– Each program has own page table (page table
register)
Gary Marsden Slide 51University of Cape Town
Page table
Page offsetVirtual page number
Virtual address
Page offsetPhysical page number
Physical address
Physical page numberValid
If 0 then page is notpresent in memory
Page table register
Page table
20 12
18
31 30 29 28 27 15 14 13 12 11 10 9 8 3 2 1 0
29 28 27 15 14 13 12 11 10 9 8 3 2 1 0
Gary Marsden Slide 52University of Cape Town
Page fault
If valid bit is 0, then a fault occursOS is given control through an exceptionOnce OS has control, must decide where to
place page in physical memoryPrinciple of temporal locality - throw out
least recently used page (LRU scheme)
Gary Marsden Slide 53University of Cape Town
Page fault mechanism
Physical memory
Disk storage
Valid
1
1
1
1
0
1
1
0
1
1
0
1
Page table
Virtual pagenumber
Physical page ordisk address
Gary Marsden Slide 54University of Cape Town
Coping with writing
With cache, could use buffer on main memory to hide difference in speed for write-through
Not possible with main memory / disk– Difference too great
Changes made in memory and page is written back to disk once it drops out of memory
Called copy-back– Only happens if values on page are changed
• ‘Dirty’ bit
– Just as easy to copy a whole page as the altered values
Gary Marsden Slide 55University of Cape Town
Improving Performance
Every instruction that accesses memory incurs two memory accesses– One to look up physical address in page table– Another to actually access the value
Time to use the answer to all memory problems - Principle of Locality– This time, we exploit the fact that we are often
looking up the same address translations in the page table
– We need a page table cache!
Gary Marsden Slide 56University of Cape Town
Translation-Lookaside Buffer (TLB)
Cache for the page tableAs it is a cache, we need a tag field
– Page table is 1:1 so does not have a tag
Each entry is a physical page numberNeed valid and dirty bits
– May never get to the page table, so need to know if values in deallocated page need writing
If hit, use address in TLB, otherwise– Look in Page table. If miss there– Page fault
Gary Marsden Slide 57University of Cape Town
TLB diagram
Valid
1
1
1
1
0
1
1
0
1
1
0
1
Page table
Physical pageaddressValid
TLB
1
1
1
1
0
1
TagVirtual page
number
Physical pageor disk address
Physical memory
Disk storage
Gary Marsden Slide 58University of Cape Town
Typical Values for TLB
Size: 16-512 entriesHit time: 0.5 - 1 clock cyclesMiss penalty: 10-100 clock cyclesMiss rate: 0.01% - 1%
Gary Marsden Slide 59University of Cape Town
DECStation 3100
Valid Tag Data
Page offset
Page offset
Virtual page number
Virtual address
Physical page numberValid
1220
20
16 14
Cache index
32
Cache
DataCache hit
2
Byteoffset
Dirty Tag
TLB hit
Physical page number
Physical address tag
TLB
Physical address
31 30 29 15 14 13 12 11 10 9 8 3 2 1 0
Gary Marsden Slide 60University of Cape Town
Processor Support
To enable the OS to implement protection in the VM system, the hardware must support the following:1. Support at lease two modes: User and
O/S(kernel, supervisor, executive)2. Provide a portion of CPU state that a user can
read but not write3. Support movement from user to supervisor
mode
– All this is needed to allow OS to change page tables, but disallow users from doing so
Gary Marsden Slide 61University of Cape Town
Handling faults
When a fault occurs, the interrupt mechanism halts the user process and invokes the OS to find the correct page, then returns control to the user process
On fault:1. Find physical address from page table2. Choose page to replace (check dirty bit)3. Start a read to bring page from disk to memory
Gary Marsden Slide 62University of Cape Town
Memory access diagram
Yes
Deliver datato the CPU
Write?
Try to read datafrom cache
Write data into cache,update the tag, and put
the data and the addressinto the write buffer
Cache hit?Cache miss stall
TLB hit?
TLB access
Virtual address
TLB missexception
No
YesNo
YesNo
Write accessbit on?
YesNo
Write protectionexception
Physical address
Gary Marsden Slide 63University of Cape Town
Summary
VM is level in memory hierarchy bridging main memory - disk caching
Cost of miss is high, so:– Pages are large (spatial locality)– Virtual - physical address mapping is fully
associativeDisk writes are expensive: write-back and
dirty bitAllows multiple processSpeeded up via address translation cache
– TLB
Gary Marsden Slide 64University of Cape Town
VM Performance
If a program needs more virtual memory than there is physical memory; big trouble– System Thrashes– Best buy more memory
More common are TLB misses– A 64 entry TLB gives 64x4k = 0.25 Mb!– Cheat by having variable page sizes
Gary Marsden Slide 65University of Cape Town
Memory hierarchy framework
By now you will have spotted that there are similarities in different memory hierarchies
We want to look at common issuesFeature L1 Cache Virt. Mem. TLB
Blocks 250-2000 16000 - 250 000
16-512
Kbytes 16-64 250000-1000000000
0.25 - 16
Block size (B)
32-64 4000 - 64 000
4 - 32
Miss penalty 10 - 25 10 - 100 million
10-1000
Miss rate 2% - 5% 0.000001% - 0.0001%
0.01% - 2%
Gary Marsden Slide 66University of Cape Town
How to compare
Four questions to apply between any two levels of hierarchy1. Where can a block be placed?2. How is block found?3. Which block replaced on cache miss?4. What happens on write?
Gary Marsden Slide 67University of Cape Town
Placement
DirectSet associativeFully associative
(see cache section)
Gary Marsden Slide 68University of Cape Town
VM associativity
In VM, there are three key factors1. Miss rates are crucial as cost is high2. Mapping is implemented in software, so no
cycle time impact3. Large page size means table size is relatively
small
– Therefore, VM is always fully associative– Cache and TLB: often set associative
(recently move toward direct mapped)
Gary Marsden Slide 69University of Cape Town
Find a block
Direct - index (1 comparison)Set - search set (set size)Fully
– Search the whole cache (cache size)– Use lookup table (0)
Gary Marsden Slide 70University of Cape Town
Replace a block
In practice LRU is not truly used; there is always some approximation
Often a random scheme is employed due to the low overhead in calculation
Gary Marsden Slide 71University of Cape Town
Writing
Write-through– Usually for cache
Write-back– Only workable scheme for virtual memory
Gary Marsden Slide 72University of Cape Town
Intuitive model
All misses can be classified as:– Compulsory: First access to block (cold start)
• Increase block size
– Capacity: Cache cannot contain all the blocks it needs
• Increase cache size, but not at the cost of access time
– Conflict misses: or collision misses in direct mapped or set associative caches
• Increase associativity, but not at cost of access time
Gary Marsden Slide 73University of Cape Town
AMD K7
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Gary Marsden Slide 74University of Cape Town
Intel P4 - Prescott original (1Mb L2 cache)
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Gary Marsden Slide 75University of Cape Town
Intel P4 - Prescott newer (2Mb L2 cache)
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Gary Marsden Slide 76University of Cape Town
Intel Centrinio (2Mb L2 Cache)
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Gary Marsden Slide 77University of Cape Town
Cell processor
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Gary Marsden Slide 78University of Cape Town
Relative performance
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Gary Marsden Slide 79University of Cape Town
AMD and Intel
Both have L1 and L2 cache– L1 serves to make register data available– L2 serves to get correct machine instructions
lined up
P4 has TLB for instructions and TLB for dataAMD has TLBs for instructions and data, for
L1 and L2 cacheBoth cope with page sizes ranging from 4k
to 4Mb
Gary Marsden Slide 80University of Cape Town
Multiport cache
From the section on pipelines, you will realise that multiple instructions are executed simultaneously
Therefore the cache is being accessed simultaneously by multiple instructions– Hence multi-port functionality
Gary Marsden Slide 81University of Cape Town
Future
L3 cache now moving on-chipXeons have 1Mb L3 cacheClear that efforts to increase clock speed
not so important any moreBetter performance to be gained by
– Improving memory hierarchy– Understanding how pipelines and memory
interact
Gary Marsden Slide 82University of Cape Town
Thanks for the memory!