5. memory hierarchy - mcmaster universityse2ga3/chapter 5.pdfmemory hierarchy block (aka line): unit...
TRANSCRIPT
-
5. Memory HierarchyComputer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3
Emil Sekerinski, McMaster University, Fall Term 2015/16
-
Movie Rental Store
• You have a huge warehouse with every movie ever made.
• Getting a movie from the warehouse takes 15 minutes.
• Are not competitive if every rental takes 15 minutes.
• You have some small shelves in the front office.
Here are some suggested improvements to the store:
1. Whenever someone rents a movie, just keep it in the front office for a while in case someone else wants to rent it.
2. Watch the trends in movie watching and attempt to guess movies that will be rented soon – put those in the front office.
3. Whenever someone rents a movie in a series (Star Wars), grab the other movies in the series and put them in the front office.
4. Use scooters to get the movies faster
Office
Warehouse
Selection Spatial Temporal A 2 1 B 4 2 C 4 3 D 3 1 E None of the above
Which pair of changes would be likely most effective?
-
Principle of Locality
Programs access a small proportion of their address space at any time
• Temporal locality: items accessed recently are likely to be accessed again soon, e.g. instructions in a loop, local variables
• Spatial locality: items near those accessed recently are likely to be accessed soon, e.g. sequential instruction access, array data
Memory hierarchy takes advantage of locality
• Store everything on disk
• Copy recently accessed (and nearby) items from disk to smaller DRAM memory: main memory
• Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory: cache memory attached to CPU
-
Memory Hierarchy
Block (aka line): unit of copying, may be multiple words
• Hit: access satisfied by upper level Hit ratio: hits/accesses
• Miss: accessed data is absent, block copied from lower level Time taken: miss penalty Miss ratio: misses/accesses = 1 – hit ratio
Current memory technology:
• Static RAM (SRAM): 0.5ns – 2.5ns, $2000 – $5000 per GB
• Dynamic RAM (DRAM): 50ns – 70ns, $20 – $75 per GB
• Magnetic disk: 5ms – 20ms, $0.20 – $2 per GB
• Ideal memory: access time of SRAM, capacity and cost/GB of disk
-
DRAM Technology
Data stored as a charge in a capacitor: single transistor used to access the charge
Bits are organized as a rectangular array
DRAM accesses an entire row
• Burst mode: supply successive words from a row with reduced latency
Must periodically be refreshed:
• Read contents and write back
• Performed on a DRAM row
Double data rate (DDR) DRAM
• Transfer on rising and falling clock edges
Quad data rate (QDR) DRAM
• Separate DDR inputs and outputs
-
Flash Storage
Nonvolatile semiconductor storage
• 100× – 1000× faster than disk
• Smaller, lower power, more robust
• But more $/GB (between disk and DRAM)
NOR flash: bit cell like a NOR gate
• Random read/write access
• Used for instruction memory in embedded systems
NAND flash: bit cell like a NAND gate
• Denser (bits/area), but block-at-a-time access
• Cheaper per GB
• Used for USB keys, media storage, …
Flash bits wears out after 1000’s of accesses
• Wear leveling: remap data to less used blocks
-
Cache Memory
The level of the memory hierarchy closest to the CPU
Given accesses X1, …, Xn–1, Xn• How do we know if the data is present?
• Where do we look?
-
Direct Mapped Cache
Location determined by address
Direct mapped: only one choice
(Block address) modulo (#Blocks in cache)
• #Blocks is a power of 2• Use low-order address bits
How do we know which particular block is stored in a cache location?• Store block address as well as the data• Actually, only need the high-order bits• Called the tag
What if there is no data in a location?• Valid bit: 1 = present, 0 = not present• Initially 0
-
Cache Example: Initial State
8-blocks, 1 word/block, direct mapped
Index V Tag Data
000 N
001 N
010 N
011 N
100 N
101 N
110 N
111 N
-
Cache Example
Index V Tag Data 000 N 001 N 010 N 011 N 100 N 101 N 110 Y 10 Mem[10110] 111 N
Word addr Binary addr Hit/miss Cache block 22 10 110 Miss 110
-
Cache Example
Index V Tag Data 000 N 001 N 010 Y 11 Mem[11010] 011 N 100 N 101 N 110 Y 10 Mem[10110] 111 N
Word addr Binary addr Hit/miss Cache block 26 11 010 Miss 010
-
Cache Example
Index V Tag Data 000 N 001 N 010 Y 11 Mem[11010] 011 N 100 N 101 N 110 Y 10 Mem[10110] 111 N
Word addr Binary addr Hit/miss Cache block 22 10 110 Hit 110 26 11 010 Hit 010
-
Cache Example
Index V Tag Data 000 Y 10 Mem[10000] 001 N 010 Y 11 Mem[11010] 011 Y 00 Mem[00011] 100 N 101 N 110 Y 10 Mem[10110] 111 N
Word addr Binary addr Hit/miss Cache block 16 10 000 Miss 000 3 00 011 Miss 011
16 10 000 Hit 000
-
Cache Example
Index V Tag Data 000 Y 10 Mem[10000] 001 N 010 Y 10 Mem[10010] 011 Y 00 Mem[00011] 100 N 101 N 110 Y 10 Mem[10110] 111 N
Word addr Binary addr Hit/miss Cache block 18 10 010 Miss 010
-
Address Subdivision
-
Larger Block Size
64 blocks, 16 bytes/block
• To what block number does address 1200 map?
Block address = 1200/16 = 75
Block number = 75 modulo 64 = 11
Larger blocks should reduce miss rate due to spatial locality
• But in a fixed-sized cache, larger blocks ⇒ fewer blocks ⇒ increased miss rate Larger blocks ⇒ pollution
• Larger miss penalty: can override benefit of reduced miss rate
Tag Index Offset 0 3 4 9 10 31
4 bits 6 bits 22 bits
Data based on SPEC92
-
Cache Misses
On cache hit, CPU proceeds normally; on cache miss
• Stall the CPU pipeline
• Fetch block from next level of hierarchy
• Instruction cache miss: restart instruction fetch
• Data cache miss: complete data access
On data-write hit, could just update the block in cache
• But then cache and memory would be inconsistent
Write through: also update memory
• But makes writes take longer, e.g. if base CPI = 1, 10% of instructions are stores, write to memory takes 100 cycles: effective CPI = 1 + 0.1×100 = 11
Solution: write buffer
• Holds data waiting to be written to memory
• CPU continues immediately, only stalls on write if write buffer is already full
-
Write Allocation
Write back alternative: On data-write hit, just update the block in cache
• Keep track of whether each block is dirty
When a dirty block is replaced, write it back to memory
What should happen on a write miss?
For write-through
• Allocate on miss: fetch the block
• Write around: don’t fetch the block, since programs often write a whole block before reading it (e.g., initialization)
For write-back
• Usually fetch the block
-
Associative Caches
Fully associative: allow a given block to go in any cache entry
• Requires all entries to be searched at once
• Comparator per entry (expensive)
n-way set associative: each set contains n entries
• Block number determines which set: (Block number) modulo (#Sets in cache)
• Search all entries in a given set at once: n comparators (less expensive)
-
Associativity Example
Direct mapped:
2-way set associative:
Fully associative:
Compare 4-block caches, block access sequence: 0, 8, 0, 6, 8
Block address
Cache index
Hit/miss Cache content after access 0 1 2 3
0 0 miss Mem[0] 8 0 miss Mem[8] 0 0 miss Mem[0] 6 2 miss Mem[0] Mem[6] 8 0 miss Mem[8] Mem[6]
Block address
Cache index
Hit/miss Cache content after access Set 0 Set 1
0 0 miss Mem[0] 8 0 miss Mem[0] Mem[8] 0 0 hit Mem[0] Mem[8] 6 0 miss Mem[0] Mem[6] 8 0 miss Mem[8] Mem[6]
Block address
Hit/miss Cache content after access
0 miss Mem[0] 8 miss Mem[0] Mem[8] 0 hit Mem[0] Mem[8] 6 miss Mem[0] Mem[8] Mem[6] 8 hit Mem[0] Mem[8] Mem[6]
-
How much associativity?
Increased associativity decreases miss rate, but with diminishing returns
-
Descriptions of caches
1. Exceptional usage of the cache space in exchange for a slow hit time
2. Poor usage of the cache space in exchange for an excellent hit time
3. Reasonable usage of cache space in exchange for a reasonable hit time
Selection Fully-Associative
8-way Set Associative
Direct Mapped
A 3 2 1 B 3 3 2 C 1 2 3 D 3 2 1 E None of the above
-
Interactions with Software
Misses depend on memory access patterns
• Algorithm behavior
• Compiler optimization for memory access
-
Optimization via Blocking
Maximize accesses to data before it is replaced
Consider inner loops of matrix multiply, with one-dimensional array:
C A B
for(intj=0;j<n;++j)
{
doublecij=C[i+j*n];
for(intk=0;k<n;k++)
cij+=A[i+k*n]*B[k+j*n];
C[i+j*n]=cij;
}
for(inti=0;i<n;i++)
for(intj=0;j<n;j++)
{
doublecij=C[i][j];
for(intk=0;k<n;k++)
cij+=A[i][k]*B[k][j];
C[i][j]=cij;
}
recent accessolder accessnot yet accessed
-
Blocked Matrix Multiply Access Pattern
unoptimized blocked
-
Multilevel On-Chip Caches
-
Virtual Memory
Another level in the cache/memory hierarchy: Virtual memory allows us to view main memory as a cache of a larger memory space (on disk).
cpu $
cache
memory
disk
cacheing
cacheing
virtual memory
Add latencies
1-4 (L2 – 14)
40-60
150-300
10,000,000-80,000,000
-
Each program gets a private virtual address space holding its frequently used code and data, protected from other programs
CPU and OS translate virtual addresses to physical addresses
• VM “block” is called a page
• VM translation “miss” is called a page fault
-
Page Table
On page fault, the page must be fetched from disk
• Takes millions of clock cycles, handled by OS code
Try to minimize page fault rate
• Fully associative placement
• Smart replacement algorithms
-
Page Table
Stores placement information
• Array of page table entries, indexed by virtual page number
• Page table register in CPU points to page table in physical memory
If page is present in memory
• PTE stores the physical page number
• Plus other status bits (referenced, dirty, …)
If page is not present
• PTE can refer to location in swap space on disk
-
Replacement and Writes
To reduce page fault rate, prefer least-recently used (LRU) replacement
• Reference bit (aka use bit) in PTE set to 1 on access to page
• Periodically cleared to 0 by OS
• A page with reference bit = 0 has not been used recently
Disk writes take millions of cycles
• Block at once, not individual locations
• Write through is impractical
• Use write-back
• Dirty bit in PTE set when page is written
-
Fast Translation Using a TLB
Address translation would appear to require extra memory references
• One to access the PTE
• Then the actual memory access
But access to page tables has good locality
• So use a fast cache of PTEs within the CPU
• Called a Translation Look-aside Buffer (TLB)
• Typical: 16–512 PTEs, 0.5–1 cycle for hit, 10–100 cycles for miss, 0.01%–1% miss rate
• Misses could be handled by hardware or software
-
Fast Translation Using a TLB
If page is in memory
• Load the PTE from memory and retry
• Could be handled in hardware: can get complex for more complicated page table structures
• Or in software: raise a special exception, with optimized handler
If page is not in memory (page fault)
• OS handles fetching the page and updating the page table
• Then restart the faulting instruction