5. memory hierarchy - mcmaster universityse2ga3/chapter 5.pdfmemory hierarchy block (aka line): unit...

5. Memory HierarchyComputer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3

Emil Sekerinski, McMaster University, Fall Term 2015/16

Movie Rental Store

• You have a huge warehouse with every movie ever made.

• Getting a movie from the warehouse takes 15 minutes.

• Are not competitive if every rental takes 15 minutes.

• You have some small shelves in the front office.

Here are some suggested improvements to the store:

1. Whenever someone rents a movie, just keep it in the front office for a while in case someone else wants to rent it.

2. Watch the trends in movie watching and attempt to guess movies that will be rented soon – put those in  the front office.

3. Whenever someone rents a movie in a series (Star Wars), grab the other movies in the series and put  them in the front office.

4. Use scooters to get the movies faster

Office

Warehouse

Selection Spatial Temporal A 2 1 B 4 2 C 4 3 D 3 1 E None of the above

Which pair of changes would be likely most effective?

Principle of Locality

Programs access a small proportion of their address space at any time

• Temporal locality: items accessed recently are likely to be accessed again soon, e.g. instructions in a loop, local variables

• Spatial locality: items near those accessed recently are likely to be accessed soon, e.g. sequential instruction access, array data

Memory hierarchy takes advantage of locality

• Store everything on disk

• Copy recently accessed (and nearby)  items from disk to smaller DRAM memory: main memory

• Copy more recently accessed (and nearby)  items from DRAM to smaller SRAM memory: cache memory attached to CPU

Memory Hierarchy

Block (aka line): unit of copying, may be multiple words

• Hit: access satisfied by upper level  Hit ratio: hits/accesses

• Miss: accessed data is absent,  block copied from lower level  Time taken: miss penalty Miss ratio: misses/accesses = 1 – hit ratio

Current memory technology:

• Static RAM (SRAM): 0.5ns – 2.5ns, $2000 – $5000 per GB

• Dynamic RAM (DRAM): 50ns – 70ns, $20 – $75 per GB

• Magnetic disk: 5ms – 20ms, $0.20 – $2 per GB

• Ideal memory: access time of SRAM, capacity and cost/GB of disk

DRAM Technology

Data stored as a charge in a capacitor: single transistor used to access the charge

Bits are organized as a rectangular array

DRAM accesses an entire row

• Burst mode: supply successive words from a row with reduced latency

Must periodically be refreshed:

• Read contents and write back

• Performed on a DRAM row

Double data rate (DDR) DRAM

• Transfer on rising and falling clock edges

Quad data rate (QDR) DRAM

• Separate DDR inputs and outputs

Flash Storage

Nonvolatile semiconductor storage

• 100× – 1000× faster than disk

• Smaller, lower power, more robust

• But more $/GB (between disk and DRAM)

NOR flash: bit cell like a NOR gate

• Random read/write access

• Used for instruction memory in embedded systems

NAND flash: bit cell like a NAND gate

• Denser (bits/area), but block-at-a-time access

• Cheaper per GB

• Used for USB keys, media storage, …

Flash bits wears out after 1000’s of accesses

• Wear leveling: remap data to less used blocks

Cache Memory

The level of the memory hierarchy closest to the CPU

Given accesses X1, …, Xn–1, Xn• How do we know if the data is present?

• Where do we look?

Direct Mapped Cache

Location determined by address

Direct mapped: only one choice

(Block address) modulo (#Blocks in cache)

• #Blocks is a power of 2• Use low-order address bits

How do we know which particular block is stored in a cache location?• Store block address as well as the data• Actually, only need the high-order bits• Called the tag

What if there is no data in a location?• Valid bit: 1 = present, 0 = not present• Initially 0

Cache Example: Initial State

8-blocks, 1 word/block, direct mapped

Index V Tag Data

000 N

001 N

010 N

011 N

100 N

101 N

110 N

111 N

Cache Example

Index V Tag Data 000 N 001 N 010 N 011 N 100 N 101 N 110 Y 10 Mem[10110] 111 N

Word addr Binary addr Hit/miss Cache block 22 10 110 Miss 110

Cache Example

Index V Tag Data 000 N 001 N 010 Y 11 Mem[11010] 011 N 100 N 101 N 110 Y 10 Mem[10110] 111 N


Cache Example

Index V Tag Data 000 N 001 N 010 Y 11 Mem[11010] 011 N 100 N 101 N 110 Y 10 Mem[10110] 111 N

Word addr Binary addr Hit/miss Cache block 22 10 110 Hit 110 26 11 010 Hit 010

Cache Example

Index V Tag Data 000 Y 10 Mem[10000] 001 N 010 Y 11 Mem[11010] 011 Y 00 Mem[00011] 100 N 101 N 110 Y 10 Mem[10110] 111 N

Word addr Binary addr Hit/miss Cache block 16 10 000 Miss 000 3 00 011 Miss 011

16 10 000 Hit 000

Cache Example

Index V Tag Data 000 Y 10 Mem[10000] 001 N 010 Y 10 Mem[10010] 011 Y 00 Mem[00011] 100 N 101 N 110 Y 10 Mem[10110] 111 N


Address Subdivision

Larger Block Size

64 blocks, 16 bytes/block

• To what block number does address 1200 map?

Block address = 1200/16 = 75

Block number = 75 modulo 64 = 11

Larger blocks should reduce miss rate due to spatial locality

• But in a fixed-sized cache, larger blocks ⇒ fewer blocks ⇒ increased miss rate Larger blocks ⇒ pollution

• Larger miss penalty:  can override benefit of  reduced miss rate

Tag Index Offset 0 3 4 9 10 31

4 bits 6 bits 22 bits

Data based on SPEC92

Cache Misses

On cache hit, CPU proceeds normally; on cache miss

• Stall the CPU pipeline

• Fetch block from next level of hierarchy

• Instruction cache miss: restart instruction fetch

• Data cache miss: complete data access

On data-write hit, could just update the block in cache

• But then cache and memory would be inconsistent

Write through: also update memory

• But makes writes take longer, e.g. if base CPI = 1, 10% of instructions are stores, write to memory takes 100 cycles: effective CPI = 1 + 0.1×100 = 11

Solution: write buffer

• Holds data waiting to be written to memory

• CPU continues immediately, only stalls on write if write buffer is already full

Write Allocation

Write back alternative: On data-write hit, just update the block in cache

• Keep track of whether each block is dirty

When a dirty block is replaced, write it back to memory

What should happen on a write miss?

For write-through

• Allocate on miss: fetch the block

• Write around: don’t fetch the block, since programs often write a whole block before reading it (e.g., initialization)

For write-back

• Usually fetch the block

Associative Caches

Fully associative: allow a given block to go in any cache entry

• Requires all entries to be searched at once

• Comparator per entry (expensive)

n-way set associative: each set contains n entries

• Block number determines which set: (Block number) modulo (#Sets in cache)

• Search all entries in a given set at once: n comparators (less expensive)

Associativity Example

Direct mapped:

2-way set associative:

Fully associative:

Compare 4-block caches, block access sequence: 0, 8, 0, 6, 8

Block address

Cache index

Hit/miss Cache content after access 0 1 2 3

0 0 miss Mem[0] 8 0 miss Mem[8] 0 0 miss Mem[0] 6 2 miss Mem[0] Mem[6] 8 0 miss Mem[8] Mem[6]

Block address

Cache index

Hit/miss Cache content after access Set 0 Set 1

0 0 miss Mem[0] 8 0 miss Mem[0] Mem[8] 0 0 hit Mem[0] Mem[8] 6 0 miss Mem[0] Mem[6] 8 0 miss Mem[8] Mem[6]

Block address

Hit/miss Cache content after access

0 miss Mem[0] 8 miss Mem[0] Mem[8] 0 hit Mem[0] Mem[8] 6 miss Mem[0] Mem[8] Mem[6] 8 hit Mem[0] Mem[8] Mem[6]

How much associativity?

Increased associativity decreases miss rate, but with diminishing returns

Descriptions of caches

1. Exceptional usage of the cache space in exchange for a slow hit time

2. Poor usage of the cache space in exchange for an excellent hit time

3. Reasonable usage of cache space in exchange for a reasonable hit time

Selection Fully-Associative

8-way Set Associative

Direct Mapped

A 3 2 1 B 3 3 2 C 1 2 3 D 3 2 1 E None of the above

Interactions with Software

Misses depend on memory access patterns

• Algorithm behavior

• Compiler optimization for memory  access

Optimization via Blocking

Maximize accesses to data before it is replaced

Consider inner loops of matrix multiply,  with one-dimensional array:

C A B

for(intj=0;j<n;++j)

{

doublecij=C[i+j*n];

for(intk=0;k<n;k++)

cij+=A[i+k*n]*B[k+j*n];

C[i+j*n]=cij;

}

for(inti=0;i<n;i++)

for(intj=0;j<n;j++)

{

doublecij=C[i][j];

for(intk=0;k<n;k++)

cij+=A[i][k]*B[k][j];

C[i][j]=cij;

}

recent accessolder accessnot yet accessed

Blocked Matrix Multiply Access Pattern

unoptimized blocked

Multilevel On-Chip Caches

Virtual Memory

Another level in the cache/memory hierarchy: Virtual memory allows us to view main memory as a cache of a larger memory space (on disk).

cpu $

cache

memory

disk

cacheing

cacheing

virtual memory

Add latencies

1-4 (L2 – 14)

40-60

150-300

10,000,000-80,000,000

Each program gets a private virtual address space holding its frequently used code and data, protected from other programs

CPU and OS translate virtual addresses to physical addresses

• VM “block” is called a page

• VM translation “miss” is called a page fault

Page Table

On page fault, the page must be fetched from disk

• Takes millions of clock cycles, handled by OS code

Try to minimize page fault rate

• Fully associative placement

• Smart replacement algorithms

Page Table

Stores placement information

• Array of page table entries, indexed by virtual page number

• Page table register in CPU points to page table in physical memory

If page is present in memory

• PTE stores the physical page number

• Plus other status bits (referenced, dirty, …)

If page is not present

• PTE can refer to location in swap space on disk

Replacement and Writes

To reduce page fault rate, prefer least-recently used (LRU) replacement

• Reference bit (aka use bit) in PTE set to 1 on access to page

• Periodically cleared to 0 by OS

• A page with reference bit = 0 has not been used recently

Disk writes take millions of cycles

• Block at once, not individual locations

• Write through is impractical

• Use write-back

• Dirty bit in PTE set when page is written

Fast Translation Using a TLB

Address translation would appear to require extra memory references

• One to access the PTE

• Then the actual memory access

But access to page tables has good locality

• So use a fast cache of PTEs within the CPU

• Called a Translation Look-aside Buffer (TLB)

• Typical: 16–512 PTEs, 0.5–1 cycle for hit, 10–100 cycles for miss, 0.01%–1% miss rate

• Misses could be handled by hardware or software

Fast Translation Using a TLB

If page is in memory

• Load the PTE from memory and retry

• Could be handled in hardware: can get complex for more complicated page table structures

• Or in software: raise a special exception, with optimized handler

If page is not in memory (page fault)

• OS handles fetching the page and updating the page table

• Then restart the faulting instruction

5. memory hierarchy - mcmaster universityse2ga3/chapter 5.pdfmemory hierarchy block (aka line): unit...

Documents