memory/storage architecture lab computer architecture memory hierarchy

Memory/Storage Architecture Lab

Computer Architecture

Memory Hierarchy

2Memory/Storage Architecture Lab 2

Technology Trends

'80 '83 '85 '89 '92 '96 '98 '00 '04 '07

Year Capacity $/GB

1980 64Kbit $1500000

1983 256Kbit $500000

1985 1Mbit $200000

1989 4Mbit $50000

1992 16Mbit $15000

1996 64Mbit $10000

1998 128Mbit $4000

2000 256Mbit $1000

2004 512Mbit $250

2007 1Gbit $50

Memory Hierarchy

Ideally one would desire an indefinitely large memory capacity such that any particular … word would be immediately available … We are … forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible.

Burks, Goldstine, and von Neumann, 1946

Ideally one would desire an indefinitely large memory capacity such that any particular … word would be immediately available … We are … forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible.

Burks, Goldstine, and von Neumann, 1946

Size of the memory at each level

Decreasing cost

Increasing speed and bandwidth

Levels in the memory hierarchy

Level 1

Level 2

• • •

Level n

Memory Technology (Big Picture)

Speed: Fastest Size: Smallest Cost: Highest

Slowest

BiggestLowest

Processor

Control

DatapathMemory

Memory

Memory Technology (Real-world Realization)

Processor

Control

Off-chip Level Caches (SRAM)

Main Memory (DRAM)

Secondary

storage (Disk)

isters O

Register Cache Main Memory Disk Memory

Speed <1ns <5ns 50ns~70ns 5ms~20ms

Size 100B KB→MB MB→GB GB→TB

Management Compiler Hardware OS OS

Register Cache Main Memory Disk Memory

Speed <1ns <5ns 50ns~70ns 5ms~20ms

Size 100B KB→MB MB→GB GB→TB

Management Compiler Hardware OS OS

Memory Hierarchy

An optimization resulting from a perfect match between memory technology and two types of program locality

Temporal locality (locality in time) − If an item is referenced, it will tend to be referenced again soon.

Spatial locality (locality in space)− If an item is referenced, items whose addresses are close by will

tend to be referenced soon.

Goal : To provide a “virtual” memory technology (an illusion) that has an access time of the highest-level memory with the size and cost of the lowest-level memory

Temporal and Spatial Localities

Source: Glass & Cao (1997 ACM SIGMETRICS)

Memory Hierarchy Terminology

Hit – Accessed data is found in upper level Hit Rate = fraction of accesses found in upper level Hit Time = time to access the upper level

Miss – Accessed data found only in lower level Processor waits until data is fetched from next level,

then restarts/continues access Miss rate = 1 – (hit rate) Miss penalty = time to get block from lower level

+ time to replace in upper level

Hit time << miss penalty Average memory access time << worst case access time Average memory access time

= hit time + miss rate ⅹmiss penalty

Data are transferred in the unit of blocks

(CPU) Cache

Upper level : SRAM (small, fast, expensive)

lower level : DRAM (large, slow, cheap) Goal : To provide a “virtual” memory technology that has

an access time of SRAM with the size and cost of DRAM Additional benefits

Reduction of memory bandwidth consumed by processor

More memory bandwidth for I/O No need to change the ISA

Direct-mapped Cache

Each memory block is mapped to a single cache block The mapped cache block is determined by memory block

address mod number of cache blocks

Direct-Mapped Cache Example

Consider a direct-mapped cache with block size 4 bytes and total capacity 4KB

Assume 1 word per block… The 2 lowest address bits

specify the byte within a block The next 10 address bits

specify the block’s index within the cache

The 20 highest address bits are the unique tag for this memory block

The valid bit specifies whether the block is an accurate copy of memory

Exploit temporal locality

On cache read

On cache hit, CPU proceeds normally On cache miss (handled completely by hardware)

Stall the CPU pipeline Fetch the missed block from the next level of hierarchy Instruction cache miss

− Restart instruction fetch

Data cache miss− Complete data access

On cache write

Write-through Always write the data into both the cache and main

memory Simple but slow and increases memory traffic (requires a

write buffer) Write-back

Write the data into the cache only and update the main memory when a dirty block is replaced (requires a dirty bit and possibly a write buffer)

Fast but complex to implement and causes a consistency problem

Write allocation

What should happen on a write miss? Alternatives for write-through

Allocate on miss: fetch the block Write around: don’t fetch the block

− Since programs often write a whole block before reading it (e.g., initialization)

For write-back Usually fetch the block

Memory Reference Sequence

Look at the following sequence of memory references for the previous direct-mapped cache

0, 4, 8188, 0, 16384, 0

0 0 XXXX XXXX

1 0 XXXX XXXX

2 0 XXXX XXXX

3 0 XXXX XXXX

1021 0 XXXX XXXX

1022 0 XXXX XXXX

1023 0 XXXX XXXX

… …

Tag Data

Cache Initially Empty

After Reference 1

0, 4, 8188, 0, 16384, 0 Address = 00000000000000000000 0000000000 00

0 1 00000000000000000000 Memory bytes 0…3 (copy)

1 0 XXXX XXXX

2 0 XXXX XXXX

3 0 XXXX XXXX

1021 0 XXXX XXXX

1022 0 XXXX XXXX

1023 0 XXXX XXXX

… …

Tag Data

Cache Miss, Place Block at Index 0

After Reference 2

0, 4, 8188, 0, 16384, 0 Address = 00000000000000000000 0000000001 00

2 0 XXXX XXXX

3 0 XXXX XXXX

1021 0 XXXX XXXX

1022 0 XXXX XXXX

1023 0 XXXX XXXX

… …

Tag Data

After Reference 3

0, 4, 8188, 0, 16384, 0 Address = 00000000000000000001 1111111111 00

2 0 XXXX XXXX

3 0 XXXX XXXX

1021 0 XXXX XXXX

1022 0 XXXX XXXX

… …

Tag Data

After Reference 4

0, 4, 8188, 0, 16384, 0 Address = 00000000000000000000 0000000000 00

2 0 XXXX XXXX

3 0 XXXX XXXX

1021 0 XXXX XXXX

1022 0 XXXX XXXX

… …

Tag Data

Cache Hit to Block at Index 0

After Reference 5

0, 4, 8188, 0, 16384, 0 Address = 00000000000000000100 0000000000 00 [same index!]

0 1 Memory bytes 16384…16387(copy)

2 0 XXXX XXXX

3 0 XXXX XXXX

1021 0 XXXX XXXX

1022 0 XXXX XXXX

… …

Tag Data

Cache Miss, Replace Block at Index 0

Miss0000000000000000000000000000000000000100

After Reference 6

0, 4, 8188, 0, 16384, 0 Address = 00000000000000000000 0000000000 00 [same index!]

0 1 Memory bytes 0…3 (copy)

2 0 XXXX XXXX

3 0 XXXX XXXX

1021 0 XXXX XXXX

1022 0 XXXX XXXX

… …

Tag Data

Cache Miss, Replace Block at Index 0 Total of 1 Hit and 5

Misses

Cache Miss, Replace Block at Index 0 Total of 1 Hit and 5

Misses

0000000000000000000000000000000000000100

Exploiting Spatial Locality - Larger than one word block size

16 KB Direct-mapped cache with 256 64B (16 words) blocks

Miss Rate vs. Block Size

Set-Associative Caches

Allow multiple entries per index to improve hit rates n-way set associative caches allow up to n conflicting references to be

cached− n is the number of cache blocks in each set− n comparisons are needed to search all blocks in the set in parallel− When there is a conflict, which block is replaced (this was easy for direct mapped caches

– there`s only one entry!)

Fully-associative caches− a single (very large!) set allows a memory location to be placed in any cache block

Direct-mapped caches are essentially 1-way set-associative caches

For fixed cache capacity, higher associativity leads to higher hit rates Because more combinations of memory blocks can be present in the

cache Set associativity optimizes cache contents, but at what cost?

Cache Organization Spectrum

Implementation of Set Associative Cache

Cache Organization Example

One-way set associative

(direct mapped)

Block Tag

Two-way set associative

Set Tag

Four-way set associative

Data Tag

Eight-way set associative (fully associative)

Cache Block Replacement Policy

Direct-mapped Caches No replacement policy is needed since each memory block

can be placed in only one cache block N-way set-associative Caches

Each memory block can be placed in any of the n cache blocks in the mapped set

Least Recently Used (LRU) replacement policy is typically used to select a block to be replaced among the blocks in the mapped set

LRU replaces the block that has not been used for the longest time

Miss Rate vs. Set Associativity

Memory Reference Sequence

Look again at the following sequence of memory references for a 2-way set associative cache with a block size of two words (8bytes)

0, 4, 8188, 0, 16384, 0

This sequence had 5 misses and 1 hit for the direct mapped cache with the same capacity

0 XXXX XXXX

… …

Set Number Vali

dTag Data

Cache Initially EmptyCache Initially Empty

After Reference 1

0, 4, 8188, 0, 16384, 0 Address = 000000000000000000000 00000000 000

1 000000000000000000000 Memory bytes 0..7 (copy)

0 XXXX XXXX

… …

Tag Data

Set Number

Cache Miss, Place in First Block of Set 0

After Reference 2

0, 4, 8188, 0, 16384, 0 Address = 000000000000000000000 00000000 100

0 XXXX XXXX

… …

Tag Data

Set Number

Cache Hit to first Block in Set 0

After Reference 3

0, 4, 8188, 0, 16384, 0 Address = 000000000000000000111 11111111 000

0 XXXX XXXX

… …

Tag Data

Set Number

Cache Miss, Place in First Block of Set 255

After Reference 4

0, 4, 8188, 0, 16384, 0 Address = 000000000000000000000 00000000 000

0 XXXX XXXX

… …

Tag Data

Set Number

After Reference 5

0, 4, 8188, 0, 16384, 0 Address = 000000000000000010000 00000000 000

1 000000000000000010000 Memory bytes 16384..16391(copy)

0 XXXX XXXX

… …

Tag Data

Set Number

Cache Miss, Place in Second Block of Set 0

After Reference 6

0, 4, 8188, 0, 16384, 0 Address = 000000000000000000000 00000000 000

1 000000000000000010000 Memory bytes 16384..16391(copy)

0 XXXX XXXX

… …

Tag Data

Set Number

Total of 3 hits and 3 misses

Improving Cache Performance

Cache Performance is determined byAverage memory access time = hit time + (miss rate x miss penalty)

Decrease hit time Make cache smaller, but miss rate increases Use direct mapped, but miss rate increase

Decrease miss rate Make cache larger, but can increases hit time Add associativity, but can increases hit time Increase block size, but increases miss penalty

Decrease miss penalty Reduce transfer time component of miss penalty Add another level of cache

Current Cache Organizations

Intel Nehalem AMD Opteron X4

L1 caches(per core)

L1 I-cache: 32KB, 64-byte blocks, 4-way, approx LRU replacement, hit time n/a

L1 D-cache: 32KB, 64-byte blocks, 8-way, approx LRU replacement, write-back/allocate, hit time n/a

L1 I-cache: 32KB, 64-byte blocks, 2-way, LRU replacement, hit time 3 cycles

L1 D-cache: 32KB, 64-byte blocks, 2-way, LRU replacement, write-back/allocate, hit time 9 cycles

L2 unified cache(per core)

256KB, 64-byte blocks, 8-way, approx LRU replacement, write-back/allocate, hit time n/a

512KB, 64-byte blocks, 16-way, approx LRU replacement, write-back/allocate, hit time n/a

L3 unified cache (shared)

8MB, 64-byte blocks, 16-way, replacement n/a, write-back/allocate, hit time n/a

2MB, 64-byte blocks, 32-way, replace block shared by fewest cores, write-back/allocate, hit time 32 cycles

n/a: data not available

Cache Coherence Problem

Suppose two CPU cores share a physical address space Write-through caches

Time step

Event CPU A’s cache

CPU B’s cache

Memory

1 CPU A reads X 0 0

2 CPU B reads X 0 0 0

3 CPU A writes 1 to X 1 0 1

Snoopy Protocols

Write Invalidate Protocol: Write to shared data: an invalidate is sent to all caches

which snoop and invalidate any copies Write Broadcast Protocol:

Write to shared data: broadcast on bus, processors snoop, and update copies

Write serialization: bus serializes requests Bus is single point of arbitration

Write invalidate Protocol

Cache gets exclusive access to a block when it is to be written

Broadcasts an invalidate message on the bus Subsequent read in another cache misses

− Owning cache supplies updated value

CPU activity Bus activity CPU A’s cache

CPU B’s cache

Memory

CPU A reads X Cache miss for X 0 0

CPU B reads X Cache miss for X 0 0 0

CPU A writes 1 to X Invalidate for X 1 0

CPU B read X Cache miss for X 1 1 1

Summary

Memory hierarchies are an optimization resulting from a perfect match between memory technology and two types of program locality

Temporal locality Spatial locality

The goal is to provide a “virtual” memory technology (an illusion) that has an access time of the highest-level memory with the size and cost of the lowest-level memory

Cache memory is an instance of a memory hierarchy exploits both temporal and spatial localities direct-mapped caches are simple and fast but have higher miss rates set-associative caches have lower miss rates but are complex and slow multilevel caches are becoming increasingly popular cache coherence protocols ensures consistency among multiple caches

memory/storage architecture lab computer architecture memory hierarchy

Documents

eecs 252 graduate computer architecture lec 16 – advanced...

computer systems architecture a networking approach chapter...

memory organization (memory...

07 memory hierarchy computer architecture

lecture 10 memory hierarchy and cache design computer...

18-447 computer architecture lecture 19: memory hierarchy...

ece 4100/6100 advanced computer architecture lecture 10 ...

cse502: computer architecture memory hierarchy & caches

computer architecture, memory hierarchy & virtual memory...

computer architecture lecture 21 memory hierarchy design

182.092 computer architecture chapter 5: memory...

cmput 229 - computer organization and architecture i1 memory...

ece7660 advanced computer architecture basics...

eecs 252 graduate computer architecture lec 17 – advanced...

memory hierarchy & cache memory © avi mendelson, 3/2005 1...

computer architecture lec 17 – advanced memory hierarchy...

advanced computer architecture memory hierarchy design

cs1104 – computer organization part 2: computer...

july 30, 2001systems architecture ii1 systems architecture...

computer systems architecture a networking approach chapter...