memory/storage architecture lab computer architecture memory hierarchy

Memory/Storage Architecture Lab

Computer Architecture

Memory Hierarchy

2Memory/Storage Architecture Lab 2

Technology Trends

0

50

100

150

200

250

300

'80 '83 '85 '89 '92 '96 '98 '00 '04 '07

Trac

Tcac

Year Capacity $/GB

1980 64Kbit $1500000

1983 256Kbit $500000

1985 1Mbit $200000

1989 4Mbit $50000

1992 16Mbit $15000

1996 64Mbit $10000

1998 128Mbit $4000

2000 256Mbit $1000

2004 512Mbit $250

2007 1Gbit $50


Memory Hierarchy

Ideally one would desire an indefinitely large memory capacity such that any particular … word would be immediately available … We are … forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible.

Burks, Goldstine, and von Neumann, 1946

Ideally one would desire an indefinitely large memory capacity such that any particular … word would be immediately available … We are … forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible.

Burks, Goldstine, and von Neumann, 1946

CPU

Size of the memory at each level

Decreasing cost

Increasing speed and bandwidth

Levels in the memory hierarchy

Level 1

Level 2

• • •

Level n


Memory Technology (Big Picture)

Speed: Fastest Size: Smallest Cost: Highest

Slowest

BiggestLowest

Processor

Control

DatapathMemory

Memory

Memory

Mem

or

y Mem

or

y


Memory Technology (Real-world Realization)

Processor

Control

Off-chip Level Caches (SRAM)

Main Memory (DRAM)

Secondary

storage (Disk)

Reg

isters O

n-ch

ip

Cach

es

Register Cache Main Memory Disk Memory

Speed <1ns <5ns 50ns~70ns 5ms~20ms

Size 100B KB→MB MB→GB GB→TB

Management Compiler Hardware OS OS

Register Cache Main Memory Disk Memory

Speed <1ns <5ns 50ns~70ns 5ms~20ms

Size 100B KB→MB MB→GB GB→TB

Management Compiler Hardware OS OS


Memory Hierarchy

An optimization resulting from a perfect match between memory technology and two types of program locality

Temporal locality (locality in time) − If an item is referenced, it will tend to be referenced again soon.

Spatial locality (locality in space)− If an item is referenced, items whose addresses are close by will

tend to be referenced soon.

Goal : To provide a “virtual” memory technology (an illusion) that has an access time of the highest-level memory with the size and cost of the lowest-level memory


Temporal and Spatial Localities

Source: Glass & Cao (1997 ACM SIGMETRICS)


Memory Hierarchy Terminology

Hit – Accessed data is found in upper level Hit Rate = fraction of accesses found in upper level Hit Time = time to access the upper level

Miss – Accessed data found only in lower level Processor waits until data is fetched from next level,

then restarts/continues access Miss rate = 1 – (hit rate) Miss penalty = time to get block from lower level

+ time to replace in upper level

Hit time << miss penalty Average memory access time << worst case access time Average memory access time

= hit time + miss rate ⅹmiss penalty

Data are transferred in the unit of blocks


(CPU) Cache

Upper level : SRAM (small, fast, expensive)

lower level : DRAM (large, slow, cheap) Goal : To provide a “virtual” memory technology that has

an access time of SRAM with the size and cost of DRAM Additional benefits

Reduction of memory bandwidth consumed by processor

More memory bandwidth for I/O No need to change the ISA


Direct-mapped Cache

Each memory block is mapped to a single cache block The mapped cache block is determined by memory block

address mod number of cache blocks


Direct-Mapped Cache Example

Consider a direct-mapped cache with block size 4 bytes and total capacity 4KB

Assume 1 word per block… The 2 lowest address bits

specify the byte within a block The next 10 address bits

specify the block’s index within the cache

The 20 highest address bits are the unique tag for this memory block

The valid bit specifies whether the block is an accurate copy of memory

Exploit temporal locality


On cache read

On cache hit, CPU proceeds normally On cache miss (handled completely by hardware)

Stall the CPU pipeline Fetch the missed block from the next level of hierarchy Instruction cache miss

− Restart instruction fetch

Data cache miss− Complete data access


On cache write

Write-through Always write the data into both the cache and main

memory Simple but slow and increases memory traffic (requires a

write buffer) Write-back

Write the data into the cache only and update the main memory when a dirty block is replaced (requires a dirty bit and possibly a write buffer)

Fast but complex to implement and causes a consistency problem


Write allocation

What should happen on a write miss? Alternatives for write-through

Allocate on miss: fetch the block Write around: don’t fetch the block

− Since programs often write a whole block before reading it (e.g., initialization)

For write-back Usually fetch the block


Memory Reference Sequence

Look at the following sequence of memory references for the previous direct-mapped cache

0, 4, 8188, 0, 16384, 0

0 0 XXXX XXXX

1 0 XXXX XXXX

2 0 XXXX XXXX

3 0 XXXX XXXX

1021 0 XXXX XXXX

1022 0 XXXX XXXX

1023 0 XXXX XXXX

… …

Index

Valid

Tag Data

Cache Initially Empty

Cache Initially Empty


After Reference 1


0, 4, 8188, 0, 16384, 0 Address = 00000000000000000000 0000000000 00

0 1 00000000000000000000 Memory bytes 0…3 (copy)

1 0 XXXX XXXX

2 0 XXXX XXXX

3 0 XXXX XXXX

1021 0 XXXX XXXX

1022 0 XXXX XXXX

1023 0 XXXX XXXX

… …

Index

Valid

Tag Data

Cache Miss, Place Block at Index 0


Miss


After Reference 2


0, 4, 8188, 0, 16384, 0 Address = 00000000000000000000 0000000001 00



2 0 XXXX XXXX

3 0 XXXX XXXX

1021 0 XXXX XXXX

1022 0 XXXX XXXX

1023 0 XXXX XXXX

… …

Index

Valid

Tag Data



Miss


After Reference 3


0, 4, 8188, 0, 16384, 0 Address = 00000000000000000001 1111111111 00



2 0 XXXX XXXX

3 0 XXXX XXXX

1021 0 XXXX XXXX

1022 0 XXXX XXXX


… …

Index

Valid

Tag Data



Miss


After Reference 4


0, 4, 8188, 0, 16384, 0 Address = 00000000000000000000 0000000000 00



2 0 XXXX XXXX

3 0 XXXX XXXX

1021 0 XXXX XXXX

1022 0 XXXX XXXX


… …

Index

Valid

Tag Data

Cache Hit to Block at Index 0

Cache Hit to Block at Index 0

Hit


After Reference 5


0, 4, 8188, 0, 16384, 0 Address = 00000000000000000100 0000000000 00 [same index!]

0 1 Memory bytes 16384…16387(copy)


2 0 XXXX XXXX

3 0 XXXX XXXX

1021 0 XXXX XXXX

1022 0 XXXX XXXX


… …

Index

Valid

Tag Data

Cache Miss, Replace Block at Index 0

Cache Miss, Replace Block at Index 0

Miss0000000000000000000000000000000000000100


After Reference 6


0, 4, 8188, 0, 16384, 0 Address = 00000000000000000000 0000000000 00 [same index!]

0 1 Memory bytes 0…3 (copy)


2 0 XXXX XXXX

3 0 XXXX XXXX

1021 0 XXXX XXXX

1022 0 XXXX XXXX


… …

Index

Valid

Tag Data

Cache Miss, Replace Block at Index 0 Total of 1 Hit and 5

Misses

Cache Miss, Replace Block at Index 0 Total of 1 Hit and 5

Misses

Miss

Agai

n

0000000000000000000000000000000000000100


Exploiting Spatial Locality - Larger than one word block size

16 KB Direct-mapped cache with 256 64B (16 words) blocks


Miss Rate vs. Block Size


Set-Associative Caches

Allow multiple entries per index to improve hit rates n-way set associative caches allow up to n conflicting references to be

cached− n is the number of cache blocks in each set− n comparisons are needed to search all blocks in the set in parallel− When there is a conflict, which block is replaced (this was easy for direct mapped caches

– there`s only one entry!)

Fully-associative caches− a single (very large!) set allows a memory location to be placed in any cache block

Direct-mapped caches are essentially 1-way set-associative caches

For fixed cache capacity, higher associativity leads to higher hit rates Because more combinations of memory blocks can be present in the

cache Set associativity optimizes cache contents, but at what cost?


Cache Organization Spectrum


Implementation of Set Associative Cache


Cache Organization Example

One-way set associative

(direct mapped)

Block Tag

Data

0

1

2

3

4

5

6

7

Two-way set associative

Set Tag

Data

0

1

2

3

Tag

Data

Four-way set associative

Set

0

1

Tag

Data Tag

Data Tag

Data Tag

Data

Tag

Data Tag

Data Tag

Data Tag

Data Tag

Data Tag

Data Tag

Data Tag

Data

Eight-way set associative (fully associative)


Cache Block Replacement Policy

Direct-mapped Caches No replacement policy is needed since each memory block

can be placed in only one cache block N-way set-associative Caches

Each memory block can be placed in any of the n cache blocks in the mapped set

Least Recently Used (LRU) replacement policy is typically used to select a block to be replaced among the blocks in the mapped set

LRU replaces the block that has not been used for the longest time


Miss Rate vs. Set Associativity


Memory Reference Sequence

Look again at the following sequence of memory references for a 2-way set associative cache with a block size of two words (8bytes)

0, 4, 8188, 0, 16384, 0

This sequence had 5 misses and 1 hit for the direct mapped cache with the same capacity

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

… …

Set Number Vali

dTag Data

0

1

255

Cache Initially EmptyCache Initially Empty


After Reference 1


0, 4, 8188, 0, 16384, 0 Address = 000000000000000000000 00000000 000

1 000000000000000000000 Memory bytes 0..7 (copy)

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

… …

Valid

Tag Data

0

1

255

Set Number

Cache Miss, Place in First Block of Set 0


Miss


After Reference 2


0, 4, 8188, 0, 16384, 0 Address = 000000000000000000000 00000000 100


0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

… …

Valid

Tag Data

0

1

255

Set Number

Cache Hit to first Block in Set 0


Hit


After Reference 3


0, 4, 8188, 0, 16384, 0 Address = 000000000000000000111 11111111 000


0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX


0 XXXX XXXX

… …

Valid

Tag Data

0

1

255

Set Number



Miss


After Reference 4


0, 4, 8188, 0, 16384, 0 Address = 000000000000000000000 00000000 000


0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX


0 XXXX XXXX

… …

Valid

Tag Data

0

1

255

Set Number



Hit


After Reference 5


0, 4, 8188, 0, 16384, 0 Address = 000000000000000010000 00000000 000


1 000000000000000010000 Memory bytes 16384..16391(copy)

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX


0 XXXX XXXX

… …

Valid

Tag Data

0

1

255

Set Number

Cache Miss, Place in Second Block of Set 0

Cache Miss, Place in Second Block of Set 0

Miss


After Reference 6


0, 4, 8188, 0, 16384, 0 Address = 000000000000000000000 00000000 000


1 000000000000000010000 Memory bytes 16384..16391(copy)

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX


0 XXXX XXXX

… …

Valid

Tag Data

0

1

255

Set Number


Total of 3 hits and 3 misses


Total of 3 hits and 3 misses

Hit


Improving Cache Performance

Cache Performance is determined byAverage memory access time = hit time + (miss rate x miss penalty)

Decrease hit time Make cache smaller, but miss rate increases Use direct mapped, but miss rate increase

Decrease miss rate Make cache larger, but can increases hit time Add associativity, but can increases hit time Increase block size, but increases miss penalty

Decrease miss penalty Reduce transfer time component of miss penalty Add another level of cache


Current Cache Organizations

Intel Nehalem AMD Opteron X4

L1 caches(per core)

L1 I-cache: 32KB, 64-byte blocks, 4-way, approx LRU replacement, hit time n/a

L1 D-cache: 32KB, 64-byte blocks, 8-way, approx LRU replacement, write-back/allocate, hit time n/a

L1 I-cache: 32KB, 64-byte blocks, 2-way, LRU replacement, hit time 3 cycles

L1 D-cache: 32KB, 64-byte blocks, 2-way, LRU replacement, write-back/allocate, hit time 9 cycles

L2 unified cache(per core)

256KB, 64-byte blocks, 8-way, approx LRU replacement, write-back/allocate, hit time n/a

512KB, 64-byte blocks, 16-way, approx LRU replacement, write-back/allocate, hit time n/a

L3 unified cache (shared)

8MB, 64-byte blocks, 16-way, replacement n/a, write-back/allocate, hit time n/a

2MB, 64-byte blocks, 32-way, replace block shared by fewest cores, write-back/allocate, hit time 32 cycles

n/a: data not available


Cache Coherence Problem

Suppose two CPU cores share a physical address space Write-through caches

Time step

Event CPU A’s cache

CPU B’s cache

Memory

0 0

1 CPU A reads X 0 0

2 CPU B reads X 0 0 0

3 CPU A writes 1 to X 1 0 1


Snoopy Protocols

Write Invalidate Protocol: Write to shared data: an invalidate is sent to all caches

which snoop and invalidate any copies Write Broadcast Protocol:

Write to shared data: broadcast on bus, processors snoop, and update copies

Write serialization: bus serializes requests Bus is single point of arbitration


Write invalidate Protocol

Cache gets exclusive access to a block when it is to be written

Broadcasts an invalidate message on the bus Subsequent read in another cache misses

− Owning cache supplies updated value

CPU activity Bus activity CPU A’s cache

CPU B’s cache

Memory

0

CPU A reads X Cache miss for X 0 0

CPU B reads X Cache miss for X 0 0 0

CPU A writes 1 to X Invalidate for X 1 0

CPU B read X Cache miss for X 1 1 1


Summary

Memory hierarchies are an optimization resulting from a perfect match between memory technology and two types of program locality

Temporal locality Spatial locality

The goal is to provide a “virtual” memory technology (an illusion) that has an access time of the highest-level memory with the size and cost of the lowest-level memory

Cache memory is an instance of a memory hierarchy exploits both temporal and spatial localities direct-mapped caches are simple and fast but have higher miss rates set-associative caches have lower miss rates but are complex and slow multilevel caches are becoming increasingly popular cache coherence protocols ensures consistency among multiple caches

memory/storage architecture lab computer architecture memory hierarchy

Documents