memory/storage architecture lab computer architecture memory hierarchy
TRANSCRIPT
Memory/Storage Architecture Lab
Computer Architecture
Memory Hierarchy
2Memory/Storage Architecture Lab 2
Technology Trends
0
50
100
150
200
250
300
'80 '83 '85 '89 '92 '96 '98 '00 '04 '07
Trac
Tcac
Year Capacity $/GB
1980 64Kbit $1500000
1983 256Kbit $500000
1985 1Mbit $200000
1989 4Mbit $50000
1992 16Mbit $15000
1996 64Mbit $10000
1998 128Mbit $4000
2000 256Mbit $1000
2004 512Mbit $250
2007 1Gbit $50
3Memory/Storage Architecture Lab 3
Memory Hierarchy
Ideally one would desire an indefinitely large memory capacity such that any particular … word would be immediately available … We are … forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible.
Burks, Goldstine, and von Neumann, 1946
Ideally one would desire an indefinitely large memory capacity such that any particular … word would be immediately available … We are … forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible.
Burks, Goldstine, and von Neumann, 1946
CPU
Size of the memory at each level
Decreasing cost
Increasing speed and bandwidth
Levels in the memory hierarchy
Level 1
Level 2
• • •
Level n
4Memory/Storage Architecture Lab 4
Memory Technology (Big Picture)
Speed: Fastest Size: Smallest Cost: Highest
Slowest
BiggestLowest
Processor
Control
DatapathMemory
Memory
Memory
Mem
or
y Mem
or
y
5Memory/Storage Architecture Lab 5
Memory Technology (Real-world Realization)
Processor
Control
Off-chip Level Caches (SRAM)
Main Memory (DRAM)
Secondary
storage (Disk)
Reg
isters O
n-ch
ip
Cach
es
Register Cache Main Memory Disk Memory
Speed <1ns <5ns 50ns~70ns 5ms~20ms
Size 100B KB→MB MB→GB GB→TB
Management Compiler Hardware OS OS
Register Cache Main Memory Disk Memory
Speed <1ns <5ns 50ns~70ns 5ms~20ms
Size 100B KB→MB MB→GB GB→TB
Management Compiler Hardware OS OS
6Memory/Storage Architecture Lab 6
Memory Hierarchy
An optimization resulting from a perfect match between memory technology and two types of program locality
Temporal locality (locality in time) − If an item is referenced, it will tend to be referenced again soon.
Spatial locality (locality in space)− If an item is referenced, items whose addresses are close by will
tend to be referenced soon.
Goal : To provide a “virtual” memory technology (an illusion) that has an access time of the highest-level memory with the size and cost of the lowest-level memory
7Memory/Storage Architecture Lab 7
Temporal and Spatial Localities
Source: Glass & Cao (1997 ACM SIGMETRICS)
8Memory/Storage Architecture Lab 8
Memory Hierarchy Terminology
Hit – Accessed data is found in upper level Hit Rate = fraction of accesses found in upper level Hit Time = time to access the upper level
Miss – Accessed data found only in lower level Processor waits until data is fetched from next level,
then restarts/continues access Miss rate = 1 – (hit rate) Miss penalty = time to get block from lower level
+ time to replace in upper level
Hit time << miss penalty Average memory access time << worst case access time Average memory access time
= hit time + miss rate ⅹmiss penalty
Data are transferred in the unit of blocks
9Memory/Storage Architecture Lab 9
(CPU) Cache
Upper level : SRAM (small, fast, expensive)
lower level : DRAM (large, slow, cheap) Goal : To provide a “virtual” memory technology that has
an access time of SRAM with the size and cost of DRAM Additional benefits
Reduction of memory bandwidth consumed by processor
More memory bandwidth for I/O No need to change the ISA
10Memory/Storage Architecture Lab 10
Direct-mapped Cache
Each memory block is mapped to a single cache block The mapped cache block is determined by memory block
address mod number of cache blocks
11Memory/Storage Architecture Lab 11
Direct-Mapped Cache Example
Consider a direct-mapped cache with block size 4 bytes and total capacity 4KB
Assume 1 word per block… The 2 lowest address bits
specify the byte within a block The next 10 address bits
specify the block’s index within the cache
The 20 highest address bits are the unique tag for this memory block
The valid bit specifies whether the block is an accurate copy of memory
Exploit temporal locality
12Memory/Storage Architecture Lab 12
On cache read
On cache hit, CPU proceeds normally On cache miss (handled completely by hardware)
Stall the CPU pipeline Fetch the missed block from the next level of hierarchy Instruction cache miss
− Restart instruction fetch
Data cache miss− Complete data access
13Memory/Storage Architecture Lab 13
On cache write
Write-through Always write the data into both the cache and main
memory Simple but slow and increases memory traffic (requires a
write buffer) Write-back
Write the data into the cache only and update the main memory when a dirty block is replaced (requires a dirty bit and possibly a write buffer)
Fast but complex to implement and causes a consistency problem
14Memory/Storage Architecture Lab 14
Write allocation
What should happen on a write miss? Alternatives for write-through
Allocate on miss: fetch the block Write around: don’t fetch the block
− Since programs often write a whole block before reading it (e.g., initialization)
For write-back Usually fetch the block
15Memory/Storage Architecture Lab 15
Memory Reference Sequence
Look at the following sequence of memory references for the previous direct-mapped cache
0, 4, 8188, 0, 16384, 0
0 0 XXXX XXXX
1 0 XXXX XXXX
2 0 XXXX XXXX
3 0 XXXX XXXX
1021 0 XXXX XXXX
1022 0 XXXX XXXX
1023 0 XXXX XXXX
… …
Index
Valid
Tag Data
Cache Initially Empty
Cache Initially Empty
16Memory/Storage Architecture Lab 16
After Reference 1
Look at the following sequence of memory references for the previous direct-mapped cache
0, 4, 8188, 0, 16384, 0 Address = 00000000000000000000 0000000000 00
0 1 00000000000000000000 Memory bytes 0…3 (copy)
1 0 XXXX XXXX
2 0 XXXX XXXX
3 0 XXXX XXXX
1021 0 XXXX XXXX
1022 0 XXXX XXXX
1023 0 XXXX XXXX
… …
Index
Valid
Tag Data
Cache Miss, Place Block at Index 0
Cache Miss, Place Block at Index 0
Miss
17Memory/Storage Architecture Lab 17
After Reference 2
Look at the following sequence of memory references for the previous direct-mapped cache
0, 4, 8188, 0, 16384, 0 Address = 00000000000000000000 0000000001 00
0 1 00000000000000000000 Memory bytes 0…3 (copy)
1 1 00000000000000000000 Memory bytes 4…7 (copy)
2 0 XXXX XXXX
3 0 XXXX XXXX
1021 0 XXXX XXXX
1022 0 XXXX XXXX
1023 0 XXXX XXXX
… …
Index
Valid
Tag Data
Cache Miss, Place Block at Index 1
Cache Miss, Place Block at Index 1
Miss
18Memory/Storage Architecture Lab 18
After Reference 3
Look at the following sequence of memory references for the previous direct-mapped cache
0, 4, 8188, 0, 16384, 0 Address = 00000000000000000001 1111111111 00
0 1 00000000000000000000 Memory bytes 0…3 (copy)
1 1 00000000000000000000 Memory bytes 4…7 (copy)
2 0 XXXX XXXX
3 0 XXXX XXXX
1021 0 XXXX XXXX
1022 0 XXXX XXXX
1023 1 00000000000000000001 Memory bytes 8188…8191 (copy)
… …
Index
Valid
Tag Data
Cache Miss, Place Block at Index 1023
Cache Miss, Place Block at Index 1023
Miss
19Memory/Storage Architecture Lab 19
After Reference 4
Look at the following sequence of memory references for the previous direct-mapped cache
0, 4, 8188, 0, 16384, 0 Address = 00000000000000000000 0000000000 00
0 1 00000000000000000000 Memory bytes 0…3 (copy)
1 1 00000000000000000000 Memory bytes 4…7 (copy)
2 0 XXXX XXXX
3 0 XXXX XXXX
1021 0 XXXX XXXX
1022 0 XXXX XXXX
1023 1 00000000000000000001 Memory bytes 8188…8191 (copy)
… …
Index
Valid
Tag Data
Cache Hit to Block at Index 0
Cache Hit to Block at Index 0
Hit
20Memory/Storage Architecture Lab 20
After Reference 5
Look at the following sequence of memory references for the previous direct-mapped cache
0, 4, 8188, 0, 16384, 0 Address = 00000000000000000100 0000000000 00 [same index!]
0 1 Memory bytes 16384…16387(copy)
1 1 00000000000000000000 Memory bytes 4…7 (copy)
2 0 XXXX XXXX
3 0 XXXX XXXX
1021 0 XXXX XXXX
1022 0 XXXX XXXX
1023 1 00000000000000000001 Memory bytes 8188…8191 (copy)
… …
Index
Valid
Tag Data
Cache Miss, Replace Block at Index 0
Cache Miss, Replace Block at Index 0
Miss0000000000000000000000000000000000000100
21Memory/Storage Architecture Lab 21
After Reference 6
Look at the following sequence of memory references for the previous direct-mapped cache
0, 4, 8188, 0, 16384, 0 Address = 00000000000000000000 0000000000 00 [same index!]
0 1 Memory bytes 0…3 (copy)
1 1 00000000000000000000 Memory bytes 4…7 (copy)
2 0 XXXX XXXX
3 0 XXXX XXXX
1021 0 XXXX XXXX
1022 0 XXXX XXXX
1023 1 00000000000000000001 Memory bytes 8188…8191 (copy)
… …
Index
Valid
Tag Data
Cache Miss, Replace Block at Index 0 Total of 1 Hit and 5
Misses
Cache Miss, Replace Block at Index 0 Total of 1 Hit and 5
Misses
Miss
Agai
n
0000000000000000000000000000000000000100
22Memory/Storage Architecture Lab 22
Exploiting Spatial Locality - Larger than one word block size
16 KB Direct-mapped cache with 256 64B (16 words) blocks
23Memory/Storage Architecture Lab 23
Miss Rate vs. Block Size
24Memory/Storage Architecture Lab 24
Set-Associative Caches
Allow multiple entries per index to improve hit rates n-way set associative caches allow up to n conflicting references to be
cached− n is the number of cache blocks in each set− n comparisons are needed to search all blocks in the set in parallel− When there is a conflict, which block is replaced (this was easy for direct mapped caches
– there`s only one entry!)
Fully-associative caches− a single (very large!) set allows a memory location to be placed in any cache block
Direct-mapped caches are essentially 1-way set-associative caches
For fixed cache capacity, higher associativity leads to higher hit rates Because more combinations of memory blocks can be present in the
cache Set associativity optimizes cache contents, but at what cost?
25Memory/Storage Architecture Lab 25
Cache Organization Spectrum
26Memory/Storage Architecture Lab 26
Implementation of Set Associative Cache
27Memory/Storage Architecture Lab 27
Cache Organization Example
One-way set associative
(direct mapped)
Block Tag
Data
0
1
2
3
4
5
6
7
Two-way set associative
Set Tag
Data
0
1
2
3
Tag
Data
Four-way set associative
Set
0
1
Tag
Data Tag
Data Tag
Data Tag
Data
Tag
Data Tag
Data Tag
Data Tag
Data Tag
Data Tag
Data Tag
Data Tag
Data
Eight-way set associative (fully associative)
28Memory/Storage Architecture Lab 28
Cache Block Replacement Policy
Direct-mapped Caches No replacement policy is needed since each memory block
can be placed in only one cache block N-way set-associative Caches
Each memory block can be placed in any of the n cache blocks in the mapped set
Least Recently Used (LRU) replacement policy is typically used to select a block to be replaced among the blocks in the mapped set
LRU replaces the block that has not been used for the longest time
29Memory/Storage Architecture Lab 29
Miss Rate vs. Set Associativity
30Memory/Storage Architecture Lab 30
Memory Reference Sequence
Look again at the following sequence of memory references for a 2-way set associative cache with a block size of two words (8bytes)
0, 4, 8188, 0, 16384, 0
This sequence had 5 misses and 1 hit for the direct mapped cache with the same capacity
0 XXXX XXXX
0 XXXX XXXX
0 XXXX XXXX
0 XXXX XXXX
0 XXXX XXXX
0 XXXX XXXX
0 XXXX XXXX
… …
Set Number Vali
dTag Data
0
1
255
Cache Initially EmptyCache Initially Empty
31Memory/Storage Architecture Lab 31
After Reference 1
Look again at the following sequence of memory references for a 2-way set associative cache with a block size of two words (8bytes)
0, 4, 8188, 0, 16384, 0 Address = 000000000000000000000 00000000 000
1 000000000000000000000 Memory bytes 0..7 (copy)
0 XXXX XXXX
0 XXXX XXXX
0 XXXX XXXX
0 XXXX XXXX
0 XXXX XXXX
0 XXXX XXXX
… …
Valid
Tag Data
0
1
255
Set Number
Cache Miss, Place in First Block of Set 0
Cache Miss, Place in First Block of Set 0
Miss
32Memory/Storage Architecture Lab 32
After Reference 2
Look again at the following sequence of memory references for a 2-way set associative cache with a block size of two words (8bytes)
0, 4, 8188, 0, 16384, 0 Address = 000000000000000000000 00000000 100
1 000000000000000000000 Memory bytes 0..7 (copy)
0 XXXX XXXX
0 XXXX XXXX
0 XXXX XXXX
0 XXXX XXXX
0 XXXX XXXX
0 XXXX XXXX
… …
Valid
Tag Data
0
1
255
Set Number
Cache Hit to first Block in Set 0
Cache Hit to first Block in Set 0
Hit
33Memory/Storage Architecture Lab 33
After Reference 3
Look again at the following sequence of memory references for a 2-way set associative cache with a block size of two words (8bytes)
0, 4, 8188, 0, 16384, 0 Address = 000000000000000000111 11111111 000
1 000000000000000000000 Memory bytes 0..7 (copy)
0 XXXX XXXX
0 XXXX XXXX
0 XXXX XXXX
0 XXXX XXXX
1 000000000000000000111 Memory bytes 8188..8195 (copy)
0 XXXX XXXX
… …
Valid
Tag Data
0
1
255
Set Number
Cache Miss, Place in First Block of Set 255
Cache Miss, Place in First Block of Set 255
Miss
34Memory/Storage Architecture Lab 34
After Reference 4
Look again at the following sequence of memory references for a 2-way set associative cache with a block size of two words (8bytes)
0, 4, 8188, 0, 16384, 0 Address = 000000000000000000000 00000000 000
1 000000000000000000000 Memory bytes 0..7 (copy)
0 XXXX XXXX
0 XXXX XXXX
0 XXXX XXXX
0 XXXX XXXX
1 000000000000000000111 Memory bytes 8188..8195 (copy)
0 XXXX XXXX
… …
Valid
Tag Data
0
1
255
Set Number
Cache Hit to first Block in Set 0
Cache Hit to first Block in Set 0
Hit
35Memory/Storage Architecture Lab 35
After Reference 5
Look again at the following sequence of memory references for a 2-way set associative cache with a block size of two words (8bytes)
0, 4, 8188, 0, 16384, 0 Address = 000000000000000010000 00000000 000
1 000000000000000000000 Memory bytes 0..7 (copy)
1 000000000000000010000 Memory bytes 16384..16391(copy)
0 XXXX XXXX
0 XXXX XXXX
0 XXXX XXXX
1 000000000000000000111 Memory bytes 8188..8195 (copy)
0 XXXX XXXX
… …
Valid
Tag Data
0
1
255
Set Number
Cache Miss, Place in Second Block of Set 0
Cache Miss, Place in Second Block of Set 0
Miss
36Memory/Storage Architecture Lab 36
After Reference 6
Look again at the following sequence of memory references for a 2-way set associative cache with a block size of two words (8bytes)
0, 4, 8188, 0, 16384, 0 Address = 000000000000000000000 00000000 000
1 000000000000000000000 Memory bytes 0..7 (copy)
1 000000000000000010000 Memory bytes 16384..16391(copy)
0 XXXX XXXX
0 XXXX XXXX
0 XXXX XXXX
1 000000000000000000111 Memory bytes 8188..8195 (copy)
0 XXXX XXXX
… …
Valid
Tag Data
0
1
255
Set Number
Cache Hit to first Block in Set 0
Total of 3 hits and 3 misses
Cache Hit to first Block in Set 0
Total of 3 hits and 3 misses
Hit
37Memory/Storage Architecture Lab 37
Improving Cache Performance
Cache Performance is determined byAverage memory access time = hit time + (miss rate x miss penalty)
Decrease hit time Make cache smaller, but miss rate increases Use direct mapped, but miss rate increase
Decrease miss rate Make cache larger, but can increases hit time Add associativity, but can increases hit time Increase block size, but increases miss penalty
Decrease miss penalty Reduce transfer time component of miss penalty Add another level of cache
38Memory/Storage Architecture Lab 38
Current Cache Organizations
Intel Nehalem AMD Opteron X4
L1 caches(per core)
L1 I-cache: 32KB, 64-byte blocks, 4-way, approx LRU replacement, hit time n/a
L1 D-cache: 32KB, 64-byte blocks, 8-way, approx LRU replacement, write-back/allocate, hit time n/a
L1 I-cache: 32KB, 64-byte blocks, 2-way, LRU replacement, hit time 3 cycles
L1 D-cache: 32KB, 64-byte blocks, 2-way, LRU replacement, write-back/allocate, hit time 9 cycles
L2 unified cache(per core)
256KB, 64-byte blocks, 8-way, approx LRU replacement, write-back/allocate, hit time n/a
512KB, 64-byte blocks, 16-way, approx LRU replacement, write-back/allocate, hit time n/a
L3 unified cache (shared)
8MB, 64-byte blocks, 16-way, replacement n/a, write-back/allocate, hit time n/a
2MB, 64-byte blocks, 32-way, replace block shared by fewest cores, write-back/allocate, hit time 32 cycles
n/a: data not available
39Memory/Storage Architecture Lab 39
Cache Coherence Problem
Suppose two CPU cores share a physical address space Write-through caches
Time step
Event CPU A’s cache
CPU B’s cache
Memory
0 0
1 CPU A reads X 0 0
2 CPU B reads X 0 0 0
3 CPU A writes 1 to X 1 0 1
40Memory/Storage Architecture Lab 40
Snoopy Protocols
Write Invalidate Protocol: Write to shared data: an invalidate is sent to all caches
which snoop and invalidate any copies Write Broadcast Protocol:
Write to shared data: broadcast on bus, processors snoop, and update copies
Write serialization: bus serializes requests Bus is single point of arbitration
41Memory/Storage Architecture Lab 41
Write invalidate Protocol
Cache gets exclusive access to a block when it is to be written
Broadcasts an invalidate message on the bus Subsequent read in another cache misses
− Owning cache supplies updated value
CPU activity Bus activity CPU A’s cache
CPU B’s cache
Memory
0
CPU A reads X Cache miss for X 0 0
CPU B reads X Cache miss for X 0 0 0
CPU A writes 1 to X Invalidate for X 1 0
CPU B read X Cache miss for X 1 1 1
42Memory/Storage Architecture Lab 42
Summary
Memory hierarchies are an optimization resulting from a perfect match between memory technology and two types of program locality
Temporal locality Spatial locality
The goal is to provide a “virtual” memory technology (an illusion) that has an access time of the highest-level memory with the size and cost of the lowest-level memory
Cache memory is an instance of a memory hierarchy exploits both temporal and spatial localities direct-mapped caches are simple and fast but have higher miss rates set-associative caches have lower miss rates but are complex and slow multilevel caches are becoming increasingly popular cache coherence protocols ensures consistency among multiple caches