ECE 4750 Computer ArchitectureTopic 5: Memory System Basics
Christopher BattenSchool of Electrical and Computer Engineering
Cornell University
http://www.csl.cornell.edu/courses/ece4750
slide revision: 2013-09-22-21-12
Memory Technology Motivation for Caches Classifying Caches Cache Performance
Processors, Memories, Networks
ECE 4750 T05: Memory System Basics 2 / 37
Memory Technology Motivation for Caches Classifying Caches Cache Performance
Agenda
Memory Technology
Motivation for Caches
Classifying Caches
Cache Performance
ECE 4750 T05: Memory System Basics 3 / 37
• Memory Technology • Motivation for Caches Classifying Caches Cache Performance
Latches and Registers
ECE 4750 T05: Memory System Basics 4 / 37
• Memory Technology • Motivation for Caches Classifying Caches Cache Performance
Naive Register File
ECE 4750 T05: Memory System Basics 5 / 37
• Memory Technology • Motivation for Caches Classifying Caches Cache Performance
ECE 4750 T05: Memory System Basics 6 / 37
• Memory Technology • Motivation for Caches Classifying Caches Cache Performance
Memory Arrays: Register File
ECE 4750 T05: Memory System Basics 7 / 37
• Memory Technology • Motivation for Caches Classifying Caches Cache Performance
Memory Arrays: SRAMs
ECE 4750 T05: Memory System Basics 8 / 37
• Memory Technology • Motivation for Caches Classifying Caches Cache Performance
Memory Arrays: DRAMs
ECE 4750 T05: Memory System Basics 9 / 37
• Memory Technology • Motivation for Caches Classifying Caches Cache Performance
Relative Memory Sizes of SRAM vs. DRAM
[ Foss, “Implementing Application-Specific Memory”, ISSCC 1996 ]
DRAM on memory chip
On-Chip SRAM in logic chip
ECE 4750 T05: Memory System Basics 10 / 37
• Memory Technology • Motivation for Caches Classifying Caches Cache Performance
Memory Technology Trade-offs
Latches/Registers
Register File
SRAM
DRAMHigh CapacityHigh LatencyLow Bandwidth
Low CapacityLow LatencyHigh Bandwidth(more and wider ports)
ECE 4750 T05: Memory System Basics 11 / 37
Memory Technology • Motivation for Caches • Classifying Caches Cache Performance
Agenda
Memory Technology
Motivation for Caches
Classifying Caches
Cache Performance
ECE 4750 T05: Memory System Basics 12 / 37
Memory Technology • Motivation for Caches • Classifying Caches Cache Performance
CPU-Memory Bottleneck
ProcessorMain
Memory
I Performance of high-speed computers is usually limited by memorybandwidth and latency
I Latency is time for a single access. Main memory latency is usually >> than processor cycle time
I Bandwidth is the number of accesses per unit time. If m instructions are loads/stores, 1 + m memory accesses per instruction,
CPI = 1 requires at least 1 + m memory accesses per cycle
I Bandwidth-Delay product is amount of data that can be in flight at thesame time (Little’s Law)
ECE 4750 T05: Memory System Basics 13 / 37
Memory Technology • Motivation for Caches • Classifying Caches Cache Performance
Processor-DRAM Latency Gap
!Proc 60%/year
DRAM 7%/year
1
10
100
1000
1980 1985 1990 1995 2000
DRAM
CPU Processor-Memory Performance Gap: (grows 50% / year)
Perf
orm
ance
I Four-issue 2 GHz superscalar accessing 100 ns DRAM could execute800 instructions during the time for one memory access!
I Long latencies mean large bandwidth-delay products which can bedifficult to saturate, meaning bandwidth is wasted
ECE 4750 T05: Memory System Basics 14 / 37
Memory Technology • Motivation for Caches • Classifying Caches Cache Performance
Physical Size Affects Latency
Processor
SmallMemory
Processor
LargeMemory▶ Signals have further to travel
▶ Fan out to more locations
ECE 4750 T05: Memory System Basics 15 / 37
Memory Technology • Motivation for Caches • Classifying Caches Cache Performance
Memory Hierarchy
Processor
SmallFast
Memory(RF,SRAM)
BigSlow
Memory(DRAM)
I Capacity → Register << SRAM << DRAMI Latency → Register << SRAM << DRAMI Bandwidth → on-chip >> off-chip
I On a data access:. if data is in fast memory→ low-latency access to SRAM. if data is not in fast memory→ long-latency access to DRAM
I Memory hierarchies only work if the small, fast memory actually storesdata that is reused by the processor
ECE 4750 T05: Memory System Basics 16 / 37
Memory Technology • Motivation for Caches • Classifying Caches Cache Performance
Management of Memory Hierarchy
I Small/fast storage (e.g., registers). Address usually specified in instruction. Generally implemented directly as a register file
I Large/slower storage (e.g., main memory). Addresses usually computed from values in registers. Generally implemented with large off-chip DRAM
I Ideally would like intermediate level that is inbetween registers and main memory (e.g., cache). Need to predict what is kept in fast memory. Generally implemented with small on-chip/off-chip SRAM
ECE 4750 T05: Memory System Basics 17 / 37
Memory Technology • Motivation for Caches • Classifying Caches Cache Performance
Desk/Library Analogy
I How many books to check out at once?I How long to keep books on desk?I Which books to return?I How to find books on your desk?
ECE 4750 T05: Memory System Basics 18 / 37
Memory Technology • Motivation for Caches • Classifying Caches Cache Performance
Common And Predictable Memory Reference Patterns
Address
Time
Instruction fetches
Stack accesses
Data accesses
n loop iterations
subroutine call
subroutine return
argument access
vector access
scalar accesses
I Temporal Locality –if a location isreference it is likelyto be reference againin the near future
I Spatial Locality –if a location isreference it is likelythat locations near itwill be referenced inthe near future
ECE 4750 T05: Memory System Basics 19 / 37
Memory Technology • Motivation for Caches • Classifying Caches Cache Performance
Real Memory Reference PatternsM
em
ory
Ad
dre
ss
Time (one dot per access to that address at that time)
D.J. Hatfield, J. Gerald, "Program Restructuring for Virtual Menory," IBM Systems Journal, 10(3): 168-192, 1971
Spatial Locality
Temporal Locality
Spatial and TemporalLocality
ECE 4750 T05: Memory System Basics 20 / 37
Memory Technology • Motivation for Caches • Classifying Caches Cache Performance
Caches Exploit Both Types of Locality
Processor
SmallFast
Memory(RF,SRAM)
BigSlow
Memory(DRAM)
I Exploit temporal locality by rememberthe contents of recently access locations
I Exploit spatial locality by fetchingblocks of data around recently access locations
ECE 4750 T05: Memory System Basics 21 / 37
Memory Technology • Motivation for Caches • Classifying Caches Cache Performance
PARCv4 Multicore Systems
ECE 4750 T05: Memory System Basics 22 / 37
Memory Technology Motivation for Caches • Classifying Caches • Cache Performance
Agenda
Memory Technology
Motivation for Caches
Classifying Caches
Cache Performance
ECE 4750 T05: Memory System Basics 23 / 37
Memory Technology Motivation for Caches • Classifying Caches • Cache Performance
Inside a Cache
CACHE Processor Main Memory
Address Address
Data Data
Address Tag
Data Block
Data Byte
Data Byte
Data Byte
Line 100
304
6848
copy of main memory location 100
copy of main memory location 101
416
ECE 4750 T05: Memory System Basics 24 / 37
Memory Technology Motivation for Caches • Classifying Caches • Cache Performance
Basic Cache Algorithm for a Load
Processor issues load request to cache
Compare request address to cache tags & see if there is a match
Read block of data frommain memory
Replace victim block in cachewith new block
Return copy of data from cache
Cache Hit(found in cache)
Cache Miss(not in cache)
ECE 4750 T05: Memory System Basics 25 / 37
Memory Technology Motivation for Caches • Classifying Caches • Cache Performance
Classifying Caches
Processor
SmallFast
Memory(RF,SRAM)
BigSlow
Memory(DRAM)
I Block Placement – Where can a block be placed in the cache?
I Block Identification – How is a block found if it is in the cache?
I Block Replacement – Which block should be replaced on a miss?
I Write Strategy – What happens on a write?
ECE 4750 T05: Memory System Basics 26 / 37
Memory Technology Motivation for Caches • Classifying Caches • Cache Performance
Block Placement – Where place block in cache?
0 1 2 3 4 5 6 7 0 1 2 3 Set Number
Cache
Fully (2-way) Set Direct Associative Associative Mapped anywhere anywhere in only into set 0 block 4 (12 mod 4) (12 mod 8)
0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 0 1 2 3 4 5 6 7 8 9
2 2 2 2 2 2 2 2 2 2 0 1 2 3 4 5 6 7 8 9
3 3 0 1
Memory
Block Number
block 12 can be placed
ECE 4750 T05: Memory System Basics 27 / 37
Memory Technology Motivation for Caches • Classifying Caches • Cache Performance
Block Identification – How find block in cache?
Tag Index Offset
Tags Data Block
Block Address
V
Set 0Set 1Set 2Set 3
I Cache uses index and offset to find potential match, then checks tagI Tag check only includes higher order bits, low order bits are implicitI In this example (Direct-mapped, 8B block, 4 line cache)
. Number of bits in offset?
. Number of bits in index?
. Number of bits in tag?
ECE 4750 T05: Memory System Basics 28 / 37
Memory Technology Motivation for Caches • Classifying Caches • Cache Performance
Block Identification – How find block in cache?
Tag Index Offset
Tags Data Block
Block Address
V
Set 0
Set 1
I Cache checks all potential blocks w/ fast parallel tag checkI In this example (2-way associative, 8B block, 4 line cache)
. Number of bits in offset?
. Number of bits in index?
. Number of bits in tag?
ECE 4750 T05: Memory System Basics 29 / 37
Memory Technology Motivation for Caches • Classifying Caches • Cache Performance
Block Replacement – Which block to replace?
I No choice in a direct mapped cache
I In an associative cache, which block from setshould be evicted when the set becomes full?
I Random
I Least Recently Used (LRU). LRU cache state must be updated on every access. True implementation only feasible for small sets (2-way). Pseudo-LRU binary tree often used for 4-8 way
I First In, First Out (FIFO) aka Round-Robin. Used in highly associative caches
I Not Least Recently Used (NLRU). FIFO with exception for most recently used block(s)
ECE 4750 T05: Memory System Basics 30 / 37
Memory Technology Motivation for Caches • Classifying Caches • Cache Performance
Write Strategy – How are writes handled?
I Cache Hit. Write Through – write both cache and memory,
generally higher traffic but simpler to design. Write Back – write cache only, memory is written
when evicted, dirty bit per block avoids unnecessary write backs,more complicated
I Cache Miss. No Write Allocate – only write to main memory. Write Allocate – fetch block into cache, then write
I Common Combinations. Write Through & No Write Allocate. Write Back & Write Allocate
ECE 4750 T05: Memory System Basics 31 / 37
Memory Technology Motivation for Caches Classifying Caches • Cache Performance •
Agenda
Memory Technology
Motivation for Caches
Classifying Caches
Cache Performance
ECE 4750 T05: Memory System Basics 32 / 37
Memory Technology Motivation for Caches Classifying Caches • Cache Performance •
Processor Performance
Processor CacheMain
Memory
Hit
Miss
lw F D X M W
F D X M W
F D X M WMMM
F D X M WXXX
F D D M WXDD
F F F M WXDF
stalladdu
lw
addu
beq
subu
10
0 it
era
tion
s
TimeProgram
=Instructions
Program× Cycles
Instruction× Time
Cycles
What is time / program for this simple example?
ECE 4750 T05: Memory System Basics 33 / 37
Memory Technology Motivation for Caches Classifying Caches • Cache Performance •
Average Memory Access Time
Processor CacheMain
Memory
Hit
Miss
lw F D X M W
F D X M W
F D X M WMMM
F D X M WXXX
F D D M WXDD
F F F M WXDF
stalladdu
lw
addu
beq
subu
10
0 it
era
tion
s
Avg Mem Access Time = Hit Time + ( Miss Rate × Miss Penalty )
What is average memory access time for this simple example?
ECE 4750 T05: Memory System Basics 34 / 37
Memory Technology Motivation for Caches Classifying Caches • Cache Performance •
Categorizing Misses: The Three C’s
Tag Index Offset
Tags Data Block
Block Address
V
Set 0
Set 1
I Compulsory – first-reference to a block, occur even with infinite cache
I Capacity – cache is too small to hold all data needed by program,occur even under perfect replacement policy (loop over 5 cache lines)
I Conflict – misses that occur because of collisions due to less than fullassociativity (loop over 3 cache lines)
ECE 4750 T05: Memory System Basics 35 / 37
Memory Technology Motivation for Caches Classifying Caches Cache Performance
Summary
I Technology constraint leads to fundamental trade-off betweenmemory size (capacity) and memory latency
I Creating a memory hierarchy enables both low latency and highcapacity, IF data in higher levels of the hierarchy store data that isreused by the processor. Temporal Locality – refers likely reuse in time. Spatial Locality – refers likely reuse in space
I Caches exploit temporal and spatial locality, and involve a variety ofdifferent design decisions. Block Placement – Where can a block be placed in the cache?. Block Identification – How is a block found if it is in the cache?. Block Replacement – Which block should be replaced on a miss?. Write Strategy – What happens on a write?
I Average Mem Access Time – simple metric for evaluating trade-offs
ECE 4750 T05: Memory System Basics 36 / 37
Memory Technology Motivation for Caches Classifying Caches Cache Performance
Acknowledgements
Some of these slides contain material developed and copyrighted by:
Arvind (MIT), Krste Asanovic (MIT/UCB), Joel Emer (Intel/MIT)James Hoe (CMU), John Kubiatowicz (UCB), David Patterson (UCB)
MIT material derived from course 6.823UCB material derived from courses CS152 and CS252
ECE 4750 T05: Memory System Basics 37 / 37