Download - Topic 5: Memory System Basics ECE 4750 Computer Architecture · ECE 4750 Computer Architecture Topic 5: Memory System Basics Christopher Batten School of Electrical and Computer Engineering

ECE 4750 Computer ArchitectureTopic 5: Memory System Basics

Christopher BattenSchool of Electrical and Computer Engineering

Cornell University

http://www.csl.cornell.edu/courses/ece4750

slide revision: 2013-09-22-21-12

Memory Technology Motivation for Caches Classifying Caches Cache Performance

Processors, Memories, Networks

ECE 4750 T05: Memory System Basics 2 / 37


Agenda

Memory Technology

Motivation for Caches

Classifying Caches

Cache Performance


• Memory Technology • Motivation for Caches Classifying Caches Cache Performance

Latches and Registers



Naive Register File



Memory Arrays: Register File



Memory Arrays: SRAMs



Memory Arrays: DRAMs



Relative Memory Sizes of SRAM vs. DRAM

[ Foss, “Implementing Application-Specific Memory”, ISSCC 1996 ]

DRAM on memory chip

On-Chip SRAM in logic chip



Memory Technology Trade-offs

Latches/Registers

Register File

SRAM

DRAMHigh CapacityHigh LatencyLow Bandwidth

Low CapacityLow LatencyHigh Bandwidth(more and wider ports)


Memory Technology • Motivation for Caches • Classifying Caches Cache Performance

Agenda

Memory Technology


Classifying Caches

Cache Performance



CPU-Memory Bottleneck

ProcessorMain

Memory

I Performance of high-speed computers is usually limited by memorybandwidth and latency

I Latency is time for a single access. Main memory latency is usually >> than processor cycle time

I Bandwidth is the number of accesses per unit time. If m instructions are loads/stores, 1 + m memory accesses per instruction,

CPI = 1 requires at least 1 + m memory accesses per cycle

I Bandwidth-Delay product is amount of data that can be in flight at thesame time (Little’s Law)



Processor-DRAM Latency Gap

!Proc 60%/year

DRAM 7%/year

1

10

100

1000

1980 1985 1990 1995 2000

DRAM

CPU Processor-Memory Performance Gap: (grows 50% / year)

Perf

orm

ance

I Four-issue 2 GHz superscalar accessing 100 ns DRAM could execute800 instructions during the time for one memory access!

I Long latencies mean large bandwidth-delay products which can bedifficult to saturate, meaning bandwidth is wasted



Physical Size Affects Latency

Processor

SmallMemory

Processor

LargeMemory▶ Signals have further to travel

▶ Fan out to more locations



Memory Hierarchy

Processor

SmallFast

Memory(RF,SRAM)

BigSlow

Memory(DRAM)

I Capacity → Register << SRAM << DRAMI Latency → Register << SRAM << DRAMI Bandwidth → on-chip >> off-chip

I On a data access:. if data is in fast memory→ low-latency access to SRAM. if data is not in fast memory→ long-latency access to DRAM

I Memory hierarchies only work if the small, fast memory actually storesdata that is reused by the processor



Management of Memory Hierarchy

I Small/fast storage (e.g., registers). Address usually specified in instruction. Generally implemented directly as a register file

I Large/slower storage (e.g., main memory). Addresses usually computed from values in registers. Generally implemented with large off-chip DRAM

I Ideally would like intermediate level that is inbetween registers and main memory (e.g., cache). Need to predict what is kept in fast memory. Generally implemented with small on-chip/off-chip SRAM



Desk/Library Analogy

I How many books to check out at once?I How long to keep books on desk?I Which books to return?I How to find books on your desk?



Common And Predictable Memory Reference Patterns

Address

Time

Instruction fetches

Stack accesses

Data accesses

n loop iterations

subroutine call

subroutine return

argument access

vector access

scalar accesses

I Temporal Locality –if a location isreference it is likelyto be reference againin the near future

I Spatial Locality –if a location isreference it is likelythat locations near itwill be referenced inthe near future



Real Memory Reference PatternsM

em

ory

Ad

dre

ss

Time (one dot per access to that address at that time)

D.J. Hatfield, J. Gerald, "Program Restructuring for Virtual Menory," IBM Systems Journal, 10(3): 168-192, 1971

Spatial Locality

Temporal Locality

Spatial and TemporalLocality



Caches Exploit Both Types of Locality

Processor

SmallFast

Memory(RF,SRAM)

BigSlow

Memory(DRAM)

I Exploit temporal locality by rememberthe contents of recently access locations

I Exploit spatial locality by fetchingblocks of data around recently access locations



PARCv4 Multicore Systems


Memory Technology Motivation for Caches • Classifying Caches • Cache Performance

Agenda

Memory Technology


Classifying Caches

Cache Performance



Inside a Cache

CACHE Processor Main Memory

Address Address

Data Data

Address Tag

Data Block

Data Byte

Data Byte

Data Byte

Line 100

304

6848

copy of main memory location 100

copy of main memory location 101

416



Basic Cache Algorithm for a Load

Processor issues load request to cache

Compare request address to cache tags & see if there is a match

Read block of data frommain memory

Replace victim block in cachewith new block

Return copy of data from cache

Cache Hit(found in cache)

Cache Miss(not in cache)



Classifying Caches

Processor

SmallFast

Memory(RF,SRAM)

BigSlow

Memory(DRAM)

I Block Placement – Where can a block be placed in the cache?

I Block Identification – How is a block found if it is in the cache?

I Block Replacement – Which block should be replaced on a miss?

I Write Strategy – What happens on a write?



Block Placement – Where place block in cache?

0 1 2 3 4 5 6 7 0 1 2 3 Set Number

Cache

Fully (2-way) Set Direct Associative Associative Mapped anywhere anywhere in only into set 0 block 4 (12 mod 4) (12 mod 8)

0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 0 1 2 3 4 5 6 7 8 9

2 2 2 2 2 2 2 2 2 2 0 1 2 3 4 5 6 7 8 9

3 3 0 1

Memory

Block Number

block 12 can be placed



Block Identification – How find block in cache?

Tag Index Offset

Tags Data Block

Block Address

V

Set 0Set 1Set 2Set 3

I Cache uses index and offset to find potential match, then checks tagI Tag check only includes higher order bits, low order bits are implicitI In this example (Direct-mapped, 8B block, 4 line cache)

. Number of bits in offset?

. Number of bits in index?

. Number of bits in tag?



Block Identification – How find block in cache?

Tag Index Offset

Tags Data Block

Block Address

V

Set 0

Set 1

I Cache checks all potential blocks w/ fast parallel tag checkI In this example (2-way associative, 8B block, 4 line cache)

. Number of bits in offset?

. Number of bits in index?

. Number of bits in tag?



Block Replacement – Which block to replace?

I No choice in a direct mapped cache

I In an associative cache, which block from setshould be evicted when the set becomes full?

I Random

I Least Recently Used (LRU). LRU cache state must be updated on every access. True implementation only feasible for small sets (2-way). Pseudo-LRU binary tree often used for 4-8 way

I First In, First Out (FIFO) aka Round-Robin. Used in highly associative caches

I Not Least Recently Used (NLRU). FIFO with exception for most recently used block(s)



Write Strategy – How are writes handled?

I Cache Hit. Write Through – write both cache and memory,

generally higher traffic but simpler to design. Write Back – write cache only, memory is written

when evicted, dirty bit per block avoids unnecessary write backs,more complicated

I Cache Miss. No Write Allocate – only write to main memory. Write Allocate – fetch block into cache, then write

I Common Combinations. Write Through & No Write Allocate. Write Back & Write Allocate


Memory Technology Motivation for Caches Classifying Caches • Cache Performance •

Agenda

Memory Technology


Classifying Caches

Cache Performance



Processor Performance

Processor CacheMain

Memory

Hit

Miss

lw F D X M W

F D X M W

F D X M WMMM

F D X M WXXX

F D D M WXDD

F F F M WXDF

stalladdu

lw

addu

beq

subu

10

0 it

era

tion

s

TimeProgram

=Instructions

Program× Cycles

Instruction× Time

Cycles

What is time / program for this simple example?



Average Memory Access Time

Processor CacheMain

Memory

Hit

Miss

lw F D X M W

F D X M W

F D X M WMMM

F D X M WXXX

F D D M WXDD

F F F M WXDF

stalladdu

lw

addu

beq

subu

10

0 it

era

tion

s

Avg Mem Access Time = Hit Time + ( Miss Rate × Miss Penalty )

What is average memory access time for this simple example?



Categorizing Misses: The Three C’s

Tag Index Offset

Tags Data Block

Block Address

V

Set 0

Set 1

I Compulsory – first-reference to a block, occur even with infinite cache

I Capacity – cache is too small to hold all data needed by program,occur even under perfect replacement policy (loop over 5 cache lines)

I Conflict – misses that occur because of collisions due to less than fullassociativity (loop over 3 cache lines)



Summary

I Technology constraint leads to fundamental trade-off betweenmemory size (capacity) and memory latency

I Creating a memory hierarchy enables both low latency and highcapacity, IF data in higher levels of the hierarchy store data that isreused by the processor. Temporal Locality – refers likely reuse in time. Spatial Locality – refers likely reuse in space

I Caches exploit temporal and spatial locality, and involve a variety ofdifferent design decisions. Block Placement – Where can a block be placed in the cache?. Block Identification – How is a block found if it is in the cache?. Block Replacement – Which block should be replaced on a miss?. Write Strategy – What happens on a write?

I Average Mem Access Time – simple metric for evaluating trade-offs



Acknowledgements

Some of these slides contain material developed and copyrighted by:

Arvind (MIT), Krste Asanovic (MIT/UCB), Joel Emer (Intel/MIT)James Hoe (CMU), John Kubiatowicz (UCB), David Patterson (UCB)

MIT material derived from course 6.823UCB material derived from courses CS152 and CS252


Download - Topic 5: Memory System Basics ECE 4750 Computer Architecture · ECE 4750 Computer Architecture Topic 5: Memory System Basics Christopher Batten School of Electrical and Computer Engineering

Top Related