chapter 7: memory system design

7-1 Chapter 7—Memory System Design

Computer Systems Design and Architecture by V. Heuring and H. Jordan © 1997 V. Heuring and H. Jordan

Chapter 7: Memory System Design

Topics

Introduction: The Components of the Memory System

The CPU - Memory Interface

Memory Structure

Memory Boards and Modules

Two-Level Memory Hierarchy

The Cache

Virtual Memory



Components of a Computer System

CPU MemorySystem

Input/OutputSystem



Memory System

• The memory system houses the information that is processed by the computer system:

• Data - variables, images, documents, etc.

• Programs - operating system, applications

• The memory system has many levels and categories:

• CPU registers, cache, main memory, disk, tape

• Volatile, Non-volatile

• Speed

• Access pattern

• Capacity



Tbl 7.3 The Memory Hierarchy, Cost, and Performance

Some Typical Values:

Component

Access pattern Random Random Random Direct Sequentialaccess access access access access

Capacity, bytes 64–1024 8–1024KB 8–64 MB 1–10 GB 1 TB

Latency 1–10 ns 20 ns 50 ns 10 ms 10 ms–10 s

Block size 1 word 16 words 16 words 4 KB 4 KB

Bandwidth System 8 MB/s 1 MB/s 1 MB/s 1 MB/sclock

rate

Cost/MB High $500 $30 $0.25 $0.02

CPUCache Main Memory Disk Memory

TapeMemory



The Main Memory System

• The system’s disk space and tape backup facility are part of the memory system

• Information storage (programs, documents, etc.)

• Increasing the amount of storage space available to programs (virtual memory)

• However, for this discussion, we will concentrate on the construction and operation of the physical memory in the system (cache, main memory)

• Interfacing memory to the CPU

• Memory organization

• Implementation



Components of the Main Memory System

• The main memory system typically consists of two types of memory

• ROM - read only memory - used to hold the non-volatile part of the system information

• RAM - random access memory - used to hold the system information that can be lost when the system powers down

• The term RAM for the volatile portion of the memory is somewhat a misnomer as the ROM memory is also random access, but that is the term that has become standard

• Every system must contain some amount of non-volatile ROM to tell the CPU what to do when it first powers up

• ROM contains the entire program that the CPU executes - RAM is used only for temporary data storage (dedicated microcontrollers)

• ROM holds a small program that loads the main program (an application or operating system) into RAM and executes it - called a “boot ROM” or “boot strap” program (complex embedded systems, desktop systems mainframes)



Tbl 7.1 Some Memory Properties

Symbol Definition Intel Intel PowerPC8088 8086 601

w CPU word size 16 bits 16 bits 64 bits

m Bits in a logical memory address 20 bits 20 bits 32 bits

s Bits in smallest addressable unit 8 bits 8 bits 8 bits

b Data bus size 8 bits 16 bits 64 bits

2m Memory word capacity, s-sized wds 220 words 220 words 232 words

2mxs Memory bit capacity 220 x 8 bits 220 x 8 bits 232 x 8 bits



Big-Endian and Little-Endian Storage

AB CDmsb ... lsb

ABCD0

1

ABCD

0

1

Little-Endian Big-Endian

• When data types having a word size larger than the smallest addressable unit are stored in memory (eg. a 32 bit floating point number stored in a memory addressed in 16 bit words) there are two ways to do it:

• Store the least significant part of the word at the lowest address (little-Endian, little end first)

• Store the most significant part of the word at the lowest address (big-Endian, big end first)

• Example: The hexadecimal 16-bit number 0xABCD, stored at address 0 in a byte addressable memory:



Tbl 7.2 Memory Performance Parameters

Symbol Definition Units Meaning

ta Access time time Time to access a memory word

tc Cycle time time Time from start of access to start of next

access

k Block size words Number of words per block

Bandwidth words/time Word transmission rate

tl Latency time Time to access first word of a sequence

of words

tbl = Block time Time to access an entire block of words

tl + k/ access time

(Information is often stored and moved in blocks at the cache and disk level.)



Fig 7.1 The CPU–Memory Interface

Sequence of events:Read:

1. CPU loads MAR, issues Read, and REQUEST2. Main memory transmits words to MDR3. Main memory asserts COMPLETE

Write:1. CPU loads MAR and MDR, asserts Write, and REQUEST2. Value in MDR is written into address specified by MAR3. Main memory asserts COMPLETE

–more–

CPU

m

Main memory

Address busData bus

s Address

0

1

2

3

2m – 1

A0 – Am –1

D0 – Db –1

R/W

REQUEST

COMPLETE

MDR

Registerfile

Control signals

m

w

w

MAR

b



Fig 7.1 The CPU–Memory Interface (cont’d.)

Additional points:• If b < w, main memory must make w/b b-bit transfers• Some CPUs allow reading and writing of word sizes < w

Example: Intel 8088: m = 20, w = 16, s = b = 88- and 16-bit values can be read and written

• If memory is sufficiently fast, or if its response is predictable,then COMPLETE may be omitted

• Some systems use separate R and W lines, and omit REQUEST

CPU

m

Main memory

Address busData bus

s Address

0

1

2

3

2m – 1

A0 – AmÐ1

D0 – DbÐ1

R/W

REQUEST

COMPLETE

MDR

Registerfile

Control signals

m

w

w

MAR

b



Specific Example - the SRC Memory Interface

SRC CPUMemory

Subsystem

R

W

ADDR

Done

MEMStrobe

DATA

• Memory Read Operation• CPU places address on ADDR bus and asserts Read line

• Memory places valid data on the DATA bus and asserts Done and MEMStrobe

• Memory Write Operation• CPU places address on ADDR bus, data on the DATA bus, and

asserts Write line

• Memory stores data and asserts Done



SRC Memory Read Bus Cycle

System Clock

R

W

ADDR

DATA

Done

MEMStrobe

Valid data frommemory to CPU

Valid address from CPU to memory

Start of memoryread operation

End of memoryread operation

1 to n clock cycles



SRC Memory Write Bus Cycle

System Clock

R

W

ADDR

DATA

Done

MEMStrobe

Valid address from CPU to memory

Start of memorywrite operation

End of memorywrite operation

1 to n clock cycles

Valid data from CPU to memory



Wait States

• Memory operations can take from 1 to n clock cycles

• If the memory is fast enough to perform the operation in a single clock cycle, the COMPLETE signal (Done and MEMStrobe in the SRC) can be asserted in the same clock cycle

• If the memory can not complete the request in a single clock cycle, it simply continues working on it and delays sending the COMPLETE signal.

• If the memory operation takes n>1 clock cycles to complete, the memory is said to have inserted n-1 “wait states” into the operation

• n=1 is a zero wait state access- this is usually not possible given the increasing speed of processors vs. the speed of memory



SRC Wait State ExamplesZero Wait States (n=1)

One Wait State (n=2)



More SRC Wait State ExamplesThree Wait States (n=4)

Many Wait States



Addressing Memory - Decoding the Chip Selects

• The main memory system is typically made up of “blocks” of memory implemented as physically separate units

• Each block must be activated based on its portion of the address space of the CPU - this is accomplished using individual Chip Selects (CS) to activate each block

• Activation of the CS is accomplished by decoding the address from the CPU - one of the functions of the memory controller

SRC CPU

Memory Subsystem

RW

ADDR

DoneMEMStrobe

DATA

ROM

RAM1

RAM2

MemoryController

CS

CS

CS



Building an Address Decoder

• Determine how you want the various ROM and RAM memory modules arranged in the CPU’s address space

• Calculate the starting and ending address for each block of memory

• Given a starting address of s and a memory block size of k, the ending address, e = s + (k-1)

• If the ending address of a previous memory block is e, the starting address of the next block, s = e+1

• Calculate the number of lower address bits that must be connected to the block to address the memory within it

• Determine the unique pattern on the upper address bits that will signify activity on that block

• Construct logic that will activate the CS for that block given the above pattern



Building an Address Decoder - An Example

• Consider a processor with a 32 bit address bus which uses word addressing connected to the following memory:

• 32K word EEPROM starting at address 0x00000000

• 16K word RAM module above the EEPROM

• a second 16KRAM module above the first

32K ROMblock

16K RAMblock

16K RAMblock

Addresses

Memory Map

0x00000000

0x00007FFF

0x00008000

0x0000BFFF0x0000C000

0x0000FFFF

32K = 215 - 15 bits to address memory within this block





Address Decoding Example (cont.)• First block of ROM uses bits (14:0) to address it

• Starts at address 0x00000000 and goes to 0X00007FFF

• Selected when all bits above (14) are ‘0’

• First block of RAM uses bits (13:0) to address it

• Starts at address 0x00008000 and goes to 0x0000BFFF

• Selected when bits (32:16) are ‘0’, and bits (15:14) are “10”

• Second block of RAM uses bits (13:0) to address it

• Starts at address 0x0000C000 and goes to 0x0000FFFF

• Selected when bits (32:16) are ‘0’, and bits (15:14) are “11”

ADDR(31:15) CS_ROM1

ADDR(14:0) to ROM1’s address lines

ADDR(31:16)

CS_RAM1ADDR(14)ADDR(15)

ADDR(13:0) to RAM1’s address lines

ADDR(31:16)





Another Address Decoding Example

• Decoding is more complex when a larger block sits above a smaller one

• Consider a processor with a 16 bit address bus which uses word addressing connected to the following memory:

• 2K word EEPROM starting at address 0x00000000

• 16K word RAM module above the EEPROM

2K ROMblock

16K RAMblock

Addresses

Memory Map

0x0000

0x07FF0x0800

0x47FF





Another Decoding Example (cont.)• Selecting first ROM is simple

• Selected when all bits above ADDR(10) are ‘0’

• Selecting (and addressing within) RAM is more complex• Starting address of RAM is 0x0800 - 0b 0000 1000 0000 0000

• However, ADDR(13:0) are used to address within the RAM and must be ‘0’ to address the first RAM location

• Now we have to subtract 0b 1000 0000 0000 from ADDR(13:0) value to get proper address within RAM - too complex

• Solution - adjust memory map so that all blocks are “aligned” on size of largest block

2K ROMblock

16K RAMblock

Addresses

Memory Map

0x0000

0x07FF

0x4000

0x7FFF



14K blockunused



Another Decoding Example (cont.)

• First block of ROM uses bits (10:0) to address it

• Starts at address 0x0000 and goes to 0X07FF

• Selected when all bits above (10) are ‘0’

• First block of RAM uses bits (13:0) to address it

• Starts at address 0x4000 and goes to 0x7FFF

• Selected when bit (15) is ‘0’, and bit (14) is ‘1’

ADDR(15:11) CS_ROM1

ADDR(10:0) to ROM1’s address lines





Additional Functions of the Memory Controller

• As just discussed, the memory controller must decode the addresses from the CPU and select the proper memory block

• In addition, the memory controller must provide the proper handshaking between the CPU and the memory

• Most memory chips are asynchronous (don’t work on a clock) and do not have any outputs which tell when the data is valid - you simply have to apply the correct inputs and wait the proper amount of time

• Some memory chips require a sequence of signals to be applied for different operations

• Each CPU type had a different set of control signals and a different bus cycle

• Finally, some memory controllers are required to perform complex address calculations to translate addresses from the CPU into a different format that the memory system requires



Fig 7.3 Conceptual Structure of a Memory Cell

Select

DataIn

DataOut

R/W

Select

DataOutDataIn

R/W

Regardless of the technology, all RAM memory cells must providethese four functions: Select, DataIn, DataOut, and R/W.

This “static” RAM cell is unrealistic in practice, but it is functionally correct.We will discuss more practical designs later.



Fig 7.4 An 8-Bit Register as a1-D RAM Array

Data bus is bidirectional and buffered

The entire register is selected with one select line, and uses one R/W line

Select

DataIn DataOut

R/W

d0

Select

R/W

d1 d2 d3 d4 d5 d6 d7

D

D D D D D D D D



Fig 7.5 A 4 x 8 2-D Memory Cell Array

R/W is commonto all

2-bitaddress

Bidirectional 8-bit buffered data bus

2-4 line decoder selects one of the four 8-bit arrays

d0

R/W

d1 d2 d3 d4 d5 d6 d7

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

2– 4 decoder

A1

A0

• This type of N to 2N decoder requires 2N

gates, which is practical for only very small arrays



Fig 7.6 A 64 K x 1 Static RAM Chip~square array fits IC designparadigm

Selecting rows separatelyfrom columns means only256 x 2 = 512 circuit elementsinstead of 65536 circuitelements!

CS, Chip Select, allows chips in arrays to be selected individually

This chip requires 21 pins including power and ground, and sowill fit in a 22-pin package.

256

256

1 256–1 mux1 1–256 demux

1

8

8

Row address:A0–A7

8–256row

decoder

256 256cell array

Column address:A8–A15

R/W

CS



This chip requires 24 pins including power and ground, and so will require a 24-pin package. Package size and pin count can dominate chip cost.

Fig 7.7 A 16 K x 4 SRAM Chip

There is little difference between this chip and the previous one, except that there are4 64-1 multiplexers instead of1 256-1 multiplexer.

256

64 each

4 64– 1 muxes4 1–64 demuxes

4

8

6

Row address:A0–A7

8–256row

decoder

4 64 256cell arrays

Column address:A8–A13

R/W

CS



Fig 7.8 Matrix and Tree Decoders• 2-level decoders are limited in size because of gate fan-in.

Most technologies limit fan-in to ~8.

• When decoders must be built with fan-in >8, then additional levelsof gates are required.

• Tree and matrix decoders are two ways to design decoders with large fan-in:

3-to-8 line tree decoder constructedfrom 2-input gates.

4-to-16 line matrix decoderconstructed from 2-input gates.

m0

m1

m2

m3

m4

m5

m6

m7

x2 x3

m8

m9

m10

m11

m12

m13

m14

m15

x0

x1

2 –4

decoder

2–4decoder

x0

m1

m0

m5

m4

m3

x2x2

m2

m7

m6x1

2–4

decod

er



Fig 7.9Six-Transistor

Static RAM Cell

This is a more practicaldesign than the 8-gatedesign shown earlier.

A value is read byprecharging the bitlines to a value 1/2way between a 0 anda 1, while asserting theword line. This allows thelatch to drive the bit linesto the value stored inthe latch.

bi bi

R/W

Columnselect

(from columnaddressdecoder)

CS

di

Sense/write amplifiers —sense and amplify dataon Read, drive bi and

bi on write

Additional cells

Switches to control access

to cell

+5Activeloads

Word line wi

Storagecell

Dual rail data lines for reading and writing



Fig 7.10 Static RAM Read Operation

Access time from Address—the time required of the RAM array to decode the address and provide value to the data bus.

Memoryaddress

Read/write

CS

Data

tAA



Fig 7.11 Static RAM Write Operations

Write time—the time the data must be held valid in order to decode address and store value in memory cells.

Memoryaddress

Read/write

CS

Data

tw

Data is latched on this rising edge



Fig 7.12 Dynamic RAM Cell

Organization

Write: place value on bit lineand assert word line.Read: precharge bit line,assert word line, sense valueon bit line with sense/amp.

Capacitor willdischarge in 4–15 ms.

Refresh capacitor by reading (sensing) value on bit line, amplifying it, and placing it back on bit line where itrecharges capacitor.

This need to refresh the storage cells of dynamic RAM chips complicatesDRAM system design.

bi

R/W R

W

Columnselect

(from columnaddressdecoder)

CS

Sense/write amplifiers —sense and amplify dataon Read, drive bi and bi

on write

Additional cells

tc

Capacitor stores charge for a 1, no charge fora 0

Word line wj

di

Switch to controlaccess to cellSingle bit line



Fig 7.13 Dynamic RAM

Chip Organization

• Addresses are time-multiplexed on address bus using RAS and CAS as strobes of rows and columns.

• CAS is normally used as the CS function.

Notice pin counts:• Without address

multiplexing: 27 pins including power and ground.

• With address multiplexing: 17 pins including power and ground.

1024

10

1024

10

A0–A9

RAS

CAS

R/W

Control

1024

1024 sense/write amplifiersand column latches

1024 1024 cell array

10 column address latches,1–1024 muxes and demuxes

di

do

Ro

w la

t ch

es a

nd d

ecod

er

Controllogic



Figs 7.14, 7.15DRAM Read and Write Cycles

Typical DRAM Read operation Typical DRAM Write operation

Access time Cycle timeNotice that it is the bit line prechargeoperation that causes the differencebetween access time and cycle time.

Data hold from RAS.

Memoryaddress

RAS

Column addressRow address

tRAS tprechg

CAS

W

Data

tDHR

tC

Memoryaddress

RAS

Column addressRow address

tRAS tPrechg

CAS

R/W

Data

tA

tC



DRAM Refresh and Row Access• Refresh is usually accomplished by a “RAS-only” cycle. The row address

is placed on the address lines and RAS asserted. This refreshes the entire row. CAS is not asserted. The absence of a CAS phase signals the chip that a row refresh is requested, and thus no data is placed on the external data lines.

• Many chips use “CAS before RAS” to signal a refresh. The chip has an internal counter, and whenever CAS is asserted before RAS, it is a signal to refresh the row pointed to by the counter, and to increment the counter.

• Most DRAM vendors also supply one-chip DRAM controllers that encapsulate the refresh and other functions.

• Page mode, nibble mode, and static column mode allow rapid access tothe entire row that has been read into the column latches.

• Video RAMS, VRAMS, clock an entire row into a shift register where it canbe rapidly read out, bit by bit, for display.



ROM (Non-Volatile) Memory

1 0 1 0

00

+v

Rowdecoder

CS

Address

Fig 7.16 A 2-D CMOS ROM Chip



Tbl 7.4 ROM Types

ROM Cost Programmability Time to Time to EraseType Program

Mask- Very At factory Weeks N/Aprogrammed inexpensive onlyROM

PROM Inexpensive Once, by Seconds N/A end user

EPROM Moderate Many times Seconds 20 minutes

Flash Expensive Many times 100 s 1 s, largeEPROM block

EEPROM Very Many times 100 s 10 ms,expensive byte



Memory Boards and Modules

• There is a need for memories that are larger and wider than a single chip• Chips can be organized into “boards.”

• Boards may not be actual, physical boards, but may consist ofstructured chip arrays present on the motherboard.

• A board or collection of boards make up a memory module.

• Memory modules:• Satisfy the processor–main memory interface requirements• May have DRAM refresh capability• May expand the total main memory capacity• May be interleaved to provide faster access to blocks of words



This is a slightly different view of the memory chip than previous.

Multiple chip selects ease the assembly ofchips into chip arrays. Usually providedby an external AND gate.

Fig 7.17 General Structureof a Memory Chip

Rowdecoder

Memorycell

array

I/Omultiplexer

Address

CS CSR/W

Address

Data

R/W

s s

m

s

s

Chip selects

m

s

. . .

. . .

. . .

Data



Fig 7.18 Word Assembly from Narrow Chips

P chips expand word size from s bits to p x s bits.

All chips have common CS, R/W, and Address lines.

Select

Address

R/W

CSR/W

Address

Data

CSR/W

Address

Data

CSR/W

Address

Data

s s

p s

s

. . .



Fig 7.19 Increasing the Number of Words by a Factor of 2k

The additional k address bits are used to select one of 2k chips,each one of which has 2m words:

Word size remains at s bits.

Address

R/W

CSR/W

Address

Data

CSR/W

Address

Data

CSR/W

Address

Data

s

m

m+k

k

s

s

s

k to 2k

decoder

. . .



Fig 7.20 Chip

Matrix Using Two

Chip Selects

Multiple chip select linesare used to replace thelast level of gates in thismatrix decoder scheme.

This schemesimplifies thedecoding fromuse of a (q+k)-bitdecoderto using oneq-bit and onek-bit decoder.

Address

R/W

m

q

m + q + k

k

s

One of 2m+q+k

s-bit words

Horizontaldecoder

Ver

tical

dec

oder

CS1 CS2

R/WAddress

Data



Fig 7.21Three-

DimensionalDynamic

RAM Array

• CAS is used to enable top decoder in decoder tree.

• Use one 2-D array for each bit. Each 2-D array on separate board.

w

2k r d

ecod

e r

2kc decoder

CAS

Highaddress

Multiplexedaddress m/2

kc + kr

kc

kr

Enable

2k r d

e cod

er

RAS CAS

R/WAddress

Data

Data

R/W

RAS

RAS CAS

R/WAddress

Data

RAS CAS

R/WAddress

Data

. . .

. . .

. . .



Fig 7.22 A Memory Module and Its Interface

Must provide—• Read and Write signals.• Ready: memory is ready to accept commands.• Address—to be sent with Read/Write command.• Data—sent with Write or available upon Read when Ready is asserted.• Module select—needed when there is more than one module.

Control signal generator:for SRAM, just strobesdata on Read, ProvidesReady on Read/Write

For DRAM—also providesCAS, RAS, R/W, multiplexesaddress, generates refreshsignals, and provides Ready.

Bus Interface:

m

w

Data register

Memory boardsand/orchips

Chip/boardselection

Address register

Controlsignal

generator

Moduleselect

Address

k + m

k

Data

Ready

Write

Read

w



Fig 7.23 Dynamic RAM Modulewith Refresh Control

Moduleselect

Address

k+m

Read

Write

Ready

Dataw

w

m/2

m/2m/2

k m/2

2

Refreshclock and

control

Re

ques

t

Ref

res h

Gra

nt

Memorytiming

generator

Refresh counter

Chip/boardselection

Address register

Addressmultiplexer

Board andchip selects

Data lines

Address lines

DynamicRAMarray

RAS

CAS

R/W

Data register



Memory System Performance

For all accesses:• transmission of address to memory• transmission of control information to memory (R/W, Request, etc.)• decoding of address by memory

For a Read:• return of data from memory• transmission of completion signal

For a Write:• transmission of data to memory (usually simultaneous with address)• storage of data into memory cells• transmission of completion signal

Breaking the memory access process into steps:

The next slide shows the access process in more detail.



Fig 7.26 Sequence of Steps in Accessing Memory

“Hidden refresh” cycle. A normal cycle would exclude thepending refresh step. -more-

Command to memory

Address to memory

Write data to memory

(a) Static RAM behavior

(b) Dynamic RAM behavior

Write data

Return data

Address decode Complete

Row address & RAS Precharge

Precharge

Read or write

Read or write

Read or write

Read or write

Read or writeWrite

Write

Write

Pending refresh

Read

Read

ta

tc

tc

ta

Column address & CASComplete

Return data

Write data to memory

Refresh Precharge

R/W

Complete



Example SRAM TimingsApproximate values for static RAM Read timing:

• Address bus drivers turn-on time: 40 ns.• Bus propagation and bus skew: 10 ns.• Board select decode time: 20 ns.• Time to propagate select to another board: 30 ns.• Chip select: 20 ns.

PROPAGATION TIME FOR ADDRESS AND COMMANDTO REACH CHIP: 120 ns.

• On-chip memory read access time: 80 ns.• Delay from chip to memory board data bus: 30 ns.• Bus driver and propagation delay (as before): 50 ns.

TOTAL MEMORY READ ACCESS TIME: 280 ns.

Moral: 70 ns chips do not necessarily provide 70 ns access time!



Cache Memory• Caches exploit two properties of program behavior:

• Spatial locality: the property that if a given memory location is referenced, those locations physically near it are likely to be referenced “soon.”

• Temporal locality: the property that if a given memory location is referenced, it is likely to be referenced again, “soon.”

• The solution is to place a small amount of very fast memory “close” to the CPU and use the above rules to keep copies of memory contents that are likely to be used in the future in there

CPU • • •

Large, slower(10 to 100 clock cycles)

MainMemory

Cache

Small, fast(1 to 2 clock cycles)

Disk, etc.

Very large, very slow(> 1000’s of clock cycles)

Usually on the same chip



Cache Memory

• Caches exploit spatial and temporal locality by performing two functions:

• When a given word is requested from memory by the CPU, the cache fetches not only that word from memory, but also other words near it (spatial locality)

• The total group of memory words fetched by the cache is called a cache “block”

• The cache stores words (and their associated blocks) that have been recently requested by the CPU in case they are needed again (temporal locality)

• When a new block is requested, if the cache is full, it must decide which old block must be removed (placed back in main memory) to make room for the new block



CPUCache

Block

Main memory

Mapping functionAddress

Word

Fig 7.30 The Cache Mapping Function

The cache mapping function is responsible for all cache operations:• Placement strategy: where to place an incoming block in the cache• Replacement strategy: which block to replace upon a miss• Read and write policy: how to handle reads and writes upon cache misses

Mapping function must be implemented in hardware

Three different types of mapping functions:• Associative• Direct mapped• Block-set associative

Example: 256 KB 16 words 32 MB



Cache Performance - Hits and Misses;

Hit: the word requested by the CPU was found in the cache

Miss: the requested by the CPU word was not found in the cache(A miss will result in a request for the block containing the word frommain memory

Hit ratio (or hit rate) = h = number of hits

Miss ratio: 1 - hit ratio

tc = cache memory access time. tm = main memory access time

Access time, ta = h • tc + (1-h) • tm

total number of references

Example:

cache access time=2 clock cycles, main memory access time = 40 clock cycles,hit ratio = 95%

ta = .95 * 2 + (1.0 - .95) * 40 = 3.9 clock cyclesFaster than main memory alone by a factor of 10!



Memory Fields and Address Translation for Caches

Example of processor-issued 32-bit address:031

32 bits

• That same 32-bit address can be partitioned into two fields, a block field, and a word field

• The word field represents of the word within the block specified in the block field

• If the block size is N words, the word field is log2N bits• The block number servers to uniquely identify the block within the

cache (called the tag)

Block number Word

26 6

226 64 word blocks

00 ••• 001001 001011

Example of a specific memory reference: Block 9, word 11.



*16 bits, while unrealistically small, simplifies the examples

Associative mapped cache model: any block from main memory can be put anywhere in the cache.Assume a 16-bit main memory.*

Fig 7.31 Associative Cache

Cachememory

Mainmemory

Validbits

0

1

2

255

421

?

119

2

Cache block 0 MM block 0

MM block 1

MM block 2

MM block 119

MM block 421

MM block 8191

?

Cache block 2

Cache block 255

1

0

1

313

1

Tag memory

Tag field,

13 bits

Tag

Main memory address:

Byte

One cache line,8 bytes


Valid,1 bit



Fig 7.32 Associative Cache MechanismBecause any block can reside anywhere in the cache, an associative (content addressable) memory is used. All locations are searched simultaneously.

Matchbit

Validbit

Match

64

3

8To CPU

Argumentregister

Associative tag memory

313 Selector

TagMain memory address

Byte

Cache block 0

?

Cache block 2

Cache block 255


2

3

1

4

5

6



Advantages and Disadvantagesof the Associative Mapped Cache

Advantage

• Most flexible of all—any MM block can go anywhere in the cache.

Disadvantages

• Large tag memory.

• Need to search entire tag memory simultaneously means lots ofhardware.

Replacement Policy is an issue when the cache is full. –more later–

Q.: How is an associative search conducted at the logic gate level?

Direct-mapped caches simplify the hardware by allowing each MM blockto go into only one place in the cache:



Fig 7.33 Direct-Mapped Cache

Key Idea: all the MM blocks from a given group can go into only one location in the cache, corresponding to the group number.

Now the cache needs only examine the single groupthat its reference specifies.

Cachememory Main memory block numbers Group #:

Validbits

0

1

2

255

30

9

1

1

1

1

1

38

38

1

Tag memory

Tag field,5 bits

Group

5

Tag

Cache address:


Byte

One cacheline,

8 bytes One cache line,8 bytes

0

1

2

255

256

257

258

511

512

513

514

767

2305

7680

7681

7682

7936

7937

7938

0

1

2

2558191

0Tag #: 1 2 9 30 31



Fig 7.34 Direct-Mapped Cache Operation 1. Decode the

group number of the incoming MM address to select the group

2. If MatchAND Valid

3. Then gate out the tag field

4. Compare cache tag with incoming tag

5. If a hit, then gate out the cache line

6. and use the word field toselect the desired word.

Cachememory

Validbits

0

1

2

255

30

9

1

1

1

5

5 5

8

64

3

1

256

1

1

38

1

Tag memory

Tag field,5 bits

Group

5

Tag

Cache missCache hit=

Main memory addressByte

Hit

5-bitcomparator Selector

8–256decoder

4

3

1

5

2

6



Direct-Mapped Caches

• The direct mapped cache uses less hardware, but is much more restrictive in block placement.

• If two blocks from the same group are frequently referenced, then the cache will “thrash.” That is, repeatedly bring the two competing blocks into and out of the cache. This will cause a performance degradation.

• Block replacement strategy is trivial.

• Compromise—allow several cache blocks in each

group—the Block-Set-Associative Cache:



Fig 7.35 2-Way Set-Associative CacheExample shows 256 groups, a set of two per group.Sometimes referred to as a 2-way set-associative cache.

Cachememory Main memory block numbers Group #:

7680

2304

258

0

1

2

255

2

2

0

38

38

Tag memory

Tag field,5 bits

Set

5

Tag

Cache group address:


Byte

One cacheline,

8 bytes One cache line,8 bytes

512

513

255

0

1

2

255

256

257

258

511

512

513

514

767

2304

7680

7681

7682

7936

7937

7938

0

1

2

2558191

0Tag #: 1 2 9 30 31

30

9

1

1 511



Getting Specific:The Intel Pentium Cache

• The Pentium actually has two separate caches—one for instructions and one for data. Pentium issues 32-bit MM addresses.

• Each cache is 2-way set-associative• Each cache is 8 K = 213 bytes in size• 32 = 25 bytes per line.• Thus there are 64 or 26 bytes per set, and therefore

213/26 = 27 = 128 groups• This leaves 32 - 5 - 7 = 20 bits for the tag field:

20 7 5

Tag Set (group) Word

31 0



Cache Read and Write Policies• Read and Write cache hit policies

• Writethrough—updates both cache and MM upon each write.• Write back—updates only cache. Updates MM only upon block

removal.• “Dirty bit” is set upon first write to indicate block must be

written back.

• Read and Write cache miss policies• Read miss—bring block in from MM

• Either forward desired word as it is brought in, or• Wait until entire line is filled, then repeat the cache request.

• Write miss• Write-allocate—bring block into cache, then update• Write–no-allocate—write word to MM without bringing block into

cache.



Block Replacement Strategies

• Not needed with direct-mapped cache

• Least Recently Used (LRU)• Track usage with a counter. Each time a block is accessed:

• Clear counter of accessed block• Increment counters with values less than the one accessed• All others remain unchanged

• When set is full, remove line with highest count

• Random replacement—replace block at random• Even random replacement is a fairly effective strategy



• The PPC 601 has a unified cache—that is, a single cache for both instructions and data.

• It is 32 KB in size, organized as 64 x 8 block-set associative, with blocks being 8 8-byte words organized as 2 independent 4-word sectors for convenience in the updating process

• A cache line can be updated in two single-cycle operations of 4 words each.• Normal operation is write back, but write through can be selected on a per

line basis via software. The cache can also be disabled via software.

Fig 7.36 The PowerPC 601 Cache Structure

8 words 8 words

64 bytes

8 words 8 words

64 bytes

8 words 8 words

64 bytes

8 words 8 words

64 bytes

8 words 8 words

64 bytes

66

Tag memory

Cachememory

Address tag

Line 63

64 sets

Line 0

Setof 8

Line (set) #

20

Tag

Physical address:

Word #

Sector 0 Sector 1

8 words 8 words

64 bytes20 bits



Virtual Memory

• Virtual memory is a memory organization in which the processor issues all memory references as effective addresses in a flat address space which does not directly correspond to an address in physical memory

• Virtual addresses issued by the processor must be translated into physical addresses

• These physical addresses may correspond to an address in main memory, OR to a section of memory that is stored on disk

• Virtual memory has two major benefits:• Allows the computer system to use a portion of its disk space as an

extension to main memory, thus allowing larger programs to be executed

• Facilitates Multiprogramming - where the computer services more than one user simultaneously

• Allows each user to have the same amount of memory space in which to run programs

• Allows each user to have a logically separate memory space for protection



Virtual Memory - How it works

CPUMain

(physical)Memory

SystemDisk

Space

Individual programs’ virtual address spaces (large and flat)

Portion of the active program currently being accessed by the CPU

Operating System (eg. Unix, Windows98, etc.) is responsible for keeping the information currently being executed by the CPU in main memory Portions of programs that are not

currently being accessed are stored temporarily on disk as needed



Virtual Memory - Terms

• Portions of program are transfered between main memory and disk in blocks just as they are between main memory and cache - here they are called Pages and the transfer process is called Paging

• When the CPU wants to access a portion of the program that is on disk, it must be fetched and brought into main memory - called a Page fault - synonymous with a cache miss

• Pages are moved from disk to main memory only when a word in the page is requested by the processor - called Demand Paging

• In most virtual memory systems, pages can be placed anywhere in main memory (like a fully associative cache)

• Block placement and replacement decisions must be made each time a block is moved - just as with a fully associative cache



Virtual Memory - Implementation

CPUMain MemoryCache Disk

MMULogicalAddress

PhysicalAddressMapping

Tables

VirtualAddress

The memory management unit, MMU, is responsible for mapping logicaladdresses issued by the CPU to physical addresses that are presented tothe cache and main memory.

• Effective address—an address computed by by the processor while executing a program. Synonymous with logical address.

• The term effective address is often used when referring to activity inside the CPU. Logical address is most often used when referring to addresses when viewed from outside the CPU.

• Virtual address—an intermedate address generated from the logical address by the memory management unit, MMU - often the same as the logical address

• Physical address— the translation of the virtual address by the MMU into an address that can be presented to the memory unit.

A word about addresses:

(Note: Every address reference must be translated.)

CPU Chip



Virtual Addressing—Advantages

• Simplified addressing. Each program unit can be compiled into its own memory space, beginning at address 0 and potentially extending far beyond the amount of physical memory present in the system.

• No address relocation required at load time.

• No need to fragment the program to accommodate Cost effective use of physical memory.

• Less expensive secondary (disk) storage can replace primary storage. (The MMU will bring portions of the program into physical memory as required)

• Access control. As each memory reference is translated, it can be simultaneously checked for read, write, and execute privileges.

• This allows access/security control at the most fundamental levels.

• Can be used to prevent buggy programs and intruders from causing damage to other users or the system.

This is the origin of those “bus error” and “segmentation fault” messages.



• This figure shows the mapping between virtual memory pages, physical memory pages, and pages in secondary memory. Page n - 1 is not present in physical memory, but only in secondary memory.

• The MMU manages this mapping.

Fig 7.41 Memory

Management by Paging

Programunit

0

Page 1Page 2

Page n – 1

Virtual memory

Page 0

Physical memory

Secondary memory

. .



Page Tables• For each portion of virtual memory (each page) for a given program,

the system must keep a pointer to where it is actually located - either in main (physical) memory or on disk (secondary memory)

• The group of pointers for a program is called a page table

Physical memory

Secondary memory

Programunit

0Page 1Page 2

Page n – 1

Virtual memory

Page 0

. .

Page 2

Page 1

Page 0

Page n-1

Page Table

. .

address of Page 1

address of Page 0

address of Page 2

address of Page n-1



Fig 7.42 Virtual

Address Translation

in aPaged MMU

• 1 table per user per program unit

• One translation per memory access

• Potentially large page tableA page fault will typically result in 100,000 or more cycles passing before the page has been brought from secondary storage to MM.

Page tablelimit register

Page tablebase registerNoBounds

errorAccess-control bits:presence bit,dirty bit,usage bits

Physicalpage number or pointer tosecondarystorage

+

Offset in page table

Hit.Page in

primary memory.

Translate toDisk address.

Miss(page fault).

Page in secondarymemory.

Page table

Desired word

Main memory

Virtual address from CPU

Page number Offset in page Physical page

Physical address

Word



Page Placementand Replacement

Page tables are direct mapped, since the physical page is computeddirectly from the virtual page number.

But physical pages can reside anywhere in physical memory.

Page tables such as those on the previous slide result in large pagetables, since there must be a page table entry for every page in theprogram unit.

Some implementations resort to hash tables instead, which need haveentries only for those pages actually present in physical memory.

Replacement strategies are generally LRU, or at least employ a “use bit”to guide replacement.



Fast Address Translation:Regaining Lost Ground

• The concept of virtual memory is very attractive, but leads to considerable overhead:

• There must be a translation for every memory reference.• There must be two memory references for every program reference:

• One to retrieve the page table entry,• one to retrieve the value.

• Most caches are addressed by physical address, so there must be a virtual to physical translation before the cache can be accessed.

The answer: a small cache in the processor that retains the last few virtual to physical translations: a Translation Lookaside Buffer, TLB.

The TLB contains not only the virtual to physical translations, but also the valid, dirty, and protection bits, so a TLB hit allows the processor to access physical memory directly.

The TLB is usually implemented as a fully associative cache:



Fig 7.43 Translation Lookaside BufferStructure and Operation

TLB

Desired word

Main memory or cache

Virtual address from CPU

Page number

Associative lookupof virtual pagenumber in TLB

TLB miss.Look for

physical pagein page table.

To page table

Virtualpagenumber

Word Physical page

Physical address

Word

Hit

N

Y

Access-control bits:presence bit,dirty bit,valid bit,usage bits

Physicalpagenumber

TLB hit.Page is in

primary memory.



Fig 7.44 Operation of the Memory Hierarchy

Virtual address

CPU Cache Main memory Secondary memory

Search TLB Search cache

Update cachefrom MM

Return valuefrom cache

Generate physicaladdress

TLBhit

Cachehit

Search page table

Update MM, cache,and page table

Page fault. Get pagefrom secondary

memory

Generate physicaladdress

Update TLB

Pagetable

hit

Y Y Y

Miss Miss Miss



Fig 7.45 PowerPC 601 MMU Operation

“Segments” are actually more akin tolarge (256 MB) blocks.

0 Set 1UTLB

12

WordVirtual pg #Seg

#7

32-bit logical address from CPU

9

16

4

CompareCompare

16

40

12

Hit—to CPU

Miss—topage table

search

d0–d31

20

32 Cache

40

Hit

20-bit physical address

2–1 mux

24

4

0

0

127

15

24-bit virtual segment ID

Set 0

40-bit virtual page20-bitphysical

page

(VSID)Accesscontrol

andmisc.7

Miss—cache load

. .. .



Chapter 7 Summary• Most memory systems are multileveled—cache, main memory,

and disk.

• Static and dynamic RAM are fastest components, and their speed has the strongest effect on system performance.

• Chips are organized into boards and modules.

• Larger, slower memory is attached to faster memory in a hierarchical structure.

• The cache to main memory interface requires hardware address translation.

• Virtual memory—the main memory–disk interface—can employ software for address translation because of the slower speeds involved.

• The hierarchy must be carefully designed to ensure optimum price-performance.

chapter 7: memory system design

Documents