cmput229 - fall 2003

CMPUT 229 - Computer Organization and Architecture I

1

CMPUT229 - Fall 2003

Topic D: The Memory HierarchyJosé Nelson Amaral


2

Reading Assignment

Bryant, Randal E., O’Hallaron, David, Computer Systems: A Programmer’s Perspective, Prentice Hall, 2003. (B&H)

Chapter 6: The Memory Hierarchy


3

Types of Memories

Read/Write Memory (RWM):

the time required to read orwrite a bit of memory is independent of the bit’s location.

once a word is writtento a location, it remains stored as long as power is appliedto the chip, unless the location is written again.

the data stored ateach location must be refreshed periodically by reading it andthen writing it back again, or else it disappears.

we can store and retrieve data.

Random Access Memory (RAM):

Static Random Access Memory (SRAM):

Dynamic Random Access Memory (DRAM):

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

DOUT3 DOUT2 DOUT1 DOUT0

3-to-8decoder

2

1

0

A2

A1

A0

0

1

2

3

4

5

6

7

DIN3 DIN0DIN2 DIN1

WE_LCS_L

OE_L

WR_L

IOE_L

0

1

1

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

DOUT3 DOUT3 DOUT3 DOUT3

3-to-8decoder

2

1

0

A2

A1

A0

0

1

2

3

4

5

6

7

DIN3 DIN3DIN3 DIN3

WE_LCS_L

OE_L

WR_L

IOE_L

0

1

1


8

Refreshing the Memory

Vcap

0V

HIGHLOW

VCC

time

0 stored

1 written refreshes

The solution is to periodically refresh the memorycells by reading and writing back each one of them.


9

SRAM with Bi-directional Data Bus

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

IN OUTSELWR

DIO3 DIO2 DIO1 DIO0

WE_LCS_L

OE_L

WR_L

IOE_L

microprocessor


10

DRAM High Level View

Cols

Rows

0 1 2 3

0

1

2

3

Internal row buffer

DRAM chip

addr

data

2/

8/

Memorycontroller

(to CPU)

Byant/O’Hallaron, pp. 459


11

DRAM RAS Request

RAS = 2

Cols

Rows

0 1 2 3

0

1

2

3

Internal row buffer

DRAM chip

Row 2

addr

data

2/

8/

Memorycontroller

RAS = Row Address StrobeByant/O’Hallaron, pp. 460


12

DRAM CAS Request

Supercell (2,1)

Cols

Rows

0 1 2 3

0

1

2

3

Internal row buffer

DRAM chip

CAS = 1

addr

data

2/

8/

Memorycontroller

CAS = Column Address StrobeByant/O’Hallaron, pp. 460

Memory Modules: Supercell (i,j)

031 78151623243263 394047485556

64-bit double word at main memory address A

addr (row = i, col = j)

data

64 MB memory module

consisting of8 8Mx8 DRAMs

Memorycontroller

bits0-7

DRAM 7

DRAM 0

bits8-15

bits16-23

bits24-31

bits32-39

bits40-47

bits48-55

bits56-63

64-bit doubleword to CPU chip


Step 1: Apply row address

1

Step 2: RAS go from high to low and remain low2

Step 4: WE must be high

4

Step 3: Apply column address

3Step 5: CAS goes from high to low and remain low

5

Step 6: OE goes low

6

Step 7: Data appears

7

Step 8: RAS and CAS return to high

8

Read Cycle on an Asynchronous DRAM


15

Improved DRAMs

Central Idea: Each read to a DRAM actuallyreads a complete row of bits or word line fromthe DRAM core into an array of sense amps.

A traditional asynchronous DRAM interfacethen selects a small number of these bits to bedelivered to the cache/microprocessor.

All the other bits already extracted from the DRAMcells into the sense amps are wasted.


16

Fast Page Mode DRAMs

In a DRAM with Fast Page Mode, a page is defined asall memory addresses that have the same row address.

To read in fast page mode, all the steps from 1 to 7 ofa standard read cycle are performed.

Then OE and CAS are switched high, but RAS remains low.

Then the steps 3 to 7 (providing a new column address,asserting CAS and OE) are performed for each newmemory location to be read.

A Fast Page Mode Read Cycle on an Asynchronous DRAM


18

Enhanced Data Output RAMs (EDO-RAM)

The process to read multiple locations in an EDO-RAMis very similar to the Fast Page Mode.

The difference is that the output drivers are not disabledwhen CAS goes high.

This distintion allows the data from the current read cycleto be present at the outputs while the next cyclebegins.

As a result, faster read cycle times are allowed.

An Enhanced Data Output Read Cycle on an Asynchronous DRAM


20

Synchronous DRAMs (SDRAM)

A Synchronous DRAM (SDRAM) has a clock input. It operatesin a similar fashion as the fast page mode and EDO DRAM.However the consecutive data is output synchronously on thefalling/rising edge of the clock, instead of on command byCAS.

How many data elements will be output (the length of the burst) is programmable up to the maximum size ofthe row.

The clock in an SDRAM typically runs oneorder of magnitude faster than the access time forindividual accesses.


21

DDR SDRAM

A Double Data Rate (DDR) SDRAM is an SDRAMthat allows data transfers both on the rising andfalling edge of the clock.

Thus the effective data transfer rate of a DDR SDRAM is two times the data transfer rate ofa standard SDRAM with the same clock frequency.


22

The Rambus DRAM (RDRAM)

Multiple memory arrays (banks)Rambus DRAMs are synchronous and transfer data on both edges of the clock.


23

SDRAM Memory Systems

Complex circuits for RAS/CAS/OE.

Each DIMM is connectedin parallel with the memorycontroller.(DIMM = Dual In-line Memory Module)

Often requires buffering.

Needs the whole clockcycle to establish valid data.

Making the bus wider ismechanically complicated.


24

RDRAM Memory Systems


25

Bus Structure

Mainmemory

I/O bridge

Bus interface

ALU

Register fileCPU

System bus Memory bus

Disk controller

Graphicsadapter

USBcontroller

Mouse Keyboard Monitor

Disk

I/O bus Expansion slots forother devices such

as network adapters



26

DMA Request

Mainmemory

I/O bridge

Bus interface

ALU

Register fileCPU


Disk controller

Graphicsadapter

USBcontroller


Disk


as network adapters

DMA = Direct Memory Access



27

DMA Transfer

Mainmemory

I/O bridge

Bus interface

ALU

Register fileCPU


Disk controller

Graphicsadapter

USBcontroller


Disk


as network adapters




28

DMA Complet. Notification

Mainmemory

I/O bridge

Bus interface

ALU

Register fileCPU

Memory bus

Disk controller

Graphicsadapter

USBcontroller


Disk


as network adapters


Interrupt



29

Locality

We say that a computer program exhibits good locality if the program tends to reference data that is nearby or datathat has been referenced recently.

Because a program might do one of these things, but not the other,the principle of locality is separated into two flavors:

Temporal locality: a memory location that is referenced once is likely to be referenced multiple times in the near future.

Spatial locality: if a memory location that is referenced once then locations that are nearby are likely to be referenced in the near future.



30

Examples

In the Sampler function below, RandInt returns a randomly selected integer within the specified interval.Which program has better locality?

1 int SumVec(int v[], int N) 2 { 3 int i; 4 int sum = 0; 5 6 for (i=0 ; i<N ; i=i+1) 7 sum += v[i]; 8 return sum; 9 }

1 int Sampler(int v[], int N, int K) 2 { 3 int i, j; 4 int sum = 0; 5 6 for (i=0 ; i<K ; i=i+1) 7 { 8 j = RandInt(0,N-1); 9 sum += v[j];10 }11 return sum/K;12 }


Memory Hierarchy

Larger, slower,

and cheaper (per byte)storagedevices

Registers

CPU registers hold words retrieved from cache memory.

L0:

On-chip L1cache (SRAM)

L1 cache holds cache lines retrieved from the L2 cache.L1:

Off-chip L2cache (SRAM)

L2 cache holds cache lines retrieved from memory.L2:

Main memory(DRAM)

Main memory holds disk blocks retrieved from local

disks.

L3:

Local secondary storage(local disks)

Local disks hold files retrieved from disks on

remote network servers.

L4:

Remote secondary storage(distributed file systems, Web servers)

L5:

Smaller,faster,and

costlier(per byte)storage devices

Byant/O’Hallaron,

pp. 483


32

Caching Principle

4 9 14 3

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Larger, slower, cheaper storagedevice at level k+1 is partitioned

into blocks.

Smaller, faster, more expensivedevice at level k caches a

subset of the blocks from level k+1

Data is copied betweenlevels in block-sized transfer units

Level k:

Level k+1:



33

Cache Misses

Cold Misses, or compulsory misses, occur the first time that a data is referenced.

Conflict Misses, occur when two memory references have to occupy the same memory line. It can occur even when the remainder of the cache is not in use.

Capacity Misses, occur when there are no more free lines in the cache.


34

L1 and L2 Bus System

Mainmemory

I/Obridge

Bus interfaceL2 cache

ALU

Register file

CPU chip

Cache bus System bus Memory bus

L1 cache


Cache Organization

• • • B–110

• • • B–110

Valid

Valid

Tag

TagSet 0:

B = 2b bytesper cache block

E lines per set

S = 2s sets

t tag bitsper line

1 valid bitper line

Cache size: C = B x E x S data bytes

• • •

• • • B–110

• • • B–110

Valid

Valid

Tag

TagSet 1:

• • •

• • • B–110

• • • B–110

Valid

Valid

Tag

TagSet S -1:

• • •• • •



36

Address Partition

t bits s bits b bits

0m-1

Tag Set index Block offset

Address:

Compared with tags in thecache to find a match.

Used to find the set wherethe data might be found inthe cache.

Selects which word, insidethe block, is referenced.



37

Multi-Level Cache Organization

Mainmemory Disk

L1 i-cache

L1 d-cacheRegs L2 unifiedcache

CPU



38

Writing Cache-Conscious Programs

Problem: Write C code for a function that computes the sum of the elements of a two dimensional array, a[M][N], of integers.

int SumArray(int a[][], int M, int N)

1 int SumArrayRows(int a[][], int M, int N) 2 { 3 int i, j; 4 int sum = 0; 5 6 for (i=0 ; i<M ; i++) 7 for (j=0 ; j<N ; j++) 8 sum += a[i][j]; 8 return sum; 9 }

1 int SumArrayCols(int a[][], int M, int N) 2 { 3 int i, j; 4 int sum = 0; 5 6 for (j=0 ; j<N ; i++) 7 for (i=0 ; i<M ; i++) 8 sum += a[i][j]; 8 return sum; 9 }



39

SumArrayRows Data Access Order

a[1][2]a[1][3]a[1][4]a[1][5]a[2][0]a[2][1]a[2]2]a[2][3]a[2][4]a[2][5]a[3][0]a[3][1]a[3][2]a[3][3]a[3][4]

•••

a[0][0]a[0][1]a[0][2]a[0][3]a[0][4]a[0][5]a[1][0]a[1][1]

0x8000 4000

0x8000 4004

0x8000 4010

0x8000 4024

0x8000 4008

0x8000 4014

0x8000 4028

0x8000 403C

0x8000 400C

0x8000 4018

0x8000 402C

0x8000 4040

0x8000 401C

0x8000 4030

0x8000 4044

0x8000 4050

0x8000 4020

0x8000 4034

0x8000 4048

0x8000 4054

0x8000 4038

0x8000 404C

0x8000 4058

•••




40



•••

a[0][0]a[0][1]a[0][2]a[0][3]a[0][4]a[0][5]a[1][0]a[1][1]

0x8000 4000

0x8000 4004

0x8000 4010

0x8000 4024

0x8000 4008

0x8000 4014

0x8000 4028

0x8000 403C

0x8000 400C

0x8000 4018

0x8000 402C

0x8000 4040

0x8000 401C

0x8000 4030

0x8000 4044

0x8000 4050

0x8000 4020

0x8000 4034

0x8000 4048

0x8000 4054

0x8000 4038

0x8000 404C

0x8000 4058

•••




41



•••

a[0][0]a[0][1]a[0][2]a[0][3]a[0][4]a[0][5]a[1][0]a[1][1]

0x8000 4000

0x8000 4004

0x8000 4010

0x8000 4024

0x8000 4008

0x8000 4014

0x8000 4028

0x8000 403C

0x8000 400C

0x8000 4018

0x8000 402C

0x8000 4040

0x8000 401C

0x8000 4030

0x8000 4044

0x8000 4050

0x8000 4020

0x8000 4034

0x8000 4048

0x8000 4054

0x8000 4038

0x8000 404C

0x8000 4058

•••




42



•••

a[0][0]a[0][1]a[0][2]a[0][3]a[0][4]a[0][5]a[1][0]a[1][1]

0x8000 4000

0x8000 4004

0x8000 4010

0x8000 4024

0x8000 4008

0x8000 4014

0x8000 4028

0x8000 403C

0x8000 400C

0x8000 4018

0x8000 402C

0x8000 4040

0x8000 401C

0x8000 4030

0x8000 4044

0x8000 4050

0x8000 4020

0x8000 4034

0x8000 4048

0x8000 4054

0x8000 4038

0x8000 404C

0x8000 4058

•••




43

SumArrayCols Data Access Order


•••

a[0][0]a[0][1]a[0][2]a[0][3]a[0][4]a[0][5]a[1][0]a[1][1]

0x8000 4000

0x8000 4004

0x8000 4010

0x8000 4024

0x8000 4008

0x8000 4014

0x8000 4028

0x8000 403C

0x8000 400C

0x8000 4018

0x8000 402C

0x8000 4040

0x8000 401C

0x8000 4030

0x8000 4044

0x8000 4050

0x8000 4020

0x8000 4034

0x8000 4048

0x8000 4054

0x8000 4038

0x8000 404C

0x8000 4058

•••




44



•••

a[0][0]a[0][1]a[0][2]a[0][3]a[0][4]a[0][5]a[1][0]a[1][1]

0x8000 4000

0x8000 4004

0x8000 4010

0x8000 4024

0x8000 4008

0x8000 4014

0x8000 4028

0x8000 403C

0x8000 400C

0x8000 4018

0x8000 402C

0x8000 4040

0x8000 401C

0x8000 4030

0x8000 4044

0x8000 4050

0x8000 4020

0x8000 4034

0x8000 4048

0x8000 4054

0x8000 4038

0x8000 404C

0x8000 4058

•••




45



•••

a[0][0]a[0][1]a[0][2]a[0][3]a[0][4]a[0][5]a[1][0]a[1][1]

0x8000 4000

0x8000 4004

0x8000 4010

0x8000 4024

0x8000 4008

0x8000 4014

0x8000 4028

0x8000 403C

0x8000 400C

0x8000 4018

0x8000 402C

0x8000 4040

0x8000 401C

0x8000 4030

0x8000 4044

0x8000 4050

0x8000 4020

0x8000 4034

0x8000 4048

0x8000 4054

0x8000 4038

0x8000 404C

0x8000 4058

•••




46



•••

a[0][0]a[0][1]a[0][2]a[0][3]a[0][4]a[0][5]a[1][0]a[1][1]

0x8000 4000

0x8000 4004

0x8000 4010

0x8000 4024

0x8000 4008

0x8000 4014

0x8000 4028

0x8000 403C

0x8000 400C

0x8000 4018

0x8000 402C

0x8000 4040

0x8000 401C

0x8000 4030

0x8000 4044

0x8000 4050

0x8000 4020

0x8000 4034

0x8000 4048

0x8000 4054

0x8000 4038

0x8000 404C

0x8000 4058

•••




47

Read Bandwidth

The rate that a program reads data from the memory system iscalled the read throughput or the read bandwidth.

The read throughput of a program depends on the memory hierarchy level from which the data is retrieved.

The read throughput is measured in bytes per second, or morecommonly in Mbytes/s.

We can write a program to force the data to come from the various levels in the hierarchy to estimate the read throughput.


48

Measuring Read Bandwidth

1 int test(int elems, int stride) 2 { 3 int i; 4 int result = 0; 5 volatile int sink; 6 7 for(i=0 ; i<elems ; i += stride) 8 result += data[i]; 9 sink = result; /* to prevent compiler from optimizing away the loop */10 }


Pentium III Xeon Memory Mountain

s1s3

s5s7

s9s11

s13s15 8m

2m 512k128k

32k

8k2k

0

200

400

600

800

1000

1200

Read throughput (MB/s)

Stride (words) Working set size (bytes)

Pentium III Xeon550 MHz16 KB on-chip L1 d-cache16 KB on-chip L1 i-cache512 KB off-chip unifiedL2 cache

Ridges oftemporallocality

L1

L2

Mem

Slopes ofspatiallocality

xe


Temporal Locality(stride = 1)

0

200

400

600

800

1000

1200

8m 4m 2m1024k512k 256k 128k 64k 32k 16k

8k 4k 2k 1k

Working set size (bytes)

Read througput (MB/s)

L1 cacheregion

L2 cacheregion

Main memoryregion

Spatial Locality Slope(size = 256 KB)

0

100

200

300

400

500

600

700

800

s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16

Stride (words)

Read throughput (MB/s)

One access per cache line

cmput229 - fall 2003

Documents

computer organization

architecture isram

architecture irefreshing

memory hierarchycmput

random access memory

bits location

bidirectional data bus

topic d