memory/storage architecture lab computer architecture memory hierarchy

Post on 13-Jan-2016

235 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Memory/Storage Architecture Lab

Computer Architecture

Memory Hierarchy

2Memory/Storage Architecture Lab 2

Technology Trends

0

50

100

150

200

250

300

'80 '83 '85 '89 '92 '96 '98 '00 '04 '07

Trac

Tcac

Year Capacity $/GB

1980 64Kbit $1500000

1983 256Kbit $500000

1985 1Mbit $200000

1989 4Mbit $50000

1992 16Mbit $15000

1996 64Mbit $10000

1998 128Mbit $4000

2000 256Mbit $1000

2004 512Mbit $250

2007 1Gbit $50

3Memory/Storage Architecture Lab 3

Memory Hierarchy

Ideally one would desire an indefinitely large memory capacity such that any particular … word would be immediately available … We are … forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible.

Burks, Goldstine, and von Neumann, 1946

Ideally one would desire an indefinitely large memory capacity such that any particular … word would be immediately available … We are … forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible.

Burks, Goldstine, and von Neumann, 1946

CPU

Size of the memory at each level

Decreasing cost

Increasing speed and bandwidth

Levels in the memory hierarchy

Level 1

Level 2

• • •

Level n

4Memory/Storage Architecture Lab 4

Memory Technology (Big Picture)

Speed: Fastest Size: Smallest Cost: Highest

Slowest

BiggestLowest

Processor

Control

DatapathMemory

Memory

Memory

Mem

or

y Mem

or

y

5Memory/Storage Architecture Lab 5

Memory Technology (Real-world Realization)

Processor

Control

Off-chip Level Caches (SRAM)

Main Memory (DRAM)

Secondary

storage (Disk)

Reg

isters O

n-ch

ip

Cach

es

Register Cache Main Memory Disk Memory

Speed <1ns <5ns 50ns~70ns 5ms~20ms

Size 100B KB→MB MB→GB GB→TB

Management Compiler Hardware OS OS

Register Cache Main Memory Disk Memory

Speed <1ns <5ns 50ns~70ns 5ms~20ms

Size 100B KB→MB MB→GB GB→TB

Management Compiler Hardware OS OS

6Memory/Storage Architecture Lab 6

Memory Hierarchy

An optimization resulting from a perfect match between memory technology and two types of program locality

Temporal locality (locality in time) − If an item is referenced, it will tend to be referenced again soon.

Spatial locality (locality in space)− If an item is referenced, items whose addresses are close by will

tend to be referenced soon.

Goal : To provide a “virtual” memory technology (an illusion) that has an access time of the highest-level memory with the size and cost of the lowest-level memory

7Memory/Storage Architecture Lab 7

Temporal and Spatial Localities

Source: Glass & Cao (1997 ACM SIGMETRICS)

8Memory/Storage Architecture Lab 8

Memory Hierarchy Terminology

Hit – Accessed data is found in upper level Hit Rate = fraction of accesses found in upper level Hit Time = time to access the upper level

Miss – Accessed data found only in lower level Processor waits until data is fetched from next level,

then restarts/continues access Miss rate = 1 – (hit rate) Miss penalty = time to get block from lower level

+ time to replace in upper level

Hit time << miss penalty Average memory access time << worst case access time Average memory access time

= hit time + miss rate ⅹmiss penalty

Data are transferred in the unit of blocks

9Memory/Storage Architecture Lab 9

(CPU) Cache

Upper level : SRAM (small, fast, expensive)

lower level : DRAM (large, slow, cheap) Goal : To provide a “virtual” memory technology that has

an access time of SRAM with the size and cost of DRAM Additional benefits

Reduction of memory bandwidth consumed by processor

More memory bandwidth for I/O No need to change the ISA

10Memory/Storage Architecture Lab 10

Direct-mapped Cache

Each memory block is mapped to a single cache block The mapped cache block is determined by memory block

address mod number of cache blocks

11Memory/Storage Architecture Lab 11

Direct-Mapped Cache Example

Consider a direct-mapped cache with block size 4 bytes and total capacity 4KB

Assume 1 word per block… The 2 lowest address bits

specify the byte within a block The next 10 address bits

specify the block’s index within the cache

The 20 highest address bits are the unique tag for this memory block

The valid bit specifies whether the block is an accurate copy of memory

Exploit temporal locality

12Memory/Storage Architecture Lab 12

On cache read

On cache hit, CPU proceeds normally On cache miss (handled completely by hardware)

Stall the CPU pipeline Fetch the missed block from the next level of hierarchy Instruction cache miss

− Restart instruction fetch

Data cache miss− Complete data access

13Memory/Storage Architecture Lab 13

On cache write

Write-through Always write the data into both the cache and main

memory Simple but slow and increases memory traffic (requires a

write buffer) Write-back

Write the data into the cache only and update the main memory when a dirty block is replaced (requires a dirty bit and possibly a write buffer)

Fast but complex to implement and causes a consistency problem

14Memory/Storage Architecture Lab 14

Write allocation

What should happen on a write miss? Alternatives for write-through

Allocate on miss: fetch the block Write around: don’t fetch the block

− Since programs often write a whole block before reading it (e.g., initialization)

For write-back Usually fetch the block

15Memory/Storage Architecture Lab 15

Memory Reference Sequence

Look at the following sequence of memory references for the previous direct-mapped cache

0, 4, 8188, 0, 16384, 0

0 0 XXXX XXXX

1 0 XXXX XXXX

2 0 XXXX XXXX

3 0 XXXX XXXX

1021 0 XXXX XXXX

1022 0 XXXX XXXX

1023 0 XXXX XXXX

… …

Index

Valid

Tag Data

Cache Initially Empty

Cache Initially Empty

16Memory/Storage Architecture Lab 16

After Reference 1

Look at the following sequence of memory references for the previous direct-mapped cache

0, 4, 8188, 0, 16384, 0 Address = 00000000000000000000 0000000000 00

0 1 00000000000000000000 Memory bytes 0…3 (copy)

1 0 XXXX XXXX

2 0 XXXX XXXX

3 0 XXXX XXXX

1021 0 XXXX XXXX

1022 0 XXXX XXXX

1023 0 XXXX XXXX

… …

Index

Valid

Tag Data

Cache Miss, Place Block at Index 0

Cache Miss, Place Block at Index 0

Miss

17Memory/Storage Architecture Lab 17

After Reference 2

Look at the following sequence of memory references for the previous direct-mapped cache

0, 4, 8188, 0, 16384, 0 Address = 00000000000000000000 0000000001 00

0 1 00000000000000000000 Memory bytes 0…3 (copy)

1 1 00000000000000000000 Memory bytes 4…7 (copy)

2 0 XXXX XXXX

3 0 XXXX XXXX

1021 0 XXXX XXXX

1022 0 XXXX XXXX

1023 0 XXXX XXXX

… …

Index

Valid

Tag Data

Cache Miss, Place Block at Index 1

Cache Miss, Place Block at Index 1

Miss

18Memory/Storage Architecture Lab 18

After Reference 3

Look at the following sequence of memory references for the previous direct-mapped cache

0, 4, 8188, 0, 16384, 0 Address = 00000000000000000001 1111111111 00

0 1 00000000000000000000 Memory bytes 0…3 (copy)

1 1 00000000000000000000 Memory bytes 4…7 (copy)

2 0 XXXX XXXX

3 0 XXXX XXXX

1021 0 XXXX XXXX

1022 0 XXXX XXXX

1023 1 00000000000000000001 Memory bytes 8188…8191 (copy)

… …

Index

Valid

Tag Data

Cache Miss, Place Block at Index 1023

Cache Miss, Place Block at Index 1023

Miss

19Memory/Storage Architecture Lab 19

After Reference 4

Look at the following sequence of memory references for the previous direct-mapped cache

0, 4, 8188, 0, 16384, 0 Address = 00000000000000000000 0000000000 00

0 1 00000000000000000000 Memory bytes 0…3 (copy)

1 1 00000000000000000000 Memory bytes 4…7 (copy)

2 0 XXXX XXXX

3 0 XXXX XXXX

1021 0 XXXX XXXX

1022 0 XXXX XXXX

1023 1 00000000000000000001 Memory bytes 8188…8191 (copy)

… …

Index

Valid

Tag Data

Cache Hit to Block at Index 0

Cache Hit to Block at Index 0

Hit

20Memory/Storage Architecture Lab 20

After Reference 5

Look at the following sequence of memory references for the previous direct-mapped cache

0, 4, 8188, 0, 16384, 0 Address = 00000000000000000100 0000000000 00 [same index!]

0 1 Memory bytes 16384…16387(copy)

1 1 00000000000000000000 Memory bytes 4…7 (copy)

2 0 XXXX XXXX

3 0 XXXX XXXX

1021 0 XXXX XXXX

1022 0 XXXX XXXX

1023 1 00000000000000000001 Memory bytes 8188…8191 (copy)

… …

Index

Valid

Tag Data

Cache Miss, Replace Block at Index 0

Cache Miss, Replace Block at Index 0

Miss0000000000000000000000000000000000000100

21Memory/Storage Architecture Lab 21

After Reference 6

Look at the following sequence of memory references for the previous direct-mapped cache

0, 4, 8188, 0, 16384, 0 Address = 00000000000000000000 0000000000 00 [same index!]

0 1 Memory bytes 0…3 (copy)

1 1 00000000000000000000 Memory bytes 4…7 (copy)

2 0 XXXX XXXX

3 0 XXXX XXXX

1021 0 XXXX XXXX

1022 0 XXXX XXXX

1023 1 00000000000000000001 Memory bytes 8188…8191 (copy)

… …

Index

Valid

Tag Data

Cache Miss, Replace Block at Index 0 Total of 1 Hit and 5

Misses

Cache Miss, Replace Block at Index 0 Total of 1 Hit and 5

Misses

Miss

Agai

n

0000000000000000000000000000000000000100

22Memory/Storage Architecture Lab 22

Exploiting Spatial Locality - Larger than one word block size

16 KB Direct-mapped cache with 256 64B (16 words) blocks

23Memory/Storage Architecture Lab 23

Miss Rate vs. Block Size

24Memory/Storage Architecture Lab 24

Set-Associative Caches

Allow multiple entries per index to improve hit rates n-way set associative caches allow up to n conflicting references to be

cached− n is the number of cache blocks in each set− n comparisons are needed to search all blocks in the set in parallel− When there is a conflict, which block is replaced (this was easy for direct mapped caches

– there`s only one entry!)

Fully-associative caches− a single (very large!) set allows a memory location to be placed in any cache block

Direct-mapped caches are essentially 1-way set-associative caches

For fixed cache capacity, higher associativity leads to higher hit rates Because more combinations of memory blocks can be present in the

cache Set associativity optimizes cache contents, but at what cost?

25Memory/Storage Architecture Lab 25

Cache Organization Spectrum

26Memory/Storage Architecture Lab 26

Implementation of Set Associative Cache

27Memory/Storage Architecture Lab 27

Cache Organization Example

One-way set associative

(direct mapped)

Block Tag

Data

0

1

2

3

4

5

6

7

Two-way set associative

Set Tag

Data

0

1

2

3

Tag

Data

Four-way set associative

Set

0

1

Tag

Data Tag

Data Tag

Data Tag

Data

Tag

Data Tag

Data Tag

Data Tag

Data Tag

Data Tag

Data Tag

Data Tag

Data

Eight-way set associative (fully associative)

28Memory/Storage Architecture Lab 28

Cache Block Replacement Policy

Direct-mapped Caches No replacement policy is needed since each memory block

can be placed in only one cache block N-way set-associative Caches

Each memory block can be placed in any of the n cache blocks in the mapped set

Least Recently Used (LRU) replacement policy is typically used to select a block to be replaced among the blocks in the mapped set

LRU replaces the block that has not been used for the longest time

29Memory/Storage Architecture Lab 29

Miss Rate vs. Set Associativity

30Memory/Storage Architecture Lab 30

Memory Reference Sequence

Look again at the following sequence of memory references for a 2-way set associative cache with a block size of two words (8bytes)

0, 4, 8188, 0, 16384, 0

This sequence had 5 misses and 1 hit for the direct mapped cache with the same capacity

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

… …

Set Number Vali

dTag Data

0

1

255

Cache Initially EmptyCache Initially Empty

31Memory/Storage Architecture Lab 31

After Reference 1

Look again at the following sequence of memory references for a 2-way set associative cache with a block size of two words (8bytes)

0, 4, 8188, 0, 16384, 0 Address = 000000000000000000000 00000000 000

1 000000000000000000000 Memory bytes 0..7 (copy)

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

… …

Valid

Tag Data

0

1

255

Set Number

Cache Miss, Place in First Block of Set 0

Cache Miss, Place in First Block of Set 0

Miss

32Memory/Storage Architecture Lab 32

After Reference 2

Look again at the following sequence of memory references for a 2-way set associative cache with a block size of two words (8bytes)

0, 4, 8188, 0, 16384, 0 Address = 000000000000000000000 00000000 100

1 000000000000000000000 Memory bytes 0..7 (copy)

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

… …

Valid

Tag Data

0

1

255

Set Number

Cache Hit to first Block in Set 0

Cache Hit to first Block in Set 0

Hit

33Memory/Storage Architecture Lab 33

After Reference 3

Look again at the following sequence of memory references for a 2-way set associative cache with a block size of two words (8bytes)

0, 4, 8188, 0, 16384, 0 Address = 000000000000000000111 11111111 000

1 000000000000000000000 Memory bytes 0..7 (copy)

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

1 000000000000000000111 Memory bytes 8188..8195 (copy)

0 XXXX XXXX

… …

Valid

Tag Data

0

1

255

Set Number

Cache Miss, Place in First Block of Set 255

Cache Miss, Place in First Block of Set 255

Miss

34Memory/Storage Architecture Lab 34

After Reference 4

Look again at the following sequence of memory references for a 2-way set associative cache with a block size of two words (8bytes)

0, 4, 8188, 0, 16384, 0 Address = 000000000000000000000 00000000 000

1 000000000000000000000 Memory bytes 0..7 (copy)

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

1 000000000000000000111 Memory bytes 8188..8195 (copy)

0 XXXX XXXX

… …

Valid

Tag Data

0

1

255

Set Number

Cache Hit to first Block in Set 0

Cache Hit to first Block in Set 0

Hit

35Memory/Storage Architecture Lab 35

After Reference 5

Look again at the following sequence of memory references for a 2-way set associative cache with a block size of two words (8bytes)

0, 4, 8188, 0, 16384, 0 Address = 000000000000000010000 00000000 000

1 000000000000000000000 Memory bytes 0..7 (copy)

1 000000000000000010000 Memory bytes 16384..16391(copy)

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

1 000000000000000000111 Memory bytes 8188..8195 (copy)

0 XXXX XXXX

… …

Valid

Tag Data

0

1

255

Set Number

Cache Miss, Place in Second Block of Set 0

Cache Miss, Place in Second Block of Set 0

Miss

36Memory/Storage Architecture Lab 36

After Reference 6

Look again at the following sequence of memory references for a 2-way set associative cache with a block size of two words (8bytes)

0, 4, 8188, 0, 16384, 0 Address = 000000000000000000000 00000000 000

1 000000000000000000000 Memory bytes 0..7 (copy)

1 000000000000000010000 Memory bytes 16384..16391(copy)

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

1 000000000000000000111 Memory bytes 8188..8195 (copy)

0 XXXX XXXX

… …

Valid

Tag Data

0

1

255

Set Number

Cache Hit to first Block in Set 0

Total of 3 hits and 3 misses

Cache Hit to first Block in Set 0

Total of 3 hits and 3 misses

Hit

37Memory/Storage Architecture Lab 37

Improving Cache Performance

Cache Performance is determined byAverage memory access time = hit time + (miss rate x miss penalty)

Decrease hit time Make cache smaller, but miss rate increases Use direct mapped, but miss rate increase

Decrease miss rate Make cache larger, but can increases hit time Add associativity, but can increases hit time Increase block size, but increases miss penalty

Decrease miss penalty Reduce transfer time component of miss penalty Add another level of cache

38Memory/Storage Architecture Lab 38

Current Cache Organizations

Intel Nehalem AMD Opteron X4

L1 caches(per core)

L1 I-cache: 32KB, 64-byte blocks, 4-way, approx LRU replacement, hit time n/a

L1 D-cache: 32KB, 64-byte blocks, 8-way, approx LRU replacement, write-back/allocate, hit time n/a

L1 I-cache: 32KB, 64-byte blocks, 2-way, LRU replacement, hit time 3 cycles

L1 D-cache: 32KB, 64-byte blocks, 2-way, LRU replacement, write-back/allocate, hit time 9 cycles

L2 unified cache(per core)

256KB, 64-byte blocks, 8-way, approx LRU replacement, write-back/allocate, hit time n/a

512KB, 64-byte blocks, 16-way, approx LRU replacement, write-back/allocate, hit time n/a

L3 unified cache (shared)

8MB, 64-byte blocks, 16-way, replacement n/a, write-back/allocate, hit time n/a

2MB, 64-byte blocks, 32-way, replace block shared by fewest cores, write-back/allocate, hit time 32 cycles

n/a: data not available

39Memory/Storage Architecture Lab 39

Cache Coherence Problem

Suppose two CPU cores share a physical address space Write-through caches

Time step

Event CPU A’s cache

CPU B’s cache

Memory

0 0

1 CPU A reads X 0 0

2 CPU B reads X 0 0 0

3 CPU A writes 1 to X 1 0 1

40Memory/Storage Architecture Lab 40

Snoopy Protocols

Write Invalidate Protocol: Write to shared data: an invalidate is sent to all caches

which snoop and invalidate any copies Write Broadcast Protocol:

Write to shared data: broadcast on bus, processors snoop, and update copies

Write serialization: bus serializes requests Bus is single point of arbitration

41Memory/Storage Architecture Lab 41

Write invalidate Protocol

Cache gets exclusive access to a block when it is to be written

Broadcasts an invalidate message on the bus Subsequent read in another cache misses

− Owning cache supplies updated value

CPU activity Bus activity CPU A’s cache

CPU B’s cache

Memory

0

CPU A reads X Cache miss for X 0 0

CPU B reads X Cache miss for X 0 0 0

CPU A writes 1 to X Invalidate for X 1 0

CPU B read X Cache miss for X 1 1 1

42Memory/Storage Architecture Lab 42

Summary

Memory hierarchies are an optimization resulting from a perfect match between memory technology and two types of program locality

Temporal locality Spatial locality

The goal is to provide a “virtual” memory technology (an illusion) that has an access time of the highest-level memory with the size and cost of the lowest-level memory

Cache memory is an instance of a memory hierarchy exploits both temporal and spatial localities direct-mapped caches are simple and fast but have higher miss rates set-associative caches have lower miss rates but are complex and slow multilevel caches are becoming increasingly popular cache coherence protocols ensures consistency among multiple caches

top related