memory

Memory System

Based on slides from Jones & Bartlett

COMP 228, Fall 2009 Memory System 1

6.1 Introduction

• Memory lies at the heart of the stored-program computer.

• In previous chapters, we studied the components from which memory is built and the ways in which memory is accessed by various ISAs.

• In this chapter, we focus on memory organization. A clear understanding of these ideas is essential for the analysis of system performance.


6.2 Types of Memory

• Two typical kinds of main memory: random access memory, RAM, and read-only-memory, ROM.

• Two typical types of RAM, dynamic RAM (DRAM) and static RAM (SRAM).

• Dynamic RAM consists of memory cells implemented as static charge stored in capacitors that leak over time. Thus they must be refreshed every few milliseconds to prevent data loss.

• DRAM is cheaper owing to its simpler design.


6.2 Types of Memory

• SRAM consists of circuits similar to a D flip-flop with its storage intact until changed or power is turned off.

• SRAM is faster memory and it doesn’t need to be refreshed like DRAM does. It can be used to build both the main memory and cache memory, which we will discuss in detail later.

• Technology has evolved into many variations of memory cell and chip designs, some allowing overlapping of accesses.

• ROM is used to store permanent, or semi-permanent data that persists even while power is turned off.


6.3 The Memory Hierarchy

• Generally speaking, faster memory is more expensive than slower memory.

• To provide the best performance at the lowest cost, memory is organized in a hierarchical fashion.

• Small, fast storage elements are kept in the CPU, larger, slower main memory is accessed through the system bus.

• Larger, (almost) permanent storage in the form of disk and tape drives is physically farther away from the CPU.



• This storage organization can be thought of as a pyramid:



• To access a particular piece of data from ‘memory’, the CPU first sends a request to its nearest memory, usually cache.

• If the data is not in cache, then main memory is queried. If the data is not in main memory, then the request goes to disk and the program is temporally suspended by the operating system.

• Once the data is located, then the data, and a number of its nearby data elements (in a block) are fetched into cache memory.



• This leads us to some general terminologies.– A hit occurs when data is found at a given memory level.– A miss occurs when data is not found at a memory level.– The hit rate is the percentage of time data is found at a given

memory level.– The miss rate is the percentage of time it is not. – Miss rate = 1 - hit rate.– The hit time is the time required to access data at a given

memory level (when a hit occurs)– The miss penalty is the additional time required to process a

miss, including the time that it takes to replace a block of memory plus the time it takes to deliver the data to the processor.



• An entire blocks of data is copied after a hit because the principle of locality asserts that once a byte is accessed, it is likely that a nearby data element will be needed soon.

• There are two forms of locality:– Temporal locality- Recently-accessed information tend to

be accessed again in the near future.– Spatial locality - Accesses tend to cluster in memory

region.


6.4 Cache Memory

• The purpose of cache memory is to speed up accesses by buffering data closer to the CPU via temporal/spatial locality, instead of having to access them in main memory.

• Although cache is much smaller than main memory, its access time is a fraction of that of main memory.


6.4 Cache Memory

• Both the main memory and the cache are partitioned into a set of fixed size blocks.

• To avoid confusion with memory blocks, we use the term cache line to refer to a block in cache.

• Cache mapping refers to the mapping of memory blocks to cache lines.

• Generally, a cache line can be used to buffer a set of the memory blocks. Each cache line has a tag register that identifies the actual memory block currently buffered in the line.


6.4 Cache Memory

• The simplest cache mapping scheme is direct mapped cache.

• In a direct mapped cache consisting of N lines of cache, block X of main memory is mapped to cache line Y = X mod N.

• Thus, suppose we have 4 lines of cache, line 2 of cache may hold blocks 2, 6, 10, 14, . . . of main memory.

• Once a block of memory is copied into a line in cache, a valid bit is set for the cache line to let the system know that the line contains valid data.

What could happen if there were no valid bit?


6.4 Cache Memory

• The diagram below is a schematic of what cache looks like.

• Line 0 contains block 0 from main memory, identified with the tag 00000000. Line 1 contains a memory block from memory that is identified with the tag 11110101.

• Line 1 is buffering memory block 1111010101 = 3D516• The other two lines are not valid.

Line


6.4 Cache Memory

• The size of each field into which a memory address is divided depends on the size of the cache.

• Suppose our memory consists of 214 words, cache has 16 = 24 lines, and each line holds 8 words.– Thus memory is divided into 214 / 2 8 = 211 blocks.

• For our field sizes, we know we need 4 bits for the cache line (#), 3 bits for the word (#), and the tag (#) is what’s left over:

Line


6.4 Cache Memory

• Suppose a program generates the address 1AA. In 14-bit binary, this number is: 00000110101010.

• The first 7 bits of this address go in the tag field, the next 4 bits go in the line field, and the final 3 bits indicate the word within the line/block.

• The (7+4) bits together identifies the memory block.• Memory block # 3516 is mapped into cache line # 5.

Line


6.4 Cache Memory

• If subsequently the program generates the address 1AB, it will find the data it is looking for in line 0101, word 011.

• However, if the program generates the address, 3AB, instead, the block loaded for address 1AAwould be evicted from the cache, and replaced by the blocks associated with the 3AB reference.

Line


6.4 Cache Memory

• Suppose a program generates a series of memory references such as: 1AB, 3AB, 1AB, 3AB, . . . The cache will continually evict and replace blocks.

• The theoretical advantage offered by the cache is lost in this extreme case.

• This is the main disadvantage of direct mapped cache; if the locality involves multiple blocks that are mapped into a same line, continual cache misses may result.

• Other cache mapping schemes are designed to prevent this kind of thrashing.


Clicker 6.1

Suppose: Main Memory is 1M words. Cache is 1K words.Block Size is 16 words.Memory address = <tag, line#, word#>Cache address = <line#, word#>How many cache lines are there?

A: 16B: 32C: 64D: 128


Clicker 6.2

How many memory blocks are there?

A: 16KB: 32KC: 64KD: 128K


Clicker 6.3

Which cache line is used to buffer memory block # 600?

A: Line # 0B: Line # 20C: Line # 40D: Line # 60E: None of the above


Clicker 6.4

What is the length of each field in the memory address (written as <tag, line #, word #>?

A: <6, 10, 4>B: <10, 6, 6>C: <10, 10, 4>D: <10, 6, 4>E: None of the above


Clicker 6.5

What is the cache line (#) used to buffer the memory address FFF00 (in hex)?

A: Line # 16B: Line # 32C: Line # 48D: Line # 64E: None of the above


Clicker 6.6

Suppose a program accesses memory locations FFF00, FFF01, ….. FFFFF sequentially. How many cache misses will occur?

A: 16B: 32C: 128D: 256E: None of the above


Clicker 6.7

Continuing with the previous sequence of 256 memory accesses (FFF00 to FFFFF), what is the resulting hit ratio?

A: 9/10B: 8/9C: 7/8D: 3/4E: None of the above


6.4 Cache Memory

• Instead of placing a memory block to a specific cache line, we could allow a block to go anywhere in cache.

• In this way, cache would have to fill up before any block eviction is needed.

• This is how fully associative cache works. • As a result, a memory address contains

only two fields: tag and word. The tag is actually the memory block #.


6.4 Cache Memory

• Suppose, as before, we have 14-bit memory addresses and a cache with 16 blocks, each block of size 8. The field format of a memory reference is:

• When the cache is searched, all tags are searched in parallel to retrieve the data quickly.

• This requires special, costly hardware. So it is not actually used in cache systems.


Clicker 6.7

In a fully associative memory, the memory address is <tag, word #> = <16 bits, 4 bits>.

How many comparisons with tag are needed in checking a memory address for cache hit?

A: 1B: 16C: 64KD: None of the above


6.4 Cache Memory

• You will recall that direct mapped cache evicts a block whenever another memory reference needs that block.

• With fully associative cache, we have no such mapping, thus we must devise an algorithm to determine which block to evict from the cache.

• The block that is evicted is the victim block.• There are a number of ways to pick a victim, we

will discuss them shortly.


6.4 Cache Memory

• Set associative cache combines the ideas of direct mapped cache and fully associative cache to form a compromised solution.

• An N-way set associative cache mapping is like direct mapped cache in that a cache line can be the target (buffer) of a proper subset of the memory blocks.

• An alternative view is that each memory block can be mapped into a set of N cache lines. As a result, each memory access can be checked at a set of N cache lines, rather than a singleton in the case of direct mapping or all of the cache lines in the case of fully associative mapping.


6.4 Cache Memory

• The number of cache blocks per set in set associative cache varies according to overall system design.

• For example, a 2-way set associative cache can be conceptualized as shown in the schematic below.

• Each set may buffer two different memory blocks.Line Line


6.4 Cache Memory

• In set associative cache mapping, a memory reference is divided into three fields: tag, set, and word, as shown below.

• As with direct-mapped cache, the word field chooses the word within the cache block, and the tag field uniquely identifies the memory address.

• The set field determines the set to which the memory block maps. Note that tag + set = block # (of memory).


6.4 Cache Memory

• Suppose we have a main memory of 214 bytes.• This memory is mapped to a 2-way set associative

cache having 16 lines where each line contains 8 words.

• Since this is a 2-way cache, each set consists of 2 lines, and there are 8 sets.

• Thus, we need 3 bits for the set, 3 bits for the word, giving 8 leftover bits for the tag:


Clicker 6.8

In a 4-way set-associative memory mapping system where memory address = <tag, set #, word #> = <10 bits, 4 bits, 4 bits>.

How many comparisons are needed for checking a memory access for cache hit?

A: 1KB: 16C: 4D: None of the above


Clicker 6.9

Consider the direct-mapped cache (address = <10,6,4>) and the 4-way set-associative cache (address = <10, 4, 4>).

1. The set-associative cache will always have a higher hit ratio for a same program.

2. The hit ratio of both cache will improve by increasing the block size (without changing the cache size).

A: 1 and 2 are trueB: Only 1 is trueC: Only 2 is trueD: Neither is true


6.4 Cache Memory

• With fully associative and set associative cache, a replacement policy is invoked when it becomes necessary to evict a block from cache.

• An optimal replacement policy requires knowledge of the future to see which block whose next reference is farthest away in future.

• It is impossible to implement an optimal replacement algorithm because this future knowledge is not available in general.


6.4 Cache Memory

• The replacement policy that we choose depends upon the locality that we are trying to optimize--usually, we are interested in temporal locality.

• A least recently used (LRU) algorithm keeps track of the last time that a block was accessed and evicts the block that has been unused for the longest period of time.

• The disadvantage of this approach is its implementation complexity: LRU has to maintain an access history for each block, which ultimately slows down the cache.

• Spatial locality is exploited in using blocks as basic units of mapping.


6.4 Cache Memory

• First-in, first-out (FIFO) is a popular cache replacement policy.

• In FIFO, the block that has been in the cache the longest, regardless of when it was last used. Notice that FIFO is not equivalent to LRU.

• A random replacement policy does what its name implies: It picks a block at random and replaces it with a new block.

• Random replacement can certainly evict a block that will be needed often or needed soon, but the randomness avoids thrashing.


6.4 Cache Memory

• The performance of hierarchical memory is measured by itseffective access time (EAT).

• EAT is a weighted average that takes into account the hit ratio and relative access times of successive levels of memory.

• The EAT for a two-level memory is given by:EAT = H × TC + (1-H) × TM.

where H is the cache hit rate and TC and TM are the access time in case of cache hit and that in case of cache miss respectively.

Equivalently, we can rewrite the above as:EAT= TC + (1 – H) TP

where TP is the miss penalty (additional time in case of cache miss) = TM – TC


Clicker 6.10

Consider the following parameters:main memory access time = 10 nscache hit time = 2 nscache miss penalty = 20 nshit ratio = 85%

What is the memory speedup provided the cache?

A: 5B: 4C: 2D: None of the above


Clicker 6.11

For the same memory/cache system as the previous, what is the cache hit ratio required to double the previous speedup (of 2)?

A: ImpossibleB: between 98% and 99%C: between 99% and 99.5%D: between 99.5% and 100%E: between 85% and 98%


6.4 Cache Memory

• For example, consider a system with a main memory access time of 200ns supported by a cache having a 10ns access time and a hit rate of 99%. If we assume TM = 200 ns as well, then the EAT is:

0.99(10ns) + 0.01(200ns) = 9.9ns + 2ns = 11ns.• This equation for determining the effective access

time can be extended to any number of memory levels, as we will see in later sections.

• Usually TM > main memory access time because a block is transferred, not just a single word.


6.4 Cache Memory

• Cache replacement policies must also take into account dirty blocks, those blocks that have been updated while they were in the cache [as a result, the copy in memory differs from the copy in cache].

• Dirty blocks must be written back to memory. A write policy determines how this will be done.

• There are two common types of write policies,write through and write back.

• Write through updates cache and main memory simultaneously on every write.

• Write back updates main memory when the block is evicted.


6.4 Cache Memory

• Write back cache updates memory only when the block is selected for replacement.

• The disadvantage of write through is that memory must be updated with each cache write, which may tie up the system bus and memory unnecessarily (when a location is updated multiple times).

• The advantage of write back is that memory traffic is minimized, but its disadvantage is potential inconsistency between data in memory and data in cache which causes problem when memory is shared among processors and I/O devices.


6.4 Cache Memory

• The cache we have been discussing is called a unified or integrated cache where both instructions and data are stored in a same cache.

• Many modern systems employ separate caches for data and instructions. These are called instruction cache and data cache.

• The separation of data from instructions provides better locality and concurrency, at the extra cost of separate management.


6.4 Cache Memory

• Cache performance can also be improved by adding a small associative cache to hold blocks that have been evicted recently. This is called a victim cache. A victim cache can be accessed upon cache miss before memory is accessed.

• A trace cache is a variant of an instruction cache that holds decoded instructions for program branches, giving faster access to instructions from branch targets. This allows more efficient processing of program loops, especially in the context of a pipelined processor.


6.4 Cache Memory

• Most of today’s small systems employ multilevel cache hierarchies.

• The levels of cache form their own small memory hierarchy.

• Level1 cache (small size) is situated on the processor itself.– Access time is typically about 1ns.

• Level 2 cache (medium size) may be on the motherboard, or on an expansion card.– Access time is usually around a few ns.


6.4 Cache Memory

• In systems that employ three levels of cache, the Level 2 cache is placed on the same die as the CPU (reducing access time to about 10ns)

• Accordingly, the Level 3 cache (2MB to 256MB) refers to cache that is situated between the processor and main memory.

• Once the number of cache levels is determined, the next thing to consider is whether data (or instructions) can exist in more than one cache level.


6.4 Cache Memory

• If the cache system used an inclusive cache, the same data may be present at multiple levels of cache.

• Strictly inclusive caches guarantee that all data in a smaller cache also exists at the next higher level.

• Exclusive caches permit only one copy of the data.• The tradeoffs in choosing one over the other

involve weighing the variables of access time, memory size (cost) and consistency in use.


Program Performance Analysis

Consider MARIE program:S1, Load Sum

AddI XStore SumLoad XAdd OneStore XLoad CSubt OneStore CSkipcond 400Jump S1Halt

Sum, Dec 0One, Dec 1C, Dec 128X, Dec 128



Assumptions: Cache size = 128 wordsblock/line size = 8 wordsmemory access time = 10 nscache hit time = 1 nsmiss penalty = 30 ns

Analysis:# Cache lines = 128/8 = 16.Code stored in memory block # 0 to # 1 , buffered also in cache line # 0 to # 1 .Data stored in memory blocks # 16 to # 31 , buffered in cache line # 0 to # 15 .#Instructions executed per iteration = 10 .#Memory operand accessed per iteration = 10 .



Cache Hit/Miss Tracing for first eight iteration:Instruction Instruction DataLoad Sum Miss (Block 0 Line 0)* Miss (Block 1 Line 1)*AddI X Miss (Block 16 Line 0)Store Sum Miss (Block 0 Line 0)Load XAdd OneStore XLoad CSubt OneStore CSkipcond 400Jump S1*only in the first iteration



Cache Hit/Miss Tracing for second set of eight iteration:Instruction Instruction DataLoad SumAddI X Miss (Block 17 Line 1)Store Sum Miss (Block 1 Line 1)Load XAdd OneStore XLoad CSubt OneStore CSkipcond 400Jump S1



Cache Hit/Miss Tracing for 17th iteration: (and similarly 25th, 33rd, …) Instruction Instruction DataLoad SumAddI X Miss (Block 18 Line 2)Store SumLoad XAdd OneStore XLoad CSubt OneStore CSkipcond 400Jump S1



Total number of cache misses = (2 + 2*8) + 2*8 + 14 = 481st 8 iter 2nd 8 iter rest

Miss ratio = 48/ (20*128) = 3/160

Speedup = tM / [ tC + (1-r)tP ]= 10/[1 + 9/16] = 6.4


Clicker 6.12

For the previous example, how would the miss ratio change if the initial content of C is 1K instead?

A: not much changeB: increase noticeablyC: decrease noticeably


Clicker 6.13

For the same example, how would the miss ratio change if the block size is changed from 8 to 16?

A: not much changeB: increase noticeablyC: decrease noticeably


Clicker 6.14

For the same example, if 2-way set-associative mapping is used, how will the miss ratio change?

A: not much changedB: increase noticeablyC: decrease noticeably

memory

Documents

memory hierarchy

cache memory

faster memory

slower memory

block of memory

memory organization

memory cells

memory systembased