memory - university of arizonaece569a/readings/ppt/memory.pdfcache hierarchy apopular latency...

Memory ConsistencyMemory Consistency

Memory Consistency

Memory Consistency

Reads and writes of the shared memory face consistency problemNeed to achieve controlled consistency in memory eventsShared memory behavior determined by:

Program orderMemory access order

ChallengesModern processors reorder operationsCompiler optimizations (scalar replacement, instruction rescheduling

Basic Concept

On a multiprocessor:Concurrent instruction streams (threads) on different processorsMemory events performed by one process may create data to be used by another⌧Events: read and write

Memory consistency model specifies how the memory events initiated by one process should be observed by other processesEvent ordering

Declare which memory access is allowed , which process should wait for a later access when processes compete

Uniprocessor vs. Multiprocessor Model

Understanding Program OrderInitially X = 2

P1 P2 ….. …..r0=Read(X) r1=Read(x)r0=r0+1 r1=r1+1Write(r0,X) Write(r1,X)….. ……Possible execution sequences:P1:r0=Read(X) P2:r1=Read(X)P2:r1=Read(X) P2:r1=r1+1P1:r0=r0+1 P2:Write(r1,X)P1:Write(r0,X) P1:r0=Read(X)P2:r1=r1+1 P1:r0=r0+1P2:Write(r1,X) P1:Write(r0,X)

x=3 x=4

Interleaving

Program orders of individual instruction streams may need to be modified because of interaction among them

Finding optimum global memory order is an NP hard problem

a. A=1;b. Print B,C;

c. B=1;d. Print A,C;

P2e. C=1;f. Print A,B;

P3

A, B, C shared variables (initially 0)Shared Memory

switch

P1

Example

Concatenate program orders in P1, P2 and P36-tuple binary strings (64 output combinations)(a,b,c,d,e,f) => (001011) (in order execution)(a,c,e,b,d,f) => (111111) (in order execution)(b,d,f,e,a,c) => (000000) (out of order execution)⌧6! (720 possible permutations)

a. A=1;b. Print B,C;

c. B=1;d. Print A,C;

P2e. C=1;f. Print A,B;

P3

A, B, C shared variables (initially 0)Shared Memory

switch

P1

Mutual exclusion problem

mutual exclusion problem in concurrent programming

allow two threads to share a single-use resource without conflict, using only shared memory for communication. avoid the strict alternation of a naive turn-taking algorithm

Definition

If two processes attempt to enter a critical section at the same time, allow only one process in, based on whose turn it is. If one process is already in the critical section, the other process will wait for the first process to exit. How would you implement this without

mutual exclusion, freedom from deadlock, and freedom from starvation.

Solution: Dekker’s Algorithm

This is done by the use of two flags f0 and f1 which indicate an intention to enter the critical section and a turn variable which indicates who has priority between the two processes.

flag[0] := false flag[1] := false turn := 0 // or 1

flag[0] := true while flag[1] = true {

if turn ≠ 0 { flag[0] := false while turn ≠ 0 { } flag[0] := true

} } // critical section ... turn := 1 flag[0] := false // remainder

// section

flag[1] := true while flag[0] = true {

if turn ≠ 1 { flag[1] := false while turn ≠ 1 { } flag[1] := true

} } // critical section ... turn := 0 flag[1] := false // remainder

// section

P0 P1

Disadvantages

limited to two processes makes use of busy waiting instead of process suspension. Modern CPUs execute their instructions in an out-of-order fashion,

even memory accesses can be reordered

flag[0] = 0;flag[1] = 0; turn;

flag[0] = 1; turn = 1; while (flag[1] == 1 && turn == 1) {

// busy wait } // critical section ... // end of critical section flag[0] = 0;

flag[1] = 1;turn = 0; while (flag[0] == 1 && turn == 0){

// busy wait} // critical section... // end of critical sectionflag[1] = 0;

P0 P1

Peterson’s Algorithm

Lamport's bakery algorithm

a bakery with a numbering machine

the 'customers' will be threads, identified by the letter i, obtained from a global variable. more than one thread might get the same number

// declaration and initial values of global variablesEntering: array [1..NUM_THREADS] of bool = {false}; Number: array [1.. NUM_THREADS] of integer = {0}; 1 lock(integer i) { 2 Entering[i] = true; 3 Number[i] = 1 + max(Number[1], ..., Number[NUM_THREADS]); 4 Entering[i] = false; 5 for (j = 1; j <= NUM_THREADS; j++) { 6 // Wait until thread j receives its number: 7 while (Entering[j]) { /* nothing */ } 8 // Wait until all threads with smaller numbers or with the same9 // number, but with higher priority, finish their work: 10 while ((Number[j] != 0) && ((Number[j], j) < (Number[i], i))) { 11 /* nothing */ 12 } 13 } 14 } 15 unlock(integer i) { 16 Number[i] = 0; 17 } 18 Thread(integer i) { 19 while (true) { 20 lock(i); 21 // The critical section goes here...22 unlock(i); 23 // non-critical section...24 } 25 }

ModelsStrict Consistency: Read always returns with most recent Write to same address

Sequential Consistency: The result of any execution appears as the interleaving of individual programs strictly in sequential program order

Processor Consistency: Writes issued by each processor are in program order, but writes from different processors can be out of order (Goodman)

Weak Consistency: Programmer uses synch operations to enforce sequential consistency (Dubois)

Reads from each processor is not restrictedMore opportunities for pipelining

Relationship to Cache Coherence Protocol

Cache coherence protocol must observe the constraints imposed by the memory consistency model

Ex: Read hit in a cache⌧Reading without waiting for the completion of a previous write

my violate sequential consistency

Cache coherence protocol provides a mechanism to propagate the newly written valueMemory consistency model places an additional constraint on when the value can be propagated to a given processor

Latency Tolerance

Scalable systemsDistributed shared memory architectureAccess to remote memory: long latencyProcessor speed vs. the memory and interconnect

Need forLatency reduction, avoidance, hiding

Latency Avoidance

Organize user applications at architectural, compiler or application levels to achieve program/data localityPossible when applications exhibit:

Temporal or spatial locality

How do you enhance locality?

Locality Enhancement

Architectural support:Cache coherency protocols, memory consistency models, fast message passing, etc.

User supportHigh Performance Fortran: program instructs compiler how to allocate the data (example ?)

Software supportCompiler performs certain transformations⌧Example?

Latency Reduction

What if locality is limited? Data access is dynamically changing?

For ex: sorting algorithmsWe need latency reduction mechanisms

Target communication subsystem⌧Interconnect⌧Network interface⌧Fast communication software

• Cluster: TCP, UDP, etc

Latency Hiding

Hide communication latency within computationOverlapping techniques⌧Prefetching techniques

• Hide read latency

⌧Distributed coherent caches• Reduce cache misses • Shorten time to retrieve clean copy

⌧Multiple context processors• Switch from one context to another when long-latency

operations is encountered (hardware supported multithreading)

Memory Delays

SMPhigh in multiprocessors due to added contention for shared resources such as a shared bus and memory modules

Distributedare even more pronounced in distributed-memory multiprocessors where memory requests may need to be satisfied across an interconnection network.

By masking some or all of these significant memory latencies, prefetching can be an effective means of speeding up multiprocessor applications

Data Prefetching

Overlapping computation with memory accesses

Rather than waiting for a cache miss to perform a memory fetch, data prefetchinganticipates such misses and issues a fetch to the memory system in advance of the actual memory reference.

Cache Hierarchy

Popular latency reducing techniqueBut still common for scientific programs to spend more than half their run times stalled on memory requests

partially a result of the “on demand” fetch policy⌧fetch data into the cache from main memory only

after the processor has requested a word and found it absent from the cache.

Why do scientific applications exhibit poor cache utilization?

Is something wrong with the principle of locality? The traversal of large data arrays is often at the heart of thisproblem. Temporal locality in array computations

once an element has been used to compute a result, it is often not referenced again before it is displaced from the cache to make room for additional array elements.

Sequential array accesses patterns exhibit a high degree of spatial locality, many other types of array access patterns do not.

For example, in a language which stores matrices in row-major order, a row-wise traversal of a matrix will result in consecutively referenced elements being widely separated in memory. Such strided reference patterns result in low spatial locality if the stride is greater than the cache block size. In this case, only one word per cache block is actually used while the remainder of the block remains untouched even though cache space has been allocated for it.

Time: Computation and memory references satisfied within the cache hierarchy

main memory access time

Memory references r1,r2 and r3 not in the cache

Challenges

Cache pollutionData arrives early enough to hide all of the memory latency Data must be held in the processor cache for some period of time before it is used by the processor. During this time, the prefetched data are exposed to the cache replacement policy and may be evicted from the cache before use. Moreover, the prefetched data may displace data in the cache that is currently in use by the processor.

Memory bandwidthBack to figure: ⌧ No prefetch: the three memory requests occur within the first 31 time units of program

startup, ⌧With prefetch: these requests are compressed into a period of 19 time units.

By removing processor stall cycles, prefetching effectively increases the frequency of memory requests issued by the processor. Memory systems must be designed to match this higher bandwidth to avoid becoming saturated and nullifying the benefits of prefetching.

Spatial Locality

Block transfer is a way of prefetching(1960s)Software prefetching later (1980s)

Binding Prefetch

Non-blocking load instructions these instructions are issued in advance of the actual use to take advantage of the parallelism between the processor and memory subsystem. Rather than loading data into the cache, however, the specified word is placed directly into a processor register.

the value of the prefetched variable is bound to a named location at the time the prefetch is issued.

Software-Initiated Data Prefetching

Some form of fetch instructioncan be as simple as a load into a processor register

Fetches are non-blocking memory operationsAllow prefetches to bypass other outstanding memory operations in the cache.

Fetch instructions cannot cause exceptionsThe hardware required to implement software-initiated prefetching is modest

Prefetch Challenges

prefetch scheduling.judicious placement of fetch instructions within the target application.not possible to precisely predict when to schedule a prefetch so that data arrives in the cache at the moment it will be requested by the processoruncertainties not predictable at compile time ⌧careful consideration when statically scheduling prefetch

instructions.may be added by the programmer or by the compiler during an optimization pass.⌧programming effort ?

Suitable spots for “Fetch”

most often used within loops responsible for large array calculations.

common in scientific codes,exhibit poor cache utilizationpredictable array referencing patterns.

Example:

assume a four-word cache block

How to solve these two issues?software piplining

Issues:Cache misses during the first iterationUnnecessary prefetches in the last iteration of the unrolled loop

Assumptions

implicit assumptionPrefetching one iteration ahead of the data’s actual use is sufficient to hide the latency

What if the loops contain small computational bodies.Define prefetch distance⌧initiate prefetches d iterations before the data is referenced⌧How do you determine “d”?

• Let – “l” be the average cache miss latency, measured in processor

cycles, – “s” be the estimated cycle time of the shortest possible execution

path through one loop iteration, including the prefetch overhead.

d

Revisiting the example

let us assume an average miss latency of 100 processor cycles and a loop iteration time of 45 cycles

d=3 (handle a prefetch distance of three)

Case Study

Given a distributed-shared multiprocessorlet’s define a remote access cache (RAC)

Assume that RAC is located at the network interface of each nodeMotivation: prefetched remote data could be accessed at a speed comparable to that of local memory while the processor cache hierarchy was reserved for demand-fetched data.

Which one is better: Having RAC or pretefetching data directly into the processor cache hierarchy?

Despite significantly increasing cache contention and reducing overall cache space, The latter approach results in higher cache hit rates, ⌧dominant performance factor.

Case Study

Transfer of individual cache blocks across the interconnection network of a multiprocessor yields low network efficiency

what if we propose transferring prefetched data in larger units?

Method: a compiler schedules a single prefetch command before the loop is entered rather than software pipelining prefetches within a loop.

transfer of large blocks of remote memory used within the loop bodyprefetched into local memory to prevent excessive cache pollution.

Issues:binding prefetch since data stored in a processor’s local memory are not exposed to any coherency policyimposes constraints on the use of prefetched data which, in turn, limits the amount of remote data that can be prefetched.

What about besides the “loops”?

Prefetching is normally restricted to loops array accesses whose indices are linear functions of the loop indices compiler must be able to predict memory access patterns when scheduling prefetches. such loops are relatively common in scientific codes but far less so in general applications.

Irregular data structuresdifficult to reliably predict when a particular data will be accessedonce a cache block has been accessed, there is less of a chance that several successive cache blocks will also be requested when datastructures such as graphs and linked lists are used.comparatively high temporal locality ⌧result in high cache utilization thereby diminishing the benefit of

prefetching.

What is the overhead of fetch instructions?

require extra execution cyclesfetch source addresses must be calculated and stored in the processor

to avoid recalculation for the matching load or store instruction.⌧How:

• Register space⌧Problem:

• compiler will have less register space to allocate to other active variables.• fetch instructions increase register pressure• It gets worse when

– the prefetch distance is greater than one– multiple prefetch addresses

code expansion may degrade instruction cache performance.

software-initiated prefetching is done staticallyunable to detect when a prefetched block has been prematurely evicted and needs to be re-fetched.

Hardware-Initiated Data Prefetching

Prefetching capabilities without the need for programmer or compiler intervention.No changes to existing executables

instruction overhead completely eliminated.

can take advantage of run-time information to potentially make prefetching more effective.

Cache BlocksTypically: fetch data from main memory into the processor cache in units of cache blocks.

multiple word cache blocks are themselves a form of data prefetching.large cache blocks ⌧Effective prefetching vs cache pollution.

What is the complication for SMPs with private caches⌧false sharing: when two or more processors wish to access different

words within the same cache block and at least one of the accesses is a store.

⌧cache coherence traffic is generated to ensure that the changes made to a block by a store operation are seen by all processors caching the block.

• Unnecessary traffic• Increasing the cache block size increases the likelihood of such

occurances

How do we take advantage of spatial locality without introducing some of the problems associated with large cache blocks?

Sequential prefetching

one block lookahead (OBL) approachinitiates a prefetch for block b+1 when block b is accessed.

How is it different from doubling the block size?

prefetched blocks are treated separately with regard to the cache replacement and coherency policies.

OBL: Case Study

Assume that a large block contains one word which is frequently referenced and several other words which are not in use. Assume that an LRU replacement policy is used, What is the implication?

the entire block will be retained even though only a portion of the block’s data is actually in use.

How do we solve? Replace large block with two smaller blocks, ⌧one of them could be evicted to make room for more active data.⌧use of smaller cache blocks reduces the probability of false

sharing

OBL implementations

Based on “what type of access to block b initiates the prefetch of b+1”

prefetch on miss ⌧Initiates a prefetch for block b+1 whenever an access for block b

results in a cache miss.⌧If b+1 is already cached, no memory access is initiated

tagged prefetch algorithms⌧Associates a tag bit with every memory block.⌧Use this bit to detect

• when a block is demand-fetched or • when a prefetched block is referenced for the first time.

⌧Then, next sequential block is fetched.Which one is better in terms of reducing miss rate? Prefetch on miss vs tagged prefetch?

Prefetch on miss vs tagged prefetchAccessing three contiguous blocks strictly sequential access pattern:

Shortcoming of the OBL

prefetch may not be initiated far enough in advance of the actual use to avoid a processor memory stall.

A sequential access stream resulting from a tight loop, for example, may not allow sufficient time between the use of blocks b and b+1 to completely hide the memory latency.

How do you solve this shortcoming?

Increase the number of blocks prefetchedafter a demand fetch from one to “d”

As each prefetched block, b, is accessed for the first time, the cache is interrogated to check if blocks b+1, ... b+d are present in the cache

What if d=1? What kind of prefetching is this?

Tagged

Another technique with d-prefetch

d prefetched blocks are brought into a FIFO stream buffer before being brought into the cache.

As each buffer entry is referenced, it is brought into the cachewhile the remaining blocks are moved up in the queue and a new block is prefetched into the tail position. If a miss occurs in the cache and the desired block is also not found at the head of the stream buffer, the buffer is flushed.

Advantage:prefetched data are not placed directly into the cache, avoids cache pollution.

Disadvantage:requires that prefetched blocks be accessed in a strictly sequential order to take advantage of the stream buffer.

Tradeoffs of d-prefetching?

Good: increasing the degree of prefetchingreduces miss rates in sections of code that show a high degree of spatial locality

Badadditional traffic and cache pollution are generated by sequential prefetching during program phases that show little spatial locality.

What if are able to vary the “d”

Adaptive sequential prefetching

d is matched to the degree of spatial locality exhibited by the program at a particular point in time.a prefetch efficiency metric is periodically calculatedPrefetch efficiency

ratio of useful prefetches to total prefetches⌧a useful prefetch occurs whenever a prefetched block results in a cache hit.

d is initialized to one, incremented whenever efficiency exceeds a predetermined upper threshold decremented whenever the efficiency drops below a lower thresholdIf d=0, no prefetching

Which one is better? adaptive or tagged prefetching?Miss ratio vs Memory traffic and contention

Sequential prefetchingsummary

Does sequential prefetching require changes to existing executables?What about the hardware complexity?Which one offers both simplicity and performance?

TAGGEDCompared to software-initiated prefetching, what might be the problem?

tend to generate more unnecessary prefetches. Non-sequential access patterns are not good⌧Ex: such as scalar references or array accesses with large strides, will result

in unnecessary prefetch requests ⌧do not exhibit the spatial locality upon which sequential prefetching is

based. To enable prefetching of strided and other irregular data access patterns, several more elaborate hardware prefetching techniques have been proposed.

Prefetching with arbitrary strides

Reference Prediction Table

State: initial, transient, steady

RPT Entries State Transition

Matrix MultiplicationAssume that starting addresses a=10000 b=20000 c=30000, and 1 word cache block

After the first iteration of inner loop

Matrix Multiplication

After the second iteration of inner loop

Hits/misses?

Matrix Multiplication

After the third iteration

b and c hits provided that a prefetch of distance one is enough

RPT Limitations

Prefetch distance to one loop iterationLoop entrance : missLoop exit: unnecessary prefetch

How can we solve this? Use longer distancePrefetch address = effective address +

(stride x distance )with lookahead program counter (LA-PC)

Summary

Prefetchestimely, useful, and introduce little overhead.

Reduce secondary effects in the memory system strategies are diverse and no single strategy provides optimal performance

Summary

Prefetching schemes are diverse. To help categorize a particular approach it is useful to answer three basic questions concerning the prefetching mechanism:

1) When are prefetches initiated, 2) Where are prefetched data placed,3) What is the unit of prefetch?

Software vs Hardware Prefetching

Prefetch instructions actually increase the amount of work done by the processor.Hardware-based prefetching techniques do not require the use of explicit fetch instructions.

hardware monitors the processor in an attempt to infer prefetching opportunities.no instruction overheadgenerates more unnecessary prefetches than software-initiated schemes.⌧need to speculate on future memory accesses without the benefit

of compile-time information• Cache pollution• Consume memory bandwidth

Conclusions

Prefetches can be initiated either by explicit fetch operation within a program (software initiated)logic that monitors the processor’s referencing pattern (hardware-initiated).

Prefetches must be timely. issued too early⌧chance that the prefetched data will displace other useful data or be

displaced itself before use. issued too late⌧may not arrive before the actual memory reference and introduce stalls

Prefetches must be precise. The software approach issues prefetches only for data that is likely to be used Hardware schemes tend to fetch more data unnecessarily.

Conclusions

The decision of where to place prefetched data in the memory hierarchy

higher level of the memory hierarchy to provide a performance benefit.

The majority of schemes prefetched data in some type of cache memory.

Prefetched data in processor registersbinding and additional constraints must be imposed on the use of the data.

Finally, multiprocessor systems can introduce additional levels into the memory hierarchy which must be taken into consideration.

Conclusions

Data can be prefetched in units of single words, cache blocks or larger blocks of memory.

determined by the organization of the underlying cache and memory system.

Uniprocessors and SMPsCache blocks appropriate

Distributed memory multiprocessor larger memory blocks ⌧to amortize the cost of initiating a data transfer across an

interconnection network

memory - university of arizonaece569a/readings/ppt/memory.pdfcache hierarchy apopular latency...

Documents