lecture 4. memory consistency models prof. taeweon suh computer science education korea university...

Lecture 4. Memory Consistency Models

Prof. Taeweon SuhComputer Science Education

Korea University

COM503 Parallel Computer Architecture & Programming

Korea Univ

Memory Consistency

• What do you expect from the following code?

Processor 1

• Program orders in P1 and P2’s accesses to different locations are not implied nor enforced by coherence Coherence requires that the new value for A eventually become visible

to process P2 (not necessarily before the new value of flag is observed)

Note that x86 CPU is a superscalar with OOO (Out-Of-Order) execution

What would you do if you want “print A” to print “1”?

Processor 2

A = 1flag = 1

while (flag == 0)print A

Korea Univ3

Demo#include <stdio.h>#include <omp.h>

int main(){ int a, b; int a_tmp, b_tmp;

a = 0; b = 0; #pragma omp parallel num_threads(2) shared(a, b) { //printf("Parallel region is executed by thread ID %d\n", omp_get_thread_num()); #pragma omp single nowait { a = 1;

b = 2;//while(1);

} #pragma omp single nowait {

a_tmp = a;b_tmp = b;printf("A = %d, B = %d\n", a_tmp, b_tmp);//while(1);

} } return 0;}

Korea Univ

Memory Consistency

• Use barrier

Processor 1 Processor 2

print ABarrier (b1)

• A barrier is often built using reads and writes to ordinary shared variables (e.g., b1 above) rather than a special barrier operation• Coherence does not say anything at all about the order

among these accesses• It would be interesting to see how OpenMP (or Pthreads)

implements barrier in low level

• But, CPU typically provides barrier instructions (such as sfence, lfence, mfence in x86)

Korea Univ

Memory Consistency

• So, clearly we need something more than coherence to give a shared address space a clear semantics That is, an ordering model that programmers can use to reason about the

possible results and hence the correctness of their programs

• Memory consistency model for a shared address space specifies constraints on the order in which memory operations must appear to be performed (i.e., to become visible to the processors) with respect to one another It includes operations to the same locations or to different locations and by

the same process or different processes, so in this sense memory consistency subsumes coherence

print ABarrier (b1)

Korea Univ

Programmer’s Abstraction of Memory Subsystem

P1 P2 Pn

Memory

● ● ● ● ●

Processors are issuing memory references as per program order

The “switch” is randomly set after each memory reference

Partial order

Interleaving the partial (program) orders for different processes may yield a large number of possible total orders

Korea Univ

Sequential Consistency

• Sequential consistency (SC) Formalized by Lamport in 1979

A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor occur in this sequence in the order specified by its program

• Implementing SC requires that the system (s/w and h/w) follow 2 constraints Program order requirement: memory operations of a process must

appear to become visible (to itself and others) in program order

Write atomicity: all writes (to any location) should appear to all processors to have occurred in the same order

Korea Univ

Sequential Consistency

(1a) A = 1(1b) B = 2

(2a) print B(2b) print A

/* Assume initial values of A and B are 0 */

• What values of A and B do you expect to be printed on P2?

(A B) = (1, 2)? (A B) = (1, 0)? (A B) = (0, 0)? (A B) = (0, 2)?

• Under SC, the result (0, 2) for (A, B) would not be allowed since it would then appear that the writes of A and B by P1 executed out of program order• Execution order of 1b, 2a, 2b, and 1a is not sequentially

consistent

Korea Univ

How to Impose Constraint?

• In practice, to constrain the compiler optimizations, multithreaded and parallel programs annotate variables or memory references that are used to preserve orders

• A particularly stringent example is the use of the volatile qualifier in a variable declaration It prevents the variable from being register allocated or any

memory operation on the variable from being reordered with respect to operations before or after it in program order

Korea Univ

Reordering Impact Example

• How would reordering the memory operations affect semantics in a parallel program running on a multiprocessor and in a threaded program in which the two processes are interleaved on the same processor?

A = 1flag = 1

while (flag == 0)print A

• The compiler may reorder the writes to A and flag with no impact on a sequential program

• It violates our intuition for both parallel programs and multithreaded uniprocessor programs

• For many compilers, these reorderings can be avoided by declaring the variable flag to be of type volatile integer (instead of integer)

Korea Univ

Problems with SC

• The SC model provides an intuitive semantics to the programmer The program order and a consistent interleaving across processes can

be quite easily implemented

• However, its drawback is that it restricts many of the performance optimizations that modern uniprocessor compilers and microprocessors employ With the high cost of memory access latency, computer systems

achieve higher performance by reordering or overlapping the multiple memory or communication operations from a processor

Preserving the sufficient conditions for SC does not allow for much reordering or overlap in hardware

With SC, the compiler can not reorder memory accesses even if they are to different locations, disallowing critical performance optimizations such as code motion, common-subexpression elimination, software pipelining, and even register allocation

Korea Univ

Reality Check

• Unfortunately, many of the optimizations that are commonly employed in both compilers and processors violate the SC property

• Explicitly parallel programs use uniprocessor compilers, which are concerned only about preserving dependences to the same location So, compliers routinely reorder accesses to different locations within

a process, so a processor may in fact issue accesses out of the program order seen by the programmer

Advanced compiler optimizations can change the order in which different memory locations are accessed or can even eliminate memory operations

• Common subexpression elimination, constant propagation, register allocation, and loop transformations such as loop splitting, loop reversal, and blocking

Korea Univ

Example: Register Allocation

• How can the register allocation lead to a violation of SC even if the hardware satisfies SC

P 1 P2

B = 0A = 1u = B

A = 0B = 1v = A

P 1 P2

r1 = 0A = 1u = r1B = r1

r2 = 0B = 1v = r2A = r2

• The result (u, v) = (0, 0) is disallowed under SC

• A uniprocessor compiler might easily perform these optimizations in each process

• They are valid for sequential programs since the reordered accesses are to different locations

Korea Univ

Problems with SC

• Providing SC at the programmer’s interface implies supporting SC at lower-level interfaces If the sufficient conditions for SC are met, a processor waits for an access to

complete before issuing the next one• So, most of the latency suffered by memory references is directly seen by processors

as stall time

• Although a processor may continue executing non-memory instructions while a single outstanding memory reference is being serviced, the expected benefit from such overlap is tiny, since even without ILP (Instruction-Level Parallelism) every third instruction on average is a memory reference

• So, we go to do something about this performance problem

Programmer’s interface: We focus mainly on the consistency model as seen by the programmer. That is, at the interface between the programmer and the rest of the system composed of the compiler, operating system, hardware. For example, a processor may preserve all program orders presented to it among memory operations, but if the compiler has already reordered operations, then programmers can no longer reason with the simple model exported by the hardware

Korea Univ

Solutions?

• One approach is to preserve SC at the programmer’s interface, but find ways to hide the long stalls from the processor 1st technique

• Compiler does not reorder memory operations, but latency tolerance techniques such as data prefetching or multithreading are used to overlap data transfer with one another or with computation

• But, the actual read and write operations are not issued before previous ones complete in program order

Korea Univ

Solutions?

2nd technique• Compiler reorders operations as long as it can guarantee that SC will

not be violated in the results Compiler algorithms have been developed for this (Shasha and Snir 1988,

Kris and Yelick 1994, 1995)

• At the hardware level, Memory operations are issued and executed out of program order, but

are guaranteed to become visible to other processors in program order• This approach is well suited to dynamically scheduled processors that use

instruction lookahead buffer to find independent instructions to issue• Instructions are inserted in the lookahead buffer in program order• They are guaranteed to retire from the lookahead buffer in program order

Speculative execution such as Branch prediction Speculative reads

• Values returned by reads are used even before they are known to be correct• Later, roll back if they are incorrect

• Or Change the memory consistency model itself!

Korea Univ

Relaxed Consistency Models

• A completely different way to overcome the performance limitations imposed by SC is to change the memory consistency model itself That is, not to guarantee such strong ordering constraints to the programmer,

but still retain semantics that are intuitive enough to be useful

The intuition behind the relaxed models is that SC is usually too conservative

Many of the orders it preserves are not really needed to satisfy a programmer’s intuition in most situations

• By relaxing the ordering constraints, these relaxed consistency models allow the compiler to reorder accesses before presenting them to the hardware, at least to some extent

• A the hardware level, they allow multiple memory accesses from the same process not only to be outstanding at a time, but even to complete or become visible out of order, thus allowing much of the latency to be overlapped and hidden from the processor

Korea Univ

Example

• Writes to variables A and B by P1 can be reordered without affecting the results All we must ensure is that both of them complete before the variable flag is set to 1

• Reads to variables A and B can be reordered at P2 once flag has been observed to change to value 1

• Even with these reorderings, the results look just like those of an SC execution

P 1 P2

A = 1B = 1flag = 1

While (flag ==0)u = Av = B

P 1 P2

A = 1B = 1flag = 1

While (flag ==0)While (flag ==0)While (flag ==0)u = Av = B

Ordering Under SC Ordering necessary for correct program semantics

Korea Univ

Reality Check

• It would be wonderful if system software or hardware could automatically detect which program orders are critical to maintaining SC semantics and allow the others to be violated for higher performance (Shasha and Snir 1998)

• However, the problem is intractable (in fact, undecidable) for general programs, and inexact solutions are often too conservative to be very useful

Korea Univ

Relaxed Consistency Model

• A relaxed consistency model requires 2 things What program orders among memory operations are guaranteed to be

preserved by the system, including that write atomicity will be maintained

If not all program orders are guaranteed to be preserved by default, then what mechanisms the system provides for a programmer to enforce order explicitly when desired

• As should be clear by now, the compiler and the hardware have their own system specifications, but we focus on the specification that the two together (or the system as a whole) presents to the programmer

• For a processor architecture, the specification it exports governs the reorderings that it allows and it also provides the order-preserving primitives It is often called the processor’s memory model

Korea Univ

• A programmer may use the consistency model to reason about correctness and insert the appropriate order-preserving mechanisms However, this is a very low-level interface for a programmer

Parallel programming is challenging enough without having to think about reorderings and write atomicity

• What programmer wants is a methodology for writing “safe” programs So, this is a contract: if the program follows certain high-level rules or

provides enough program annotations (such as synchronization), then any system on which program runs will always guarantee a sequentially consistent execution, regardless of the default orderings permitted by the system specifications

Korea Univ

• The programmer’s responsibility is to use the rules and annotations, which hopefully does not involve reasoning at the level of potential orderings

• The system’s responsibility is to use the rules and annotations as constraints to maintain the illusion of sequential consistency

Korea Univ

Ordering Specifications

• TSO (Total Store Ordering) Sindhu, Frailong, and Cekleov 1991, Sun Microsystems

• PC (Processor Consistency) Goodman 1989 and Gharachorloo 1990, Intel Pentium

• PSO (Partial Store Ordering) Sindhu, Frailong, and Cekleov 1991, Sun Microsystems

• WO (Weak Ordering) Dubois, Scheurich, and Briggs 1986

• RC (Release Consistency) Gharachorloo 1990

• RMO (Relaxed Memory Ordering) Weaver and Germond 1994, Sun Sparc V8 and V9

• Digital Alpha (Sites 1992) and IBM/Motorola PowerPC (May et al. 1994) models

Korea Univ

1. Relaxing the Write-to-Read Program Order

• The main motivation is to allow the hardware to hide the latency of write operations While the write miss is still in the write buffer and not yet visible to other

processors, the processor can issue and complete reads that hit its cache

• The models (TSO and PC) in this class preserve the programmer’s intuition quite well, for the most part, even without any special operations TSO and PC allow a read to bypass an earlier incomplete write in program

TSO and PC preserve the ordering of writes in program order• But, PC does not guarantee write atomicity

Korea Univ

Write Atomicity

• Write atomicity ensures that nothing a processor does after it has seen the new value produced by a write (e.g. another write that it issues) becomes visible to other processes before they too have seen the new value for that write All writes (to any location) should appear to all processors to have

occurred in the same order

• Write serialization says that writes to the same location should appear to all processors to have occurred in the same order

Korea Univ

Write Atomicity Example

• This example illustrates the importance of write atomicity for sequential consistency

A = 1; while (A == 0);B = 1;

Processor 3

while (B == 0);print A;

What happens if P2 writes B before it is guaranteed that P3 has seen the new value of A?

Korea Univ

Example Code Sequences

• SC is guaranteed in TSO and PC?

A popular software-only mutual exclusion algorithm called Dekker’s algorithm (which is used in the absence of hardware support for atomic read-modify-write operations) relies on the property that both A and B will not be read as 0 in (d)

A = 1;flag = 1;

while (flag == 0);print A;

(a)P1 P2

A = 1;B = 1;

print B;print A;

A = 1; while (A == 0);B = 1;

while (B == 0);print A;

A = 1;print B;

B =1;print A;

Korea Univ

How to Ensure SC Semantics?

• To ensure SC semantics when desired (e.g., to port a program written under SC assumptions to a TSO or PC system), we need mechanisms to enforce 2 types of extra orderings A read does not complete before an earlier write in program order

(applies to both TSO and PC)• Sun’s Sparc V9 provides memory barrier (MEMBAR) or fence instructions of different

flavors that can ensure any desired ordering MEMBAR prevents any read that follows it in program order from issuing before all

writes that precede it have completed

• On architectures that do not provide memory barrier instructions, it is possible to achieve this effect by substituting an atomic read-modify-write operation or sequence for the original read

A read-modify-write is treated as being both a read and a write, so it cannot be reordered with respect to previous writes in these models

Write atomicity for a read operation (applied to PC) • Replacing a read with a read-modify-write also guarantees write atomicity at that

read on machines supporting the PC model Refer to Adve et al, 1993 referenced in the textbook

Korea Univ

2. Relaxing the W-R and W-W Program Orders

• It allows writes and reads to bypass earlier writes (to different locations) It enables multiple write misses to be fully overlapped and to

become visible out of program order

Sun’s Sparc’s Partial Store Ordering (PSO) model belongs to this category

• The only additional instruction we need over TSO is one that enforces w-w ordering in a process’s program order In Sun’s Sparc V9, it can be achieved by using a MEMBAR instruction

Sun’s Sparc V8 provides a special instruction called store barrier (STBAR) to achieve this

Korea Univ

3. Relaxing All Program Orders

• No program orders are guaranteed by default These models are particularly well matched to

superscalar processors whose implementation allows for proceeding past read misses to other memory locations

Prominent models in this category• Weak ordering (WO): WO is the seminal model

• Release consistency (RC)

• Sparc V9 relaxed memory ordering (RMO)

• Digital Alpha model

• IBM PowerPC model

Korea Univ

Weak Ordering (WO)

• The motivation of WO is quite simple Most parallel programs use synchronization operations to

coordinate accesses to data when necessary Between synchronization operations, they do not rely on

the order of accesses being preserved

P1, P2, … Pn

...Lock (TaskQ)newTask→next = Head;if (Head != NULL) Head→prev = newTask;Head = newTask;UnLock(TaskQ)...

Korea Univ

Illustration of WO

Read/Write…

Read/Write

Read/Write…

Read/Write

Read/Write…

Read/Write

Sync (Acquire)

Sync (Release)

Block 1

Block 2

Block 3

Read, write and read-modify-write operations in blocks 1, 2, and 3 can be arbitrarily reordered within its block

Korea Univ

Weak Ordering (WO)

• The intuitive semantics are not violated by any program reorderings as long as synchronization operations are not reordered with respect to data accesses

• Sufficient conditions to ensure a WO system Before a synchronization operation is issued, the processor waits for all

previous operations in program order to have completed

Similarly, memory accesses that follow the synchronization operation are not issued until the synchronization operation completes

• When synchronization operations are infrequent, as in many parallel programs, WO typically provides considerable reordering freedom to the hardware and compiler

Korea Univ

Release Consistency (RC)

• Improvement from WO

• Acquire can be reordered with memory accesses in block 1 The purpose of an acquire is to delay memory accesses in block 2 until the

acquire completes

No reason to wait for block 1 to complete before the acquire can be issued

• Release can be reordered with memory accesses in block 3 The purpose of a release is to grant access to the new data that are modified

before the release in program order

No reason to delay processing block 3 until the release has completed

Read/Write…

Read/Write

Read/Write…

Read/Write

Read/Write…

Read/Write

Sync (Acquire)

Sync (Release)

Korea Univ

Memory Barriers of Commercial Processors

• Processors provide specific instructions called memory barriers or fences that can be used to enforce orderings Synchronization operations (or acquires or releases) cause the

compiler to insert the appropriate special instructions or the programmer can insert these instructions directly

• Alpha supports 2 kinds of fence instructions: the memory barrier (MB) and the write memory barrier (WMB) The MB fence is like a synchronization operation in WO

• It waits for all previously issued memory accesses to complete before issuing any new accesses

The WMB fence imposes program order only between writes• Thus, a read issued after a WMB can still bypass a write access issued

before the WMB

Korea Univ

Memory Barriers of Commercial Processors

• The Sparc V9 RMO provides a fence or MEMBAR instruction with 4 flavor bits associated with it Each bit indicates a particular type of ordering to be enforced

between previous and following load-store operations• The 4 possibilities are R-R, R-W, W-R and W-W

• Any combinations of these bits can be set, offering a variety of ordering choices

• The IBM PowerPC mode provides only a single fence instruction called SYNC, that is equivalent to Alpha’s MB fence

Korea Univ

Characteristics of Various Systems

Korea Univ

Characteristics of Various Systems

Korea Univ

Programmer’s Interface

• A program running “correctly” on a system with TSO (with enough memory barriers) will not necessarily work “correctly” on a system with WO

• Programmer Programmers ensure that all synchronization operations are explicitly

labeled or identified• For example, LOCK, UNLOCK and BARRIER

• System (compiler and hardware) The compiler or run-time library translates these synchronization

operations into the appropriate order-preserving operations (memory barrier or fences)

Then, the system (compiler plus hardware) guarantees sequentially consistent executions even though it may reorder operations between synchronization operations

Korea Univ

Backup Slides

Korea Univ

A Typical Memory Hierarchy

• Take advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology at the speed offered by the fastest technology

On-Chip Components

L2 (SecondLevel)Cache

CPU CoreSecondary

Storage(Disk)Re

g File

MainMemory(DRAM)

L1I (Instr Cache)

L1D (Data Cache)

lower levelhigher level

Note that the cache coherence hardware updates or invalidates only the memory and the caches (not the registers of CPU)

Korea Univ

The Memory Hierarchy: Why Does It Work?

• Temporal Locality (locality in time) If a memory location is referenced, then it will tend to

be referenced again soon Keep most recently accessed data items closer to the

processor

• Spatial Locality (locality in space) If a memory location is referenced, the locations with

nearby addresses will tend to be referenced soon Move blocks consisting of contiguous words closer to

the processor

Korea Univ43

int A[100], B[100], C[100], D; for (i=0; i<100; i++) {

C[i] = A[i] * B[i] + D;}

A[0]A[1]A[2]A[3]A[5]A[6]A[7] A[4]

A[96]A[97]A[98]A[99]B[1]B[2]B[3] B[0]

. . . . . . . . . . . . . .

B[5]B[6]B[7] B[4]B[9]B[10]B[11] B[8]

C[0]C[1]C[2]C[3]C[5]C[6]C[7] C[4]

. . . . . . . . . . . . . .

C[96]C[97]C[98]C[99]D

A Cache Line (block)

Example of Locality

Slide from Prof Sean Lee in Georgia Tech

Korea Univ

Volatile

• When would you use a variable declaration with volatile , for example, in C?

Korea Univ

True Sharing & False Sharing

Korea Univ

Impact of Cache Line Size

Korea Univ

Synchronization

lecture 4. memory consistency models prof. taeweon suh computer science education korea university...

Documents

lecture 0. course introduction prof. taeweon suh computer...

lecture 4. mips instructions #3 branch instructions prof....

n, z, c, v in cpsr with adder & subtractor prof. taeweon suh...

lecture 2. general-purpose (gp) computer systems prof....

arm cpu internal i prof. taeweon suh computer science...

lecture 7. instructions and high-level to machine code prof....

arm cpu internal ii prof. taeweon suh computer science...

lecture 3. virtual platform and arm intro. prof. taeweon suh...

lecture 2. instructions and high-level to machine code prof....

lecture 4. adder & subtractor prof. taeweon suh computer...

lecture 1. what is embedded system? prof. taeweon suh...

lecture 3. apic id prof. taeweon suh computer science...

lecture 6. cache #1 prof. taeweon suh computer science &...

lecture 6. alu, shifter, counter, shift register prof....

lecture 4. arm instructions prof. taeweon suh computer...

lecture 6. multithreading & multicore processors prof....

lecture 6. verilog hdl – sequential logic prof. taeweon...

lecture 1. x86 prof. taeweon suh computer science &...

lecture 1. technology trend prof. taeweon suh computer...

arm instructions i prof. taeweon suh computer science...