lecture 4. memory consistency models prof. taeweon suh computer science education korea university...
Post on 29-Dec-2015
214 Views
Preview:
TRANSCRIPT
Lecture 4. Memory Consistency Models
Prof. Taeweon SuhComputer Science Education
Korea University
COM503 Parallel Computer Architecture & Programming
Korea Univ
Memory Consistency
• What do you expect from the following code?
2
Processor 1
• Program orders in P1 and P2’s accesses to different locations are not implied nor enforced by coherence Coherence requires that the new value for A eventually become visible
to process P2 (not necessarily before the new value of flag is observed)
Note that x86 CPU is a superscalar with OOO (Out-Of-Order) execution
What would you do if you want “print A” to print “1”?
Processor 2
A = 1flag = 1
while (flag == 0)print A
Korea Univ3
Demo#include <stdio.h>#include <omp.h>
int main(){ int a, b; int a_tmp, b_tmp;
a = 0; b = 0; #pragma omp parallel num_threads(2) shared(a, b) { //printf("Parallel region is executed by thread ID %d\n", omp_get_thread_num()); #pragma omp single nowait { a = 1;
b = 2;//while(1);
} #pragma omp single nowait {
a_tmp = a;b_tmp = b;printf("A = %d, B = %d\n", a_tmp, b_tmp);//while(1);
} } return 0;}
Korea Univ
Memory Consistency
• Use barrier
4
Processor 1 Processor 2
A = 1
print ABarrier (b1)
• A barrier is often built using reads and writes to ordinary shared variables (e.g., b1 above) rather than a special barrier operation• Coherence does not say anything at all about the order
among these accesses• It would be interesting to see how OpenMP (or Pthreads)
implements barrier in low level
• But, CPU typically provides barrier instructions (such as sfence, lfence, mfence in x86)
Korea Univ
Memory Consistency
• So, clearly we need something more than coherence to give a shared address space a clear semantics That is, an ordering model that programmers can use to reason about the
possible results and hence the correctness of their programs
• Memory consistency model for a shared address space specifies constraints on the order in which memory operations must appear to be performed (i.e., to become visible to the processors) with respect to one another It includes operations to the same locations or to different locations and by
the same process or different processes, so in this sense memory consistency subsumes coherence
5
Processor 1 Processor 2
A = 1
print ABarrier (b1)
Korea Univ
Programmer’s Abstraction of Memory Subsystem
6
P1 P2 Pn
…..
Memory
● ● ● ● ●
Processors are issuing memory references as per program order
The “switch” is randomly set after each memory reference
Partial order
Partial order
Partial order
Interleaving the partial (program) orders for different processes may yield a large number of possible total orders
Korea Univ
Sequential Consistency
• Sequential consistency (SC) Formalized by Lamport in 1979
A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor occur in this sequence in the order specified by its program
• Implementing SC requires that the system (s/w and h/w) follow 2 constraints Program order requirement: memory operations of a process must
appear to become visible (to itself and others) in program order
Write atomicity: all writes (to any location) should appear to all processors to have occurred in the same order
7
Korea Univ
Sequential Consistency
8
Processor 1 Processor 2
(1a) A = 1(1b) B = 2
(2a) print B(2b) print A
/* Assume initial values of A and B are 0 */
• What values of A and B do you expect to be printed on P2?
(A B) = (1, 2)? (A B) = (1, 0)? (A B) = (0, 0)? (A B) = (0, 2)?
• Under SC, the result (0, 2) for (A, B) would not be allowed since it would then appear that the writes of A and B by P1 executed out of program order• Execution order of 1b, 2a, 2b, and 1a is not sequentially
consistent
Korea Univ
How to Impose Constraint?
• In practice, to constrain the compiler optimizations, multithreaded and parallel programs annotate variables or memory references that are used to preserve orders
• A particularly stringent example is the use of the volatile qualifier in a variable declaration It prevents the variable from being register allocated or any
memory operation on the variable from being reordered with respect to operations before or after it in program order
9
Korea Univ
Reordering Impact Example
• How would reordering the memory operations affect semantics in a parallel program running on a multiprocessor and in a threaded program in which the two processes are interleaved on the same processor?
10
Processor 1 Processor 2
A = 1flag = 1
while (flag == 0)print A
• The compiler may reorder the writes to A and flag with no impact on a sequential program
• It violates our intuition for both parallel programs and multithreaded uniprocessor programs
• For many compilers, these reorderings can be avoided by declaring the variable flag to be of type volatile integer (instead of integer)
Korea Univ
Problems with SC
• The SC model provides an intuitive semantics to the programmer The program order and a consistent interleaving across processes can
be quite easily implemented
• However, its drawback is that it restricts many of the performance optimizations that modern uniprocessor compilers and microprocessors employ With the high cost of memory access latency, computer systems
achieve higher performance by reordering or overlapping the multiple memory or communication operations from a processor
Preserving the sufficient conditions for SC does not allow for much reordering or overlap in hardware
With SC, the compiler can not reorder memory accesses even if they are to different locations, disallowing critical performance optimizations such as code motion, common-subexpression elimination, software pipelining, and even register allocation
11
Korea Univ
Reality Check
• Unfortunately, many of the optimizations that are commonly employed in both compilers and processors violate the SC property
• Explicitly parallel programs use uniprocessor compilers, which are concerned only about preserving dependences to the same location So, compliers routinely reorder accesses to different locations within
a process, so a processor may in fact issue accesses out of the program order seen by the programmer
Advanced compiler optimizations can change the order in which different memory locations are accessed or can even eliminate memory operations
• Common subexpression elimination, constant propagation, register allocation, and loop transformations such as loop splitting, loop reversal, and blocking
12
Korea Univ
Example: Register Allocation
• How can the register allocation lead to a violation of SC even if the hardware satisfies SC
13
P 1 P2
B = 0A = 1u = B
A = 0B = 1v = A
P 1 P2
r1 = 0A = 1u = r1B = r1
r2 = 0B = 1v = r2A = r2
• The result (u, v) = (0, 0) is disallowed under SC
• A uniprocessor compiler might easily perform these optimizations in each process
• They are valid for sequential programs since the reordered accesses are to different locations
Korea Univ
Problems with SC
• Providing SC at the programmer’s interface implies supporting SC at lower-level interfaces If the sufficient conditions for SC are met, a processor waits for an access to
complete before issuing the next one• So, most of the latency suffered by memory references is directly seen by processors
as stall time
• Although a processor may continue executing non-memory instructions while a single outstanding memory reference is being serviced, the expected benefit from such overlap is tiny, since even without ILP (Instruction-Level Parallelism) every third instruction on average is a memory reference
• So, we go to do something about this performance problem
14
Programmer’s interface: We focus mainly on the consistency model as seen by the programmer. That is, at the interface between the programmer and the rest of the system composed of the compiler, operating system, hardware. For example, a processor may preserve all program orders presented to it among memory operations, but if the compiler has already reordered operations, then programmers can no longer reason with the simple model exported by the hardware
Korea Univ
Solutions?
• One approach is to preserve SC at the programmer’s interface, but find ways to hide the long stalls from the processor 1st technique
• Compiler does not reorder memory operations, but latency tolerance techniques such as data prefetching or multithreading are used to overlap data transfer with one another or with computation
• But, the actual read and write operations are not issued before previous ones complete in program order
15
Korea Univ
Solutions?
2nd technique• Compiler reorders operations as long as it can guarantee that SC will
not be violated in the results Compiler algorithms have been developed for this (Shasha and Snir 1988,
Kris and Yelick 1994, 1995)
• At the hardware level, Memory operations are issued and executed out of program order, but
are guaranteed to become visible to other processors in program order• This approach is well suited to dynamically scheduled processors that use
instruction lookahead buffer to find independent instructions to issue• Instructions are inserted in the lookahead buffer in program order• They are guaranteed to retire from the lookahead buffer in program order
Speculative execution such as Branch prediction Speculative reads
• Values returned by reads are used even before they are known to be correct• Later, roll back if they are incorrect
• Or Change the memory consistency model itself!
16
Korea Univ
Relaxed Consistency Models
• A completely different way to overcome the performance limitations imposed by SC is to change the memory consistency model itself That is, not to guarantee such strong ordering constraints to the programmer,
but still retain semantics that are intuitive enough to be useful
The intuition behind the relaxed models is that SC is usually too conservative
Many of the orders it preserves are not really needed to satisfy a programmer’s intuition in most situations
• By relaxing the ordering constraints, these relaxed consistency models allow the compiler to reorder accesses before presenting them to the hardware, at least to some extent
• A the hardware level, they allow multiple memory accesses from the same process not only to be outstanding at a time, but even to complete or become visible out of order, thus allowing much of the latency to be overlapped and hidden from the processor
17
Korea Univ
Example
• Writes to variables A and B by P1 can be reordered without affecting the results All we must ensure is that both of them complete before the variable flag is set to 1
• Reads to variables A and B can be reordered at P2 once flag has been observed to change to value 1
• Even with these reorderings, the results look just like those of an SC execution
18
P 1 P2
A = 1B = 1flag = 1
While (flag ==0)u = Av = B
P 1 P2
A = 1B = 1flag = 1
While (flag ==0)While (flag ==0)While (flag ==0)u = Av = B
Ordering Under SC Ordering necessary for correct program semantics
Korea Univ
Reality Check
• It would be wonderful if system software or hardware could automatically detect which program orders are critical to maintaining SC semantics and allow the others to be violated for higher performance (Shasha and Snir 1998)
• However, the problem is intractable (in fact, undecidable) for general programs, and inexact solutions are often too conservative to be very useful
19
Korea Univ
Relaxed Consistency Model
• A relaxed consistency model requires 2 things What program orders among memory operations are guaranteed to be
preserved by the system, including that write atomicity will be maintained
If not all program orders are guaranteed to be preserved by default, then what mechanisms the system provides for a programmer to enforce order explicitly when desired
• As should be clear by now, the compiler and the hardware have their own system specifications, but we focus on the specification that the two together (or the system as a whole) presents to the programmer
• For a processor architecture, the specification it exports governs the reorderings that it allows and it also provides the order-preserving primitives It is often called the processor’s memory model
20
Korea Univ
Relaxed Consistency Model
• A programmer may use the consistency model to reason about correctness and insert the appropriate order-preserving mechanisms However, this is a very low-level interface for a programmer
Parallel programming is challenging enough without having to think about reorderings and write atomicity
• What programmer wants is a methodology for writing “safe” programs So, this is a contract: if the program follows certain high-level rules or
provides enough program annotations (such as synchronization), then any system on which program runs will always guarantee a sequentially consistent execution, regardless of the default orderings permitted by the system specifications
21
Korea Univ
Relaxed Consistency Model
• The programmer’s responsibility is to use the rules and annotations, which hopefully does not involve reasoning at the level of potential orderings
• The system’s responsibility is to use the rules and annotations as constraints to maintain the illusion of sequential consistency
22
Korea Univ
Ordering Specifications
• TSO (Total Store Ordering) Sindhu, Frailong, and Cekleov 1991, Sun Microsystems
• PC (Processor Consistency) Goodman 1989 and Gharachorloo 1990, Intel Pentium
• PSO (Partial Store Ordering) Sindhu, Frailong, and Cekleov 1991, Sun Microsystems
• WO (Weak Ordering) Dubois, Scheurich, and Briggs 1986
• RC (Release Consistency) Gharachorloo 1990
• RMO (Relaxed Memory Ordering) Weaver and Germond 1994, Sun Sparc V8 and V9
• Digital Alpha (Sites 1992) and IBM/Motorola PowerPC (May et al. 1994) models
23
Korea Univ
1. Relaxing the Write-to-Read Program Order
• The main motivation is to allow the hardware to hide the latency of write operations While the write miss is still in the write buffer and not yet visible to other
processors, the processor can issue and complete reads that hit its cache
• The models (TSO and PC) in this class preserve the programmer’s intuition quite well, for the most part, even without any special operations TSO and PC allow a read to bypass an earlier incomplete write in program
order
TSO and PC preserve the ordering of writes in program order• But, PC does not guarantee write atomicity
24
Korea Univ
Write Atomicity
• Write atomicity ensures that nothing a processor does after it has seen the new value produced by a write (e.g. another write that it issues) becomes visible to other processes before they too have seen the new value for that write All writes (to any location) should appear to all processors to have
occurred in the same order
• Write serialization says that writes to the same location should appear to all processors to have occurred in the same order
25
Korea Univ
Write Atomicity Example
• This example illustrates the importance of write atomicity for sequential consistency
26
Processor 1 Processor 2
A = 1; while (A == 0);B = 1;
Processor 3
while (B == 0);print A;
What happens if P2 writes B before it is guaranteed that P3 has seen the new value of A?
Korea Univ
Example Code Sequences
• SC is guaranteed in TSO and PC?
27
A popular software-only mutual exclusion algorithm called Dekker’s algorithm (which is used in the absence of hardware support for atomic read-modify-write operations) relies on the property that both A and B will not be read as 0 in (d)
P1 P2
A = 1;flag = 1;
while (flag == 0);print A;
(a)P1 P2
A = 1;B = 1;
print B;print A;
(b)
P1 P2
A = 1; while (A == 0);B = 1;
P3
while (B == 0);print A;
(c)
P1 P2
A = 1;print B;
B =1;print A;
(d)
Korea Univ
How to Ensure SC Semantics?
• To ensure SC semantics when desired (e.g., to port a program written under SC assumptions to a TSO or PC system), we need mechanisms to enforce 2 types of extra orderings A read does not complete before an earlier write in program order
(applies to both TSO and PC)• Sun’s Sparc V9 provides memory barrier (MEMBAR) or fence instructions of different
flavors that can ensure any desired ordering MEMBAR prevents any read that follows it in program order from issuing before all
writes that precede it have completed
• On architectures that do not provide memory barrier instructions, it is possible to achieve this effect by substituting an atomic read-modify-write operation or sequence for the original read
A read-modify-write is treated as being both a read and a write, so it cannot be reordered with respect to previous writes in these models
Write atomicity for a read operation (applied to PC) • Replacing a read with a read-modify-write also guarantees write atomicity at that
read on machines supporting the PC model Refer to Adve et al, 1993 referenced in the textbook
28
Korea Univ
2. Relaxing the W-R and W-W Program Orders
• It allows writes and reads to bypass earlier writes (to different locations) It enables multiple write misses to be fully overlapped and to
become visible out of program order
Sun’s Sparc’s Partial Store Ordering (PSO) model belongs to this category
• The only additional instruction we need over TSO is one that enforces w-w ordering in a process’s program order In Sun’s Sparc V9, it can be achieved by using a MEMBAR instruction
Sun’s Sparc V8 provides a special instruction called store barrier (STBAR) to achieve this
29
Korea Univ
3. Relaxing All Program Orders
• No program orders are guaranteed by default These models are particularly well matched to
superscalar processors whose implementation allows for proceeding past read misses to other memory locations
Prominent models in this category• Weak ordering (WO): WO is the seminal model
• Release consistency (RC)
• Sparc V9 relaxed memory ordering (RMO)
• Digital Alpha model
• IBM PowerPC model
30
Korea Univ
Weak Ordering (WO)
• The motivation of WO is quite simple Most parallel programs use synchronization operations to
coordinate accesses to data when necessary Between synchronization operations, they do not rely on
the order of accesses being preserved
31
P1, P2, … Pn
...Lock (TaskQ)newTask→next = Head;if (Head != NULL) Head→prev = newTask;Head = newTask;UnLock(TaskQ)...
Korea Univ
Illustration of WO
32
Read/Write…
Read/Write
Read/Write…
Read/Write
Read/Write…
Read/Write
Sync (Acquire)
Sync (Release)
Block 1
Block 2
Block 3
Read, write and read-modify-write operations in blocks 1, 2, and 3 can be arbitrarily reordered within its block
Korea Univ
Weak Ordering (WO)
• The intuitive semantics are not violated by any program reorderings as long as synchronization operations are not reordered with respect to data accesses
• Sufficient conditions to ensure a WO system Before a synchronization operation is issued, the processor waits for all
previous operations in program order to have completed
Similarly, memory accesses that follow the synchronization operation are not issued until the synchronization operation completes
• When synchronization operations are infrequent, as in many parallel programs, WO typically provides considerable reordering freedom to the hardware and compiler
33
Korea Univ
Release Consistency (RC)
• Improvement from WO
• Acquire can be reordered with memory accesses in block 1 The purpose of an acquire is to delay memory accesses in block 2 until the
acquire completes
No reason to wait for block 1 to complete before the acquire can be issued
• Release can be reordered with memory accesses in block 3 The purpose of a release is to grant access to the new data that are modified
before the release in program order
No reason to delay processing block 3 until the release has completed
34
Read/Write…
Read/Write
Read/Write…
Read/Write
Read/Write…
Read/Write
Sync (Acquire)
Sync (Release)
1
2
3
Korea Univ
Memory Barriers of Commercial Processors
• Processors provide specific instructions called memory barriers or fences that can be used to enforce orderings Synchronization operations (or acquires or releases) cause the
compiler to insert the appropriate special instructions or the programmer can insert these instructions directly
• Alpha supports 2 kinds of fence instructions: the memory barrier (MB) and the write memory barrier (WMB) The MB fence is like a synchronization operation in WO
• It waits for all previously issued memory accesses to complete before issuing any new accesses
The WMB fence imposes program order only between writes• Thus, a read issued after a WMB can still bypass a write access issued
before the WMB
35
Korea Univ
Memory Barriers of Commercial Processors
• The Sparc V9 RMO provides a fence or MEMBAR instruction with 4 flavor bits associated with it Each bit indicates a particular type of ordering to be enforced
between previous and following load-store operations• The 4 possibilities are R-R, R-W, W-R and W-W
• Any combinations of these bits can be set, offering a variety of ordering choices
• The IBM PowerPC mode provides only a single fence instruction called SYNC, that is equivalent to Alpha’s MB fence
36
Korea Univ
Characteristics of Various Systems
37
Korea Univ
Characteristics of Various Systems
38
Korea Univ
Programmer’s Interface
• A program running “correctly” on a system with TSO (with enough memory barriers) will not necessarily work “correctly” on a system with WO
• Programmer Programmers ensure that all synchronization operations are explicitly
labeled or identified• For example, LOCK, UNLOCK and BARRIER
• System (compiler and hardware) The compiler or run-time library translates these synchronization
operations into the appropriate order-preserving operations (memory barrier or fences)
Then, the system (compiler plus hardware) guarantees sequentially consistent executions even though it may reorder operations between synchronization operations
39
Korea Univ
Backup Slides
40
Korea Univ
A Typical Memory Hierarchy
• Take advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology at the speed offered by the fastest technology
41
On-Chip Components
L2 (SecondLevel)Cache
CPU CoreSecondary
Storage(Disk)Re
g File
MainMemory(DRAM)
ITLB
DTLB
L1I (Instr Cache)
L1D (Data Cache)
lower levelhigher level
Note that the cache coherence hardware updates or invalidates only the memory and the caches (not the registers of CPU)
Korea Univ
The Memory Hierarchy: Why Does It Work?
• Temporal Locality (locality in time) If a memory location is referenced, then it will tend to
be referenced again soon Keep most recently accessed data items closer to the
processor
• Spatial Locality (locality in space) If a memory location is referenced, the locations with
nearby addresses will tend to be referenced soon Move blocks consisting of contiguous words closer to
the processor
42
Korea Univ43
int A[100], B[100], C[100], D; for (i=0; i<100; i++) {
C[i] = A[i] * B[i] + D;}
A[0]A[1]A[2]A[3]A[5]A[6]A[7] A[4]
A[96]A[97]A[98]A[99]B[1]B[2]B[3] B[0]
. . . . . . . . . . . . . .
B[5]B[6]B[7] B[4]B[9]B[10]B[11] B[8]
C[0]C[1]C[2]C[3]C[5]C[6]C[7] C[4]
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
C[96]C[97]C[98]C[99]D
A Cache Line (block)
Example of Locality
Slide from Prof Sean Lee in Georgia Tech
Korea Univ
Volatile
• When would you use a variable declaration with volatile , for example, in C?
44
Korea Univ
True Sharing & False Sharing
45
Korea Univ
Impact of Cache Line Size
46
Korea Univ
Synchronization
47
top related