![Page 1: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/1.jpg)
OpenMP, Cache CoherenceInstructor: Sruthi Veeragandham
![Page 2: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/2.jpg)
Review of Last Lecture (1/3)
• Sequential software is slow software– SIMD and MIMD only path to higher performance
• Multithreading increases utilization, Multicore more processors (MIMD)
• OpenMP as simple parallel extension to C– Small, so easy to learn, but not very high level
– It’s easy to get into trouble (more today!)
27/24/2018 CS61C Su18 - Lecture 20
![Page 3: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/3.jpg)
Review of Last Lecture (2/3)
• Synchronization in RISC-V:• Atomic Swap:amoswap.w.aq rd,rs2,(rs1)amoswap.w.rl rd,rs2,(rs1)– swaps the memory value at M[R[rs1]] with the
register value in R[rs2]
– atomic because this is done in one instruction
• Another option: lr (load reserve) & sc (store conditional)
37/24/2018 CS61C Su18 - Lecture 20
![Page 4: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/4.jpg)
Review of Last Lecture (3/3)
4
Shares iterations of a loop across the threads
Each section is executedby a separate thread
Serializes the executionof a thread
• These are defined within a parallel section
7/24/2018 CS61C Su18 - Lecture 20
![Page 5: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/5.jpg)
Agenda
7/24/2018 CS61C Su18 - Lecture 20 5
• OpenMP Directives– Workshare for Matrix Multiplication
– Synchronization
• Administrivia
• Common OpenMP Pitfalls
• Multiprocessor Cache Coherence
• Break
• Coherence Protocol: MOESI
![Page 6: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/6.jpg)
Matrix Multiplication
6
C
= ×
A B
ai*
b*j
cij
7/24/2018 CS61C Su18 - Lecture 20
![Page 7: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/7.jpg)
Advantage: Code simplicity
Disadvantage: Blindly marches throughmemory (how does this affect the cache?)
Naïve Matrix Multiply
for (i=0; i<N; i++) for (j=0; j<N; j++)
for (k=0; k<N; k++) C[i][j] += A[i][k] * B[k][j];
77/24/2018 CS61C Su18 - Lecture 20
C[i*N+j] += A[i*N+k] * B[k*N+j];
![Page 8: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/8.jpg)
Matrix Multiply in OpenMPstart_time = omp_get_wtime();
#pragma omp parallel for private(tmp, i, j, k) for (i=0; i<Mdim; i++){ for (j=0; j<Ndim; j++){ tmp = 0.0; for( k=0; k<Pdim; k++){ /* C(i,j) = sum(over k) A(i,k) * B(k,j)*/ tmp += *(A+(i*Pdim+k)) * *(B+(k*Ndim+j)); } *(C+(i*Ndim+j)) = tmp; } }run_time = omp_get_wtime() - start_time;
8
Outer loop spread across N threads; inner loops inside a single thread
7/24/2018 CS61C Su18 - Lecture 20
Why is there no data race here?● Different threads only work on different ranges of i -- inside writing memory
access● Never reducing to a single value (because every write is unique).
![Page 9: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/9.jpg)
Question: What if cache block size > N?
Naïve Matrix Multiply
for (i=0; i<N; i++) for (j=0; j<N; j++)
for (k=0; k<N; k++) C[i*N+j] += A[i*N+k] * B[N*k+j];
97/24/2018 CS61C Su18 - Lecture 20
![Page 10: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/10.jpg)
Block Size > N
10
C
= ×
A B
ai*
b*j
cij
7/24/2018 CS61C Su18 - Lecture 20
Won’t use last half of the block!
Cache Block
![Page 11: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/11.jpg)
Question: What if cache block size > N?
—We wouldn’t be using all the data in the blocks that were put in the cache for matrix C and A!
What about if cache block size < N?
Naïve Matrix Multiply
for (i=0; i<N; i++) for (j=0; j<N; j++)
for (k=0; k<N; k++) C[i*N+j] += A[i*N+k] * B[N*k+j];
117/24/2018 CS61C Su18 - Lecture 20
![Page 12: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/12.jpg)
Block Size < N
12
C
= ×
A B
ai*
b*j
cij
7/24/2018 CS61C Su18 - Lecture 20
Must pull in two blocks instead of
one!Cache Block
![Page 13: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/13.jpg)
Cache Blocking
• Increase the number of cache hits you get by using up as much of the cache block as possible– For an N x N matrix multiplication:
• Instead of striding by the dimensions of the matrix, stride by the blocksize
• When N is not perfect divisible by the blocksize, chunk up data as much as possible into block sizes and handle the remainder as a tailcase
• You’ve already done this in lab 7—really try to understand it!
137/24/2018 CS61C Su18 - Lecture 20
![Page 14: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/14.jpg)
Agenda
7/24/2018 CS61C Su18 - Lecture 20 14
• OpenMP Directives– Workshare for Matrix Multiplication
– Synchronization
• Administrivia
• Common OpenMP Pitfalls
• Multiprocessor Cache Coherence
• Break
• Coherence Protocol: MOESI
![Page 15: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/15.jpg)
OpenMP Reduction
• Reduction: specifies that 1 or more variables that are private to each thread are subject of reduction operation at end of parallel region:reduction(operation:var)– Operation: perform on the variables (var) at the
end of the parallel region– Var: variable(s) on which to perform scalar
reduction #pragma omp for reduction(+ : nSum) for (i = START ; i <= END ; ++i) nSum += i;
157/24/2018 CS61C Su18 - Lecture 20
![Page 16: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/16.jpg)
Sample use of reductiondouble compute_sum(double *a, int a_len) {
double sum = 0.0;
#pragma omp parallel for reduction(+ : sum)for (int i = 0; i < a_len; i++) {
sum += a[i];
}
return sum;
}
167/24/2018 CS61C Su18 - Lecture 20
![Page 17: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/17.jpg)
Administrivia
• HW6 Released! Due 7/30
• Midterm 2 is tomorrow in lecture!
– Covering up to Performance
– There will be discussion after MT2 :(
– Check out Piazza for more logistics
• Proj4 Released soon!
• Guerilla session is now Sunday 2-4pm, @Cory 540AB
177/19/2018 CS61C Su18 - Lecture 18
![Page 18: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/18.jpg)
Agenda
7/24/2018 CS61C Su18 - Lecture 20 18
• OpenMP Directives– Workshare for Matrix Multiplication
– Synchronization
• Administrivia
• Common OpenMP Pitfalls
• Multiprocessor Cache Coherence
• Meet the Staff
• Coherence Protocol: MOESI
![Page 19: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/19.jpg)
OpenMP Pitfalls
• We can’t just throw pragmas on everything and expect performance increase ☹– Might not change speed much or break code!
– Must understand application and use wisely
• Discussed here:1) Data dependencies
2) Sharing issues (private/non-private variables)
3) Updating shared values
4) Parallel overhead
197/24/2018 CS61C Su18 - Lecture 20
![Page 20: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/20.jpg)
OpenMP Pitfall #1: Data Dependencies
• Consider the following code:a[0] = 1;for(i=1; i<5000; i++) a[i] = i + a[i-1];
• There are dependencies between loop iterations!– Splitting this loop between threads does not
guarantee in-order execution– Out of order loop execution will result in
undefined behavior (i.e. likely wrong result)
207/24/2018 CS61C Su18 - Lecture 20
![Page 21: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/21.jpg)
Open MP Pitfall #2: Sharing Issues•Consider the following loop: #pragma omp parallel for for(i=0; i<n; i++){
temp = 2.0*a[i]; a[i] = temp; b[i] = c[i]/temp;
} • temp is a shared variable! #pragma omp parallel for private(temp)
for(i=0; i<n; i++){ temp = 2.0*a[i]; a[i] = temp; b[i] = c[i]/temp;
}
217/24/2018 CS61C Su18 - Lecture 20
![Page 22: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/22.jpg)
OpenMP Pitfall #3: Updating Shared Variables Simultaneously
• Now consider a global sum:
for(i=0; i<n; i++)sum = sum + a[i];
• This can be done by surrounding the summation by a critical/atomic section or reduction clause:
#pragma omp parallel for reduction(+:sum){
for(i=0; i<n; i++) sum = sum + a[i];
}– Compiler can generate highly efficient code for reduction
227/24/2018 CS61C Su18 - Lecture 20
![Page 23: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/23.jpg)
OpenMP Pitfall #4: Parallel Overhead
• Spawning and releasing threads results in significant overhead
• Better to have fewer but larger parallel regions– Parallelize over the largest loop that you can (even
though it will involve more work to declare all of the private variables and eliminate dependencies)
237/24/2018 CS61C Su18 - Lecture 20
![Page 24: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/24.jpg)
OpenMP Pitfall #4: Parallel Overhead
start_time = omp_get_wtime();for (i=0; i<Ndim; i++){ for (j=0; j<Mdim; j++){ tmp = 0.0; #pragma omp parallel for reduction(+:tmp) for( k=0; k<Pdim; k++){ /* C(i,j) = sum(over k) A(i,k) * B(k,j)*/ tmp += *(A+(i*Ndim+k)) * *(B+(k*Pdim+j)); } *(C+(i*Ndim+j)) = tmp; }}run_time = omp_get_wtime() - start_time;
24
Too much overhead in thread generation to have this statement run this frequently.
Poor choice of loop to parallelize.
7/24/2018 CS61C Su18 - Lecture 20
![Page 25: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/25.jpg)
Agenda
7/24/2018 CS61C Su18 - Lecture 20 25
• OpenMP Directives– Workshare for Matrix Multiplication
– Synchronization
• Administrivia
• Common OpenMP Pitfalls
• Multiprocessor Cache Coherence
• Break
• Coherence Protocol: MOESI
![Page 26: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/26.jpg)
Where are the caches?
26
Processor 0
Control
DatapathPC
Registers
(ALU)
MemoryInput
Output
Bytes
I/O-Memory Interfaces
Processor 0 Memory Accesses
Processor 1
Control
DatapathPC
Registers
(ALU)
Processor 1 Memory Accesses
7/24/2018 CS61C Su18 - Lecture 20
![Page 27: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/27.jpg)
Multiprocessor Caches• Memory is a performance bottleneck
– Even with just one processor
– Caches reduce bandwidth demands on memory
• Each core has a local private cache– Cache misses access shared common memory
27
Processor Processor Processor
Cache Cache Cache
Interconnection Network
Memory I/O7/24/2018 CS61C Su18 - Lecture 20
![Page 28: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/28.jpg)
Shared Memory and Caches
• What if? – Processors 1 and 2 read Memory[1000] (value 20)
28
Processor Processor Processor
Cache Cache Cache
Interconnection Network
Memory I/O
1000
20
1000
1000: 20 1000:20
0 1 2
7/24/2018 CS61C Su18 - Lecture 20
![Page 29: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/29.jpg)
Shared Memory and Caches
• Now:– Processor 0 writes Memory[1000] with 40
29
Processor Processor Processor
Cache Cache Cache
Interconnection Network
Memory I/O
0 1 2
1000 20 1000 20
1000
1000 40
1000 40
Problem?
7/24/2018 CS61C Su18 - Lecture 20
![Page 30: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/30.jpg)
Keeping Multiple Caches Coherent
• Architect’s job: keep cache values coherent with shared memory
• Idea: on cache miss or write, notify other processors via interconnection network– If reading, many processors can have copies
– If writing, invalidate all other copies
• Write transactions from one processor “snoop” tags of other caches using common interconnect– Invalidate any “hits” to same address in other caches
307/24/2018 CS61C Su18 - Lecture 20
![Page 31: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/31.jpg)
Shared Memory and Caches
• Example, now with cache coherence– Processors 1 and 2 read Memory[1000]
– Processor 0 writes Memory[1000] with 40
31
Processor Processor Processor
Cache Cache Cache
Interconnection Network
Memory I/O
0 1 2
1000 20 1000 20
Processor 0WriteInvalidatesOther Copies
1000
1000 40
1000 40
7/24/2018 CS61C Su18 - Lecture 20
![Page 32: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/32.jpg)
Using write-through caches removes the need for cache coherence
(A)
Every processor store instruction must check the contents of other caches
(B)
Most processor load and store accesses only need to check in the local private cache
(C)
Only one processor can cache any memory location at one time
(D)
32
Question: Which statement is TRUE about multiprocessor cache coherence?
![Page 33: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/33.jpg)
Break
7/24/2018 CS61C Su18 - Lecture 20 33
![Page 34: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/34.jpg)
Agenda
7/24/2018 CS61C Su18 - Lecture 20 34
• OpenMP Directives– Workshare for Matrix Multiplication
– Synchronization
• Administrivia
• Common OpenMP Pitfalls
• Multiprocessor Cache Coherence
• Break
• Coherence Protocol: MOESI
![Page 35: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/35.jpg)
How Does HW Keep $ Coherent?• Simple protocol: MSI• Each cache tracks state of each block in cache:
– Modified: up-to-date, changed (dirty), OK to write• no other cache has a copy• copy in memory is out-of-date• must respond to read request by other processors
– Shared: up-to-date data, not allowed to write• other caches may have a copy• copy in memory is up-to-date
– Invalid: data in this block is “garbage”
357/24/2018 CS61C Su18 - Lecture 20
![Page 36: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/36.jpg)
MSI Protocol: Current Processor
367/24/2018 CS61C Su18 - Lecture 20
Invalid
Shared
Modified
Read Miss (get block from memory)
Write Miss (get block from memory)
(WB/Write Alloc)
Read HitWrite HitRead Hit
Write Hit
![Page 37: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/37.jpg)
MSI Protocol: Response to Other Processors
377/24/2018 CS61C Su18 - Lecture 20
Invalid
Shared
Modified
Probe Read Hit
Probe Write Hit
Probe Write Miss
(write back to memory)
A probe is another
processor
Probe Read Miss (write
back to memory)
![Page 38: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/38.jpg)
How to keep track of state block is in?
• Already have valid bit + dirty bit• Introduce a new bit called “shared” bit
X = doesn’t matter38
Valid Bit Dirty Bit Shared Bit
Modified 1 1 0
Shared 1 0 1
Invalid 0 X X
![Page 39: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/39.jpg)
MSI Example
39
Processor 0 Processor 1
Cache CacheInterconnection Network
Memory I/OBlock 0
Block 1
Processor 0 -- Block 0 Read
Invalid
Invalid Invalid
Invalid
Processor 1 -- Block 0 ReadProcessor 0 -- Block 0 WriteProcessor 1 -- Block 0 Read
Block 0: Shared Block 0: SharedBlock 0: Modified
Block 0: Shared InvalidBlock 0: Shared
Block 0: updated
Block 0: Modified Shared
Here there’s only one other processor, but we would have to check & invalidate the data in *every* other processor
![Page 40: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/40.jpg)
Compatibility Matrix
• Each block in each cache is in one of the following states:– Modified (in cache)– Shared (in cache)– Invalid (not in cache)
407/24/2018 CS61C Su18 - Lecture 20
Compatibility Matrix: Allowed states for a given cache block in any pair of caches
M S I
M ✅
S ✅ ✅
I ✅ ✅ ✅
![Page 41: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/41.jpg)
Problem: Writing to Shared is Expensive
• If block is in shared, need to check if other caches have data (so we can invalidate) if we want to write
• If block is in modified, don’t need to check other caches if we want to write. – Why? Only one cache can have data if modified
41
![Page 42: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/42.jpg)
Performance Enhancement 1: Exclusive State
• New state: exclusive• Exclusive: up-to-date data, OK to write (change to
modified)– no other cache has a copy– copy in memory up-to-date– no write to memory if block replaced– supplies data on read instead of going to memory
• Now, if block is in shared, at least 1 other cache must contain it: – Shared: up-to-date data, not allowed to write
• other caches may definitely have a copy• copy in memory is up-to-date
42
![Page 43: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/43.jpg)
MESI Protocol: Current Processor
437/24/2018 CS61C Su18 - Lecture 20
ExclusiveInvalid
Shared
Modified
Read Miss, first to cache data
Read Miss, other cache(s) have data
Write Miss (WB/Write Alloc)
Read Hit
Write H
it
Read HitWrite HitRead Hit
![Page 44: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/44.jpg)
MESI Protocol: Response to Other Processors
447/24/2018 CS61C Su18 - Lecture 20
ExclusiveInvalid
Shared
Modified
Probe Read Hit
Probe Write Hit
(write back to memory)
Probe Write Miss
Probe Write Hit
Probe
Read Hit
![Page 45: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/45.jpg)
How to keep track of state block is in?
• New entry in truth table: Exclusive
X = doesn’t matter45
Valid Bit Dirty Bit Shared Bit
Modified 1 1 0
Exclusive 1 0 0
Shared 1 0 1
Invalid 0 X X
![Page 46: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/46.jpg)
Problem: Expensive to Share Modified
• In MSI and MESI, if we want to share block in modified:
1. Modified data written back to memory2. Modified block → shared
3. Block that wants data → shared • Writing to memory is expensive! Can we
avoid it?
46
![Page 47: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/47.jpg)
Performance Enhancement 2: Owned State
• Owner: up-to-date data, read-only (like shared, you can write if you invalidate shared copies first and your state changes to modified)– Other caches have a shared copy (Shared state)– Data in memory not up-to-date– Owner supplies data on probe read instead of going to
memory
• Shared: up-to-date data, not allowed to write– other caches definitely have a copy– copy in memory is may be up-to-date
47
![Page 48: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/48.jpg)
Common Cache Coherency Protocol: MOESI (snoopy protocol)• Each block in each cache is in one of the
following states:– Modified (in cache)– Owned (in cache)– Exclusive (in cache)– Shared (in cache)– Invalid (not in cache)
487/24/2018 CS61C Su18 - Lecture 20
Compatibility Matrix: Allowed states for a given cache block in any pair of caches
M O E S I
M ✅
O ✅ ✅
E ✅
S ✅ ✅ ✅
I ✅ ✅ ✅ ✅ ✅
![Page 49: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/49.jpg)
Write Hit
Exclusive
MOESI Protocol: Current Processor
497/24/2018 CS61C Su18 - Lecture 20
Invalid
Shared
Owned
Modified
Read Miss Exclusive
Read Miss Shared
Write Miss (WB/Write Alloc)
Read Hit
Write H
it
Read HitWrite Hit
Read Hit
Read Hit
![Page 50: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/50.jpg)
Probe Read Hit
Exclusive
MOESI Protocol: Response to Other Processors
507/24/2018 CS61C Su18- Lecture 20
Invalid
Shared
Owned
Modified
Probe Write Miss
Probe Write Hit Probe Write Hit
Probe Read Hit
Probe Read Hit
Probe Write
Hit
Probe Read
Hit
![Page 51: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/51.jpg)
How to keep track of state block is in?
• New entry in truth table: Owned
X = doesn’t matter51
Valid Bit Dirty Bit Shared Bit
Modified 1 1 0
Owned 1 1 1
Exclusive 1 0 0
Shared 1 0 1
Invalid 0 X X
![Page 52: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/52.jpg)
Invalid
MOESI Example
52
Processor 0 Processor 1
Cache CacheInterconnection Network
Memory I/OBlock 0
Block 1
Processor 0 -- Block 0 Read
Invalid Invalid
Invalid
Processor 1 -- Block 0 ReadProcessor 0 -- Block 0 WriteProcessor 1 -- Block 0 Read
Processor 0 -- Block 0 Write
Block 0: ExclusiveBlock 0: Exclusive Modified Block 0: SharedBlock 0: Modified Owned InvalidBlock 0: Owned Modified Block 0: Shared
Block 0 is in the exclusive state, so we don’t have to check if the block is in other caches and needs to be invalidated.
Block 0: Modified Owned
Block 0 is never unnecessarily evicted & we don’t waste bandwidth writing to memory
![Page 53: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/53.jpg)
CS61C L20 Thread Level Parallelism II (53) Garcia, Spring 2014 © UCB
![Page 54: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/54.jpg)
Cache Coherence Tracked by Block
• Suppose: – Block size is 32 bytes– P0 reading and writing variable X, P1 reading and writing
variable Y– X in location 4000, Y in 4012
• What will happen?54
Processor 0 Processor 1
4000 4000 4004 4008 4012 4016 4028Tag 32-Byte Data Block
Cache 0 Cache 1
Memory
7/24/2018 CS61C Su18 - Lecture 20
![Page 55: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/55.jpg)
False Sharing
• Block ping-pongs between two caches even though processors are accessing disjoint variables– Effect called false sharing
• How can you prevent it?– Want to “place” data on different blocks
– Reduce block size
557/24/2018 CS61C Su18 - Lecture 20
![Page 56: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/56.jpg)
False Sharing vs. Real Sharing
• If same piece of data being used by 2 caches, ping-ponging is inevitable
• This is not false sharing• Would miss occur if block size was only 1 word?
– Yes: true sharing– No: false sharing
56
Processor 0 Processor 1
4000 4000 4004 4008 4012 4016 4028Tag 32-Byte Data Block
Cache 0 Cache 1
Memory
![Page 57: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/57.jpg)
Understanding Cache Misses: The 3Cs
• Compulsory (cold start or process migration, 1st reference):– First access to a block in memory impossible to avoid– Solution: block size ↑ (MP ↑; very large blocks could cause
MR ↑)
• Capacity:– Cache cannot hold all blocks accessed by the program– Solution: cache size ↑ (may cause access/HT ↑)
• Conflict (collision):– Multiple memory locations map to same cache location– Solutions: cache size ↑, associativity ↑ (may cause access/HT
↑)
577/24/2018 CS61C Su18 - Lecture 20
![Page 58: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/58.jpg)
“Fourth C”: Coherence Misses
• Misses caused by coherence traffic with other processor
• Also known as communication misses because represents data moving between processors working together on a parallel program
• For some parallel programs, coherence misses can dominate total misses
587/24/2018 CS61C Su18 - Lecture 20
![Page 59: OpenMP, Cache Coherencecs61c/resources/su... · •OpenMP as simple parallel extension to C –Small, so easy to learn, but not very high level –It’s easy to get into trouble](https://reader036.vdocument.in/reader036/viewer/2022081616/5fe8f45a81b5703d35547d37/html5/thumbnails/59.jpg)
Summary
• Synchronization via hardware primitives:
– RISCV does it with load reserve + store conditional or amoswap
• OpenMP as simple parallel extension to C
– Synchronization accomplished with critical/atomic/reduction
– Pitfalls can reduce speedup or break program logic
• Cache coherence implements shared memory even with multiple copies in multiple caches
– False sharing a concern
7/24/2018 CS61C Su18 - Lecture 20 59