multithreading issues, cache coherency
TRANSCRIPT
![Page 1: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/1.jpg)
Multithreading Issues, Cache CoherencyInstructor: Sean Farhat
![Page 2: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/2.jpg)
Review of Last Lecture (1/3)
• Sequential software is slow software
– SIMD and MIMD only path to higher performance
• Multithreading increases utilization, Multicore more processors (MIMD)
• OpenMP as simple parallel extension to C
– Small, so easy to learn, but not very high level
– It’s easy to get into trouble (more today!)
8/3/20 CS 61C Su20 - Lecture 23 2
![Page 3: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/3.jpg)
Review of Last Lecture (2/3)
• Synchronization in RISC-V:• Atomic Swap:amoswap.w.aq rd,rs2,(rs1)
amoswap.w.rl rd,rs2,(rs1)
– swaps the memory value at M[R[rs1]] with the register value in R[rs2]
– atomic because this is done in one instruction
8/3/20 CS 61C Su20 - Lecture 23 3
![Page 4: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/4.jpg)
Review of Last Lecture (3/3)
8/3/20 CS 61C Su20 - Lecture 23 4
Shares iterations of a loop across the threads
Each section is executedby a separate thread
Serializes the executionof a thread
• These are defined within a parallel section
![Page 5: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/5.jpg)
Parallel Hello World
#include <stdio.h>
#include <omp.h>
int main () {
int nthreads, tid;
/* Fork team of threads with private var tid */
#pragma omp parallel private(tid)
{
tid = omp_get_thread_num(); /* get thread id */
printf("Hello World from thread = %d\n", tid);
/* Only master thread does this */
if (tid == 0) {
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
} /* All threads join master and terminate */
}8/3/20 CS 61C Su20 - Lecture 23 5
![Page 6: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/6.jpg)
Agenda
• OpenMP Directives
– Workshare for Matrix Multiplication
– Synchronization
• Common OpenMP Pitfalls
• Multiprocessor Cache Coherence
• Coherence Protocol: MOESI
8/3/20 CS 61C Su20 - Lecture 23 6
![Page 7: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/7.jpg)
Matrix Multiplication
8/3/20 CS 61C Su20 - Lecture 23 7
C
= ×
A B
ai
*
b*j
cij
![Page 8: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/8.jpg)
Advantage: Code simplicity
Disadvantage: Blindly marches throughmemory (how does this affect the cache?)
Naïve Matrix Multiply
for (i=0; i<N; i++)
for (j=0; j<N; j++)
for (k=0; k<N; k++)
C[i][j] += A[i][k] * B[k][j];
8/3/20 CS 61C Su20 - Lecture 23 8
C[i*N+j] += A[i*N+k] * B[k*N+j];
What’s actually
happening behind the
scenes (view as 1D
arrays)
![Page 9: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/9.jpg)
Matrix Multiply in OpenMP
start_time = omp_get_wtime();
#pragma omp parallel for private(tmp, i, j, k)
for (i=0; i<Mdim; i++){
for (j=0; j<Ndim; j++){
tmp = 0.0;
for( k=0; k<Pdim; k++){
/* C(i,j) = sum(over k) A(i,k) * B(k,j)*/
tmp += *(A+(i*Pdim+k)) * *(B+(k*Ndim+j));
}
*(C+(i*Ndim+j)) = tmp;
}
}
run_time = omp_get_wtime() - start_time;
8/3/20 CS 61C Su20 - Lecture 23 9
Outer loop spread across N threads; inner loops inside a single thread
Why is there no data race here?
● Different threads only work on different ranges of i -- inside
writing memory access
● Each thread works on writing to different rows
● Never reducing to a single value (because every write is
unique).
![Page 10: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/10.jpg)
Question: What if cache block size > N?
Naïve Matrix Multiply
for (i=0; i<N; i++)
for (j=0; j<N; j++)
for (k=0; k<N; k++)
C[i*N+j] += A[i*N+k] * B[N*k+j];
8/3/20 CS 61C Su20 - Lecture 23 10
![Page 11: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/11.jpg)
Block Size > N
8/3/20 CS 61C Su20 - Lecture 23 11
C
= ×
A B
ai* b*j
cij
Won’t use last half of the block!
Cache Block
![Page 12: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/12.jpg)
Question: What if cache block size > N?
—We wouldn’t be using all the data in the blocks that were put in the cache for matrix C and A!
What about if cache block size < N?
Naïve Matrix Multiply
for (i=0; i<N; i++)
for (j=0; j<N; j++)
for (k=0; k<N; k++)
C[i*N+j] += A[i*N+k] * B[N*k+j];
8/3/20 CS 61C Su20 - Lecture 23 12
![Page 13: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/13.jpg)
Block Size < N
8/3/20 CS 61C Su20 - Lecture 23 13
C
= ×
A B
ai* b*j
cij
Must pull in two blocks instead of one!
Cache Block
![Page 14: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/14.jpg)
Cache Blocking
• Increase the number of cache hits you get by using up as much of the cache block as possible– For an N x N matrix multiplication:
• Instead of striding by the dimensions of the matrix, stride by the blocksize• When N is not perfect divisible by the blocksize, chunk up data as much as
possible into block sizes and handle the remainder as a tailcase
• You’ve already done this in the cache lab —really try to understand it!
8/3/20 CS 61C Su20 - Lecture 23 14
![Page 15: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/15.jpg)
Agenda
• OpenMP Directives
– Workshare for Matrix Multiplication
– Synchronization
• Common OpenMP Pitfalls
• Multiprocessor Cache Coherence
• Coherence Protocol: MOESI
8/3/20 CS 61C Su20 - Lecture 23 15
![Page 16: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/16.jpg)
Synchronization Problems
double compute_sum(double *a, int a_len) {
double sum = 0.0;
#pragma omp parallel for
for (int i = 0; i < a_len; i++) {
sum += a[i];
}
return sum;
}
8/3/20 CS 61C Su20 - Lecture 23 16
Problem: data race with sum!
Solution 1: Put body of loop in critical section
New Problem: Code is now serial! Each thread
must wait its turn to add to sum
Solution 2: Separate “sum”s for each thread!
Combine at the end
![Page 17: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/17.jpg)
OpenMP Reduction
• Reduction: specifies that 1 or more variables that are private to each thread are subject of reduction operation at end of parallel region:reduction(operation:var)
– Operation: perform on the variables (var) at the end of the parallel region
– Var: variable(s) on which to perform scalar reduction#pragma omp for reduction(+ : nSum)
for (i = START ; i <= END ; ++i)
nSum += i;
8/3/20 CS 61C Su20 - Lecture 23 17
![Page 18: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/18.jpg)
Sample use of reduction
double compute_sum(double *a, int a_len) {
double sum = 0.0;
#pragma omp parallel for reduction(+ : sum)
for (int i = 0; i < a_len; i++) {
sum += a[i];
}
return sum;
}
8/3/20 CS 61C Su20 - Lecture 23 18
![Page 19: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/19.jpg)
Agenda
• OpenMP Directives
– Workshare for Matrix Multiplication
– Synchronization
• Common OpenMP Pitfalls
• Multiprocessor Cache Coherence
• Coherence Protocol: MOESI
8/3/20 CS 61C Su20 - Lecture 23 19
![Page 20: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/20.jpg)
OpenMP Pitfalls
• We can’t just throw pragmas on everything and expect performance increase ☹
– Might not change speed much or break code!
– Must understand application and use wisely
• Discussed here:
1) Data dependencies
2) Sharing issues (private/non-private variables)
3) Updating shared values
4) Parallel overhead
8/3/20 CS 61C Su20 - Lecture 23 20
![Page 21: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/21.jpg)
OpenMP Pitfall #1: Data Dependencies
• Consider the following code:a[0] = 1;
for(i=1; i<5000; i++)
a[i] = i + a[i-1];
• There are dependencies between loop iterations!– Splitting this loop between threads does not guarantee in-order
execution
– Out of order loop execution will result in undefined behavior (i.e. likely wrong result)
8/3/20 CS 61C Su20 - Lecture 23 21
![Page 22: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/22.jpg)
Open MP Pitfall #2: Sharing Issues
• Consider the following loop:#pragma omp parallel for
for(i=0; i<n; i++){
temp = 2.0*a[i];
a[i] = temp;
b[i] = c[i]/temp;
}
• temp is a shared variable!#pragma omp parallel for private(temp)
for(i=0; i<n; i++){
temp = 2.0*a[i];
a[i] = temp;
b[i] = c[i]/temp;
}
8/3/20 CS 61C Su20 - Lecture 23 22
Each thread accesses
different elements of a, b,
and c, but the same temp
![Page 23: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/23.jpg)
OpenMP Pitfall #3: Updating Shared Variables Simultaneously
• Now consider a global sum:
for(i=0; i<n; i++)
sum = sum + a[i];
• This can be done by surrounding the summation by a critical/atomic section or reduction clause:
#pragma omp parallel for reduction(+:sum)
{
for(i=0; i<n; i++)
sum = sum + a[i];
}
– Compiler can generate highly efficient code for reduction
8/3/20 CS 61C Su20 - Lecture 23 23
Reads and writes to
sum interleaved
among threads
![Page 24: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/24.jpg)
OpenMP Pitfall #4: Parallel Overhead
• Spawning and releasing threads results in significant overhead
• Better to have fewer but larger parallel regions
– Parallelize over the largest loop that you can (even though it will involve more work to declare all of the private variables and eliminate dependencies)
8/3/20 CS 61C Su20 - Lecture 23 24
![Page 25: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/25.jpg)
OpenMP Pitfall #4: Parallel Overhead
start_time = omp_get_wtime();
for (i=0; i<Ndim; i++){
for (j=0; j<Mdim; j++){
tmp = 0.0;
#pragma omp parallel for reduction(+:tmp)
for( k=0; k<Pdim; k++){
/* C(i,j) = sum(over k) A(i,k) * B(k,j)*/
tmp += *(A+(i*Ndim+k)) * *(B+(k*Pdim+j));
}
*(C+(i*Ndim+j)) = tmp;
}
}
run_time = omp_get_wtime() - start_time;
8/3/20 CS 61C Su20 - Lecture 23 25
Too much overhead in thread generation to have this statement run this frequently.
Poor choice of loop to parallelize.
![Page 26: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/26.jpg)
Agenda
• OpenMP Directives
– Workshare for Matrix Multiplication
– Synchronization
• Common OpenMP Pitfalls
• Multiprocessor Cache Coherence
• Coherence Protocol: MOESI
8/3/20 CS 61C Su20 - Lecture 23 26
![Page 27: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/27.jpg)
Where are the caches?
8/3/20 CS 61C Su20 - Lecture 23 27
Processor 0
Control
DatapathPC
Registers
(ALU)
MemoryInput
Output
Bytes
I/O-Memory Interfaces
Processor 0 Memory Accesses
Processor 1
Control
DatapathPC
Registers
(ALU)
Processor 1 Memory Accesses
Recall: multithreading
could be spread
across multiple cores
![Page 28: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/28.jpg)
Multiprocessor Caches
• Memory is a performance bottleneck
– Even with just one processor
– Caches reduce bandwidth demands on memory
• Each core has a local private cache– Cache misses access shared common memory
8/3/20 CS 61C Su20 - Lecture 23 28
Processor Processor Processor
Cache Cache Cache
Interconnection Network
Memory I/O
![Page 29: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/29.jpg)
Shared Memory and Caches
• What if?
– Processors 1 and 2 read Memory[1000] (value 20)
8/3/20 CS 61C Su20 - Lecture 23 29
Processor Processor Processor
Cache Cache Cache
Interconnection Network
Memory I/O
1000
20
1000
1000: 20 1000:20
0 1 2
![Page 30: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/30.jpg)
Shared Memory and Caches
• Now:
– Processor 0 writes Memory[1000] with 40
8/3/20 CS 61C Su20 - Lecture 23 30
Processor Processor Processor
Cache Cache Cache
Interconnection Network
Memory I/O
0 1 2
1000 20 1000 20
1000
1000 40
1000 40
Problem?
![Page 31: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/31.jpg)
Keeping Multiple Caches Coherent
• Architect’s job: keep cache values coherent with shared memory
• Idea: on cache miss or write, notify other processors via interconnection network
– If reading, many processors can have copies
– If writing, invalidate all other copies
• Write transactions from one processor “snoop” tags of other caches using common interconnect
– Invalidate any “hits” to same address in other caches
8/3/20 CS 61C Su20 - Lecture 23 31
![Page 32: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/32.jpg)
Shared Memory and Caches
• Example, now with cache coherence
– Processors 1 and 2 read Memory[1000]
– Processor 0 writes Memory[1000] with 40
8/3/20 CS 61C Su20 - Lecture 23 32
Processor Processor Processor
Cache Cache Cache
Interconnection Network
Memory I/O
0 1 2
1000 20 1000 20
Processor 0WriteInvalidatesOther Copies
1000
1000 40
1000 40
![Page 33: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/33.jpg)
Using write-through caches removes the need for cache coherence
(A)
Every processor store instruction must check the contents of other caches
(B)
Most processor load and store accesses only need to check in the local private cache
(C)
Only one processor can cache any memory location at one time
(D)
QuestionWhich statement is TRUE about multiprocessor cache coherence?
33
![Page 34: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/34.jpg)
Agenda
• OpenMP Directives
– Workshare for Matrix Multiplication
– Synchronization
• Common OpenMP Pitfalls
• Multiprocessor Cache Coherence
• Coherence Protocol: MOESI
8/3/20 CS 61C Su20 - Lecture 23 34
![Page 35: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/35.jpg)
How Does HW Keep $ Coherent?
• Simple protocol: MSI• Each cache tracks state of each block in cache:
– Modified: up-to-date, changed (dirty), OK to write• no other cache has a copy• copy in memory is out-of-date• must respond to read request by other processors by updating memory
– Shared: up-to-date data, not allowed to write• other caches may have a copy• copy in memory is up-to-date
– Invalid: data in this block is “garbage”
8/3/20 CS 61C Su20 - Lecture 23 35
![Page 36: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/36.jpg)
Read Miss (get block from memory)
MSI Protocol: Current Processor
8/3/20 CS 61C Su20 - Lecture 23 36
Invalid
Shared
Modified
Read HitWrite HitRead Hit
![Page 37: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/37.jpg)
MSI Protocol: Response to Other Processors
8/3/20 CS 61C Su20 - Lecture 23 37
Invalid
Shared
Modified
Probe Read Hit
Probe Write Hit
A probe is
another
processor
A probe hit means a probe finds its
requested block in another cache
![Page 38: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/38.jpg)
How to keep track of state block is in?
• Already have valid bit + dirty bit• Introduce a new bit called “shared” bit
X = doesn’t matter
8/3/20 CS 61C Su20 - Lecture 23 38
Valid Bit Dirty Bit Shared Bit
Modified 1 1 0
Shared 1 0 X (for now)
Invalid 0 X X
![Page 39: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/39.jpg)
MSI Example
8/3/20 CS 61C Su20 - Lecture 23 39
Processor 0 Processor 1
Cache CacheInterconnection Network
Memory I/OBlock 0
Block 1
Processor 0 -- Block 0 Read
Invalid
Invalid Invalid
Invalid
Processor 1 -- Block 0 Read
Processor 0 -- Block 0 Write
Processor 1 -- Block 0 Read
Block 0:
Shared
Block 0:
Shared
Block 0:
Modified
Block 0:
Shared Invalid
Block 0:
Shared
Block 0:
updated
Block 0:
Modified Shared
Here there’s only one
other processor, but we
would have to check &
invalidate the data in
*every* other processor
![Page 40: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/40.jpg)
Compatibility Matrix
• Each block in each cache is in one of the following states:– Modified (in cache)– Shared (in cache)– Invalid (not in cache)
8/3/20 CS 61C Su20 - Lecture 23 40
Compatibility Matrix: Allowed states for a given cache block in any pair of caches
M S I
M 🇽 🇽 ✅
S 🇽 ✅ ✅
I ✅ ✅ ✅
![Page 41: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/41.jpg)
Problem: Writing to Shared is Expensive
• If block is in shared, need to check if other caches have data (so we can invalidate) if we want to write
• If block is in modified, don’t need to check other caches if we want to write. – Why? Only one cache can have data if modified
8/3/20 CS 61C Su20 - Lecture 23 41
![Page 42: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/42.jpg)
Performance Enhancement 1: Exclusive State
• New state: exclusive• Exclusive: up-to-date data, OK to write (change to modified)
– no other cache has a copy– copy in memory up-to-date– no write to memory if block replaced– supplies data on read instead of going to memory
• Now, if block is in shared, at least 1 other cache must contain it: – Shared: up-to-date data, not allowed to write
• other caches may definitely have a copy• copy in memory is up-to-date
• MESI also known as Illinois protocol (since it was developed at UIUC!)
8/3/20 CS 61C Su20 - Lecture 23 42
![Page 43: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/43.jpg)
Read Miss, first to cache data
MESI Protocol: Current Processor
8/3/20 CS 61C Su20 - Lecture 23 43
ExclusiveInvalid
SharedModified
Read Miss, other cache(s) have data
Read Hit
Read HitWrite HitRead Hit
![Page 44: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/44.jpg)
MESI Protocol: Response to Other Processors
8/3/20 CS 61C Su20 - Lecture 23 44
ExclusiveInvalid
Shared
Modified
![Page 45: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/45.jpg)
How to keep track of state block is in?
• New entry in truth table: Exclusive
X = doesn’t matter8/3/20 CS 61C Su20 - Lecture 23 45
Valid Bit Dirty Bit Shared Bit
Modified 1 1 0
Exclusive 1 0 0
Shared 1 0 1
Invalid 0 X X
![Page 46: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/46.jpg)
Problem: Expensive to Share Modified
• In MSI and MESI, if we want to share block in modified: 1. Modified data written back to memory2. Modified block → shared
3. Block that wants data → shared
• Writing to memory is expensive! Can we avoid it?
8/3/20 CS 61C Su20 - Lecture 23 46
![Page 47: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/47.jpg)
Performance Enhancement 2: Owned State
• Owner: up-to-date data, read-only (like shared, you can write if you invalidate shared copies first and your state changes to modified)– Other caches have a shared copy (Shared state)– Data in memory not up-to-date– Owner supplies data on probe read instead of going to memory– Combination of Modified and Shared
• Shared: up-to-date data, not allowed to write– other caches definitely have a copy– copy in memory is may be up-to-date
8/3/20 CS 61C Su20 - Lecture 23 47
![Page 48: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/48.jpg)
Common Cache Coherency Protocol: MOESI (snoopy protocol)
• Each block in each cache is in one of the following states:– Modified (in cache)– Owned (in cache)– Exclusive (in cache)– Shared (in cache)– Invalid (not in cache)
8/3/20 CS 61C Su20 - Lecture 23 48
Compatibility Matrix: Allowed states for a given cache block in any pair of caches
M O E S I
M 🇽 🇽 🇽 🇽 ✅
O 🇽 🇽 🇽 ✅ ✅
E 🇽 🇽 🇽 🇽 ✅
S 🇽 ✅ 🇽 ✅ ✅
I ✅ ✅ ✅ ✅ ✅
![Page 49: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/49.jpg)
Exclusive
MOESI Protocol: Current Processor
8/3/20 CS 61C Su20 - Lecture 23 49
Invalid
Shared
Owned
Modified
Read Miss Exclusive
Read Miss Shared
Read Hit
Read HitWrite Hit
Read Hit
Read Hit
![Page 50: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/50.jpg)
Exclusive
MOESI Protocol: Response to Other Processors
8/3/20 CS 61C Su20 - Lecture 23 50
Invalid
Shared
Owned
Modified
Probe Write Hit
Probe Write Hit
Probe Read Hit
Probe Read Hit
![Page 51: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/51.jpg)
How to keep track of state block is in?
• New entry in truth table: Owned
X = doesn’t matter8/3/20 CS 61C Su20 - Lecture 23 51
Valid Bit Dirty Bit Shared Bit
Modified 1 1 0
Owned 1 1 1
Exclusive 1 0 0
Shared 1 0 1
Invalid 0 X X
![Page 52: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/52.jpg)
Invalid
MOESI Example
8/3/20 CS 61C Su20 - Lecture 23 52
Processor 0 Processor 1
Cache CacheInterconnection Network
Memory I/OBlock 0
Block 1
Processor 0 -- Block 0 Read
Invalid Invalid
Invalid
Processor 1 -- Block 0 Read
Processor 0 -- Block 0 Write
Processor 1 -- Block 0 Read
Processor 0 -- Block 0 Write
Block 0:
ExclusiveBlock 0: Exclusive
Modified
Block 0:
SharedBlock 0: Modified
OwnedInvalid
Block 0: Owned
Modified
Block 0:
Shared
Block 0 is in the exclusive
state, so we don’t have to
check if the block is in
other caches and needs to
be invalidated.
Block 0: Modified
Owned
Block 0 is never
unnecessarily evicted & we
don’t waste bandwidth
writing to memory
![Page 53: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/53.jpg)
8/3/20 CS 61C Su20 - Lecture 23 53
![Page 54: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/54.jpg)
Cache Coherence Tracked by Block
• Suppose: – Block size is 32 bytes
– P0 reading and writing variable X, P1 reading and writing variable Y
– X in location 4000, Y in 4012
• What will happen?
8/3/20 CS 61C Su20 - Lecture 23 54
Processor 0 Processor 1
4000 4000 4004 4008 4012 4016 4028
Tag 32-Byte Data Block
Cache 0 Cache 1
Memory
![Page 55: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/55.jpg)
False Sharing
• Block ping-pongs between two caches even though processors are accessing disjoint variables
– Effect called false sharing
• How can you prevent it?
– Want to “place” data on different blocks
– Reduce block size
8/3/20 CS 61C Su20 - Lecture 23 55
![Page 56: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/56.jpg)
False Sharing vs. Real Sharing
• If same piece of data being used by 2 caches, ping-ponging is inevitable• This is not false sharing• Would cache invalidation occur if block size was only 1 word?
– Yes: true sharing– No: false sharing
8/3/20 CS 61C Su20 - Lecture 23 56
Processor 0 Processor 1
4000 4000 4004 4008 4012 4016 4028
Tag 32-Byte Data Block
Cache 0 Cache 1
Memory
![Page 57: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/57.jpg)
Understanding Cache Misses: The 3Cs
• Compulsory (cold start or process migration, 1st reference):– First access to a block in memory impossible to avoid
– Solution: block size ↑ (MP ↑)
• Capacity:– Cache cannot hold all blocks accessed by the program
– Solution: cache size ↑ (may cause access/HT ↑)
• Conflict (collision):– Multiple memory locations map to same cache location
– Solutions: cache size ↑, associativity ↑ (may cause access/HT ↑)
8/3/20 CS 61C Su20 - Lecture 23 57
![Page 58: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/58.jpg)
“Fourth C”: Coherence Misses
• Misses caused by coherence traffic with other processor
• Also known as communication misses because represents data moving between processors working together on a parallel program
• For some parallel programs, coherence misses can dominate total misses
8/3/20 CS 61C Su20 - Lecture 23 58
![Page 59: Multithreading Issues, Cache Coherency](https://reader030.vdocument.in/reader030/viewer/2022012423/617755e916079e10be32adac/html5/thumbnails/59.jpg)
Summary
• OpenMP as simple parallel extension to C
– Synchronization accomplished with critical/atomic/reduction
– Pitfalls can reduce speedup or break program logic
• Cache coherence implements shared memory even with multiple copies in multiple caches
– False sharing a concern
8/3/20 CS 61C Su20 - Lecture 23 59