cs 152 computer architecture & engineering
DESCRIPTION
CS 152 Computer Architecture & Engineering. Section 7 Spring 2010. Andrew Waterman. University of California, Berkeley. Mystery Die. Mystery Die. Mystery Die. RISC II: 41K transistors, 4 micron NMOS @ 12 MHz 2.2x faster than VAX 11-780 (1500 TTL chips @ 5MHz). Agenda. - PowerPoint PPT PresentationTRANSCRIPT
CS 152Computer Architecture &
Engineering
Andrew Waterman
University of California, Berkeley
Section 7Spring 2010
Mystery Die
Mystery Die
Mystery Die
• RISC II: 41K transistors, 4 micron NMOS @ 12 MHz
• 2.2x faster than VAX 11-780 (1500 TTL chips @ 5MHz)
Agenda
• Quiz 2 Post-Mortem–Mean: 53.1– Standard Deviation: 9.0
Quiz 2, Q1
for(i = 0; i < N; i++) for(j = 0; j < N;
j++) B[j*N+i] =
A[i*N+j];
• N=1024. Store/Load miss rate for 4KB 2-way cache w/LRU replacement?• LRU => no conflicts between
loads/stores• Loads are unit-stride with no reuse• All misses compulsory => 1/8
• All stores miss because of capacity misses
Quiz 2, Q1
for(i = 0; i < N; i++) for(j = 0; j < N;
j++) B[j*N+i] =
A[i*N+j];
• What about FIFO replacement?• Stores and loads could now conflict.
When?• Stores always use set i/8 % 64• Loads always use set j/8 % 64• Conflicts occur when these are equal
Quiz 2, Q1
for(i = 0; i < N; i++) for(j = 0; j < N;
j++) B[j*N+i] =
A[i*N+j];
• What about FIFO replacement?• Stores and loads could now conflict.
When?• Stores always use set i/8 % 64• Loads always use set j/8 % 64• Conflicts occur when these are equal
Quiz 2, Q1
for(i = 0; i < N; i++) for(j = 0; j < N;
j++) B[j*N+i] =
A[i*N+j];
• Is Write-Allocate a good idea for this code?
Quiz 2, Q1
for(i = 0; i < N; i++) for(j = 0; j < N;
j++) B[j*N+i] =
A[i*N+j];
• Is Write-Allocate a good idea for this code?• On every store miss, 32 bytes of data
are read into cache then discarded, so no
Quiz 2, Q1
for(i = 0; i < N; i++) for(j = 0; j < N;
j++) B[j*N+i] =
A[i*N+j];
• Is Write-Back a good idea for this code?
Quiz 2, Q1
for(i = 0; i < N; i++) for(j = 0; j < N;
j++) B[j*N+i] =
A[i*N+j];
• Is Write-Back a good idea for this code?• For Write-Allocate, bad (32 bytes
written back for each 4 byte store (total of 64 bytes traffic))
• Otherwise, OK, except the Write-Through alternative had a write buffer, which will dramatically reduce miss penalty
Quiz 2, Q1
for(i = 0; i < N; i++) for(j = 0; j < N;
j++) B[j*N+i] =
A[i*N+j];
• If cache were fully associative, how could we improve code’s performance?
Quiz 2, Q1
for(i = 0; i < N; i++) for(j = 0; j < N;
j++) B[j*N+i] =
A[i*N+j];
• If cache were fully associative, how could we improve code’s performance?• Block the transpose• FA makes this easier; lots of solutions• Here’s one; let B = 8 (words in cache
line)for(i = 0; i < N; i+=B) for(j = 0; j < N; j++) for(k = 0; k < B; k++) B[j*N+(i+k)] = A[(i+k)*N+j];
Quiz 2, Q1
for(i = 0; i < N; i++) for(j = 0; j < N;
j++) B[j*N+i] =
A[i*N+j];
• What about TLB misses?• 4KB pages, 1024-entry DM TLB
• Compulsory misses first• 2 matrices * (1024^2 words)/(1024
words/page)• = 2048
Quiz 2, Q1
for(i = 0; i < N; i++) for(j = 0; j < N;
j++) B[j*N+i] =
A[i*N+j];
• What about TLB misses?• 4KB pages, 1024-entry DM TLB
• Now consider some iteration 0 ≤ i < N-1
• After iteration i, TLB[i] = Ai, and TLB[k] = Bk, k≠i
• During iteration i+1, store to Bi will miss
• Then store to Bi+1 will miss, kicking out Ai+1
• Next load to Ai+1 will miss
• 3 conflicts/iteration• 3072+2048 misses total
Quiz 2, Q2
• Basic idea of microtags: SA caches put tag check on critical path (data-out)
• Reduce critical path by using subset of tag to select way• In this cache, microtag check -> data
out remains critical path, but 1/6 faster
Quiz 2, Q2
• AMAT = hit time + miss rate * miss penalty• Hit time not multiplied by hit rate• You have to pay the hit time even on a
miss
Quiz 2, Q2
• Microtag uniqueness affects conflict misses• Increases compared to 4-way SA• But still much better than DM• Otherwise, why would we build a
microtagged cache? Just use DM
Quiz 2, Q2
• Aliasing question was unintentionally tricky: microtags are a red herring
• The aliasing problem is just the same as for any virtually-indexed physically-tagged cache with index+offset ≤ page offset• Aliases always map to the same set,
which would be fine for DM, but with SA they can live in different ways
Quiz 2, Q2
• Aliasing question was unintentionally tricky: microtags are a red herring
• The aliasing problem is just the same as for any virtually-indexed physically-tagged cache with index+offset ≤ page offset• Simple fix: on a miss, you already have
the physical tag and all physical tags in the set
• Iff there’s a match, there’s an alias
Quiz 2, Q3
• 2x associativity, capacity & line size constant• Increases hit time due to data-out
muxing• Reduces conflict misses
• Halving line size (associativity & #sets constant)• Reduces hit time (capacity down)• Increases miss rate (same reason)• Reduces miss penalty (shorter lines,
less to fetch)
Quiz 2, Q3
• Physical -> virtual cache• Hit time reduced (only real reason to do
this is to remove TLB from hit path)• Effect on miss rate ambiguous• More misses for aliases• More misses for context switches w/o
ASIDs• Fewer misses due to address space
contiguity• Increased miss penalty because TLB
lookup is moved to miss path, and for anti-aliasing
Quiz 2, Q3
• Write buffer• Reduces both store miss penalty and
hit time• HW prefetching• HW isn’t on hit path, so no effect on hit
time• Reduces miss rate (main reason)• Prefetch buffer hit considered “slow
hit”, not miss• Reduces miss penalty (prefetches can
be in-flight when miss occurs)