cs 152 computer architecture & engineering

CS 152Computer Architecture &

Engineering

Andrew Waterman

University of California, Berkeley

Section 7Spring 2010

Mystery Die

Mystery Die

• RISC II: 41K transistors, 4 micron NMOS @ 12 MHz

• 2.2x faster than VAX 11-780 (1500 TTL chips @ 5MHz)

Agenda

• Quiz 2 Post-Mortem–Mean: 53.1– Standard Deviation: 9.0

Quiz 2, Q1

for(i = 0; i < N; i++) for(j = 0; j < N;

j++) B[j*N+i] =

A[i*N+j];

• N=1024. Store/Load miss rate for 4KB 2-way cache w/LRU replacement?• LRU => no conflicts between

loads/stores• Loads are unit-stride with no reuse• All misses compulsory => 1/8

• All stores miss because of capacity misses

Quiz 2, Q1

for(i = 0; i < N; i++) for(j = 0; j < N;

j++) B[j*N+i] =

A[i*N+j];

• What about FIFO replacement?• Stores and loads could now conflict.

When?• Stores always use set i/8 % 64• Loads always use set j/8 % 64• Conflicts occur when these are equal

Quiz 2, Q1

for(i = 0; i < N; i++) for(j = 0; j < N;

j++) B[j*N+i] =

A[i*N+j];

• Is Write-Allocate a good idea for this code?

Quiz 2, Q1

for(i = 0; i < N; i++) for(j = 0; j < N;

j++) B[j*N+i] =

A[i*N+j];

• Is Write-Allocate a good idea for this code?• On every store miss, 32 bytes of data

are read into cache then discarded, so no

Quiz 2, Q1

for(i = 0; i < N; i++) for(j = 0; j < N;

j++) B[j*N+i] =

A[i*N+j];

• Is Write-Back a good idea for this code?

Quiz 2, Q1

for(i = 0; i < N; i++) for(j = 0; j < N;

j++) B[j*N+i] =

A[i*N+j];

• Is Write-Back a good idea for this code?• For Write-Allocate, bad (32 bytes

written back for each 4 byte store (total of 64 bytes traffic))

• Otherwise, OK, except the Write-Through alternative had a write buffer, which will dramatically reduce miss penalty

Quiz 2, Q1

for(i = 0; i < N; i++) for(j = 0; j < N;

j++) B[j*N+i] =

A[i*N+j];

• If cache were fully associative, how could we improve code’s performance?

Quiz 2, Q1

for(i = 0; i < N; i++) for(j = 0; j < N;

j++) B[j*N+i] =

A[i*N+j];

• If cache were fully associative, how could we improve code’s performance?• Block the transpose• FA makes this easier; lots of solutions• Here’s one; let B = 8 (words in cache

line)for(i = 0; i < N; i+=B) for(j = 0; j < N; j++) for(k = 0; k < B; k++) B[j*N+(i+k)] = A[(i+k)*N+j];

Quiz 2, Q1

for(i = 0; i < N; i++) for(j = 0; j < N;

j++) B[j*N+i] =

A[i*N+j];

• What about TLB misses?• 4KB pages, 1024-entry DM TLB

• Compulsory misses first• 2 matrices * (1024^2 words)/(1024

words/page)• = 2048

Quiz 2, Q1

for(i = 0; i < N; i++) for(j = 0; j < N;

j++) B[j*N+i] =

A[i*N+j];

• What about TLB misses?• 4KB pages, 1024-entry DM TLB

• Now consider some iteration 0 ≤ i < N-1

• After iteration i, TLB[i] = Ai, and TLB[k] = Bk, k≠i

• During iteration i+1, store to Bi will miss

• Then store to Bi+1 will miss, kicking out Ai+1

• Next load to Ai+1 will miss

• 3 conflicts/iteration• 3072+2048 misses total

Quiz 2, Q2

• Basic idea of microtags: SA caches put tag check on critical path (data-out)

• Reduce critical path by using subset of tag to select way• In this cache, microtag check -> data

out remains critical path, but 1/6 faster

Quiz 2, Q2

• AMAT = hit time + miss rate * miss penalty• Hit time not multiplied by hit rate• You have to pay the hit time even on a

miss

Quiz 2, Q2

• Microtag uniqueness affects conflict misses• Increases compared to 4-way SA• But still much better than DM• Otherwise, why would we build a

microtagged cache? Just use DM

Quiz 2, Q2

• Aliasing question was unintentionally tricky: microtags are a red herring

• The aliasing problem is just the same as for any virtually-indexed physically-tagged cache with index+offset ≤ page offset• Aliases always map to the same set,

which would be fine for DM, but with SA they can live in different ways

Quiz 2, Q2

• Aliasing question was unintentionally tricky: microtags are a red herring

• The aliasing problem is just the same as for any virtually-indexed physically-tagged cache with index+offset ≤ page offset• Simple fix: on a miss, you already have

the physical tag and all physical tags in the set

• Iff there’s a match, there’s an alias

Quiz 2, Q3

• 2x associativity, capacity & line size constant• Increases hit time due to data-out

muxing• Reduces conflict misses

• Halving line size (associativity & #sets constant)• Reduces hit time (capacity down)• Increases miss rate (same reason)• Reduces miss penalty (shorter lines,

less to fetch)

Quiz 2, Q3

• Physical -> virtual cache• Hit time reduced (only real reason to do

this is to remove TLB from hit path)• Effect on miss rate ambiguous• More misses for aliases• More misses for context switches w/o

ASIDs• Fewer misses due to address space

contiguity• Increased miss penalty because TLB

lookup is moved to miss path, and for anti-aliasing

Quiz 2, Q3

• Write buffer• Reduces both store miss penalty and

hit time• HW prefetching• HW isn’t on hit path, so no effect on hit

time• Reduces miss rate (main reason)• Prefetch buffer hit considered “slow

hit”, not miss• Reduces miss penalty (prefetches can

be in-flight when miss occurs)

cs 152 computer architecture & engineering

Documents

n i forj

j n j bj

n jwhat

n jquiz

j n j fork

iteration i

n jif cache

b forj