cs 152 computer architecture & engineering

24
CS 152 Computer Architecture & Engineering Andrew Waterman University of California, Berkeley Section 7 Spring 2010

Upload: lunea-clark

Post on 31-Dec-2015

43 views

Category:

Documents


5 download

DESCRIPTION

CS 152 Computer Architecture & Engineering. Section 7 Spring 2010. Andrew Waterman. University of California, Berkeley. Mystery Die. Mystery Die. Mystery Die. RISC II: 41K transistors, 4 micron NMOS @ 12 MHz 2.2x faster than VAX 11-780 (1500 TTL chips @ 5MHz). Agenda. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CS 152 Computer Architecture & Engineering

CS 152Computer Architecture &

Engineering

Andrew Waterman

University of California, Berkeley

Section 7Spring 2010

Page 2: CS 152 Computer Architecture & Engineering

Mystery Die

Page 3: CS 152 Computer Architecture & Engineering

Mystery Die

Page 4: CS 152 Computer Architecture & Engineering

Mystery Die

• RISC II: 41K transistors, 4 micron NMOS @ 12 MHz

• 2.2x faster than VAX 11-780 (1500 TTL chips @ 5MHz)

Page 5: CS 152 Computer Architecture & Engineering

Agenda

• Quiz 2 Post-Mortem–Mean: 53.1– Standard Deviation: 9.0

Page 6: CS 152 Computer Architecture & Engineering

Quiz 2, Q1

for(i = 0; i < N; i++) for(j = 0; j < N;

j++) B[j*N+i] =

A[i*N+j];

• N=1024. Store/Load miss rate for 4KB 2-way cache w/LRU replacement?• LRU => no conflicts between

loads/stores• Loads are unit-stride with no reuse• All misses compulsory => 1/8

• All stores miss because of capacity misses

Page 7: CS 152 Computer Architecture & Engineering

Quiz 2, Q1

for(i = 0; i < N; i++) for(j = 0; j < N;

j++) B[j*N+i] =

A[i*N+j];

• What about FIFO replacement?• Stores and loads could now conflict.

When?• Stores always use set i/8 % 64• Loads always use set j/8 % 64• Conflicts occur when these are equal

Page 8: CS 152 Computer Architecture & Engineering

Quiz 2, Q1

for(i = 0; i < N; i++) for(j = 0; j < N;

j++) B[j*N+i] =

A[i*N+j];

• What about FIFO replacement?• Stores and loads could now conflict.

When?• Stores always use set i/8 % 64• Loads always use set j/8 % 64• Conflicts occur when these are equal

Page 9: CS 152 Computer Architecture & Engineering

Quiz 2, Q1

for(i = 0; i < N; i++) for(j = 0; j < N;

j++) B[j*N+i] =

A[i*N+j];

• Is Write-Allocate a good idea for this code?

Page 10: CS 152 Computer Architecture & Engineering

Quiz 2, Q1

for(i = 0; i < N; i++) for(j = 0; j < N;

j++) B[j*N+i] =

A[i*N+j];

• Is Write-Allocate a good idea for this code?• On every store miss, 32 bytes of data

are read into cache then discarded, so no

Page 11: CS 152 Computer Architecture & Engineering

Quiz 2, Q1

for(i = 0; i < N; i++) for(j = 0; j < N;

j++) B[j*N+i] =

A[i*N+j];

• Is Write-Back a good idea for this code?

Page 12: CS 152 Computer Architecture & Engineering

Quiz 2, Q1

for(i = 0; i < N; i++) for(j = 0; j < N;

j++) B[j*N+i] =

A[i*N+j];

• Is Write-Back a good idea for this code?• For Write-Allocate, bad (32 bytes

written back for each 4 byte store (total of 64 bytes traffic))

• Otherwise, OK, except the Write-Through alternative had a write buffer, which will dramatically reduce miss penalty

Page 13: CS 152 Computer Architecture & Engineering

Quiz 2, Q1

for(i = 0; i < N; i++) for(j = 0; j < N;

j++) B[j*N+i] =

A[i*N+j];

• If cache were fully associative, how could we improve code’s performance?

Page 14: CS 152 Computer Architecture & Engineering

Quiz 2, Q1

for(i = 0; i < N; i++) for(j = 0; j < N;

j++) B[j*N+i] =

A[i*N+j];

• If cache were fully associative, how could we improve code’s performance?• Block the transpose• FA makes this easier; lots of solutions• Here’s one; let B = 8 (words in cache

line)for(i = 0; i < N; i+=B) for(j = 0; j < N; j++) for(k = 0; k < B; k++) B[j*N+(i+k)] = A[(i+k)*N+j];

Page 15: CS 152 Computer Architecture & Engineering

Quiz 2, Q1

for(i = 0; i < N; i++) for(j = 0; j < N;

j++) B[j*N+i] =

A[i*N+j];

• What about TLB misses?• 4KB pages, 1024-entry DM TLB

• Compulsory misses first• 2 matrices * (1024^2 words)/(1024

words/page)• = 2048

Page 16: CS 152 Computer Architecture & Engineering

Quiz 2, Q1

for(i = 0; i < N; i++) for(j = 0; j < N;

j++) B[j*N+i] =

A[i*N+j];

• What about TLB misses?• 4KB pages, 1024-entry DM TLB

• Now consider some iteration 0 ≤ i < N-1

• After iteration i, TLB[i] = Ai, and TLB[k] = Bk, k≠i

• During iteration i+1, store to Bi will miss

• Then store to Bi+1 will miss, kicking out Ai+1

• Next load to Ai+1 will miss

• 3 conflicts/iteration• 3072+2048 misses total

Page 17: CS 152 Computer Architecture & Engineering

Quiz 2, Q2

• Basic idea of microtags: SA caches put tag check on critical path (data-out)

• Reduce critical path by using subset of tag to select way• In this cache, microtag check -> data

out remains critical path, but 1/6 faster

Page 18: CS 152 Computer Architecture & Engineering

Quiz 2, Q2

• AMAT = hit time + miss rate * miss penalty• Hit time not multiplied by hit rate• You have to pay the hit time even on a

miss

Page 19: CS 152 Computer Architecture & Engineering

Quiz 2, Q2

• Microtag uniqueness affects conflict misses• Increases compared to 4-way SA• But still much better than DM• Otherwise, why would we build a

microtagged cache? Just use DM

Page 20: CS 152 Computer Architecture & Engineering

Quiz 2, Q2

• Aliasing question was unintentionally tricky: microtags are a red herring

• The aliasing problem is just the same as for any virtually-indexed physically-tagged cache with index+offset ≤ page offset• Aliases always map to the same set,

which would be fine for DM, but with SA they can live in different ways

Page 21: CS 152 Computer Architecture & Engineering

Quiz 2, Q2

• Aliasing question was unintentionally tricky: microtags are a red herring

• The aliasing problem is just the same as for any virtually-indexed physically-tagged cache with index+offset ≤ page offset• Simple fix: on a miss, you already have

the physical tag and all physical tags in the set

• Iff there’s a match, there’s an alias

Page 22: CS 152 Computer Architecture & Engineering

Quiz 2, Q3

• 2x associativity, capacity & line size constant• Increases hit time due to data-out

muxing• Reduces conflict misses

• Halving line size (associativity & #sets constant)• Reduces hit time (capacity down)• Increases miss rate (same reason)• Reduces miss penalty (shorter lines,

less to fetch)

Page 23: CS 152 Computer Architecture & Engineering

Quiz 2, Q3

• Physical -> virtual cache• Hit time reduced (only real reason to do

this is to remove TLB from hit path)• Effect on miss rate ambiguous• More misses for aliases• More misses for context switches w/o

ASIDs• Fewer misses due to address space

contiguity• Increased miss penalty because TLB

lookup is moved to miss path, and for anti-aliasing

Page 24: CS 152 Computer Architecture & Engineering

Quiz 2, Q3

• Write buffer• Reduces both store miss penalty and

hit time• HW prefetching• HW isn’t on hit path, so no effect on hit

time• Reduces miss rate (main reason)• Prefetch buffer hit considered “slow

hit”, not miss• Reduces miss penalty (prefetches can

be in-flight when miss occurs)