today

38
University of Washington Today HW3 extension Phew! Lab 4? Finish up caches, exceptional control flow 1

Upload: dudley

Post on 12-Jan-2016

35 views

Category:

Documents


0 download

DESCRIPTION

Today. HW3 extension Phew!  Lab 4? Finish up caches, exceptional control flow. 8-way 1 set, 8 blocks. 1-way 8 sets, 1 block each. 2-way 4 sets, 2 blocks each. 4-way 2 sets, 4 blocks each. Set. Set. Set. Set. 0 1 2 3 4 5 6 7. 0 1 2 3. 0 1. 0. direct mapped. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Today

University of Washington

1

Today HW3 extension

Phew! Lab 4?

Finish up caches, exceptional control flow

Page 2: Today

University of Washington

2

Cache Associativity

01234567

Set

0

1

2

3

Set

0

1

Set

1-way8 sets,

1 block each

2-way4 sets,

2 blocks each

4-way2 sets,

4 blocks each

0

Set

8-way1 set,

8 blocks

direct mapped fully associative

Page 3: Today

University of Washington

3

Types of Cache Misses Cold (compulsory) miss

Occurs on first access to a block

Page 4: Today

University of Washington

4

Types of Cache Misses Cold (compulsory) miss

Occurs on first access to a block Conflict miss

Most hardware caches limit blocks to a small subset (sometimes just one) of the available cache slots

if one (e.g., block i must be placed in slot (i mod size)), direct-mapped if more than one, n-way set-associative (where n is a power of 2)

Conflict misses occur when the cache is large enough, but multiple data objects all map to the same slot

e.g., referencing blocks 0, 8, 0, 8, ... would miss every time=

Page 5: Today

University of Washington

5

Types of Cache Misses Cold (compulsory) miss

Occurs on first access to a block Conflict miss

Most hardware caches limit blocks to a small subset (sometimes just one) of the available cache slots

if one (e.g., block i must be placed in slot (i mod size)), direct-mapped if more than one, n-way set-associative (where n is a power of 2)

Conflict misses occur when the cache is large enough, but multiple data objects all map to the same slot

e.g., referencing blocks 0, 8, 0, 8, ... would miss every time Capacity miss

Occurs when the set of active cache blocks (the working set) is larger than the cache (just won’t fit)

Page 6: Today

University of Washington

6

Intel Core i7 Cache Hierarchy

Regs

L1 d-cache

L1 i-cache

L2 unified cache

Core 0

Regs

L1 d-cache

L1 i-cache

L2 unified cache

Core 3

L3 unified cache(shared by all cores)

Main memory

Processor package

L1 i-cache and d-cache:32 KB, 8-way, Access: 4 cycles

L2 unified cache:256 KB, 8-way, Access: 11 cycles

L3 unified cache:8 MB, 16-way,Access: 30-40 cycles

Block size: 64 bytes for all caches.

Page 7: Today

University of Washington

7

What about writes? Multiple copies of data exist:

L1, L2, possibly L3, main memory What is the main problem with that?

Page 8: Today

University of Washington

8

What about writes? Multiple copies of data exist:

L1, L2, possibly L3, main memory What to do on a write-hit?

Write-through (write immediately to memory) Write-back (defer write to memory until line is evicted)

Need a dirty bit to indicate if line is different from memory or not What to do on a write-miss?

Write-allocate (load into cache, update line in cache) Good if more writes to the location follow

No-write-allocate (just write immediately to memory) Typical caches:

Write-back + Write-allocate, usually Write-through + No-write-allocate, occasionally

Page 9: Today

University of Washington

9

Where else is caching used?

Page 10: Today

University of Washington

10

Software Caches are More Flexible Examples

File system buffer caches, web browser caches, etc.

Some design differences Almost always fully-associative

so, no placement restrictions index structures like hash tables are common (for placement)

Often use complex replacement policies misses are very expensive when disk or network involved worth thousands of cycles to avoid them

Not necessarily constrained to single “block” transfers may fetch or write-back in larger units, opportunistically

Page 11: Today

University of Washington

11

Optimizations for the Memory Hierarchy Write code that has locality

Spatial: access data contiguously Temporal: make sure access to the same data is not too far apart in time

How to achieve? Proper choice of algorithm Loop transformations

Page 12: Today

University of Washington

12

Example: Matrix Multiplication

a b

i

j

*c

=

c = (double *) calloc(sizeof(double), n*n);

/* Multiply n x n matrices a and b */void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i++)

for (j = 0; j < n; j++) for (k = 0; k < n; k++)

c[i*n + j] += a[i*n + k]*b[k*n + j];}

Page 13: Today

University of Washington

13

Cache Miss Analysis Assume:

Matrix elements are doubles Cache block = 64 bytes = 8 doubles Cache size C << n (much smaller than n)

First iteration: n/8 + n = 9n/8 misses

(omitting matrix c)

Afterwards in cache:(schematic)

*=

n

*=8 wide

Page 14: Today

University of Washington

14

Cache Miss Analysis Assume:

Matrix elements are doubles Cache block = 64 bytes = 8 doubles Cache size C << n (much smaller than n)

Other iterations: Again:

n/8 + n = 9n/8 misses(omitting matrix c)

Total misses: 9n/8 * n2 = (9/8) * n3

n

*=8 wide

Page 15: Today

University of Washington

15

Blocked Matrix Multiplicationc = (double *) calloc(sizeof(double), n*n);

/* Multiply n x n matrices a and b */void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i+=B)

for (j = 0; j < n; j+=B) for (k = 0; k < n; k+=B)

/* B x B mini matrix multiplications */ for (i1 = i; i1 < i+B; i1++) for (j1 = j; j1 < j+B; j1++) for (k1 = k; k1 < k+B; k1++)

c[i1*n + j1] += a[i1*n + k1]*b[k1*n + j1];}

a b

i1

j1

*c

=

Block size B x B

Page 16: Today

University of Washington

16

Cache Miss Analysis Assume:

Cache block = 64 bytes = 8 doubles Cache size C << n (much smaller than n) Three blocks fit into cache: 3B2 < C

First (block) iteration: B2/8 misses for each block 2n/B * B2/8 = nB/4

(omitting matrix c)

Afterwards in cache(schematic)

*=

*=

Block size B x B

n/B blocks

Page 17: Today

University of Washington

17

Cache Miss Analysis Assume:

Cache block = 64 bytes = 8 doubles Cache size C << n (much smaller than n) Three blocks fit into cache: 3B2 < C

Other (block) iterations: Same as first iteration 2n/B * B2/8 = nB/4

Total misses: nB/4 * (n/B)2 = n3/(4B)

*=

Block size B x B

n/B blocks

Page 18: Today

University of Washington

18

Summary No blocking: (9/8) * n3

Blocking: 1/(4B) * n3

If B = 8 difference is 4 * 8 * 9 / 8 = 36x If B = 16 difference is 4 * 16 * 9 / 8 = 72x

Suggests largest possible block size B, but limit 3B2 < C!

Reason for dramatic difference: Matrix multiplication has inherent temporal locality:

Input data: 3n2, computation 2n3

Every array element used O(n) times! But program has to be written properly

Page 19: Today

University of Washington

19

Cache-Friendly Code Programmer can optimize for cache performance

How data structures are organized How data are accessed

Nested loop structure Blocking is a general technique

All systems favor “cache-friendly code” Getting absolute optimum performance is very platform specific

Cache sizes, line sizes, associativities, etc. Can get most of the advantage with generic code

Keep working set reasonably small (temporal locality) Use small strides (spatial locality) Focus on inner loop code

Page 20: Today

University of Washington

20

Intel Core i7 Cache Hierarchy

Regs

L1 d-cache

L1 i-cache

L2 unified cache

Core 0

Regs

L1 d-cache

L1 i-cache

L2 unified cache

Core 3

L3 unified cache(shared by all cores)

Main memory

Processor package

L1 i-cache and d-cache:32 KB, 8-way, Access: 4 cycles

L2 unified cache:256 KB, 8-way, Access: 11 cycles

L3 unified cache:8 MB, 16-way,Access: 30-40 cycles

Block size: 64 bytes for all caches.

Page 21: Today

University of Washington

21

The Memory Mountains1 s3 s5 s7 s9

s11

s13

s15

s32

0

1000

2000

3000

4000

5000

6000

7000

64M 16M 4M 1M 256K

64K 16K 4K

Stride (x8 bytes)

Re

ad

th

rou

gh

pu

t (M

B/s

)

Working set size (bytes)

L1

L2

Mem

L3

Intel Core i732 KB L1 i-cache32 KB L1 d-cache256 KB unified L2 cache8M unified L3 cache

All caches on-chip

Page 22: Today

University of Washington

22

Roadmap

car *c = malloc(sizeof(car));c->miles = 100;c->gals = 17;float mpg = get_mpg(c);free(c);

Car c = new Car();c.setMiles(100);c.setGals(17);float mpg = c.getMPG();

get_mpg: pushq %rbp movq %rsp, %rbp ... popq %rbp ret

Java:C:

Assembly language:

Machine code:

01110100000110001000110100000100000000101000100111000010110000011111101000011111

Computer system:

OS:

Data & addressingIntegers & floatsMachine code & Cx86 assembly programmingProcedures & stacksArrays & structsMemory & cachesExceptions & processesVirtual memoryMemory allocationJava vs. C

Page 23: Today

University of Washington

23

Control Flow So far, we’ve seen how the flow of control changes as a single

program executes A CPU executes more than one program at a time though – we

also need to understand how control flows across the many components of the system

Exceptional control flow is the basic mechanism used for: Transferring control between processes and OS Handling I/O and virtual memory within the OS Implementing multi-process applications like shells and web servers Implementing concurrency

Page 24: Today

University of Washington

24

Control Flow Processors do only one thing:

From startup to shutdown, a CPU simply reads and executes (interprets) a sequence of instructions, one at a time

This sequence is the CPU’s control flow (or flow of control)

<startup>inst1

inst2

inst3

…instn

<shutdown>

Physical control flow

time

Page 25: Today

University of Washington

25

Altering the Control Flow Up to now: two ways to change control flow: … which ones?

Page 26: Today

University of Washington

26

Altering the Control Flow Up to now: two ways to change control flow:

Jumps (conditional and unconditional) Call and returnBoth react to changes in program state

Processor also needs to react to changes in system state Like?

Page 27: Today

University of Washington

27

Altering the Control Flow Up to now: two ways to change control flow:

Jumps (conditional and unconditional) Call and returnBoth react to changes in program state

Processor also needs to react to changes in system state user hits “Ctrl-C” at the keyboard user clicks on a different application’s window on the screen data arrives from a disk or a network adapter instruction divides by zero system timer expires

Can jumps and procedure calls achieve this?

Page 28: Today

University of Washington

28

Altering the Control Flow Up to now: two ways to change control flow:

Jumps (conditional and unconditional) Call and returnBoth react to changes in program state

Processor also needs to react to changes in system state user hits “Ctrl-C” at the keyboard user clicks on a different application’s window on the screen data arrives from a disk or a network adapter instruction divides by zero system timer expires

Can jumps and procedure calls achieve this? Jumps and calls are not sufficient – the system needs mechanisms for

“exceptional” control flow!

Page 29: Today

University of Washington

29

Exceptional Control Flow Exists at all levels of a computer system Low level mechanisms

Exceptions change processor’s in control flow in response to a system event

(i.e., change in system state, user-generated interrupt) Combination of hardware and OS software

Higher level mechanisms Process context switch Signals – you’ll hear about these in CSE451 and CSE466 Implemented by either:

OS software C language runtime library

Page 30: Today

University of Washington

30

An exception is transfer of control to the operating system (OS) in response to some event (i.e., change in processor state)

Examples: div by 0, page fault, I/O request completes, Ctrl-C

How does the system know where to jump to in the OS?

User Process OS

exceptionexception processingby exception handler

• return to I_current• return to I_next•abort

event I_currentI_next

Exceptions

Page 31: Today

University of Washington

31

01

2...

n-1

Interrupt Vectors

Each type of event has a unique exception number k

k = index into exception table (a.k.a. interrupt vector)

Handler k is called each time exception k occurs

ExceptionTable

code for exception handler 0

code for exception handler 1

code forexception handler 2

code for exception handler n-1

...

Exception numbers

Page 32: Today

University of Washington

32

Asynchronous Exceptions (Interrupts) Caused by events external to the processor

Indicated by setting the processor’s interrupt pin(s) Handler returns to “next” instruction

Examples: I/O interrupts

hitting Ctrl-C on the keyboard clicking a mouse button or tapping a touchscreen arrival of a packet from a network arrival of data from a disk

Hard reset interrupt hitting the reset button on front panel

Soft reset interrupt hitting Ctrl-Alt-Delete on a PC

Page 33: Today

University of Washington

33

Synchronous Exceptions Caused by events that occur as a result of executing an

instruction: Traps

Intentional: transfer control to OS to perform some function Examples: system calls, breakpoint traps, special instructions Returns control to “next” instruction

Faults Unintentional but possibly recoverable Examples: page faults (recoverable), segment protection faults

(unrecoverable), integer divide-by-zero exceptions (unrecoverable) Either re-executes faulting (“current”) instruction or aborts

Aborts Unintentional and unrecoverable Examples: parity error, machine check Aborts current program

Page 34: Today

University of Washington

34

Trap Example: Opening File User calls: open(filename, options) Function open executes system call instruction int

OS must find or create file, get it ready for reading or writing Returns integer file descriptor

0804d070 <__libc_open>: . . . 804d082: cd 80 int $0x80 804d084: 5b pop %ebx . . .

User Process OS

exception

open filereturns

intpop

Page 35: Today

University of Washington

35

User writes to memory location That portion (page) of user’s memory

is currently on disk

Page handler must load page into physical memory Returns to faulting instruction: mov is executed again! Successful on second try

int a[1000];main (){ a[500] = 13;}

80483b7: c7 05 10 9d 04 08 0d movl $0xd,0x8049d10

User Process OS

exception: page faultCreate page and load into memoryreturns

movl

Fault Example: Page Fault

Page 36: Today

University of Washington

36

Page handler detects invalid address Sends SIGSEGV signal to user process User process exits with “segmentation fault”

int a[1000];main (){ a[5000] = 13;}

80483b7: c7 05 60 e3 04 08 0d movl $0xd,0x804e360

User Process OS

exception: page fault

detect invalid addressmovl

signal process

Fault Example: Invalid Memory Reference

Page 37: Today

University of Washington

37

Exception Table IA32 (Excerpt)

Exception Number Description Exception Class

0 Divide error Fault

13 General protection fault Fault

14 Page fault Fault

18 Machine check Abort

32-127 OS-defined Interrupt or trap

128 (0x80) System call Trap

129-255 OS-defined Interrupt or trap

http://download.intel.com/design/processor/manuals/253665.pdf

Page 38: Today

University of Washington

38

Summary Exceptions

Events that require non-standard control flow Generated externally (interrupts) or internally (traps and faults) After an exception is handled, one of three things may happen:

Re-execute the current instruction Resume execution with the next instruction Abort the process that caused the exception