copyright 2004 koren & krishna ece655/ckptg.1 university of massachusetts dept. of electrical...
Post on 21-Dec-2015
214 views
TRANSCRIPT
Copyright 2004 Koren & Krishna ECE655/Ckptg .1
UNIVERSITY OF MASSACHUSETTSDept. of Electrical & Computer
Engineering
Fault Tolerant ComputingECE 655
Checkpointing I
Copyright 2004 Koren & Krishna ECE655/Ckptg .2
Failure During Program Execution
Computers today are much faster, but applications are more complicated
Applications which still take a long time - (1) Database Updates (2) Fluid-flow Simulation - needed for weather and
climate modeling (3) Optimization - optimal deployment of resources
by industry (e.g. - airlines) (4) Astronomy - N-body simulations and modeling
of universe (5) Biochemistry - study of protein folding
When execution time is very long - both probability of failure during execution and cost of failure become significant
Copyright 2004 Koren & Krishna ECE655/Ckptg .3
-T
-
Cost of Program Failure - Example Program takes T hours to execute
System suffers transient failures - rate of failures per hour
Failure is instantaneous but all prior work is lost E - expected total execution time including any
computational work lost due to failures If there are no failures during execution - a case
which has probability e - conditional expected total execution time = T
The probability that a failure will occur at hours into the execution is e d
In this case, hours are wasted, the program has to be restarted and an additional expected time E is needed to complete the execution - conditional expected total execution time = +E
Copyright 2004 Koren & Krishna ECE655/Ckptg .4
T
0 - -T
Cost of Failure - Calculation Averaging over all cases - E= Te +(+E)e d=E(1-e )+(1-e )/ The solution for E is E=(e -1)/ A measure of the overhead (compared to T )
- = E/T -1 =(e -1)/(T) - 1 depends only on the product T - expected
number of failures during the program execution time
increases very fast (exponentially) with T
Preferably - no need to start from the beginning with every failure - checkpointing
-T
T
T
-T
Copyright 2004 Koren & Krishna ECE655/Ckptg .5
Checkpointing - Definition A checkpoint is a snapshot of entire state of the
process at the moment it was taken - all information needed to restart the process from that point
Checkpoint saved on stable storage of sufficient reliability
Most commonly used - Disks: can hold data even if power is interrupted (but no physical damage to disk); can hold enormous quantities of data very cheaply
Checkpoints can be very large - tens or hundreds of megabytes
RAM with a battery backup is also used as stable storage
No medium is perfectly reliable - reliability must be sufficiently high for the application at hand
Copyright 2004 Koren & Krishna ECE655/Ckptg .6
Overhead and Latency of Checkpoint
Checkpoint Overhead is the increase in the execution time of the application due to taking a checkpoint
Checkpoint Latency is the time needed to save the checkpoint
In a simple system - overhead and latency are identical
If part of the checkpointing can be overlapped with application execution - latency may be substantially greater than overhead
If a process checkpoints by writing its state into an internal buffer, the CPU continues execution while the checkpoint is written from buffer to disk
Copyright 2004 Koren & Krishna ECE655/Ckptg .7
Checkpointing Latency - Example
for (i=0; i<1000000; i++) if (f(i)<min) {min=f(i); imin=i;}
for (i=0; i<100; i++) { for (j=0; j<100; j++) { c[i][j] += i*j/min; } } First part - compute the smallest value of
some function f(i) for 0<i<1000000
Second part - multiplication followed by division
Copyright 2004 Koren & Krishna ECE655/Ckptg .8
Size of Checkpoint in Example
Checkpointing latency depends on checkpoint size - can vary from program to program and even during the execution of one program
A checkpoint in the first part can be small - only program counter and the variables min and imin - most registers are not relevant
A checkpoint taken in the second part must include the array c[i][j] computed so far
In general: The size of checkpoint is program-dependent; as small as a few kilobytes or as large as several gigabytes
Copyright 2004 Koren & Krishna ECE655/Ckptg .9
Issues in Checkpointing
How many checkpoints should we have? At which points in the execution of a
program should we checkpoint? How can we reduce checkpointing overhead? How do we checkpoint distributed systems in
which there may or may not be a central controller?
At what level (kernel/user/application) should we checkpoint?
How transparent to the user should the checkpointing process be?
Copyright 2004 Koren & Krishna ECE655/Ckptg .10
Checkpointing at the Kernel Level
Checkpointing procedures included in the kernel - transparent to the user; no changes to program
When system restarts after failure - kernel responsible for managing recovery operation
Every operating system takes checkpoints when a process is preempted - process state is recorded so that execution can resume from the interrupted point without loss of computational work
Most operating systems have little or no checkpointing for fault tolerance
Copyright 2004 Koren & Krishna ECE655/Ckptg .11
Checkpointing at the User Level
In this approach, a user-level library is provided to do the checkpointing
In order to checkpoint, application programs are linked to this library
Like kernel-level checkpointing, this approach generally requires no changes to the application code - but explicit linking is required with the user-level library
The user-level library also manages recovery from failure
Copyright 2004 Koren & Krishna ECE655/Ckptg .12
Checkpointing at the Application Level
Application is responsible for all checkpointing functions - code for checkpointing and recovery is part of application
Greatest control over checkpointing process - but expensive to implement and debug
Threads are invisible at the kernel level, but User and application levels do not have
access to information held at the kernel level - they cannot assign a particular process id to a recovering process
User and application levels may not be allowed to checkpoint parts of file system - may instead store names and pointers to files
Copyright 2004 Koren & Krishna ECE655/Ckptg .13
Latency and Overhead - Analytic Model
Overhead is the part of the checkpointing not done in parallel with the application execution
Latency - time between tstart (when checkpointing operation starts) and tend (when it ends) overhead has a greater impact on performance than
latency
Checkpoint represents state of system at tstart
Overhead is the part of [tstart , tend] during which the application is blocked from executing due to checkpointing (CPU is busy checkpointing)
Denoting the overhead by tc - the overhead interval is [tstart , tstart + tc]
Copyright 2004 Koren & Krishna ECE655/Ckptg .14
Model
Boxes denote latency; shaded part - overhead If a failure occurs in [tstart , tend] - checkpoint
taken is useless and system must roll back to previous checkpoint
Example - if failure occurs in the interval [t3 , t5] - we roll back to preceding checkpoint - state of process at time t0
tr - average recovery time - time spent in a faulty state plus time to recover to a functional state (e.g., to complete rebooting the processor)
If a transient failure occurs at time , the process becomes active again at the expected time of +tr
Copyright 2004 Koren & Krishna ECE655/Ckptg .15
Analytic Model - Additional Notations
I - inter-checkpoint interval - time between completions of the i and i+1 checkpoints
E - expected length of I T - amount of time spent executing the
application over this period - if no failures, I = T+t t = latency = t - t If failure occurs hours into I - lost work includes:
work done during useful work done during [t , t ] = t - t
Average time of tr to recover and restart computations
Total amount of extra time due to a failure at hours into the interval I is + t - t + tr
start end
startend
inter
c
c
l
l
l c
Copyright 2004 Koren & Krishna ECE655/Ckptg .16
-(T+t ) c
-(T+t ) c
Intercheckpoint Interval - A First Order Approximation
Assumption - at most one failure strikes the system between successive checkpoints - good approximation if T+t is small compared to 1/ - average time between failures
Expected time between two successive checkpoints - look at two casesCase 1 - no failure between successive
checkpoints - length T+t - probability of this case is e
Case 2 - one failure - approximate probability 1-e
Additional time due to the failure = +t +t - t
Average value of = (T + t )/2 Expected additional time = (T+t )/2+ t + t - t
c
c
cl
c
c clr
r
Copyright 2004 Koren & Krishna ECE655/Ckptg .17
Length of Intercheckpoint Interval Contribution of case 1 -
Contribution of case 2 -
Adding up both cases -
Sensitivity of Einter to tl and tc -
E more sensitive to overhead tc than to
latency tl
tc should be kept low, even if tl increases
Copyright 2004 Koren & Krishna ECE655/Ckptg .18
Reducing Overhead - Buffering Processor writes checkpoint into main memory
and then returns to executing the application Direct memory access (DMA) is used to copy
the checkpoint from main memory to disk DMA requires CPU involvement only at the
beginning and end of the operation Refinement - copy on write buffering Copying portions of the process state that are
unchanged since the last checkpoint is a waste of time
If the process does not update the main memory pages too often - most of the work involved in copying the pages to a buffer area can be avoided
Copyright 2004 Koren & Krishna ECE655/Ckptg .19
Copy on Write Buffering
Most memory systems provide memory protection bits (per page of physical main memory) indicating: (page) is read-write, read-only, or inaccessible
When checkpoint is taken, protection bits of pages belonging to the process are set to read-only
The application continues running while the checkpointed pages are transferred to disk
If the application attempts to update a page, an access violation triggers
The system then buffers the page, and the permission is set to read-write
The buffered page is later copied to disk
Copyright 2004 Koren & Krishna ECE655/Ckptg .20
Reducing Checkpointing Overhead - Memory Exclusion
Two types of variables that do not need to be checkpointed - those that have not been updated and those that are “dead”
A dead variable is one whose present value will never again be used by the program
Two kinds of dead variables: those that will never again be referenced by the program, and those for which the next access will be a write
The challenge is to accurately identify such variables
Copyright 2004 Koren & Krishna ECE655/Ckptg .21
Identifying Dead Variables The address space of a process has four
segments: code, global data, heap, stack Finding dead variables in code is easy - self-modifying
code is no longer used, thus, the code segment in memory is read-only, and need not be checkpointed
The stack segment is equally easy: contents of addresses held in locations below the stack pointer are obviously dead (the virtual address space usually has the stack segment at the top, growing downwards)
Heap segment - many languages allow the programmer to explicitly allocate and deallocate memory (e.g., the malloc() and free() calls in C) - the contents of the free list are dead by definition
Some user-level checkpointing packages (e.g., libckpt) provide the programmer with procedure calls (e.g., checkpoint_here()) that specify regions of the memory that should be excluded from, or included in, future checkpoints
Copyright 2004 Koren & Krishna ECE655/Ckptg .22
Reducing Latency
Checkpoint compression - less has to be written to disk
How much is gained through compression depends on: Extent of the compression - application-dependent -
can vary between 0 and 50% Work required to execute the compression algorithm -
done by CPU - adds to the checkpointing overhead as well as to the latency
In simple sequential checkpointing where tc = tl - compression may be beneficial
In more efficient systems where tc << tl - usefulness of this approach is questionable and must be carefully assessed before use