p ortable f ault-t oleran t file i/osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfac kno...

61

Upload: others

Post on 26-Feb-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

Portable Fault-Tolerant File I/ObyIgor B. LyubashevskiySubmitted to the Department of Electrical Engineering and Computer Sciencein partial ful�llment of the requirements for the degrees ofMaster of Engineering in Electrical Engineering and Computer ScienceandBachelor of Science in Computer Science and Engineeringat theMASSACHUSETTS INSTITUTE OF TECHNOLOGYJune 1998c Igor B. Lyubashevskiy, MCMXCVIII. All rights reserved.The author hereby grants to MIT permission to reproduce and distribute publiclypaper and electronic copies of this thesis document in whole or in part.Author : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :Department of Electrical Engineering and Computer ScienceJune 1, 1998Certi�ed by : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :Volker StrumpenPostdoctoral AssociateThesis SupervisorCerti�ed by : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :Charles E. LeisersonProfessorThesis SupervisorAccepted by : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :Arthur C. SmithChairman, Department Committee on Graduate Students

Page 2: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

Portable Fault-Tolerant File I/ObyIgor B. LyubashevskiySubmitted to the Department of Electrical Engineering and Computer Scienceon June 1, 1998, in partial ful�llment of therequirements for the degrees ofMaster of Engineering in Electrical Engineering and Computer ScienceandBachelor of Science in Computer Science and EngineeringAbstractThe ftIO system provides portable and fault-tolerant �le I/O by enhancing the func-tionality of the ANSI C �le system without changing its application programmerinterface and without depending on system-speci�c implementations of the standard�le operations. The ftIO-system is an extension of the porch compiler and its runtimesystem. The porch compiler automatically generates code to save the internal stateof a program in a portable checkpoint. These portable checkpoints can be recoveredon a binary incompatible architecture. The ftIO system ensures that the state ofstable storage is recovered along with the internal state of a program. The porchcompiler automatically generates code to save bookkeeping information about ftIO'stransactional �le operations in portable checkpoints. We developed a new algorithmfor supporting transactional �le operations based on a private-copy/copy-on-write ap-proach. In this thesis, we discuss design choices for the ftIO system, describe our newalgorithm, provide experimental data for our prototype, and outline opportunities forfuture work with ftIO.Thesis Supervisor: Volker StrumpenTitle: Postdoctoral AssociateThesis Supervisor: Charles E. LeisersonTitle: Professor

2

Page 3: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

AcknowledgmentsI would like to express my sincere gratitude to Dr. Volker Strumpen whose day-to-day involvement for the last two years made this project possible. He was involvedin everything from showing me how to use LaTEX to countless brainstorming sessions.At the end, I made him su�er through all the drafts of this thesis.I would also like to thank Prof. Charles Leiserson for his support, guidance, di-rection, and invaluable feedback. His sense of humor and Wednesday-night pizzaprovided the needed stimulus to keep working.Likewise, I would like to note Edward Kogan, Dimitriy Katz, and Ilya Lisanskyfor proof reading an early version of this write-up and for their feedback.I am greateful to Anne Hunter for her encouragement and motivation.Last, but not least, I would like to express my appreciation to my girfriend, MarinaFedotova, and all my relatives for their care, support, and understanding.This work has been funded in part by DARPA grant N00014-97-1-0985.

3

Page 4: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

Contents1 Introduction 82 The porch Compiler 133 The ftIO System Design Alternatives 163.1 Undo-Log Approach : : : : : : : : : : : : : : : : : : : : : : : : : : : 183.2 Private-Copy Approach : : : : : : : : : : : : : : : : : : : : : : : : : : 193.3 Comparison of Implementations : : : : : : : : : : : : : : : : : : : : : 234 The ftIO Algorithm 274.1 The ftIO Finite Automaton : : : : : : : : : : : : : : : : : : : : : : : 304.2 Append Optimization : : : : : : : : : : : : : : : : : : : : : : : : : : : 365 Experimental Results 386 Related Work 437 Conclusion and Future Research 45A More Experimental Results 48

4

Page 5: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

List of Figures4-1 The \live" subset of of transition diagram for the ftIO �nite automaton. 304-2 Complete transition diagram of the ftIO �nite automaton. : : : : : : 324-3 Execution phases. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 334-4 Recovery phases if failure occurred during normal execution or duringporch checkpointing (top), and if failure occurred during ftIO commit(bottom). : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 354-5 Transition diagram of the ftIO �nite automaton including append op-timization. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 375-1 Performance of reading (left) and appending (right) 10Mbytes of datafrom and to a �le. Checkpoints are taken after about 106 �le operationsin the experiments marked \with checkpointing" and those marked\with append optimization." : : : : : : : : : : : : : : : : : : : : : : : 395-2 Runtimes of random �le accesses with spatial locality. The variationof the number of blocks simulates the behavior of a shadow-block im-plementation with di�erent block sizes. : : : : : : : : : : : : : : : : : 405-3 Runtimes of 2D short-range molecular dynamics code for 5; 000 parti-cles. The checkpointing interval is 20 minutes. : : : : : : : : : : : : : 42A-1 Sun Sparcstation4. Sequential read/write : : : : : : : : : : : : : : : : 49A-2 Sun Sparcstation10. Sequential read/write : : : : : : : : : : : : : : : 49A-3 Sun Sparcstation20. Sequential read/write : : : : : : : : : : : : : : : 49A-4 Sun UltraSparc1. Sequential read/write : : : : : : : : : : : : : : : : : 50A-5 Sun UltraSparc2. Sequential read/write : : : : : : : : : : : : : : : : : 505

Page 6: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

A-6 Sun Ultra-Enterprise. Sequential read/write : : : : : : : : : : : : : : : 50A-7 Sun Sparcstation4. Random access : : : : : : : : : : : : : : : : : : : 51A-8 Sun Sparcstation10. Random access : : : : : : : : : : : : : : : : : : : 52A-9 Sun Sparcstation20. Random access : : : : : : : : : : : : : : : : : : : 52A-10 Sun UltraSparc1. Random access : : : : : : : : : : : : : : : : : : : : 52A-11 Sun UltraSparc2. Random access : : : : : : : : : : : : : : : : : : : : 53A-12 Sun Ultra-Enterprise. Random access : : : : : : : : : : : : : : : : : : 53A-13 Sun Sparcstation4. Molecular dynamics : : : : : : : : : : : : : : : : : 54A-14 Sun Sparcstation10. Molecular dynamics : : : : : : : : : : : : : : : : 55A-15 Sun Sparcstation20. Molecular dynamics : : : : : : : : : : : : : : : : 55A-16 Sun UltraSparc1. Molecular dynamics : : : : : : : : : : : : : : : : : 56A-17 Sun UltraSparc2. Molecular dynamics : : : : : : : : : : : : : : : : : 56A-18 Sun Ultra-Enterprise. Molecular dynamics : : : : : : : : : : : : : : : 57

6

Page 7: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

List of Tables3.1 Time and space overheads for undo-log , copy-on-write, and shadow-block implementations. : : : : : : : : : : : : : : : : : : : : : : : : : : 243.2 Worst case time and space overheads for undo-log , copy-on-write, andshadow-block implementations. : : : : : : : : : : : : : : : : : : : : : : 25

7

Page 8: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

Chapter 1IntroductionMuch work has been done on the development of tools that allow programmers toquickly and e�ortlessly add fault tolerance to their existing code. Two general ap-proaches have received the most attention: fault tolerance through redundancy andfault tolerance through checkpointing. Using hardware redundancy is the more costlyapproach, which can easily double (or triple) the cost of hardware required to com-plete the task. The bene�t of redundant systems is that they are able to providecontinuous availability. Checkpointing systems, on the other hand, do not requireany additional hardware. They guarantee that, upon restart, a failed process can berecovered from the latest checkpoint and continue making progress with little worklost due to a failure. Checkpointing systems do not provide continuous availability,though.There are two classes of checkpointing systems: system-level checkpointing andapplication-level checkpointing [13]. System-level checkpointing comprises hardwareand/or OS support for taking periodic checkpoints of the running process. Most ofthose systems perform a periodic code dump to save the state of the memory andprocessor registers. Checkpoints of those systems, however, are often many timeslarger than the \live" state of the process. Also, the saved checkpoints are not usefulin heterogeneous environments, because a failed process cannot be recovered on abinary-incompatible architecture.With application-level checkpointing, a user process saves periodic checkpoints of8

Page 9: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

itself. Because the application knows more about the data that need to be saved aswell as their types, it can save only \live" data, as well as produce checkpoints thatcan be used to recover the process on any supported architecture. The porch compiler[18, 22] has been developed to automatically transform C programs into semanticallyequivalent C programs that are capable of saving and recovering from portable check-points. This technology can be exploited to provide fault tolerance in heterogeneoussystems and may also be used for debugging or retrospective diagnostics [5]. To thelatter end, a portable checkpoint can be used to restart or inspect a program on abinary-incompatible system architecture.The problem of most checkpointing systems is that they only allow for checkpoint-ing the internal state of a process. It is not su�cient, however, to recover the stack,heap, and data segments upon restarting the process. Most applications make use ofstable storage, and the data in stable storage also need to be recovered. For example,suppose that a process reads a value from a �le, increments it, and stores it back.Further, suppose that the process had saved a checkpoint before reading that value,and then failed some time after storing the incremented value. What will happenwhen the process is restarted, if only the internal state of the process was saved dur-ing the checkpoint? Upon recovery, the process will perform the same operations itexecuted after it saved the checkpoint|reading the value from the �le, incrementingit, and storing the result back in the �le. Clearly, this behavior is incorrect since thevalue is now incremented twice, and the state of the �le is inconsistent with the stateof the application. This example demonstrates the need for checkpointing systemsthat not only recover the internal state of a process but also revert the stable storageto a consistent state.This thesis presents an extension to the porch compiler and its runtime system(RTS for short) to support portable fault-tolerant �le I/O. My studies are basedon my prototype implementation of the ftIO system. The ftIO system enhances thefunctionality of ANSI C �le operations [10, 20] without changing its application pro-grammer interface and without depending on a system-speci�c implementation of �leoperations. As the result, application programs using formatted �le I/O can be made9

Page 10: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

fault-tolerant by precompiling them with porch. Object code can be generated there-after with a conventional C compiler. System-speci�c C library implementations canbe used without modi�cations, because porch generates code based on the standardANSI C function prototypes.The problem we are tackling can be decomposed into three subproblems:1. Providing transactional �le operations for fault tolerance.2. Checkpointing the state maintained by transactional �les operations in a portablemanner, allowing for recovery on a binary-incompatible machine.3. Maintaining the data stored in a �le in a machine-independent format.When all three functionalities are present, we can save a checkpoint of a process,including the �les it accesses, and recover the computation on a binary incompati-ble machine. The current implementation of the ftIO system provides the �rst twofunctionalities. Maintaining �le data in a machine-independent fashion can be imple-mented by extending the porch compiler. The support for architecture-independentbinary �le I/O requires future research and is discussed in Section 7.We provide transactional �le operations to support atomic transactions. We con-sider an atomic transaction to be the execution of a program between two consecutivecheckpoints. The program either commits its state during checkpointing or aborts atsome point during the execution, in which case it can be recovered from the lastcheckpoint. Problems arise if a program writes to a �le and aborts before the writehas been committed. The reason is that the �les are generally kept in nonvolatilestorage. A value written before a crash is likely to remain in the �le and will thereforeerroneously be visible after recovery.To avoid such incorrect behavior, I designed and integrated a new protocol with theporch compiler. This protocol is based on the copy-on-write approach [19], in whichthe entire �le is copied upon the �rst write operation. Subsequent �le operations areperformed on the �le replica. During checkpointing, the modi�cations are committedby simply replacing the original �le with its replica. The protocol is based on a singleglobal bit of information to ensure fault tolerance.10

Page 11: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

The ftIO system maintains bookkeeping information about the copy and the orig-inal �le to preserve consistency with the remaining state of the application. Thisbookkeeping information comprises the protocol state of the �le, e.g. clean or dirty,the �le position, the mode in which the �le was opened, an optional user bu�er,the bu�ering policy (e.g. full, line, none), the ungetch state (up to 1 character, asrequired by ANSI C), and \meta information" such as the existence of the �le andits name. It is important to checkpoint this information in a portable manner tofacilitate recovery of an application on a binary incompatible machine. We utilizethe porch compiler to automatically generate code for checkpointing the bookkeepingdata in a machine-independent format by precompiling the ftIO runtime system itselfwith porch. This approach allows me to implement the runtime portion of ftIO as ashallow layer of wrapper routines for the complete set of �le operations de�ned in theANSI C standard.The porch compiler and the ftIO system have been developed for the design offault-tolerant systems based on the notion of portable checkpoints. This technologymay be valuable for a variety of applications. Besides the most obvious use forproviding fault tolerance in a network of binary incompatible machines, the technologycan, for example, be used in the design of embedded systems. We envision the useof porch in embedded systems that use the C language for software development.Embedded systems are often built around commodity processors for which C (cross-)compilers and linkers exist or are relatively easy to port. The ftIO system requiresthat a subset of the ANSI C �le operations is implemented, which is usually the caseanyway.Once a C compiler and I/O library are available, porch can be used in single-processor systems to capture the state of a computation, including the �les it accesses,in a portable checkpoint. This checkpoint could be inspected on a binary incompat-ible machine, for example by using a debugger on a workstation. The capability ofporch to automatically generate code for converting data representations can simplifythe design of multiprocessor systems because machine-independent �le data can beexchanged across binary incompatible processors without explicitly coding for porta-11

Page 12: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

bility. I believe that porch and ftIO could be a viable aid for software developmentand debugging as well as simplifying the software for fault-tolerant heterogeneoussystems.This thesis is organized as follows. I �rst give a brief review of the porch compilertechnology in Chapter 2. In Chapter 3, I discuss design alternatives for the imple-mentation of ftIO. In Chapter 4, I present the ftIO algorithm, which is in principleindependent of the porch technology. Experimental results of the ftIO prototype arepresented in Chapter 5, and more results available in Appendix A. I present anoverview of related work in Chapter 6. Finally, I conclude with a discussion of thepossibilities for future research in Chapter 7.

12

Page 13: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

Chapter 2The porch CompilerThe porch compiler [18, 22] is a source-to-source compiler that translates C pro-grams into equivalent C programs capable of saving and recovering from portablecheckpoints. Portable checkpoints capture the state of a computation in a machine-independent format, called Universal Checkpoint Format|UCF. The code for savingand recovering as well as converting the state to and from UCF is generated automat-ically by porch. This chapter presents an overview of the porch compiler and providesa brief summary of the techniques used to provide fault tolerance.The porch compiler technology solves three key technical problems to makingcheckpoints portable.Stack environment portability The stack environment is deeply embedded in asystem, formed by hardware support, operating system and programming lan-guage design. A key design decision to implement porch as a source-to-sourcecompiler has been the necessity to avoid coping with the system-speci�c statesuch as program counter or stack layout. It is not clear whether this low-level system state could be converted across binary incompatible machines. In-stead, porch generates machine-independent source code to save and recoverfrom checkpoints. At the C language level, variables can be accessed by theirnames without worrying about low-level details such as register allocation orstack layout done by the native compiler.13

Page 14: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

Data representation conversion Two issues of data representations are of con-cern: bit-level representations and data layout. Basic data types are stored indi�erent formats at the bit level. The most prominant formats are little en-dian and big endian. Furthermore, di�erent system designs require di�erentmemory alignments of basic data types. These determine the layout of complexdata types such as structures. Consequently, all basic data types and the lay-out of complex data types are translated into a machine-independent format.The porch compiler generates code to facilitate the corresponding conversionsautomatically.Pointer portability The portability of pointers is ensured by translating them intomachine-independent o�sets within the portable checkpoint. Since the targetaddress of a pointer is not known in general at compile-time, porch is supportedby its runtime system to perform the pointer translation during checkpointingand recovery.To enable code generation, potential checkpoint locations are identi�ed in a Cprogram by inserting a call to the library function checkpoint(). For these potentialcheckpoint locations, porch generates code to save and recover the computation's statefrom portable checkpoints.The porch compiler generates code for programs that compile and run on sev-eral target architectures without the programmer modifying the source code. Thisstrategy does not prevent system-speci�c coding such as conditionally compiled orhardware-speci�c code fragments, but requires structuring a program appropriately.The means of hiding system-speci�c details from porch are structuring the programinto multiple translation units and functions. The porch compiler employs interpro-cedural analysis to instrument only those functions that are on the call path to apotential checkpoint location. The functions that are not may safely be system spe-ci�c. A reasonable convention would be to group all system-speci�c functions anddata into a translation unit separate from the portable code. Only the portable por-tion of the code would be precompiled with porch. The nonportable portion could14

Page 15: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

be di�erent for any particular system, because it would not a�ect the state of thecomputation during checkpointing.

15

Page 16: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

Chapter 3The ftIO System DesignAlternativesIn this chapter we discuss alternative implementations of transactional �le operationsin the context of checkpointed applications. We study the performance of ftIO byestimating the space and runtime overheads of these implementations. As the result ofthis study, we conclude that the most reasonable implementation for the ftIO systemis the copy-on-write implementation of the private-copy approach.The task of the ftIO system is analogous to the task of any transaction system.Indeed, we can think of all �le operations between two checkpoints as belonging toone large transaction. Hence, an implementation of ftIO is in fact an implementationof a transaction system [6], where a transaction consists of the computation executedbetween subsequent checkpoints. Transactions are atomic operations. Either all op-erations within a transaction are executed, or none are. Instead of implementing ageneral transaction system, however, several factors allow us to argue for a simpli-�ed version. The two obvious ones are the absence of nested transactions and theavailability of the internal state of the system during recovery (due to porch [18, 22]).Transaction systems are generally classi�ed as undo-log systems or private-copysystems [23]; the latter also called side-�le [6] systems. Undo-log systems maintaina log of undo records, with an undo record for each operation within a transaction.If a failure occurs, the log is used to undo the modi�cations of an un�nished trans-16

Page 17: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

action. Private-copy systems maintain a private copy of all the data accessed by theoperations of a transaction and update the original data only when the transactioncommits. There are several implementations to optimize the private-copy approach.They di�er in what is copied and when. Usually the trade-o� is between perfor-mance, storage overhead, and complexity of the implementation. The design choiceusually depends on the size of the transactions, the amount of data involved in thetransaction, the performance and space requirements for the program (e.g., real-timeresponse, limited stable storage), and the stable storage technology.In the design of the ftIO system there were three important goals:1. The support of heterogeneous computing environments.2. The ability to generate fault-tolerant code automatically.3. The e�ciency (speed) of programs using the ftIO system.I ensure platform independence of ftIO by implementing it as a set of shallowwrappers around the I/O functions from the standard C library. Since the ftIOruntime code is written entirely in the ANSI C language without any extensions (evenwithout UNIX extensions), it can be used without modi�cations on any architecture.Due to a rich set of analysis and transformation capabilities of the porch compiler,I am able to automate the translation of ANSI C programs for use with the ftIOsystem. That translation is completely transparent to the programmer and does notrequire any code to be changed manually.In the following sections we identify the most e�cient implementation of the ftIOsystem. To do so, we discuss the undo-log approach, the private-copy approach,and three possible implementations of the latter one, namely the copy-on-write, theshadow-block , and the twin-di� implementations. We conclude this chapter with acomparison of the performance and space overheads of di�erent implementations.17

Page 18: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

3.1 Undo-Log ApproachThis section describes the undo-log approach, a popular approach for implementingtransaction systems. The assumption behind this approach is that the system failsrarely, and most transactions commit normally. Therefore, undo-log systems make allchanges directly to the �les that the changes a�ect and keep a separate log to undothe changes. If a transaction commits normally, the undo-log is simply discarded. Inthis section we discuss the performance and implementation of the undo-log approachin the context of checkpointed applications and conclude that this approach is notwell-suited for the ftIO system.The undo-log approach works well for small transactions. Large transactions,however, impose a serious time and space penalty. Since the undo-log approachrequires logging of undo operations for every write operation, the cost of maintainingthe undo-log is proportional to the number of write operations between subsequentcheckpoints. For n writes between checkpoints, the runtime overhead is �(n) forcommitting the associated undo records. The overhead includes reading the originalvalue at the write location and writing the undo record to the undo-log. The spaceoverhead due to n writes is also of order �(n). Therefore, the runtime and spaceoverheads are not bound by the size of the �les. Instead, they depend on the timeinterval between checkpoints.The other problem with the undo-log approach is that one must ensure that allundo records are committed ( ushed) to stable storage before the corresponding �leoperations are committed to the disk �les. If this is not the case, a system failure aftera �le operation is committed to the disk, but before a corresponding undo operation iswritten, can make recovery impossible. Bu�ering �le opeartions in memory is a verye�ective �le I/O optimization implemented by most ANSI C libraries. However, toimplement the undo-log approach in ANSI C, one would have to sacri�ce the bene�tsof bu�ering �le operations by ushing every undo record to disk before performing theactual write operation. Undo records cannot be bu�ered in memory, since there is noguarantee that they will be ushed to disk before the corresponding write operations.18

Page 19: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

It is conceivable that one could reimplement ANSI C �le bu�ering in ftIO to ensurethat bu�ers with undo records are ushed before the corresponding bu�ers with writes.However, very often memory bu�ering of �le I/O is also implemented on the level ofthe operating system. Ensuring that an operating system ushes bu�ers with undorecords before it ushes the bu�ers with writes would require functionality beyondthat provided by ANSI C.The last potential problem with the undo-log approach is that support for renam-ing and removing �les is not straightforward. Because all �le operations are performedon the original �les, remove and rename operations would remove and rename the ac-tual �les. Hence, undoing rename and remove operations is not straightforward withthe undo-log approach.3.2 Private-Copy ApproachThis section discusses the private-copy approach, an alternative to the undo-log ap-proach. Private-copy systems, unlike undo-log systems, do not modify �les until atransaction is committed. Instead, they keep a separate private copy of the data andperform all operations on that copy. During the commit phase of the transaction, theoriginal is reconciled with the copy. In this section, we discuss implementations andsystem trade-o�s that a�ect the performance of the private-copy approach in prac-tice. Namely, we will discuss three implementations of the private-copy approach:copy-on-write, shadow-block , and twin-di� . We will �nd that copy-on-write is thepreferred implementation, since it attains the best worst-case performance and spaceoverhead.The space overhead of a private copy is bound by the size of the original �leN plus the potential bookkeeping overhead. The runtime overhead of the private-copy approach is bound by O(N). The constant factor hidden by O depends on theindividual implementation. Note that both space and runtime overheads depend onthe �le size N in the private-copy approach.Since the amount of data that must be copied is at most the total size of all �les19

Page 20: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

a�ected by the changes, private-copy systems are likely to perform better than undo-log systems in the context of relatively large transactions and relatively small �les.Additionally, since the private copy is an ordinary �le, ANSI C �le bu�ering can befully utilized during the execution of the program.Another advantage of private-copy systems is that it is easier to maintain thetransactional properties of operations such as renaming, removing, and creating new�les. Indeed, since private-copy systems do not modify the original �les, undoingrenames, removals, and �le creations involves no more than discarding private copiesof the a�ected �les.Copy-on-write ImplementationCopy-on-write was �rst developed for virtual memory management in the Mach op-erating system [19]. It can be applied in the context of �le I/O, where it may beviewed as an implementation technique of the private-copy approach. In fact, ourftIO system is based on a copy-on-write implementation.Copy-on-write is an implementation of the private-copy approach where a privatecopy of a �le is made upon the �rst write operation after the last checkpoint or afterthe program begins execution. All subsequent operations are performed on the replicauntil the next checkpoint is saved, when the original �le is replaced with the replica.Copy-on-write is an optimization of the private-copy in the sense that a copy is onlymade when the �rst write occurs, as opposed to copying at the very �rst access, whichmay be an open or read.The copy-on-write implementation su�ers runtime overhead only during the �rstwrite operation after a checkpoint. All other �le operations incur virtually no over-head, with the exception of ftIO-related bookkeeping. During the commit or recoveryphase, the original �le is replaced by the replica via a single rename operation. Anatomic rename operation is used to this end, as provided by POSIX-compliant oper-ating systems [16].Copy-on-write incurs no overhead if a �le is accessed by read operations only. Ifthere are write accesses, copy-on-write performs best for environments with relatively20

Page 21: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

infrequent checkpoints in the presence of a large number of write operations whosemodifying accesses are scattered across the entire address range of a �le. In thiscase, write accesses are expected to amortize the copy overhead, because infrequentblock-copying of �les utilizes disk technology and memory bandwidth very e�ciently.During the commit phase, the original is simply replaced with the replica via asingle rename operation. Also, the �les that were marked to be removed during thetransaction are actually removed during commit.Shadow-block ImplementationReplicating �les may be considered prohibitive if either runtime is critical, such as inreal time applications, or if the capacity of non-volatile storage is limited, as is oftenthe case in embedded systems design. Moreover, one can argue that if a programexhibits spatial locality in �le I/O, only parts of the �le would be modi�ed during atransaction. Hence, a straightforward algorithm might create replicas of the modi�edblocks only, instead of copying the entire �le. This block-based algorithm is knownas the shadow-block implementation [23], or shadow-page algorithm in [6]. It is anoptimization of the copy-on-write implementation that is suited for use in operatingsystems. Files on the disk are typically fragmented into disk blocks, and the �le systemmaintains bookkeeping information about the fragmentation. In such an environment,rather than copying an entire �le upon the �rst write, only the block accessed by awrite operation is copied. Each block of a �le can be copied, committed, and recoveredindependently.The shadow-block implementation is not well suited for use in ftIO, however,because ftIO is implemented at user level for portability reasons. At user level, thefragmentation of �les is invisible. Besides adding another level of fragmentation atuser level, a shadow-block implementation in ftIO would cause additional overheadduring the commit phase, as we discuss now.Fragmenting a �le at the user level into several independent blocks prevents usfrom using a single rename operation to commit the changes. Instead, replicas of thechanged blocks must be copied back into the original �le (master �le). Hence, during21

Page 22: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

the course of a single transaction, each updated block is copied twice|�rst when thereplica of the block is made, and second when the changes to the block are committed.In the worst case, when all blocks of a �le must be copied during a single transaction,the performance penalty of the shadow-block implementation doubles the penalty ofthe copy-on-write implementation and su�ers some overhead for the bookkeeping andadded complexity.Like the undo-log approach, the shadow-block implementation works best for smalltransactions, involving programs that exhibit strong spatial locality and use large�les, because only few blocks are likely to be updated during a single transaction.In a larger transaction, however, more blocks are likely to be modi�ed, and theperformance is more likely to resemble the worst-case performance.Twin-Di� ImplementationAnother alternative for implementing ftIO is the twin-di� technique, popular forbuilding distributed shared memory systems [26]. There, memory consistency mustbe provided at the smaller unit of basic data types rather than at the block level or,in case of distributed shared memory systems, at the level of virtual-memory pages.A typical twin-di� implementation maintains replicas of pages (twins) in memory.During the commit phase, the number of copies is minimized by reconciling only thedi� erences|the modi�ed data|between the original page and the replica.The twin-di� implementation seems appealing, because it allows for copying onlyblocks modi�ed and minimizes the number of copies during the commit phase, whichhas been identi�ed as the crucial disadvantage of the shadow-block implementation.Computing the di�s, however, requires reading both the original page and the replica.Then, relatively small, generally noncontiguous data items (the di�s) are written backinto the �le.Due to the performance characteristics of the modern I/O systems, the twin-di�implementation is not well-suited for ftIO. First of all, during the commit operation,both the original �le and the replica need to be read to compute the di�. Due todelayed disk writes, however, reads on modern architectures are more expensive than22

Page 23: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

writes. Additionally, due to the block-oriented nature of disk I/O, there is usuallyno performance gain in writing only a small portion of a block instead of the wholeblock. Finally, since the commit phase of the twin-di� approach is not atomic, andthe di�s are kept in memory, ensuring a correct recovery from crashes during commitoperation is hard.3.3 Comparison of ImplementationsThis section compares space and runtime overheads of the undo-log , copy-on-write,and shadow-block implementations discussed above. We are primarily interested inthe worst-case analysis, since it provides upper bounds on the overheads. We will alsoshow a comparative analysis of the copy-on-write and shadow-block implementations.Finally, we will show that for the modern disk I/O systems that are optimized forblock transfer of data, the copy-on-write implementation is almost always preferable.Below are the de�nitions of the variables used in the following discussion:n The number of write operations performed on a �leN The number of bytes in the �leb The number of blocks in the �lef The ratio of the modi�ed blocks to the number of blocks in the �lets Startup time required for reading or writing data (moving disk head)d Incremental time (beyond startup time) required to read or write one byteD Incremental time required to read or write the entire �le (D = Nd)S Space overheadT Time overheadFor simplicity, we assume that read and write times (ts; d;D) are the same. Vari-ables b and f apply only to the shadow-block implementation of copy-on-write ap-proach. Since f is a fraction of the modi�ed blocks, we have 1=b � f � 1, given thatat least one write operation has been performed on the �le, and therefore at least oneblock is modi�ed. 23

Page 24: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

The overheads for n write operations for the di�erent implementations are pre-sented in Table 3.1. For the undo-log implementation, the space overhead is propor-tional to the number of write operations, since each write generates a new undo record.The time overhead is n(ts + �(d)) since each write involves writing a separate undorecord. The space overhead of copy-on-write is N for the copy of a �le, and the timeoverhead is 2(ts +D) to copy the whole �le upon the �rst write. The space overheadof the shadow-block implementation is proportional to the number of blocks modi�ed,and the time overhead is the overhead for copying each modi�ed block twice, once togenerate a replica and once to commit the relpica.Implementation Space overhead S Time overhead Tungo-log �(n) n(ts +�(d))copy-on-write N 2(ts +D)shadow-block fN 4f(b � ts +D)Table 3.1: Time and space overheads for undo-log , copy-on-write, and shadow-blockimplementations.Table 3.2 presents the worst-case space and time overheads for the three imple-mentations. The �(n) space overhead of the undo-log implementation is not boundby the �le size N but by the number of �le operations between checkpoints. In par-ticular, n � N represents a possible scenario. In contrast, the space overhead ofcopy-on-write is always S = N . Also, the runtime overhead T = n(ts + �(d)) forcommitting n undo records could be substantially larger than T = 2(ts +D) for thecopy-on-write implementation. The worst case senario, for which f = 1 when allblocks are modi�ed, yields a higher runtime overhead for the shadow-block imple-mentation than for the copy-on-write implementation: 4(b � ts+D) > 2(ts+D), whileboth implementations imply S = N .Besides the worst-case analysis, it is instructive to compare the overheads of thecopy-on-write and shadow-block implementations, as there exists a ratio f � of modi�edblocks, where both implementations incur the same runtime overhead. Comparing f �to f can help choose the best implementation for a particular problem. Speci�cally,24

Page 25: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

Implementation Space overhead S Time overhead Tungo-log �(n) n(ts +�(d))copy-on-write N 2(ts +D)shadow-block when f = 1 N 4(b � ts +D)Table 3.2: Worst case time and space overheads for undo-log , copy-on-write, andshadow-block implementations.for f > f �, the copy-on-write implementation is faster, whereas for f < f �, theshadow-block implementation is faster. We �nd thatT = 4f �(bts +D) = 2(ts +D) for f � = 12 ts +Dbts +D:Depending on the disk technology used and the �le size, there are two boundary cases:1. If the startup time ts is an insigni�cant part of the time to write one blockof data ts + D=b, then b � ts � D, and both implementations incur the sameruntime overhead for f � = 1=2. In this case, the copy-on-write implementationwould be preferred only if at least half of the blocks are modi�ed during thetime between two checkpoints.2. If the startup time is large, or the �le is small, then ts � D and thus f � = 1=(2b).Since f � 1=b by de�nition, then f � < f for all problems. Therefore, copy-on-write always incures less runtime overhead.An important observation is that the larger the time between checkpoints is,the larger f becomes, and the larger the overhead the shadow-block implementationincurs. It is not clear, however, what the optimal number of blocks b for a particularproblem is. On one hand, the runtime overhead T = 2f(b � ts + D) for the shadow-blok implementation grows as b increases, but the fraction of modi�ed blocks f maydecrease with larger b, thereby reducing T . According to the experimental data (seeSection 5), however, the di�erence in performance between a user-level shadow-blockand copy-on-write implementations is likely to be insigni�cant. As a result of this25

Page 26: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

discussion, we used the copy-on-write implementation in the ftIO system, which isalso substantially simpler than the other alternatives.

26

Page 27: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

Chapter 4The ftIO AlgorithmThis chapter describes the algorithm underlying the ftIO system. In Section 4.1, weintroduce the ftIO �nite automaton and describe its operation. We also discuss theftIO commit operation and the ftIO recovery phase. In Section 4.2, we conclude bypresenting an e�ective optimization to the ftIO's copy-on-write implementation forthe case when �les are opened in the append mode.The ftIO algorithm is a private-copy/copy-on-write design for checkpointing sys-tems. Checkpointing systems are characterized by alternating phases of normal ex-ecution and checkpointing unless a failure occurs. The ftIO algorithm is based on a�nite automaton built upon a set of ftIO states for each �le accessed by the appli-cation. State transitions occur if certain �le operations are executed. All ftIO �leoperations are implemented as wrappers around the standard ANSI C �le operations.These wrapper functions maintain for each �le a data structure that contains its state.The state maintained for each �le includes the name of the �le and its replica, theftIO state (see below), the mode in which the �le is opened, the bu�ering policy forthe �le, the location and size of an optional user's bu�er, a push-back character, eofand error ags, and the current position in the �le.

27

Page 28: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

ftIO StatesEach �le is associated with three orthogonal ftIO states. A �le may be dirty or clean,it may be open or closed , and it may be live or dead .clean/dirty A �le is clean if it has not been modi�ed since the last checkpoint (or sincethe beginning of the execution, before the �rst checkpoint); otherwise, it is dirty .By this de�nition, �les that do not exist (and have not been removed since thelast checkpoint) are clean. In a private-copy implementation, no replicas existfor clean �les.open/closed A �le can be either open or closed . By default, all �les are consideredclosed, unless they are explicitly opened. In particular, �les that do not existare closed by this de�nition.live/dead A �le can be scheduled for removal, in which case it is dead. Otherwise, itis alive.Besides the �le-speci�c states, a global boolean ftIO state is maintained.FIN (for �nished) is a bit used to record the success of the ftIO commit operation.It will be explained in detail in Section 4.1.ftIO File OperationsWe distinguish ANSI C �le operations from ftIO �le operations. The latter de�nethe input alphabet of the ftIO �nite automaton as listed below:create creates and opens a �le. If the �le exists already, it is erased before beingcreated again. This ftIO �le operation corresponds to the ANSI C �le operationfopen with opening mode "w", or "a" if the �le does not already exist.createTmp creates and opens a temporary �le, which is to be removed automaticallywhen closed. By de�nition, this �le has a unique name. The correspondingANSI C operation is tmpfile. 28

Page 29: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

open opens an existing �le. The corresponding ANSI C �le operation is fopen withopening mode "r", or "a" if the �le exists.write modi�es the contents of an open �le, including appending data to that �le. Thecorresponding ANSI C �le operations are fprintf, vfprintf, fputc, fputs,putc, and fwrite.close closes an open �le. This operation corresponds to the ANSI C �le operationfclose.remove removes a closed �le. This operation corresponds to the ANSI C �le operationremove.renameSrc (to B) moves an existing and closed �le to location B. This operation cor-responds to a part of ANSI C operation rename and is applied to the �le that isthe source of rename operation. The de�nition of the input alphabet for ftIO's�nite automaton requires splitting the ANSI C �le operation rename into twoparts, renameSrc and renameDst, because a rename a�ects the source and thedestination �les di�erently.renameDst (from A) replaces a closed (possibly nonexisting) �le with a �le from loca-tion A. If this �le exists, it is overwritten with A. This operation correspondsto that part of the ANSI C operation rename, which a�etcs the destination �leof the rename operation.A note concerning portability: ANSI C does not de�ne the behavior of renameif the destination �le exists. According to the UNIX speci�cation, however,the rename operation atomically replaces the destination �le with the source�le [16]. The ftIO system implements the UNIX-like behavior for rename andrequires the UNIX-like behavior from the native C library.Note that ANSI C operations that read data, check for an error or the end of �le,change the current position in a �le, and change the bu�ering mode do not have anequivalent ftIO operation. Those ANSI C operations do not change the state of a �leand therefore do not cause a state transition of the ftIO �nite automaton.29

Page 30: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

4.1 The ftIO Finite AutomatonIn this section we explore the ftIO �nite automaton. We start by describing thethe operation of the ftIO system for live �les. Then, we present the complete �niteautomaton by broadening the context to include the dead �les and the support fortemporary �les as well as �le operations remove and rename. During the discussion ofthe ftIO commit and recovery phases, we introduce the global FIN bit and investigatethe idempotency of the commit operation. We conclude with an informal argumentabout the correctness of the ftIO algorithm.Figure 4-1 shows the transition diagram of the ftIO �nite automaton based ona subset of the ftIO states and �le operations that a�ect live �les only. This subsetdoes not contain �le operations remove, renameSrc, renameDst, and createTmp. Thestart state of the automaton is clean & closed; the accepting states after a commitoperation are shown in black.commit

&commit

&

& writewrite

commit

commit

closed

createopen close open close

open

clean

open

dirty

dirty

closed

clean&

Figure 4-1: The \live" subset of of transition diagram for the ftIO �nite automaton.Opening and closing �les does not a�ect the initial clean/dirty state. Open �lescan be written, causing the �le to become dirty. In our copy-on-write implementationof ftIO, a clean �le is copied upon a write, and the write, as well as all future writes,are performed on the replica. Upon �le creation the dirty state is set.If a �le is clean, no replica of the �le exists, and the �le has not been modi�ed sincethe last checkpoint. Therefore, all operations are performed on the actual �le until30

Page 31: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

a write operation is performed on that �le. Upon the �rst write operation, a privatecopy of the �le is generated, and the �le becomes dirty. Henceforth, all operationsare performed on the replica of the �le until the next checkpoint is saved, when thechanges to the �le are committed. During the commit operation, the original �le isreplaced by its replica, and the �le becomes clean again.Temporary Files, Removal, and RenamingANSI C supports renaming and removing �les. Additionally, it de�nes temporary�les, which are removed automatically when closed. The live/dead state is introducedin ftIO to support these features. The dead state indicates that a �le, including itsreplica, must be removed during the commit phase, provided the �le is in a closedstate. If the �le is open during the commit phase, then the �le is treated as if it werealive.Figure 4-2 shows the complete transition diagram of ftIO's �nite automaton. Itincorporates three additional states besides the four \live" states to support the ftIOoperations remove, renameSrc, renameDst, and createTmp.The states \clean & open & dead" and \dirty & open & dead" are used for temporary�les (created with createTmp) while they are open. All �les that are removed, renamed,or were created as temporary and are closed are in the \closed & dead" state. Filesin state \closed & dead" can be clean or dirty.When a temporary �le is created, that �le is created as dead. Consequently, whenthe temporary �le is closed, it will be removed during the commit operation. While itis open, the live/dead state is ignored, and the temporary �le is treated as an ordinary�le.The live/dead state enables checkpointing removed and renamed �les. If a removeoperation is applied to a �le (it must be closed to be removed), the �le's state becomesdead. When a �le is marked dead, the fopen and freopen �le operations treat that�le as if it does not exist, because the �le is e�ectively removed. The �le will beactually removed, however, only during the commit phase.If a rename operation is invoked (both source and destination �les must be closed31

Page 32: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

clean

cleanclosed

live&

&

open close close

write

writewrite

create

write

renameSrc

createTmp

remove,

closed&

renameDst

createopen,

renameSrc

remove,renameDstrenameDst

create

close

close

commit

commit

commit

commit

commitcommit

&

&open

dead

dead

clean

live&

&open

&

&

open&

&live

live

closed

dirty

dirty

open&

&dead

dirty

commit

Figure 4-2: Complete transition diagram of the ftIO �nite automaton.for that), the source �le's state becomes dead. We also need to ensure that thereplica of the destination �le contains the data from the source �le. Therefore, weeither rename the replica of the source �le to be the replica of the destination �le, ifthe source �le is dirty, or copy the source �le to become the replica of the destination�le, if the source �le was clean. Also, the destination �le is marked dirty because ithas been modi�ed, and the destination �le is marked live, if it is dead before.CheckpointingWe separate the checkpointing process into two phases. First, the porch RTS saves thevolatile state, including register and memory values. Second, the ftIO system commitsthe changes made to the �les. The two phases are accompanied by a third action,the maintenance of the FIN bit. Figure 4-3 shows the sequence of the checkpointingphases. 32

Page 33: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

FIN=false timeFIN=trueFIN=true

checkpointingnormal execution normal executionftIOporch

commitFigure 4-3: Execution phases.Upon checkpointing, the volatile state of a process is saved by the porch RTS inUCF format in a temporary checkpoint �le. This checkpoint �le includes the FINbit, explained in detail below. When the volatile state is gathered in the temporarycheckpoint �le, it replaces the previous checkpoint �le by means of the atomic renameoperation of the ANSI Standard C library [16], thereby committing the checkpoint.After the porch RTS has saved the checkpoint, the ftIO system performs commitoperations for all �les. The commit operation a�ects �les in the following way:For each dirty �le, the original �le is replaced with its replica, and the �lestate transitions to clean. If a �le is dead and closed, however, both theactual �le and the replica (if one exists) are removed irrespective of theclean/dirty state.The FIN bit ensures the atomicity of the ftIO commit operation. It occupies a bitin the checkpoint. The porch RTS initializes a new temporary checkpoint �le withthe FIN bit value set to false. At the end of the ftIO commit phase, the FIN bit isset to true in the checkpoint �le. This happens in the actual checkpoint �le, whichwas created during the porch checkpointing phase and contains the internal state ofthe program. In Figure 4-3 the FIN bit represents a global state, which is de�ned asthe value of the FIN bit in the last checkpoint �le. Therefore, the FIN bit becomesfalse in Figure 4-3 only after the porch checkpointing phase, when the new checkpoint�le with the FIN bit set to false is committed by replacing the previous checkpoint.Also note that writing the FIN bit does not require an atomic disk-write operation.Instead, we only must ensure that the FIN bit has been written to disk before the�rst write operation after checkpointing. 33

Page 34: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

Failure Cases and RecoveryFigure 4-3 above shows the two checkpointing phases embedded in periods of normalexecution. Failures may strike during any of these phases. There are, however, onlytwo distinguished failure cases, which simpli�es reasoning about the correctness ofthe ftIO algorithm.1. If the application aborts due to a failure during the normal execution or duringthe porch checkpointing phase, recovery is based on the previous checkpoint.Since all modifying �le operations have been performed on private copies, theoriginal �les are untouched. Because none of the �le modi�cations have beencommitted yet, recovery from the previous checkpoint involves only discardingthe replicas.2. If the application aborts during the ftIO-commit phase, only a subset of the�les may have been committed before the failure. In this case, the previouscheckpoint has been replaced with the new checkpoint already, and recovery willrestore the computation from the new checkpoint. Files are recovered simplyby executing the ftIO commit operations during the ftIO recovery phase beforereturning control to the application. Since committing �les involves no morethan replacing the original �les with their replicas, those �les that have not beencommitted due to a failure can be committed before the application continues.Files that have been committed already remain untouched. The ftIO commitcode does not even have to be changed because rename and remove operationsare idempotent. These operations are idempotent because only the �rst call torename or remove will succeed, and the following calls fail quietly. When therecovery phase is �nished, the FIN bit is set to true in the checkpoint �le, justas it would be set after the commit phase.Recovery, analogously to the checkpointing process, is split into two phases:1. During the porch recovery, the volatile state of a computation is restored, andthe FIN bit is read from the checkpoint �le, which determines the mode of the34

Page 35: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

ftIO recovery.2. Files are recovered during the ftIO recovery phase.

FIN=trueFIN=false time

FIN=true time

normal execution

normal execution

(noop)

(commit)recoveryporch

recoveryporch ftIO recovery

ftIO recoveryFigure 4-4: Recovery phases if failure occurred during normal execution or duringporch checkpointing (top), and if failure occurred during ftIO commit (bottom).The FIN bit serves to distinguish the two modes of recovery. Figure 4-4 shows thevalues of the FIN bit for both modes; cf. Figure 4-3. In the following, we argue infor-mally why the ftIO algorithm works due to the FIN bit. Recall from the descriptionabove that the FIN bit is initialized to false within a new temporary checkpoint �le.It is only changed to true when the ftIO commit phase is �nished.The value of FIN in the checkpoint is true upon recovery, if the checkpointingprocess succeeded. In this case the ftIO recovery phase is a noop, as shown at thetop of Figure 4-4. Only if a failure occurs during the ftIO commit phase, does theFIN bit in the checkpoint remain false. In this case, �les are committed during theftIO recovery phase by executing the commit operation (see page 33). Since renameand remove are idempotent, the subset of �les that have been committed before thecrash will not be a�ected during recovery.The only failure case remaining to be discussed is a failure during recovery, cf.Figure 4-4. No special treatment is required for this case. Since during recovery eitherthe idempotent ftIO commit operation is executed or a noop, the recovery phase cansafely be executed again to recover from a failure during recovery. No distinction isnecessary as to whether the failure occurs during the porch recovery phase or the ftIO35

Page 36: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

recovery phase.To ensure that all �les are committed correctly after the program has ended,porch saves the very last checkpoint after the main routine in the application codereturns. That last checkpoint does not contain any state from the application, butonly the state of ftIO. Hence, if the process fails during the ftIO commit operation,the program commits all �les after recovery and terminates.4.2 Append OptimizationAccording to [17], many long-running scienti�c applications store data by appendingthem to an output �le. The analysis in Section 3.3 shows that the append scenariopresents the best case for the shadow-block implementation, since only a few blocksat the end of the �le are modi�ed during the time between checkpoints. It is straight-forward, however, to optimize the copy-on-write implementation for this case. Iimplemented the append optimization for ftIO that avoids copying a �le upon the�rst write and performs therefore at least as well as the shadow-block implementationin the append scenario. The experimental results, presented in Section 5, con�rm theclaim for a signi�cant performance improvement due to the append optimization.The append optimization is based on the idea that instead of generating a replica ofthe original �le upon the �rst write, a temporary �le is created, and all writes beforethe next checkpoint are appended to this temporary �le. During the ftIO commitoperation, this temporary �le is appended to the original �le. Since the original �leis not modi�ed until the next checkpoint, discarding the temporary �le is su�cientto recover from failures that occurred during execution.Unlike renaming or removing replicas, as necessary to commit in the unoptimizedcases, appending the temporary �le to the original is not an idempotent operation.The ftIO recovery described in Section 4.1, however, requires the commit operationto be idempotent. The ftIO system stores the size of the original �le in the checkpointand truncates the original �le to the recorded size before appending the temporary �leto ensure idempotency. Hence, if a failure occurs while appending the temporary �le,36

Page 37: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

recovery involves �rst undoing the aborted append and then reexecuting the commitoperation for the �le.If a �le, opened in append mode, is closed and then reopened in a di�erent mode,a replica is generated by concatenating the original �le with the temporary �le con-taining the appended data.

clean

cleanclosed

live&

& &

&closed

dirty

live

open&

&live

dirty

&

&closed

dirty

live

clean

&

&open

live

write

renameSrcremove

,

closed&

renameDst

renameSrc

remove,renameDstclose

commit

commit

commit

&

&open

dead

dead

open&

&dead

dirty

commit

commit

open close

create

write

createTmp

createopen,

commit

clean

live&

&open open

&

&live

dirtycommit

renameDst

write

commit

writewritecommit

open

(app

end)

clos

e

close

write close

open

(app

end)

commit

open,create

renameDst

create

close

remove,

renameSrc

Figure 4-5: Transition diagram of the ftIO �nite automaton including append opti-mization.Figure 4-5 shows the transition diagram for ftIO'a �nite automaton that includesthe append optimization. The grey states and dashed transitions are added to the�nite automaton presented in Figure 4-2. No other changes are necessary.37

Page 38: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

Chapter 5Experimental ResultsWe designed three synthetic benchmarks to exhibit the performance characteristicsof our copy-on-write implementation of ftIO: a read benchmark, a write benchmark,and a random write-access with spatial locality. We also present the performance ofa molecular dynamics code to show that the overhead of checkpointing a scienti�capplication that writes a large amount of simulation results to a �le is reasonablysmall. All runtimes presented in this chapter were gathered on a Sun Sparcstation10with checkpointing to local disk. Appendix A presents data for other machines.Sequential Access: read and writeFigure 5-1 presents accumulated runtimes of the read benchmark and the write bench-mark in the presence of checkpointing compared to the original, not porchi�ed code.The read benchmark reads consecutive bytes from a �le. The write benchmark writes(appends) bytes to a �le. For both benchmarks, the interval between checkpointsis 2 seconds to exaggerate the checkpointing overhead. The runtimes reported areaccumulated over the bytes read or written. Therefore, the runtime correspondingto a particular number of reads or writes includes the number of checkpoints savedduring the period of executing that particular number of �le operations.Figure 5-1 shows that the overhead of the checkpointed version of the read -benchmark is due to two factors. First, checkpointing the program's state introduces38

Page 39: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

0 1 2 3 4 5 6 7 8 9 10 11Number of Reads (�106)051015202530354045505560

Runtime[s].................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

...........................................................................................................................................................................................with checkpointingoriginal code (not porchi�ed) 0 1 2 3 4 5 6 7 8 9 10 11Number of Writes (�106)0510

15202530354045505560

Runtime[s]......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

....................................................................................................................................................................................................................................................................................................................................................................................

...........................................................................................................................................................................

..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

with checkpointing with appendoptimizationoriginal code (not porchi�ed)

Figure 5-1: Performance of reading (left) and appending (right) 10Mbytes of datafrom and to a �le. Checkpoints are taken after about 106 �le operations in the exper-iments marked \with checkpointing" and those marked \with append optimization."small jumps when saving the internal state. Second, the overhead of the �les operationwrappers leads to a slightly larger slope of the runtime curves between checkpoints.Since no write operations are performed in the read benchmark, no cost is incurreddue to replicating the �le.The checkpointed version of the write benchmark, however, exhibits the expectedoverhead due to �le replication upon the �rst write after a checkpoint has been saved.Each step in that curve corresponds to the overhead of �le replication. The monoton-ically increasing step size of the checkpointed version without append optimizationis a result of the growing �le size. Clearly, �le replication dominates the checkpoint-ing overhead. The curve of the checkpointed version with the append optimizationexhibits a constant step size, which corresponds to checkpointing the program stateand copying the appended data from a temporary �le to the original �le. The stepsize is constant, because the same amount of bytes is appended between subsequentcheckpoints, taken at equal intervals.39

Page 40: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

Random AccessFigure 5-2 presents runtimes of our third benchmark. This benchmark analyzes theperformance of ftIO with �le operations randomly distributed over the address rangeof a �le, and where �le accesses exhibit spatial locality. Spatial locality is incorporatedby computing the location of a �le access to be normally distributed around thelocation of the previous �le access with a standard deviation of 10:24KBytes. Thisvalue corresponds to 0:1% of the total �le size of 10MBytes. Our benchmark performs106 �le operations, 90% of which are read operations, and 10% are write operations.Ten checkpoints are taken at equal intervals during the execution of the benchmark.

1 2 4 8 16 32 64 128 256Number of Blocks01002003004005006007008009001000Runtime[s].................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................� �� �� �� �� �� �� �� �? ?? ?? ?? ?? ?? ?? ?? ?with checkpointing

original code (not porchi�ed)

Figure 5-2: Runtimes of random �le accesses with spatial locality. The variation ofthe number of blocks simulates the behavior of a shadow-block implementation withdi�erent block sizes.Figure 5-2 includes runtimes of the copy-on-write implementation for di�erentnumbers of blocks. We simulate a shadow-block implementation by splitting the10MByte �le into �les (blocks) of equal size and accessing these blocks with our ran-dom strategy as if they occupied a contiguous address space. This simulation modelsa lower bound for the shadow-block implementation. By physically partitioning thedata set into separate �les we avoid copying of blocks during the commit operation.Instead, each modi�ed �le (block) is committed by renaming.The runtimes in Figure 5-2 indicate that the copy-on-write implementation ismore reasonable than the shadow-block implementation. As the number of blocks40

Page 41: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

increases from 1 to 32, performance improves by 8% only. For larger numbers ofblocks, performance degrades, primarily due to the cost of the rename operations.Another observation involves a comparison of the runtimes in Figure 5-2 withthose in Figure 5-1. Accessing a �le randomly (106 accesses) is almost three ordersof magnitude more expensive than contiguous read or write accesses, even with thespatial locality of the accesses. This allows us to argue that modern disk I/O systemshave a high startup time for disk reads and writes.Finally, the performance of the benchmark without checkpointing in Figure 5-2improves as the number of �les (blocks) increases. We attribute this behavior to the�le caching mechanisms in the Solaris operating system, which seems to be optimizedfor smaller �les.Molecular Dynamics CodeFigure 5-3 presents the cumulative runtime for a 2D short-range molecular dynam-ics code, written by Greg Johnson, Rich Brower, and Volker Strumpen. The codeoutputs the simulation results, including potential, kinetic, and total energy of theparticle system (a total of 53 bytes) every 10 iterations to a �le. The period betweencheckpoints is set to 20 minutes. The program generates approximately 400 bytes ofdata between checkpoints, and the total size of the output �le after 60; 000 iterationsis 150 Kbytes.Presented are the accumulated runtimes for the original code and the checkpointedcode with the append optimization. We observe that the overhead of checkpointing,including the ftIO overhead, is less then 6% for any number of iterations.The small steps in the runtime curve \with checkpointing" correspond to thecheckpointing overhead, which is approximately 60 seconds each on Sun Sparcsta-tion10. Hence, even as the �le size increases, the append optimization keeps thecommit cost constant. The runtime overhead of each checkpoint is 75% due tocheckpointing the internal state of the program by the porch RTS [22]. The di�erenceof the slopes of the two curves in the regions between checkpoints corresponds to the41

Page 42: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

0 5 10 15 20 25 30 35 40 45 50 55 60Number of Iterations (�103)050100150200250300

Runtime[min].......................................................................................................................................................................................................................................................................................................................................

....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

...................................................................................................................................................................................................................

.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................with checkpointing

original code

Figure 5-3: Runtimes of 2D short-range molecular dynamics code for 5; 000 particles.The checkpointing interval is 20 minutes.runtime overhead of the ftIO bookkeeping. The ftIO overhead per �le operation isinsigni�cant, because the di�erence of the slopes is marginal.

42

Page 43: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

Chapter 6Related WorkI are not aware of any checkpointing system that implements portable, fault-tolerant�le I/O. According to Gray [6, p. 723], \a large variety of methods exists for movingdata back and forth between main memory and external storage." To my knowledge,the ftIO algorithm is new in that it uses only an atomic rename operation and a singlestate bit to implement transactional �le operations.The libfcp library [24] provides support for transactional �le operations. Thislibrary implements the undo-log approach, using a two-phase commit protocol tocommit a transaction. Besides requiring explicit denotation of a program to mark thestart and end of a transaction, libfcp does not provide portability. In combination withlibckp [24, 25] and libft [8, 24], libfcp can be used to checkpoint the state of persistentstorage in the context of checkpointed applications in homogoneous environments.Paxton [14] presents a system that uses a combination of the shadow-block andundo-log techniques to support transactional �le I/O. He uses the shadow-block tech-nique, implementated at the level of the operating system, to store the data modi�edduring a transaction. An undo-log (intentions log in [14]) is kept to facilitate book-keeping of the shadow-block data structures. This system is not designed for check-pointing and is not portable, since it requires modi�cation of the operating system.Eden is a transaction-based �le system [9], which is not, however, designed forfault tolerance. It supports concurrent transactions by maintaining multiple versionsof a �le. In the presence of only one process, Eden's algorithm is similar to ftIO's43

Page 44: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

copy-on-write implementation since it creates a replica for all modi�ed �les during atransaction. The implementation of Eden is by no means portable, since it is tightlyintegrated with the operating system.A number of checkpointing systems exist for homogeneous environments, such aslibckpt [15], a supplement of the Condor system [11, 12], and the CLIP system for IntelParagon [4]. Not only are these systems tied to a particular machine architecture,they also do not support transactional �le operations. By saving only the name ofthe �le and the current �le position in their checkpoints, these systems limit theapplications to read-only or write-only �le I/O.There are two approaches to migration across binary incompatible machines, theTui system [21] and the work on Dynamic Recon�guration [7]. Neither of these sys-tems supports transactional �le operations. The Dome system [1] provides architecture-independent user-level checkpointing. However, it does not support transactional �leoperations either.The work on I/O benchmarking by Chen and Patterson [2, 3] inspired the develop-ment of our synthetic benchmarks for transactional �le-I/O. Likewise, the survey ofI/O intensive scienti�c applications [17] increased our understanding of the trade-o�sin the ftIO design.

44

Page 45: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

Chapter 7Conclusion and Future ResearchI have presented the ftIO system, which extends the porch compiler by providingportable, fault-tolerant �le I/O. Due to the ftIO system, C programs that use format-ted �le I/O can be rendered fault-tolerant in a transparent fashion by precompilationwith porch. Furthermore, no changes are necessary to the system-speci�c C library.I have analyzed performance characteristics of the undo-log approach and theprivate-copy approach, including three implementations of the latter: copy-on-write,shadow-block , and twin-di� . As the result of the analysis I found that a copy-on-writeimplementation of the private-copy approach is the most reasonable choice for theftIO system. I have introduced the ftIO algorithm, which provides transactional �leoperations using only a single state bit. The ftIO runtime system is written entirelyin ANSI C, and the ftIO code itself is compiled with porch, thereby ensuring that acheckpoint contains all ftIO-state information in a machine-independent format.I provided experimental results measuring the performance overhead of the ftIOsystem under sequential and random access workloads and comparing it with the per-formance of the shadow-block implementation of the private-copy approach. The ex-periments show that the copy-on-write implementation is the most reasonable choicefor the ftIO system. Moreover, the performance of the molecular dynamics code indi-cates that the ftIO system incures relatively low runtime overhead for this and similarapplications.Future research can extend the ftIO system in several directions. It can be mod-45

Page 46: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

i�ed to support other �le operations besides the ones from the ANSI C library, toincorporate the support for architecture-independent binary �le I/O, and to providenew functionalities such as fault-tolerant socket I/O.Architecture-independent Binary File I/OTo allow true architecture-independent checkpointing of the program state, datawritte to a �le on one machine should be readable from the �le on a binary-incompatiblemachine after recovery. In the current ftIO prototype the only way to ensure thatdata can be read on a di�erent architecture is to use formatted �le I/O [20], whichimplies that all data must be stored in ASCII format. However, ANSI C supportsbinary �le I/O via the fread and fwrite operations, which allow a program to readand write blocks of binary data from and to the main memory.The problem of architecture-independent binary �le I/O resembles the problem ofstoring and recovering from architecture-independent checkpoints, since in both casesthe data must be converted to and from an architecture-independent intermediateformat. The two problems di�er, however, in that the data type|which is requiredfrom a proper conversion|is unknown when the binary data is read from a �le. Ienvision two ways to approach the problem. First, one can store the data togetherwith its type for binary �le I/O. Second, with the assistance of a compiler, one canattempt to infer the type of the data read. I believe that the second approach ispreferable. It avoid the overhead of storing the type information with the data.Supporting Multiple Streams to a FileANSI C [20, 10] and POSIX [16] speci�cations distinguish the notion of a streamand a �le. A �le is a data object stored on a disk, whereas a stream, according to[10], \is a source or destination of data that may be associated with a disk or otherperipheral." Hence, the ANSI C operation fopen opens a stream to a speci�ed �le.The current implementation of ftIO does not support multiple open streams to asingle �le. Adding this support, however, is not di�cult. The ftIO �nite automaton46

Page 47: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

operates on �les, not on streams. Hence, the open/closed state marks whether thereare any open streams associated with a �le independent of the number of streams.Adding support for multiple streams would require reorganizing the ftIO bookkeepingto separate the information about the state of a stream from the information aboutthe state of a �le.Supporting UNIX File I/OThe algorithm developed for ftIO is not limited to support only standard ANSI C�le operations. With small changes, I believe, ftIO can support any set of existing�le operations. Implementing UNIX �le I/O, for example, requires only changingthe way that bookkeeping information is stored since UNIX �le operations are verysimilar to the ANSI C ones.

47

Page 48: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

Appendix AMore Experimental ResultsAppendix A presents experimental performance data of our benchmarks on a varietyof machines. The data show that the overhead of the ftIO system, measured as thepercentage of the runtime of the program, stays almost constant for machines of verydi�erent computational power.The �gures are organized in the order of increasing computational power. Note,however, that some of the less powerful machines, like Sparc4, have apparently supe-rior disk I/O characteristics.Sequential Access: read and writeThe �gures below present the performance of reading (left) and appending (right)10Mbytes of data from and to a �le. Ten checkpoints are taken after about 106 �leoperations (less then every 2 seconds for all machines) in the experiments marked\with checkpointing" and those marked \with append optimization."

48

Page 49: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

0 1 2 3 4 5 6 7 8 9 10 11Number of Reads (�106)051015202530354045505560

Runtime[s]............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................with checkpointingoriginal code 0 1 2 3 4 5 6 7 8 9 10 11Number of Writes (�106)0510

15202530354045505560

Runtime[s]....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

..............................................................................................................................................................................................................................................................................................................................................................................................

..................................................................................................................................................................

................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

with checkpointing with appendoptimizationoriginal codeFigure A-1: Sun Sparcstation4. Sequential read/write

0 1 2 3 4 5 6 7 8 9 10 11Number of Reads (�106)051015202530354045505560

Runtime[s]...................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

.........................................................................................................................................................................................with checkpointingoriginal code (not porchi�ed) 0 1 2 3 4 5 6 7 8 9 10 11Number of Writes (�106)0510

15202530354045505560

Runtime[s].......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

...................................................................................................................................................................................................................................................................................................................................................................................

...........................................................................................................................................................................

..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

with checkpointing with appendoptimizationoriginal code (not porchi�ed)Figure A-2: Sun Sparcstation10. Sequential read/write

0 1 2 3 4 5 6 7 8 9 10 11Number of Reads (�106)05101520253035404550Runtime[s]

.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................with checkpointingoriginal code 0 1 2 3 4 5 6 7 8 9 10 11Number of Writes (�106)0510

1520253035404550Runtime[s]

....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

......................................................................................................................................................................................................................................................................................................................................................

...................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

with checkpointingwith appendoptimizationoriginal codeFigure A-3: Sun Sparcstation20. Sequential read/write49

Page 50: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

0 1 2 3 4 5 6 7 8 9 10 11Number of Reads (�106)051015202530

Runtime[s]....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

........................................with checkpointingoriginal code 0 1 2 3 4 5 6 7 8 9 10 11Number of Writes (�106)051015202530

Runtime[s]...................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

.......................................................................................................................................................................................................................................................................................................................................................................................................

............................................................................

.................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

with checkpointingwith appendoptimizationoriginal codeFigure A-4: Sun UltraSparc1. Sequential read/write

0 1 2 3 4 5 6 7 8 9 10 11Number of Reads (�106)05

101520

Runtime[s].....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

.......................................................................................................................................................................................................................with checkpointingoriginal code 0 1 2 3 4 5 6 7 8 9 10 11Number of Writes (�106)05101520

Runtime[s]......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

................................................................................................................................................................................................................................................................................................................................................................................................................

............................................................................................

....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................with checkpointing

with appendoptimizationoriginal codeFigure A-5: Sun UltraSparc2. Sequential read/write

0 1 2 3 4 5 6 7 8 9 10 11Number of Reads (�106)05

101520

Runtime[s]................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

.............................................................................with checkpointingoriginal code 0 1 2 3 4 5 6 7 8 9 10 11Number of Writes (�106)05101520

Runtime[s].......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

....................................................................................................................................................................................................................................................................................................................................................................................................................................

.............................................................................................................................

..........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

with checkpointingwith appendoptimizationoriginal codeFigure A-6: Sun Ultra-Enterprise. Sequential read/write50

Page 51: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

Random AccessThe �gures below present the runtimes of random �le accesses with spatial locality.The variation of the number of blocks simulates the behavior of a shadow-blockimplementation with di�erent block sizes. This benchmark analyzes the performanceof ftIO with �le operations randomly distributed over the address range of a �le,and where �le accesses exhibit spatial locality. Spatial locality is incorporated bycomputing the location of a �le access to be normally distributed around the locationof the previous �le access with a standard deviation of 10:24KBytes. This valuecorresponds to 0:1% of the total �le size of 10MBytes. Our benchmark performs 106�le operations, 90% of which are read operations, and 10% are write operations. Tencheckpoints are taken at equal intervals (less then every 1:5 minutes for all cases)during the execution of the benchmark.

1 2 4 8 16 32 64 128 256Number of Blocks0100200300400500600Runtime[s].......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................� �� �� �� �� �� �� �� �? ?? ?? ?? ?? ?? ?? ?? ?with checkpointingoriginal code (not porchi�ed)

Figure A-7: Sun Sparcstation4. Random access

51

Page 52: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

1 2 4 8 16 32 64 128 256Number of Blocks01002003004005006007008009001000Runtime[s].................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................� �� �� �� �� �� �� �� �? ?? ?? ?? ?? ?? ?? ?? ?with checkpointing

original code (not porchi�ed)

Figure A-8: Sun Sparcstation10. Random access

1 2 4 8 16 32 64 128 256Number of Blocks0100200300400500Runtime[s].................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................� �� �� �� �� �� �� �� �? ?? ?? ?? ?? ?? ?? ?? ?with checkpointing

original code (not porchi�ed)

Figure A-9: Sun Sparcstation20. Random access

1 2 4 8 16 32 64 128 256Number of Blocks050100150200250300Runtime[s] ...................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

..................................� �� �� �� �� �� �� �� �? ?? ?? ?? ?? ?? ?? ?? ?with checkpointing

original code (not porchi�ed)Figure A-10: Sun UltraSparc1. Random access52

Page 53: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

1 2 4 8 16 32 64 128 256Number of Blocks050100150200250300Runtime[s] .............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

.......................� �� �� �� �� �� �� �� �? ?? ?? ?? ?? ?? ?? ?? ?with checkpointing

original code (not porchi�ed)Figure A-11: Sun UltraSparc2. Random access

1 2 4 8 16 32 64 128 256Number of Blocks050100150200Runtime[s] ...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................� �� �� �� �� �� �� �� �? ?? ?? ?? ?? ?? ?? ?? ?with checkpointing

original code (not porchi�ed)Figure A-12: Sun Ultra-Enterprise. Random access

53

Page 54: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

Molecular Dynamic CodeThe �gures below present the cumulative runtimes for a 2D short-range moleculardynamics code for 5; 000 particles. The code outputs the simulation results (53 bytes)every 10 iterations to a �le, and checkpoints are taken every 20 minutes. The programgenerates approximately 400 bytes of data between checkpoints, and the total size ofthe output �le after 60; 000 iterations is 150 Kbytes. From the data below, we can seethat the performance overhead of the ftIO system is relatively low. In fact, it rangesfrom approximately 8% on Sun Sparkstation4 to as low as 2% on Sun UltraSparc1.

0 5 10 15 20 25 30 35 40 45 50 55 60Number of Iterations (�103)0255075

100125150175200Runtime[min]

.........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

.....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

.........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

..............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................with checkpointing

original code

Figure A-13: Sun Sparcstation4. Molecular dynamics

54

Page 55: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

0 5 10 15 20 25 30 35 40 45 50 55 60Number of Iterations (�103)050100150200250300

Runtime[min].......................................................................................................................................................................................................................................................................................................................................

....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

...................................................................................................................................................................................................................

.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................with checkpointing

original code

Figure A-14: Sun Sparcstation10. Molecular dynamics

0 5 10 15 20 25 30 35 40 45 50 55 60Number of Iterations (�103)0255075

100125150175200Runtime[min]

...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

.................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................with checkpointing

original code

Figure A-15: Sun Sparcstation20. Molecular dynamics55

Page 56: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

0 5 10 15 20 25 30 35 40 45 50 55 60Number of Iterations (�103)01020304050607080

Runtime[min].................................................................................................................................................................................................................................................................................................................

............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

.................................................................................................................................................................................................................................................................................

...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................with checkpointing

original code

Figure A-16: Sun UltraSparc1. Molecular dynamics

0 5 10 15 20 25 30 35 40 45 50 55 60Number of Iterations (�103)010203040506070Runtime[min]

.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

.................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

.........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

.....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................with checkpointingoriginal code

Figure A-17: Sun UltraSparc2. Molecular dynamics56

Page 57: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

0 5 10 15 20 25 30 35 40 45 50 55 60Number of Iterations (�103)01020304050607080

Runtime[min].......................................................................................................................................................................................................................................................................................................................................................

............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

...........................................................................................................................................................

.....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

.........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................with checkpointing

original code

Figure A-18: Sun Ultra-Enterprise. Molecular dynamics

57

Page 58: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

Bibliography[1] Adam Beguelin, Erik Seligman, and Peter Stephan. Application level fault tol-erance in heterogeneous networks of workstations. Journal of Parallel and Dis-tributed Computing, 43(2):147{155, June 1997.[2] P. M. Chen and D. A. Patterson. A new approach to I/O performanceevaluation|self-scaling I/O benchmarks, predicted I/O performance. ACMTransactions on Computer Systems, 12, 4:309{339, 1994.[3] Peter M. Chen and David A. Patterson. Storage Performance|Metrics andBenchmarks. Proceedings of the IEEE, 81(8):1151{1165, August 1993.[4] Yuqun Chen, James S. Plank, and Kai Li. CLIP: A checkpointing tool formessage-passing parallel programs. Technical Report TR-543-97, Princeton Uni-versity, Computer Science Department, May 1997.[5] Stuart I. Feldman and Channing B. Brown. Igor: A System for Program Debug-ging via Reversible Execution. ACM SIGPLAN Notices, Workshop on Paralleland Distributed Debugging, 24(1):112{123, January 1989.[6] Jim Gray and Andreas Reuter. Transaction Processing: Concepts and Tech-niques. Morgan Kaufmann, San Mateo, CA, 1993.[7] Christine Hofmeister. Dynamic Recon�guration. PhD thesis, Computer ScienceDepartment, University of Maryland, College Park, 1993.[8] Y. Huang and C. Kintala. Software implemented fault tolerance: Technologiesand experience. In Jean-Claude Laprie, editor, Digest of Papers|23rd Inter-58

Page 59: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

national Symposium on Fault-Tolerant Computing, pages 2{9, Toulouse, France,June 1993.[9] W. H. Jessop, J. D. Noe, D. M. Jacobson, J. L. Baer, and C. Pu. The EDENtransaction-based �le system. In IEEE Symp. on Reliability in Distributed Soft-ware and Database Systems, IEEE CS 2, Wiederhold(ed), Pittsburgh PA, pages163{169, July 1982.[10] B. W. Kernighan and D. M. Ritchie. The C Programming Language, 2nd edition.Prentice-Hall, 1988.[11] Michael Litzkow and Marvin Solomon. Supporting checkpointing and processmigration outside the UNIX kernel. In USENIX Association, editor, Proceedingsof the Winter 1992 USENIX Conference: January 20 | January 24, 1992,San Francisco, California, pages 283{290, Berkeley, CA, USA, Winter 1992.USENIX.[12] Michael Litzkow, Todd Tannenbaum, Jim Basney, and Miron Livny. Checkpointand migration of UNIX processes in the condor distributed processing system.Technical Report CS-TR-97-1346, University of Wisconsin, Madison, April 1997.[13] A. Nangia and D. Finkel. Transaction-based fault-tolerant computing in dis-tributed systems. In Jha Niraj and Donald S. Fussell, editors, Proceedings ofthe IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, pages92{97, Amherst, MA, July 1992. IEEE Computer Society Press.[14] W. H. Paxton. A client-based transaction system to maintain data integrity.In Proceedings of the 7th ACM Symposium on Operating Systems Principles(SOSP), pages 18{23, 1979.[15] James S. Plank, Micah Beck, and Gerry Kingsley. Libckpt: Transparent Check-pointing under Unix. In USENIX Winter 1995 Technical Conference, pages213{233, New Orleans, Louisiana, January 1995.[16] Peter J. Plauger. The Standard C Library. Prentice Hall, Englewood Cli�s, 1992.59

Page 60: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

[17] James T. Poole. Preliminary survey of I/O intensive applications. TechnicalReport CCSF-38, Scalable I/O Initiative, Caltech Concurrent SupercomputingFacilities, Caltech, 1994.[18] Balkrishna Ramkumar and Volker Strumpen. Portable Checkpointing for Hetero-geneous Architectures. In Digest of Papers|27th International Symposium onFault-Tolerant Computing, pages 58{67, Seattle, Washington, June 1997. IEEEComputer Society.[19] Richard Rashid, Avadis Tevanian Jr., Michael Young, David Golub, RobertBaron, David Black, William J. Bolosky, and Jonathan Chew. Machine-Independent Virtual Memory Management for Paged Uniprocessor and Multipro-cessor Architectures. IEEE Transactions on Computers, 37(8):896{908, August1988.[20] Herbert Schildt. The Annotated ANSI C Standard. McGraw-Hill, 1990.[21] Peter W. Smith. The Possibilities and Limitations of Heterogeneous Pro-cess Migration. PhD thesis, Department of Computer Sience, University ofBritish Columbia, October 1997. (http://www.cs.ubc.ca/spider/psmith/tui.html).[22] Volker Strumpen. Compiler Technology for Portable Checkpoints. submitted forpublication (http://theory.lcs. mit.edu/~strumpen/porch.ps.gz), 1998.[23] Andrew S. Tanenbaum. Modern Operating Systems. Prentice-Hall, 1992.[24] Y. M. Wang, P. Y. Chung, Y. Huang, and E. N. Elnozahy. Integrating Check-pointing with Transaction Processing. In Digest of Papers|27th InternationalSymposium on Fault-Tolerant Computing, pages 304{308, Seattle, Washington,June 1997. IEEE Computer Society.[25] Y.-M. Wang, Y. Huang, K.-P. Vo, P.-Y. Chung, and C. Kintala. Checkpoint-ing and its applications. In Digest of Papers|25th International Symposium on60

Page 61: P ortable F ault-T oleran t File I/Osupertech.csail.mit.edu/papers/igor-meng-thesis.pdfAc kno wledgmen ts I w ould lik e to express m y sincere gratitude Dr. V olk er Strump en whose

Fault-Tolerant Computing, pages 22{32, Los Alamitos, June 1995. IEEE Com-puter Society.[26] Matthew J. Zekauskas, Wayne A. Sawdon, and Brian N. Bershad. Software WriteDetection for a Distributed Shared Memory. In 1st Symposium on OperatingSystems Design and Implementation, pages 87{100, Monterey, CA, November1994. USENIX.

61