Distributed system fault tolerance using message logging and checkpointing

dc.contributor.advisorZwaenepoel, Willy
dc.creatorJohnson, David Bruce
dc.date.accessioned2009-06-03T23:56:33Z
dc.date.available2009-06-03T23:56:33Z
dc.date.issued1990
dc.description.abstractFault tolerance can allow processes executing in a computer system to survive failures within the system. This thesis addresses the theory and practice of transparent fault-tolerance methods using message logging and checkpointing in distributed systems. A general model for reasoning about the behavior and correctness of these methods is developed, and the design, implementation, and performance of two new low-overhead methods based on this model are presented. No specialized hardware is required with these new methods. The model is independent of the protocols used in the system. Each process state is represented by a dependency vector, and each system state is represented by a dependency matrix showing a collection of process states. The set of system states that have occurred during any single execution of a system forms a lattice, with the sets of consistent and recoverable system states as sublattices. There is thus always a unique maximum recoverable system state. The first method presented uses a new pessimistic message logging protocol called sender-based message logging. Each message is logged in the local volatile memory of the machine from which it was sent, and the order in which the message was received is returned to the sender as a receive sequence number. Message logging overlaps execution of the receiver, until the receiver attempts to send a new message. Implemented in the V-System, the maximum measured failure-free overhead on distributed application programs was under 16 percent, and average overhead measured 2 percent or less, depending on problem size and communication intensity. Optimistic message logging can outperform pessimistic logging, since message logging occurs asynchronously. A new optimistic message logging system is presented that guarantees to find the maximum possible recoverable system state, which is not ensured by previous optimistic methods. All logged messages and checkpoints are utilized, and thus some messages received by a process before it was checkpointed may not need to be logged. Although failure recovery using optimistic message logging is more difficult, failure-free application overhead using this method ranged form only a maximum of under 4 percent to much less than 1 percent.
dc.format.extent133 p.en_US
dc.format.mimetypeapplication/pdf
dc.identifier.callnoThesis Comp. Sci. 1990 Johnson
dc.identifier.citationJohnson, David Bruce. "Distributed system fault tolerance using message logging and checkpointing." (1990) Diss., Rice University. <a href="https://hdl.handle.net/1911/16354">https://hdl.handle.net/1911/16354</a>.
dc.identifier.digitalJohnsonDen_US
dc.identifier.urihttps://hdl.handle.net/1911/16354
dc.language.isoeng
dc.rightsCopyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.
dc.subjectComputer science
dc.titleDistributed system fault tolerance using message logging and checkpointing
dc.typeThesis
dc.type.materialText
thesis.degree.departmentComputer Science
thesis.degree.disciplineEngineering
thesis.degree.grantorRice University
thesis.degree.levelDoctoral
thesis.degree.nameDoctor of Philosophy
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
JohnsonD.pdf
Size:
4.66 MB
Format:
Adobe Portable Document Format
Description: