Distributed system fault tolerance using message logging and checkpointing

Johnson, David Bruce

Distributed system fault tolerance using message logging and checkpointing

dc.contributor.advisor	Zwaenepoel, Willy
dc.creator	Johnson, David Bruce
dc.date.accessioned	2009-06-03T23:56:33Z
dc.date.available	2009-06-03T23:56:33Z
dc.date.issued	1990
dc.description.abstract	Fault tolerance can allow processes executing in a computer system to survive failures within the system. This thesis addresses the theory and practice of transparent fault-tolerance methods using message logging and checkpointing in distributed systems. A general model for reasoning about the behavior and correctness of these methods is developed, and the design, implementation, and performance of two new low-overhead methods based on this model are presented. No specialized hardware is required with these new methods. The model is independent of the protocols used in the system. Each process state is represented by a dependency vector, and each system state is represented by a dependency matrix showing a collection of process states. The set of system states that have occurred during any single execution of a system forms a lattice, with the sets of consistent and recoverable system states as sublattices. There is thus always a unique maximum recoverable system state. The first method presented uses a new pessimistic message logging protocol called sender-based message logging. Each message is logged in the local volatile memory of the machine from which it was sent, and the order in which the message was received is returned to the sender as a receive sequence number. Message logging overlaps execution of the receiver, until the receiver attempts to send a new message. Implemented in the V-System, the maximum measured failure-free overhead on distributed application programs was under 16 percent, and average overhead measured 2 percent or less, depending on problem size and communication intensity. Optimistic message logging can outperform pessimistic logging, since message logging occurs asynchronously. A new optimistic message logging system is presented that guarantees to find the maximum possible recoverable system state, which is not ensured by previous optimistic methods. All logged messages and checkpoints are utilized, and thus some messages received by a process before it was checkpointed may not need to be logged. Although failure recovery using optimistic message logging is more difficult, failure-free application overhead using this method ranged form only a maximum of under 4 percent to much less than 1 percent.
dc.format.extent	133 p.	en_US
dc.format.mimetype	application/pdf
dc.identifier.callno	Thesis Comp. Sci. 1990 Johnson
dc.identifier.citation	Johnson, David Bruce. "Distributed system fault tolerance using message logging and checkpointing." (1990) Diss., Rice University. <a href="https://hdl.handle.net/1911/16354">https://hdl.handle.net/1911/16354</a>.
dc.identifier.digital	JohnsonD	en_US
dc.identifier.uri	https://hdl.handle.net/1911/16354
dc.language.iso	eng
dc.rights	Copyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.
dc.subject	Computer science
dc.title	Distributed system fault tolerance using message logging and checkpointing
dc.type	Thesis
dc.type.material	Text
thesis.degree.department	Computer Science
thesis.degree.discipline	Engineering
thesis.degree.grantor	Rice University
thesis.degree.level	Doctoral
thesis.degree.name	Doctor of Philosophy

Files

Original bundle

Now showing 1 - 1 of 1

Name:: JohnsonD.pdf
Size:: 4.66 MB
Format:: Adobe Portable Document Format
Description:

Download

Collections

Rice University Electronic Theses and Dissertations