Manetho: Fault Tolerance in Distributed Systems Using Rollback-Recovery and Process Replication

dc.contributor.authorElnozahy, Elmootazbellahen_US
dc.date.accessioned2017-08-02T22:03:20Zen_US
dc.date.available2017-08-02T22:03:20Zen_US
dc.date.issued1993-10en_US
dc.date.noteOctober 1993en_US
dc.descriptionThis work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/19117en_US
dc.description.abstractThis dissertation presents a new protocol that allows rollback-recovery and process replication to co-exist in a distributed system. The protocol relies on a novel data structure called the antecedence graph, which tracks the nondeterministic events during failure-free operation and provides information for recreating them if a failure occurs. The rollback-recovery part of the protocol combines the low failure-free overhead of optimistic rollback-recovery with the advantages of pessimistic rollback-recovery, namely fast output commit, limited rollback, and failure-containment. The process replication part of the protocol features anew multicast protocol designed specifically to support process replication. Unlike previous work, the new protocol provides high throughput and low latency in message delivery without relying on the application semantics. The protocol has been implemented in the Manetho prototype. Experience with a number of long-running, compute-intensive parallel applications confirms the performance advantages of the new protocol. The implementation also features several performance optimizations that are applicable to other rollback-recovery and multicast protocols.en_US
dc.format.extent122 ppen_US
dc.identifier.citationElnozahy, Elmootazbellah. "Manetho: Fault Tolerance in Distributed Systems Using Rollback-Recovery and Process Replication." (1993) https://hdl.handle.net/1911/96435.en_US
dc.identifier.digitalTR93-212en_US
dc.identifier.urihttps://hdl.handle.net/1911/96435en_US
dc.language.isoengen_US
dc.rightsYou are granted permission for the noncommercial reproduction, distribution, display, and performance of this technical report in any format, but this permission is only for a period of forty-five (45) days from the most recent time that you verified that this technical report is still available from the Computer Science Department of Rice University under terms that include this permission. All other rights are reserved by the author(s).en_US
dc.titleManetho: Fault Tolerance in Distributed Systems Using Rollback-Recovery and Process Replicationen_US
dc.typeTechnical reporten_US
dc.type.dcmiTexten_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
TR93-212.pdf
Size:
5.18 MB
Format:
Adobe Portable Document Format