Understanding and Improving the Efficiency of Failure Resilience for Big Data Frameworks

dc.contributor.advisorNg, T. S. Eugeneen_US
dc.contributor.committeeMemberCox, Alan L.en_US
dc.contributor.committeeMemberKnightly, Edward W.en_US
dc.contributor.committeeMemberGkantsidis, Christosen_US
dc.creatorDinu, Florinen_US
dc.date.accessioned2014-08-07T20:19:57Zen_US
dc.date.available2014-08-07T20:19:57Zen_US
dc.date.created2013-12en_US
dc.date.issued2013-10-30en_US
dc.date.submittedDecember 2013en_US
dc.date.updated2014-08-07T20:19:57Zen_US
dc.description.abstractBig data processing frameworks (MapReduce, Hadoop, Dryad) are hugely popular today because they greatly simplify the management and deployment of big data analysis jobs requiring the use of many machines in parallel. A strong selling point is their built-in failure resilience support. Big data frameworks can run computations to completion despite occasional failures in the system. However, an important but overlooked point has been the efficiency of their failure resilience. The vision of this thesis is that big data frameworks should not only be failure resilient but that they should provide the resilience in an efficient manner with minimum impact on computations both under failures as well as during failure-free periods. To this end, the first part of the thesis presents the first in-depth analysis of the efficiency of the failure resilience provided by the popular Hadoop framework under failures. The results show that even single machine failures can lead to large, variable and unpredictable job running times. This thesis determines the causes behind this inefficient behavior and points out the responsible Hadoop mechanisms and their limitations. The second part of the thesis focuses on providing efficient failure resilience for the case of computations comprised of multiple jobs. We present the design, implementation and evaluation of RCMP, a MapReduce system based on the fundamental insight that using data replication to enable failure resilience oftentimes leads to significant and unnecessary increases in computation running time. In contrast, RCMP is designed to use job re-computation as a first-order failure resilience strategy. Re-computations under RCMP are efficient. Specifically, RCMP re-computes the minimum amount of work and uniquely it ensure this minimum re-computation work is performed efficiently. In particular, RCMP mitigates hot-spots that affect data transfers during re-computations and ensures that the available compute node parallelism is well leveraged.en_US
dc.format.mimetypeapplication/pdfen_US
dc.identifier.citationDinu, Florin. "Understanding and Improving the Efficiency of Failure Resilience for Big Data Frameworks." (2013) Diss., Rice University. <a href="https://hdl.handle.net/1911/76486">https://hdl.handle.net/1911/76486</a>.en_US
dc.identifier.urihttps://hdl.handle.net/1911/76486en_US
dc.language.isoengen_US
dc.rightsCopyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.en_US
dc.subjectFailureen_US
dc.subjectFailure resilienceen_US
dc.subjectBig dataen_US
dc.subjectMapReduceen_US
dc.subjectHadoopen_US
dc.subjectSystemsen_US
dc.subjectNetworkingen_US
dc.titleUnderstanding and Improving the Efficiency of Failure Resilience for Big Data Frameworksen_US
dc.typeThesisen_US
dc.type.materialTexten_US
thesis.degree.departmentComputer Scienceen_US
thesis.degree.disciplineEngineeringen_US
thesis.degree.grantorRice Universityen_US
thesis.degree.levelDoctoralen_US
thesis.degree.nameDoctor of Philosophyen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
FlorinDinu-PhDthesis-final.pdf
Size:
2.06 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
939 B
Format:
Plain Text
Description: