A storage architecture for data-intensive computing

Shafer, Jeffrey

A storage architecture for data-intensive computing

dc.contributor.advisor	Rixner, Scott	en_US
dc.creator	Shafer, Jeffrey	en_US
dc.date.accessioned	2011-07-25T02:05:16Z	en_US
dc.date.available	2011-07-25T02:05:16Z	en_US
dc.date.issued	2010	en_US
dc.description.abstract	The assimilation of computing into our daily lives is enabling the generation of data at unprecedented rates. In 2008, IDC estimated that the "digital universe" contained 486 exabytes of data [9]. The computing industry is being challenged to develop methods for the cost-effective processing of data at these large scales. The MapReduce programming model has emerged as a scalable way to perform data-intensive computations on commodity cluster computers. Hadoop is a popular open-source implementation of MapReduce. To manage storage resources across the cluster, Hadoop uses a distributed user-level filesystem. This filesystem --- HDFS --- is written in Java and designed for portability across heterogeneous hardware and software platforms. The efficiency of a Hadoop cluster depends heavily on the performance of this underlying storage system. This thesis is the first to analyze the interactions between Hadoop and storage. It describes how the user-level Hadoop filesystem, instead of efficiently capturing the full performance potential of the underlying cluster hardware, actually degrades application performance significantly. Architectural bottlenecks in the Hadoop implementation result in inefficient HDFS usage due to delays in scheduling new MapReduce tasks. Further, HDFS implicitly makes assumptions about how the underlying native platform manages storage resources, even though native filesystems and I/O schedulers vary widely in design and behavior. Methods to eliminate these bottlenecks in HDFS are proposed and evaluated both in terms of their application performance improvement and impact on the portability of the Hadoop framework. In addition to improving the performance and efficiency of the Hadoop storage system, this thesis also focuses on improving its flexibility. The goal is to allow Hadoop to coexist in cluster computers shared with a variety of other applications through the use of virtualization technology. The introduction of virtualization breaks the traditional Hadoop storage architecture, where persistent HDFS data is stored on local disks installed directly in the computation nodes. To overcome this challenge, a new flexible network-based storage architecture is proposed, along with changes to the HDFS framework. Network-based storage enables Hadoop to operate efficiently in a dynamic virtualized environment and furthers the spread of the MapReduce parallel programming model to new applications.	en_US
dc.format.mimetype	application/pdf	en_US
dc.identifier.callno	THESIS E.E. 2010 SHAFER	en_US
dc.identifier.citation	Shafer, Jeffrey. "A storage architecture for data-intensive computing." (2010) Diss., Rice University. <a href="https://hdl.handle.net/1911/62014">https://hdl.handle.net/1911/62014</a>.	en_US
dc.identifier.uri	https://hdl.handle.net/1911/62014	en_US
dc.language.iso	eng	en_US
dc.rights	Copyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.	en_US
dc.subject	Computer science	en_US
dc.subject	Applied sciences	en_US
dc.title	A storage architecture for data-intensive computing	en_US
dc.type	Thesis	en_US
dc.type.material	Text	en_US
thesis.degree.department	Electrical Engineering	en_US
thesis.degree.discipline	Engineering	en_US
thesis.degree.grantor	Rice University	en_US
thesis.degree.level	Doctoral	en_US
thesis.degree.name	Doctor of Philosophy	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 3421190.PDF
Size:: 6.32 MB
Format:: Adobe Portable Document Format

Download

Collections

Rice University Theses and Dissertations
ECE Theses and Dissertations