Performance Analysis and Optimization of Apache Pig

Liu, Ruoyu

Performance Analysis and Optimization of Apache Pig

dc.contributor.advisor	Cox, Alan Lee
dc.contributor.committeeMember	Ng, Eugene Tze Sing
dc.contributor.committeeMember	Mellor-Crummey, John
dc.creator	Liu, Ruoyu
dc.date.accessioned	2016-01-25T15:39:20Z
dc.date.available	2016-01-25T15:39:20Z
dc.date.created	2014-12
dc.date.issued	2014-08-08
dc.date.submitted	December 2014
dc.date.updated	2016-01-25T15:39:21Z
dc.description.abstract	Apache Pig is a language, compiler, and run-time library for simplifying the development of data-analytics applications on Apache Hadoop. Specifically, it enables developers to write data-analytics applications in a high-level, SQL-like language called Pig Latin that is automatically translated into a series of MapReduce computations. For most developers, this is both easier and faster than writing applications that use Hadoop directly. This thesis first presents a detailed performance analysis of Apache Pig running a collection of simple Pig Latin programs. In addition, it compares the performance of these programs to equivalent hand-coded Java programs that use the Hadoop MapReduce framework directly. In all cases, the hand-coded Java programs outperformed the Pig Latin programs. Depending on the program and problem size, the hand-coded Java was 1.15 to 3.07 times faster. The Pig Latin programs were slower for three reasons: (1) the overhead of translating Pig Latin into Java MapReduce jobs, (2) the overhead of converting data to and from the text format used in the HDFS files and Pig's own internal representation, and (3) the overhead of the additional MapReduce jobs that were performed by Pig. Finally, this thesis explores a new approach to optimizing the Fragment-replicated join operation in Apache Pig. In Pig's original implementation of this operation, an identical in-memory hash table is constructed and used by every Map task. In contrast, under the optimized implementation, this duplication of data is eliminated through the use of a new interprocess shared-memory hash table library. Benchmarks show that as the problem size grows the optimized implementation outperforms the original by a factor of two. Moreover, it is possible to run larger problems under the optimized implementation than under the original.
dc.format.mimetype	application/pdf
dc.identifier.citation	Liu, Ruoyu. "Performance Analysis and Optimization of Apache Pig." (2014) Master’s Thesis, Rice University. <a href="https://hdl.handle.net/1911/88101">https://hdl.handle.net/1911/88101</a>.
dc.identifier.uri	https://hdl.handle.net/1911/88101
dc.language.iso	eng
dc.rights	Copyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.
dc.subject	Apache Hadoop
dc.subject	Apache Pig
dc.subject	MapReduce
dc.subject	Performance
dc.subject	Hash join
dc.title	Performance Analysis and Optimization of Apache Pig
dc.type	Thesis
dc.type.material	Text
thesis.degree.department	Computer Science
thesis.degree.discipline	Engineering
thesis.degree.grantor	Rice University
thesis.degree.level	Masters
thesis.degree.name	Master of Science

Files

Original bundle

Now showing 1 - 1 of 1

Name:: LIU-DOCUMENT-2014.pdf
Size:: 258.79 KB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 2 of 2

Name:: PROQUEST_LICENSE.txt
Size:: 5.83 KB
Format:: Plain Text
Description:

Download

Name:: LICENSE.txt
Size:: 2.6 KB
Format:: Plain Text
Description:

Download

Collections

Rice University Electronic Theses and Dissertations