R-3 Repository :: Browsing by Author "Jermaine, Christopher"

Browsing by Author "Jermaine, Christopher"

Now showing 1 - 16 of 16

A Browser-based Program Execution Visualizer for Learning Interactive Programming in Python
(2015-04-23) Tang, Lei; Warren, Joe; Rixner, Scott; Jermaine, Christopher
Good educational programming tools help students practice programming skills and build better understanding of basic concepts and logic. As Rice University started offering free Massive Open Online Courses (MOOC) on the internet, we developed a web-based programming environment to teach introductory programming course in Python. The course is now one of the top-rated MOOC courses, which is believed largely due to the successful web-based educational programming environment. Here we will introduce the thought processes behind the design and then focus on the key innovations incorporated in it. The main contribution of this thesis is an entirely browser-based Python program execution visualizer that graphically demonstrates the execution information to help students understand the dynamics of program execution. Especially, this tool can also be used to visualize and debug event-driven programs. The design details and unit test infrastructure for the program execution visualizer are both introduced in this thesis.
A Data and Platform-Aware Framework For Large-Scale Machine Learning
(2015-04-24) Mirhoseini, Azalia; Koushanfar, Farinaz; Aazhang, Behnaam; Baraniuk, Richard; Jermaine, Christopher
This thesis introduces a novel framework for execution of a broad class of iterative machine learning algorithms on massive and dense (non-sparse) datasets. Several classes of critical and fast-growing data, including image and video content, contain dense dependencies. Current pursuits are overwhelmed by the excessive computation, memory access, and inter-processor communication overhead incurred by processing dense data. On the one hand, solutions that employ data-aware processing techniques produce transformations that are oblivious to the overhead created on the underlying computing platform. On the other hand, solutions that leverage platform-aware approaches do not exploit the non-apparent data geometry. My work is the first to develop a comprehensive data- and platform-aware solution that provably optimizes the cost (in terms of runtime, energy, power, and memory usage) of iterative learning analysis on dense data. My solution is founded on a novel tunable data transformation methodology that can be customized with respect to the underlying computing resources and constraints. My key contributions include: (i) introducing a scalable and parametric data transformation methodology that leverages coarse-grained parallelism in the data to create versatile and tunable data representations, (ii) developing automated methods for quantifying platform-specific computing costs in distributed settings, (iii) devising optimally-bounded partitioning and distributed flow scheduling techniques for running iterative updates on dense correlation matrices, (iv) devising methods that enable transforming and learning on streaming dense data, and (v) providing user-friendly open-source APIs that facilitate adoption of my solution on multiple platforms including (multi-core and many-core) CPUs and FPGAs. Several learning algorithms such as regularized regression, cone optimization, and power iteration can be readily solved using my APIs. My solutions are evaluated on a number of learning applications including image classification, super-resolution, and denoising. I perform experiments on various real-world datasets with up to 5 billion non-zeros on a range of computing platforms including Intel i7 CPUs, Amazon EC2, IBM iDataPlex, and Xilinx Virtex-6 FPGAs. I demonstrate that my framework can achieve up to 2 orders of magnitude performance improvement in comparison with current state-of-the-art solutions.
An Experimental Comparison of Complex Objects Implementations in Big Data Systems
(2017-06-07) Sikdar, Sourav; Jermaine, Christopher
Many data management and analytics systems support complex objects. Dataflow platforms such as Spark and Flink allow programmers to manipulate sets consisting of objects from a host programming language, often Java. Document databases such as MongoDB make use of hierarchical interchange formats--most popularly JSON--which embody a data model where individual records can themselves contain sets of records. Systems such as Dremel and AsterixDB allow complex nesting of data structures. The desire to support such complex objects forces a system designer to ask: how should complex objects be implemented in a modern data management system? In this thesis, over a suite of representative data management tasks, I experimentally evaluate the performance implications of a wide variety of complex object implementations. The choice of object implementation can have a profound effect on performance. For example, the same external sort to perform a duplicate removal can take anywhere between a half hour to fourteen and a half hours depending upon the complex object implementation. A corollary is that a bad object implementation can doom system performance. In addition, we reaffirm the value of the classical database way of storing complex objects - where there is no distinction between the in-memory and over-the-wire data representation, within a modern big data system.
Automatic Matrix Format Exploration for Large Scale Linear Algebra
(2020-10-30) Luo, Shangyu; Jermaine, Christopher
The input of a linear algebra (LA) operation, such as matrices and vectors, could be stored in multiple ways: rows/columns, strips, blocks, etc. Usually, it is very difficult for a programmer to figure out the proper format to use to make a LA computation run fast. Predicting and optimizing the runtime behavior of a LA computation is not an easy task, even when one has expert knowledge of the underlying execution engine. The situation is particularly difficult if the computation consists of thousands of operations, and those operations must be run in a distributed manner. In this paper, we argue that we can render a parallel relational database to automatically explore the formats of LA computations. More specifically, our system would take in the existing code and analyze the operations in the code, explore different formats for those operations and select the most efficient formats, and finally automatically generate the new code to run those operations in their selected formats. We show that our implementation is able to find the formats that have a better performance than the formats that are manually picked up by an expert user of the system.
Code Similarity Search in a Latent Space
(2017-04-21) Qi, Letao; Jermaine, Christopher
A huge database of program source codes that supports fast search via code similarity would be useful for several applications, including automated program synthesis and debugging, and user-facing code search in an integrated development environment. Here, "similar" is defined with respect to a set of application-defined similarity functions. The key difficulty in realizing this goal is that standard database indexing techniques cannot be applied to the problem of querying based on arbitrary similarity functions. To address this difficulty, I propose a dictionary-based approach where I represent each piece of code by a vector of similarities to a set of example database codes. Cosine similarity between the vector representing a query code and the vector representing a database code can be used to measure closeness. However, the dictionary may need to be very high dimensional if the goal is to accurately index a wide variety of database codes. Hence, I explore the idea of using projection matrix to the reduce dimensionality of the problem. One approach is to use random projection. The other approach that I explore is learning the projection matrix by developing a machine learning algorithm that is supervised using the text/code pairs provided by StackOverflow, a question-answering website for programmers.
Constrained Program Inference Using Metropolis-Hastings Sampling
(2021-04-30) Chilukuri, Meghana Vasudha Oris; Jermaine, Christopher
Conditional Program Generation (CPG) is a Sketch program generation technique that uses clues given by the user to generate the required Sketch. The CPG model generates a Sketch by sampling from the learned probability distribution P(sketch|clues). However, it cannot guarantee that the generated Sketch will incorporate all of the given clues. In practice, we find that when the CPG model assigns vanishingly low probabilities to every Sketch program in the Sketch program space that incorporates all of the given clues, it often returns a high-probability Sketch program that does not contain the clues. Such scenarios arise when the user wants to generate a novel program and gives clues that are significantly different from any set of clues the model was trained on. In this thesis, we introduce Constrained Program Inference (CPI), a method that treats constrained Sketch program generation as an inference problem, rather than a training problem. It guarantees every generated program will incorporate all of the given clues. Our method uses the Metropolis- Hastings algorithm to treat clues as hard constraints, thus enabling CPI to generate novel programs. We find that CPI is able to produce higher-quality programs than CPG.
Declarative Relational Machine Learning Systems
(2023-02-22) Jankov, Dimitrije; Jermaine, Christopher; Kyrillidis, Anastasios; Uribe, Cesar
Several systems, most notably TensorFlow and PyTorch, have revolutionized how we practice machine learning (ML). They allow an ML practitioner to create complex models with great ease. In recent years there has been an explosion in the size of ML models, and it has become apparent that the systems we use today limit the data scientist to a few standard implementations like data parallelism (DP). In an ideal scenario, the ML practitioner would specify their model, and a system would take care of managing the specifics of the computations. My research explores how we can design and implement such systems. Specifically, it tries to find the right set of changes to a declarative relational system so that it can accommodate the needs of ML systems. The results of my research show that one can create scalable distributed machine learning systems that do not constrain the abilities of data scientists and enable greater productivity.
Distributed Algorithms for Computing Very Large Thresholded Covariance Matrices
(2014-09-26) Gao, Zekai; Jermaine, Christopher; Nakhleh, Luay; Subramanian, Devika
Computation of covariance matrices from observed data is an important problem, as such matrices are used in applications such as PCA, LDA, and increasingly in the learning and application of probabilistic graphical models. One of the most challenging aspects of constructing and managing covariance matrices is that they can be huge and the size makes then expensive to compute. For a p-dimensional data set with n rows, the covariance matrix will have p(p-1)/2 entries and the naive algorithm to compute the matrix will take O(np^2) time. For large p (greater than 10,000) and n much greater than p, this is debilitating. In this thesis, we consider the problem of computing a large covariance matrix efficiently in a distributed fashion over a large data set. We begin by considering the naive algorithm in detail, pointing out where it will and will not be feasible. We then consider reducing the time complexity using sampling-based methods to compute to compute an approximate, thresholded version of the covariance matrix. Here “thresholding” means that all of the unimportant values in the matrix have been dropped and replaced with zeroes. Our algorithms have probabilistic bounds which imply that with high probability, all of the top K entries in the matrix have been retained.
Evaluating Multihop Mobile Wireless Networks with Controllable Node Sparsity or Density
(2015-12-04) Amiri, Keyvan; Johnson, David B.; Baraniuk, Richard G.; Jermaine, Christopher
Simulation is the most widely used tool for evaluating the performance of multihop mobile wireless networks, yet such simulation has so far been limited due to the lack of sufficient wireless mobility models for creating a wide range of different types of network scenarios of mobile nodes moving about for use in protocol simulation. For example, the very commonly used Random Waypoint mobility model can only effectively be used in scenarios with relatively high node density, as attempting to generate sparser scenarios (e.g., trying the same number of nodes in larger and larger spaces) results in scenarios in which the network is frequently or always partitioned, with no possible multihop wireless path between many different pairs of nodes. In this thesis, I present the design and evaluation of the Random Controlled Sparse (RCS) mobility model, a new dynamic, tunable mobility model that can be controlled to generate a wide range of mobile scenarios with varying levels of node sparsity or density while avoiding network partitions. The model requires only a small set of parameters to define the desired behavior of the scenarios being generated. In generating a scenario, RCS itself internally operates as a separate discrete event simulator, utilizing highly efficient graph and computational geometry algorithms to control the desired sparse behavior and manage the constraints between the motions of different nodes. To further improve the performance and scalability of the model, I have also parallelized certain key parts of the scenario generation in the model. To show the performance of the model in generating scenarios, I have evaluated the running time of the model across wide range of number of nodes and node densities. I also present an evaluation of the scenarios generated, in terms of metrics such as the average number of neighbors of a node and the average minimum possible path length (hop count) existing between each pair of nodes, demonstrating the range of scenarios that RCS is able to produce. To show the usefulness of the model in revealing protocol behavior, I show the performance of DSDV, a common multihop wireless ad hoc network routing protocol, across a wide range of sparse and dense network scenarios. These results demonstrate that different degrees of node sparsity or density sometimes have surprising effects on protocol performance. Simulations such as these, revealing these types of results, have not generally been possible before due to the lack of suitable mobility models. Finally, to more fully show the use of the RCS model in evaluating real protocols, I present the design and evaluation of LAMP, the Local-Approximation Multicast Protocol, a new on-demand multicast routing protocol I have designed for mobile wireless ad hoc networks that delivers high performance in both sparse as well as dense scenarios. LAMP maintains high performance by utilizing link-layer unicast transmissions, based on a new algorithm in which each node computes a local approximation of the globally optimal multicast forwarding tree to the receivers. LAMP also introduces a new distributed protocol optimization known as anticipatory forwarding, to further improve both overhead and packet delivery latency when this local approximation deviates from the globally optimal tree. I have evaluated LAMP through detailed ns-2 simulations using scenarios from the RCS model as well as the Random Waypoint model, and compared it with ODMRP and ADMR, two existing on-demand multicasting protocols that have previously been shown to perform well.
Exploring phylogenetic hypotheses via Gibbs sampling on evolutionary networks
(BioMed Central, 2016) Yu, Yun; Jermaine, Christopher; Nakhleh, Luay K.
Abstract Background Phylogenetic networks are leaf-labeled graphs used to model and display complex evolutionary relationships that do not fit a single tree. There are two classes of phylogenetic networks: Data-display networks and evolutionary networks. While data-display networks are very commonly used to explore data, they are not amenable to incorporating probabilistic models of gene and genome evolution. Evolutionary networks, on the other hand, can accommodate such probabilistic models, but they are not commonly used for exploration. Results In this work, we show how to turn evolutionary networks into a tool for statistical exploration of phylogenetic hypotheses via a novel application of Gibbs sampling. We demonstrate the utility of our work on two recently available genomic data sets, one from a group of mosquitos and the other from a group of modern birds. We demonstrate that our method allows the use of evolutionary networks not only for explicit modeling of reticulate evolutionary histories, but also for exploring conflicting treelike hypotheses. We further demonstrate the performance of the method on simulated data sets, where the true evolutionary histories are known. Conclusion We introduce an approach to explore phylogenetic hypotheses over evolutionary phylogenetic networks using Gibbs sampling. The hypotheses could involve reticulate and non-reticulate evolutionary processes simultaneously as we illustrate on mosquito and modern bird genomic data sets.
Generalizations of the Alternating Direction Method of Multipliers for Large-Scale and Distributed Optimization
(2014-11-19) Deng, Wei; Zhang, Yin; Yin, Wotao; Jermaine, Christopher; Tapia, Richard
Due to the dramatically increasing demand for dealing with "Big Data", efficient and scalable computational methods are highly desirable to cope with the size of the data. The alternating direction method of multipliers (ADMM), as a versatile algorithmic tool, has proven to be very effective at solving many large-scale and structured optimization problems, particularly arising from the areas of compressive sensing, signal and image processing, machine learning and applied statistics. Moreover, the algorithm can be implemented in a fully parallel and distributed manner to process huge datasets. These benefits have mainly contributed to the recent renaissance of ADMM for modern applications. This thesis makes important generalizations to ADMM to improve its flexibility and efficiency, as well as extending its convergence theory. Firstly, we allow more options of solving the subproblems either exactly or approximately, such as linearizing the subproblems, taking one gradient descent step, and approximating the Hessian. Often, when subproblems are expensive to solve exactly, it is much cheaper to compute approximate solutions to the subproblems which are still good enough to guarantee convergence. Although it may take more iterations to converge due to less accurate subproblems, the entire algorithm runs faster since each iteration takes much less time. Secondly, we establish the global convergence of these generalizations of ADMM. We further show the linear convergence rate under a variety of scenarios, which cover a wide range of applications in practice. Among these scenarios, we require that at least one of the two objective functions is strictly convex and has Lipschitz continuous gradient, along with certain full rank conditions on the constraint coefficient matrices. The derived rate of convergence also provides some theoretical guidance for optimizing the parameters of the algorithm. In addition, we introduce a simple technique to improve an existing convergence rate from O(1/k) to o(1/k). Thirdly, we introduce a parallel and multi-block extension to ADMM for solving convex separable problems with N blocks of variables. The algorithm decomposes the original problem into N smaller subproblems and solves them in parallel at each iteration. It is well suited to distributed computing and is particularly attractive for solving certain large-scale problems. We show that extending ADMM straightforwardly from the classic Gauss-Seidel setting to the Jacobi setting, from 2 blocks to N blocks, will preserve convergence if the constraint coefficient matrices are mutually near-orthogonal and have full column-rank. For general cases, we propose to add proximal terms of different kinds to the N subproblems so that they can be solved in flexible and efficient ways and the algorithm converges globally at a rate of o(1/k). We introduce a strategy for dynamically tuning the parameters of the algorithm, often leading to substantial acceleration of the convergence in practice. Numerical results are presented to demonstrate the efficiency of the proposed algorithm in comparison with several existing parallel algorithms. We also implemented our algorithm on Amazon EC2, an on-demand public computing cloud, and report its performance on very large-scale basis pursuit problems with distributed data.
Meta Approaches to Few-shot Image Classification
(2021-02-26) Chowdhury, Arkabandhu; Jermaine, Christopher
Since the inception of deep Convolution Neural Network (CNN) architectures, we have seen a tremendous advancement in machine image classification. However, these methods require a large amount of data, sometimes in the order of millions, but often fail to generalize when the data set is small. And so, recently a new paradigm, called `Few-Shot Learning', has been developed to tackle this problem. Essentially, the goal of few-shot learning is to develop techniques that can rapidly generalize to new tasks containing very few samples-- in extreme cases one (called one-shot) or zero (called zero-shot)-- with labels. In this work, I will be particularly tackling few-shot learning in the application of image classification. The most common approach to it is known as Meta-learning or `learning to learn' where, rather than learning to solve a particular learning problem, the goal is to solve many learning problems in an attempt to learn how to learn to solve a particular type of problem. Another way to solve the problem is to re-purpose an existing learner for a new learning problem, known as Transfer learning. In my thesis, I propose two novel approaches, based on meta-learning and transfer learning, to tackle the few-shot (or one-shot) image classification. The first approach I propose is called meta-meta classification, where one uses a large set of learning problems to design an ensemble of learners, each of which has high bias and low variance and is skilled at solving a specific type of learning problems. The meta-meta classifier learns how to examine a given learning problem and combine the various learners to solve the problem. One type of image classification is the one-vs-all (OvA) classification problem, where only one image from the positive class is available for training along with images from a number of negative classes. I evaluate my approach on a one-shot, one-class-versus-all classification task and show that it is able to outperform traditional meta-learning as well as ensembling approaches. I evaluate my method using the popular 1,000 class Imagenet data (ILSVRC2012), the 200 class Caltech-UCSD Birds dataset, the 102 class FGVC-Aircraft dataset, and the 1,200 class Omniglot hand-written character dataset. I compare my results with a popular meta-learning algorithm, called model-agnostic meta learner (MAML), as well as an ensemble of multiple MAML models, and show my approach is able to outperform them in all the problems. The second approach we investigate uses the existing concept of transfer learning, where a simple Multi-Layer Perceptron (MLP) with a hidden layer is fine-tuned on top of pre-trained CNN backbones. Surprisingly, there have been very few works in the few-shot literature that have even examined the use of an MLP for fine-tuning pre-trained models (the assumption may be be that a hidden layer would provide too many parameters for few-shot learning). In order to avoid overfitting, we simply use an L2-regularizer. We argue that a diverse feature vector made of a variety of pre-trained libraries of models trained on a diverese dataset (such as, ILSVRC2012) is sufficiently capable of being re-purposed for small-data problems. We performed a series of experiments on both classification accuracy and feature behavior on multiple few-shot problems. We carefully picked the hyperparameters after validating on Caltech-UCSD Bird dataset and did our final evaluation on FGVC-Aircraft, FC100, Omniglot, Traffic Sign, FGCVx Fungi , QuickDraw, and VGG Flower datasets. Our experimental results showed significantly better performance compared to some baselines, such as, simple ensembling, standalone best model, as well as, some other competitive meta-learning techniques.
Mining Natural APIs from Large Code Corpora using a Mixture of Hidden Markov Models
(2017-10-26) Mukherjee, Rohan; Jermaine, Christopher
A Natural API is a collection of API methods that tend to be used following certain discernible statistical patterns in real-world code. In this thesis, I present a method for learning an interpretable statistical model for such natural APIs. My model is trained on sequences of API calls produced from large software repositories through program analysis. Once trained, the model is able to recognize complex temporal dependences between methods, including methods that technically belong to different APIs, and can be used as a proxy for formal correctness specifications. Our experiments train the model on sequences of method calls generated from over 150 million lines of Android code. We evaluate the learned model by measuring accuracy in learnt specifications from the corpus, completing code with missing API calls, and searching for code that uses APIs in a way that matches a query. Our encouraging results indicate that statistical models of API calls learned from large code corpora can have broad value in software engineering.
Risk analysis for data-intensive stochastic models
(2012-12-25) Arumugam, Subramanian; Haas, Peter J.; Jampani, Ravindranath Chowdary; Jermaine, Christopher; Perez, Luis L.; Xu, Fei; International Business Machines Corporation; Rice University; University of Florida; United States Patent and Trademark Office
A risk analysis system and method are provided. The system includes an analyzer for analyzing database instances by executing a query on each database instance and selecting a cutoff value. The analyzer also discards the sets of uncertainty data that yield query-result values below the cutoff value and retains the database instances that yield query-result values above the cutoff value as elite sets. The system also includes a cloner to replicate the elite sets, and a sampler to modify the elite sets so that each elite set is mutually statistically independent while still yielding query-result values above the cutoff value.
Role of Context in Program Search and Synthesis
(2021-03-01) Mukherjee, Rohan; Jermaine, Christopher
Consider the case where a programmer has written some part of a program, but has left part of the program (such as a method or a function body) incomplete. The goal is to use the context surrounding the missing code to automatically “figure out” the programmer`s intent and suggest relevant programs back. The problem is “contextualized” in the sense that the helper engine should use clues in the partially-completed program to figure out which code is most useful. The user should not be required to formulate an explicit query. To achieve this goal, I propose two approaches. The first approach searches for relevant programs from a database of codes and the second directly synthesizes the desired code, by writing them automatically. In the first part of the thesis, I consider the problem of querying a database of open-source codes, and the task is quickly inferring which of the codes in the database would be useful to the programmer, in order to help complete the missing method. I cast contextualized code search as a learning problem, where the goal is to learn a distribution function that computes the likelihood that each database code correctly completes the program. I propose a neural model for predicting which database code is likely to be most useful. Because it will be prohibitively expensive to apply a neural model to each code in a database of millions or billions of codes at search time, one of the technical concerns is ensuring a speedy search. I address this by learning a “reverse encoder” that can be used to reduce the problem of evaluating each database code to computing a convolution of two normal distributions. In the second part of the thesis, I try to directly synthesize the most appropriate program for the user, according to the program context, while following the semantics of a programming language. Direct synthesis ensures that the system can come up with a reasonable answer to a query, even when the desired code does not exist in the database. My technical innovation in this work is to augment the grammar of the programming language with semantic annotations, to guide neural model-driven synthesis. In my work, these annotations are produced by a Java compiler. The formalism I use to add such annotations is a so-called “attribute grammar”. This method alleviates many of the problems associated with learning to synthesize programs having long-term semantic dependencies across many lines of code, by minimizing the amount of information that needs to be remembered by the neural network controlling the synthesis. Synthesizing the correct program in a particular context then reduces to finding the sequence of production rules in the attribute grammar. The resulting neural synthesizer, guided by the Java compiler, produces programs that are much more likely to be semantically correct than programs generated without the aide of an attribute grammar.
Very Large Scale Bayesian Machine Learning
(2014-07-30) Cai, Zhuhua; Jermaine, Christopher; Nakhleh, Luay; Zhong, Lin
This thesis aims to scale Bayesian machine learning (ML) to very large datasets. First, I propose a pairwise Gaussian random field model (PGRF) for high dimensional data imputation. The PGRF is a graphical, factor-based model. Besides its high accuracy, the PGRF is more efficient and scalable than the Gaussian Markov random field model (GMRF). Experiments show that the PGRF followed by the linear regression (LR) or support vector machine (SVM) reduces the RMSE by 10% to 45% compared with the mean imputation followed by the LR or SVM. Furthermore, the PGRF scales the imputation to very large datasets distributed in a 100-machine cluster that could not be handled by the GMRF or Gaussian methods at all. Unfortunately, the PGRF model is hard to implement -- approximately 18000 lines of Hadoop code and 4 months of work in distributed debugging and running. To reduce the huge amount of human effort, I designed a database system called SimSQL. SimSQL supports rich analytical methods such as Bayesian ML, and scales such methods to terabytes of data distributed over 100 machines. SimSQL enlarges the analysis power of relational database systems, and at the same time keeps merits such as declarative language, transparent optimization and automatic parallelization. SimSQL builds upon the MCDB uncertainty database, and allows the definition of recursive stochastic tables. SimSQL is an ideal platform for Markov chain simulations or iterative algorithms such as PageRank. To show SimSQL's performance, I introduce an objective benchmark that compares SimSQL with Giraph, GraphLab and Spark on five Bayesian ML problems.. The results show that SimSQL provides the best programmability and competitive performance. To run a general Bayesian ML model, SimSQL takes 1X less code than Spark, 6X less than GraphLab, and 12X less code than Giraph, while its time cost is within 5X slowdown in the worst case compared with Giraph and GraphLab. In brief, I consider both modeling and inference for large scale Bayesian ML. The goals for both sides are the same: scaling Bayesian ML to very large datasets, achieving better performance and reducing time cost in design, implementation and execution of ML algorithms.

Browsing by Author "Jermaine, Christopher"

Results Per Page

Sort Options