Browsing by Author "Jermaine, Chris"
Now showing 1 - 8 of 8
Results Per Page
Sort Options
Item Adding Vector and Matrix Support to SimSQL(2016-04-22) Luo, Shangyu; Jermaine, ChrisIn this thesis, I consider the problem of making linear algebra simple to use and efficient to run in a relational database management system. Relational database systems are widely used, and much of the data in the world is stored within them. Having linear algebra integrated into a relational database would provide great support for tasks such as in-database analytics and in-database machine learning. Currently, when it is necessary to perform such analyses, one must either extract the data from a database, and use an external tool such as MATLAB, or else use awkward, existing within-the-database linear algebra facilities. In this thesis, I will focus on my four main contributions: (1) I add vector and matrix types to SQL, the most commonly-used database programming language; (2) I design a few simple SQL language extensions to accommodate vectors and matrices; (3) I consider the problem of making vector and matrix operations efficient via integration with the database query optimizer; and (4) I conduct some experiments to show the efficacy of my language extensions.Item Applying Machine Learning to Query Optimization(2021-06-15) Sikdar, Sourav; Jermaine, ChrisRecent progress in Machine learning (ML) and Artificial Intelligence (AI) has the potential to impact the design and implementation of many aspects of modern database systems. ML and AI may have a significant impact on the design of the query optimizer, which a database uses to explore the large space of semantically equivalent plans for implementing a given query, with the goal of choosing the plan with the least cost. This thesis seeks to use ML and AI to improve the state of the art in multiple areas of database query optimization. In the first part of the thesis, I consider the problem of optimization of queries with user-defined functions (UDFs). Most modern SQL database systems and Big Data processing systems support UDFs, which make optimization difficult. The backbone of database query optimization is the collection of statistics describing the data to be processed, but when a database or Big Data computation is obscured by UDFs, good statistics are often unavailable. I propose a solution called "Multi-Step Optimization and Execution" or Monsoon. Monsoon models execution and statistics collection as a Markov decision process (MDP) and allows multiple, interleaved execution of each. Monsoon may choose to collect statistics on the UDFs, and then run a computation; or it may optimize and execute part of the plan, collecting statistics on the result of the partial plan, followed by a re-optimization step, with the process repeated as needed. Monsoon uses Monte-Carlo tree search (MCTS) (a common MDP solver) to find the best execution plan for a given query. In an experimental study, I demonstrate that Monsoon can match or outperform most alternative solutions for optimizing queries with UDFs. In the second part of the thesis, I address the problem of reducing cardinality estimation errors, stemming from inaccuracies in analytical cost models. This is a problem that has long plagued query optimizers. Traditionally, query optimizers employ static cost models that do not support any mechanism to incorporate feedback regarding the quality of the resulting plans. To alleviate this problem, neural cost models have been proposed in recent literature that can learn from their mistakes. However, these neural solutions need large numbers of example queries that have already been executed over a given database to learn from and cannot work well ``out of the box''. In this thesis, I consider the creation of a neural cost model to be an instance of few-shot learning, where the goal is to work well with just a few training examples. Unlike other domains where little is known about the semantics of the problem, one of the key aspects of the problem of learning for query optimization that makes it amenable to few-shot learning is the ability of high-quality, analytic cost models that are already known to work in many cases. The idea I explore is to build a recurrent neural network designed to mimic the classical cost model, so it performs as well as the classical model out of the box, without any training. However, since it is a neural network, it can learn. Subsequently, after the model is deployed and data are observed, the model is fine-tuned on the given database and installation. Because it is already of high quality before training, it is able to adapt to the new setting using very few training queries. In an empirical study, I demonstrate that this approach outperforms both classical and modern neural cost models.Item Calculating Variant Allele Fraction of Structural Variation in Next Generation Sequencing by Maximum Likelihood(2015-04-23) Fan, Xian; Nakhleh, Luay K.; Kavraki, Lydia; Jermaine, Chris; Chen, KenCancer cells are intrinsically heterogeneous. Multiple clones with their unique variants co-exist in tumor tissues. The variants include point mutations and structural variations. Point mutations, or single nucleotide variants are those variants on one base; structural variations are variations involving sequence with length not smaller than 50 bases. Approaches to estimate the number of clones and their respective percentages from point mutations have been recently proposed. However, structural variations, although involving more reads than point mutations, have not been quantitatively studied in characterizing cancer heterogeneity. I describe in this thesis a maximum likelihood approach to estimate variant allele fraction of a putative structural variation, as a step towards the characterization of tumor heterogeneity. A software tool, BreakDown, implemented in Perl realizing this statistical model is publicly available. I studied the performance of BreakDown through both simulated and real data, and found BreakDown outperformed other methods such as THetA in estimating variant allele fractions.Item Computer Systems for Distributed Machine Learning(2018-01-31) Gao, Zekai; Jermaine, ChrisMy thesis considers the design and development of state-of-the-art distributed data analytics systems supporting the implementation and execution of machine learning algorithms. Specifically, I consider how to support iterative, large-scale machine algorithms problems on a prototype distributed relational database system called SimSQL. SimSQL allows a programmer to leverage the power of declarative programming and data independence to specify what a computation does, and not how to implement it. This increases programmer productivity and means that the same implementation can be used for different data sets of different sizes and complexities, and different hardwares. The thesis considers three specific problems in the context of adapting SimSQL for the implementation and execution of large-scale machine learning algorithms. First, during learning, when a user-defined function is parameterized with a data object and a statistical model used to process that object, the fully parameterized model can be huge. How do we deal with the potential massive blowup in size during distributed learning? Second, although the idea of data independence—a fundamental design principle upon which relational database systems are built—supports the notion of “one implementation, any model/data size and compute hardware”, such systems lack sufficient support for recursive computations in deep learning and other applications. How should such a system be modified to support these applications? Third, some key features of distributed platforms aim at more general applications in data processing and are not always the best fit for large-scale machine learning and distributed linear algebra. Can we achieve higher efficiency on these platforms by avoiding some widely existing pitfalls? My thesis addresses the issues above by first describing and studying the ubiquitous join-and-co-group pattern for user-defined function parameterization, and carefully describing the alternatives for implementing this pattern on top of both SimSQL and Apache Spark. Second, I enhance SimSQL to support declarative recursion via multidimensional tables, then modify the query optimization framework so that it can handle the massive query plans that result from complicated recursive computations. I benchmark the resulting system, comparing it with TensorFlow and Spark. Third, I examine various performance bottlenecks associated with SimSQL in running large-scale machine learning applications, and consider three enhancements in large vector-type or matrix-type data partitioning, choice of physical plans for complicated operations as well as runtime compilation respectively.Item Differentiable Program Learning with an Admissible Neural Heuristic(2020-08-11) Shah, Ameesh; Jermaine, Chris; Chaudhuri, SwaratWe study the problem of learning differentiable functions expressed as programs in a domain-specific language. Such programmatic models can offer benefits such as composability and interpretability; however, learning them requires optimizing over a combinatorial space of program “architectures”. We frame this optimization problem as a search in a weighted graph whose paths encode top-down derivations of program syntax. Our key innovation is to view various classes of neural networks as continuous relaxations over the space of programs, which can then be used to complete any partial program. This relaxed program is differentiable and can be trained end-to-end, and the resulting training loss is an approximately admissible heuristic that can guide the combinatorial search. We instantiate our approach on top of the A* algorithm and an iteratively deepened branch-and-bound search, and use these algorithms to learn programmatic classifiers in three sequence classification tasks. Our experiments show that the algorithms outperform state-of-the-art methods for program learning, and that they discover programmatic classifiers that yield natural interpretations and achieve competitive accuracy.Item Distributed Machine Learning Scale Out with Algorithms and Systems(2020-12-04) Yuan, Binhang; Jermaine, ChrisMachine learning (ML) is ubiquitous, and has powered the recent success of artificial intelligence. However, the state of affairs with respect to distributed ML is far from ideal. TensorFlow and PyTorch simply crash when an operation’s inputs and outputs cannot fit on a GPU for model parallelism, or when a model cannot fit on a single machine for data parallelism. A TensorFlow code that works reasonably well on a single machine with eight GPUs procured from a cloud provider often runs slower on two machines totaling sixteen GPUs. In this thesis, I propose solutions at both algorithm and system levels in order to scale out distributed ML. At the algorithm level, I propose a new method to distributed neural network learning, called independent subnet training (IST). In IST, per iteration, a neural network is decomposed into a set of subnetworks of the same depth as the original network, each of which is trained locally, before the various subnets are exchanged and the process is repeated. IST training has many advantages including reduction of communication volume and frequency, implicit extension to model parallelism, and memory limit decrease in each compute site. At the system level, I believe that proper computational and implementation abstractions will allow for the construction of self-configuring, declarative ML systems, especially when the goal is to execute tensor operations for ML in a distributed environment, or partitioned across multiple AI accelerators (ASICs). To this end, I first introduce a tensor relational algebra (TRA), which is expressive to encode any tensor operation that can be written in the Einstein notation, and then consider how TRA expressions can be re-written into an implementation algebra (IA) that enables effective implementation in a distributed environment, as well as how expressions in the IA can be optimized. The empirical study shows that the optimized implementation provided by IA can reach or even out-perform carefully engineered HPC or ML systems for large scale tensor manipulations and ML workflows in distributed clusters.Item Improving Peer Evaluation Quality in Massive Open Online Courses(2015-05-26) Lu, Yanxin; Chaudhuri, Swarat; Warren, Joe; Jermaine, ChrisAs several online course providers such as Coursera, Udacity and edX emerged in 2012, Massive Open Online Courses (MOOCs) gained much attention across the globe. While MOOCs provide learning opportunities for many people, several challenges exist in the context of MOOC and one of those is how to ensure the quality of peer grading. Interactive Programming in Python course (IPP) that Rice has offered for a number of years on Coursera has suffered from the problem of low-quality peer evaluations. In this thesis, we propose our solution to improve the quality of peer evaluations by motivating peer graders. Specifically, we want to answer the question: when a student knows that his or her own peer grading efforts are being examined and they are able to grade other peer evaluations, do those tend to motivate the student to do a better job when grading assignments? We implemented a web application where students can grade peer evaluations and we also conduct a series of controlled experiments. Finally, we find a strong effect on peer evaluation quality simply because students know that they are going to be studied using a software that is supposed to help with peer grading. In addition, we find strong evidence that by grading peer evaluations students tend to give better peer evaluations. However, the strongest effect seems to be obtained via the act of grading others’ evaluations, and not from the knowledge that one’s own peer evaluation will be examined.Item Large Scale Online Aggregation Via Distributed Systems(2014-12-04) Pansare, Niketan; Jermaine, Chris; Nakhleh, Luay; Merényi, ErzsébetFrom movie recommendations to fraud detection to personalized health care, there is growing need to analyze huge amounts of data quickly. To deal with huge amounts of data, many analysts use MapReduce, a software framework that parallelizes computations across a compute cluster. However, due to the sheer volume of data, MapReduce is sometimes still not fast enough to perform complicated analysis. In this thesis, I address this problem by developing a statistical estimation framework on top of MapReduce to provide for interactive data analysis. I present three projects I have worked on under this topic. In the first project, I consider extending Online Aggregation (OLA) to a MapReduce environment. Online aggregation (OLA) allows the user to compute an arbitrary aggregation function over a data set and output probabilistic bounds on accuracy in online fashion. OLA in a relational database system uses classical sampling theory to estimate confidence bounds. The key difference in a large-scale distributed computing environment is the importance of block-based processing. At a random time instance, the system is likely processing blocks that take longer to process. Hence blocks that take longer to process are less likely to be taken into account when an estimate is generated. Since one might expect correlation between processing time and the aggregated value of a block, the estimates for the aggregate can be biased. To address the inspection paradox, I propose a Bayesian model that utilizes a joint prior over the values to be aggregated and time take to process/schedule each block. Since the model is taking timing information into account, the bias is removed. This model is implemented on Hyracks, an open-source project similar to Hadoop, the most popular implementation of MapReduce. In the second project, I consider implementing gradient descent on top of MapReduce. Gradient descent is an optimization algorithm that finds the local minima of a function L(w) by starting with an initial point $w_0$ and then taking steps in direction of negative gradient of the function to be optimized. The computation of the gradient is referred to as an epoch and gradient descent computes many epochs iteratively until completion. If the number of data points N is very large it can take lot of time to compute the aggregate for an epoch k. Since the gradient descent algorithm is essentially a user-defined aggregate function, the OLA framework developed in the first part of my thesis can be used to speed up this algorithm in a MapReduce framework. The key technical question that must be answered is “When do we stop the OLA estimation for a given epoch?”. In this thesis, I propose and evaluate a new statistical model for addressing this question Finally, I design, implement, and evaluate a particular machine learning algorithm. An extremely popular feature selection methodology is topic modeling . A topic is defined as a probability distribution over sets of words or phrases and each document in the corpus is drawn from mixture of these topics. A topic model for a corpus specifies the set of topics, as well as the proportion in which they are present in any given document. The recent interest in topic models has been driven by the explosion of electronic, text based data that are available for analysis. From web pages to emails to microblogs, text-based data are everywhere. However, not all electronically-available natural language corpora are text-based. In my thesis, I consider the problem of learning topic models over spoken language. My work is motivated by our involvement with the Spoken Web (also called the World Wide Telecomm Web), which allows users in rural India to post farming-related questions and responses to an audio forum using mobile phones. I propose a new topic model that leverages the statistical algorithms used in most modern speech-to-text software. I develop alternative version of the popular LDA topic model called the spoken topic model, or STM for short. This model uses a Bayesian interpretation of the output of a speech-to-text software that takes into account the software's explicit uncertainty description (the phrases and weights) in a principled fashion.