Repository logo
English
  • English
  • Català
  • Čeština
  • Deutsch
  • Español
  • Français
  • Gàidhlig
  • Italiano
  • Latviešu
  • Magyar
  • Nederlands
  • Polski
  • Português
  • Português do Brasil
  • Suomi
  • Svenska
  • Türkçe
  • Tiếng Việt
  • Қазақ
  • বাংলা
  • हिंदी
  • Ελληνικά
  • Yкраї́нська
  • Log In
    or
    New user? Click here to register.Have you forgotten your password?
Repository logo
  • Communities & Collections
  • All of R-3
English
  • English
  • Català
  • Čeština
  • Deutsch
  • Español
  • Français
  • Gàidhlig
  • Italiano
  • Latviešu
  • Magyar
  • Nederlands
  • Polski
  • Português
  • Português do Brasil
  • Suomi
  • Svenska
  • Türkçe
  • Tiếng Việt
  • Қазақ
  • বাংলা
  • हिंदी
  • Ελληνικά
  • Yкраї́нська
  • Log In
    or
    New user? Click here to register.Have you forgotten your password?
  1. Home
  2. Browse by Author

Browsing by Author "Sikdar, Sourav"

Now showing 1 - 2 of 2
Results Per Page
Sort Options
  • Loading...
    Thumbnail Image
    Item
    An Experimental Comparison of Complex Objects Implementations in Big Data Systems
    (2017-06-07) Sikdar, Sourav; Jermaine, Christopher
    Many data management and analytics systems support complex objects. Dataflow platforms such as Spark and Flink allow programmers to manipulate sets consisting of objects from a host programming language, often Java. Document databases such as MongoDB make use of hierarchical interchange formats--most popularly JSON--which embody a data model where individual records can themselves contain sets of records. Systems such as Dremel and AsterixDB allow complex nesting of data structures. The desire to support such complex objects forces a system designer to ask: how should complex objects be implemented in a modern data management system? In this thesis, over a suite of representative data management tasks, I experimentally evaluate the performance implications of a wide variety of complex object implementations. The choice of object implementation can have a profound effect on performance. For example, the same external sort to perform a duplicate removal can take anywhere between a half hour to fourteen and a half hours depending upon the complex object implementation. A corollary is that a bad object implementation can doom system performance. In addition, we reaffirm the value of the classical database way of storing complex objects - where there is no distinction between the in-memory and over-the-wire data representation, within a modern big data system.
  • Loading...
    Thumbnail Image
    Item
    Applying Machine Learning to Query Optimization
    (2021-06-15) Sikdar, Sourav; Jermaine, Chris
    Recent progress in Machine learning (ML) and Artificial Intelligence (AI) has the potential to impact the design and implementation of many aspects of modern database systems. ML and AI may have a significant impact on the design of the query optimizer, which a database uses to explore the large space of semantically equivalent plans for implementing a given query, with the goal of choosing the plan with the least cost. This thesis seeks to use ML and AI to improve the state of the art in multiple areas of database query optimization. In the first part of the thesis, I consider the problem of optimization of queries with user-defined functions (UDFs). Most modern SQL database systems and Big Data processing systems support UDFs, which make optimization difficult. The backbone of database query optimization is the collection of statistics describing the data to be processed, but when a database or Big Data computation is obscured by UDFs, good statistics are often unavailable. I propose a solution called "Multi-Step Optimization and Execution" or Monsoon. Monsoon models execution and statistics collection as a Markov decision process (MDP) and allows multiple, interleaved execution of each. Monsoon may choose to collect statistics on the UDFs, and then run a computation; or it may optimize and execute part of the plan, collecting statistics on the result of the partial plan, followed by a re-optimization step, with the process repeated as needed. Monsoon uses Monte-Carlo tree search (MCTS) (a common MDP solver) to find the best execution plan for a given query. In an experimental study, I demonstrate that Monsoon can match or outperform most alternative solutions for optimizing queries with UDFs. In the second part of the thesis, I address the problem of reducing cardinality estimation errors, stemming from inaccuracies in analytical cost models. This is a problem that has long plagued query optimizers. Traditionally, query optimizers employ static cost models that do not support any mechanism to incorporate feedback regarding the quality of the resulting plans. To alleviate this problem, neural cost models have been proposed in recent literature that can learn from their mistakes. However, these neural solutions need large numbers of example queries that have already been executed over a given database to learn from and cannot work well ``out of the box''. In this thesis, I consider the creation of a neural cost model to be an instance of few-shot learning, where the goal is to work well with just a few training examples. Unlike other domains where little is known about the semantics of the problem, one of the key aspects of the problem of learning for query optimization that makes it amenable to few-shot learning is the ability of high-quality, analytic cost models that are already known to work in many cases. The idea I explore is to build a recurrent neural network designed to mimic the classical cost model, so it performs as well as the classical model out of the box, without any training. However, since it is a neural network, it can learn. Subsequently, after the model is deployed and data are observed, the model is fine-tuned on the given database and installation. Because it is already of high quality before training, it is able to adapt to the new setting using very few training queries. In an empirical study, I demonstrate that this approach outperforms both classical and modern neural cost models.
  • About R-3
  • Report a Digital Accessibility Issue
  • Request Accessible Formats
  • Fondren Library
  • Contact Us
  • FAQ
  • Privacy Notice
  • R-3 Policies

Physical Address:

6100 Main Street, Houston, Texas 77005

Mailing Address:

MS-44, P.O.BOX 1892, Houston, Texas 77251-1892