Relational Computation for Very Large-Scale Machine Learning

Jermaine, Christopher M2025-05-302025-05-302025-052025-04-25May 2025https://hdl.handle.net/1911/118541In mathematics, a tensor is an algebraic object that describes multilinear rela- tionships among sets of algebraic entities associated with a vector space. From a computational perspective, tensors are commonly represented as multi-dimensional arrays—a format that plays a central role in machine learning. A widely used convention for expressing tensor operations is Einstein summation notation (EinSum), which compactly encodes summation over indexed terms. This notational framework not only streamlines the expression of complex tensor computations but also lends itself to an alternative interpretation: a multi-dimensional array can be viewed as a map- ping from a vector of integers (i.e., a primary key) to a real number. This perspective aligns closely with the classical definition of a relational database relation. As a result, many numerical and machine learning computations in tensor calculus can be reformulated as sequences of joins and aggregations over relational data. Executing these computations within a relational database system offers several key advantages, including automatic parallelization, distribution, and scalability. Moreover, relational databases are particularly effective at handling sparsity, as they are designed to effi- ciently represent and process cases where only a small subset of the possible primary keys actually occur in the relation. In this thesis, I propose an extension to Einstein notation called Upper-Case- Lower-Case Einstein Notation—a simple yet expressive framework for describing tensor programs that interleave operations over sparse (relational) data with efficient kernel calls over dense tensors. This notation enables the concise representation of computations optimized for complex sparsity patterns. To support this notation, I develop a compiler, SparseEinSum, which takes standard EinSum expressions as input, transforms them into extended Upper-Case-Lower-Case Einstein Notation as intermediate representation, and compiles them into tensor-relational algebra. The compiler incorporates sparsity estimation and cost-based schema selection to guide the transformation. The resulting programs can be executed on virtually any relational database system, leveraging arrays to manage dense tensors within a relational execution model. Experiments across tensor computation benchmarks demonstrate that the generated tensor-relational computations offer significant performance improvements. To support automatic differentiation of relational computation compiled from EinSum, I derive key rules that enable automatic differentiation for relational algebra. I introduce functional relational algebra to build functions in the relational domain and define relational analogs of partial derivatives, Jacobians, gradients, and a set of relation-Jacobian product rules for core relational operators, including table scan, selection, aggregation, and join. This functional framework builds the foundation for differentiation in relational algebra. Then, I propose a relational algebra automatic differentiation algorithm using an efficient, correctness-preserving implementation of the relation-Jacobian product. Through extensive experiments, I show that executing machine learning computations on top of a relational engine—augmented with relational algebra automatic differentiation algorithm—can scale efficiently to very large datasets. The resulting system achieves performance competitive with specialized distributed machine learning systems, while retaining the advantages of relational query optimization.application/pdfenMachine LearningRelational Computation for Very Large-Scale Machine LearningThesis2025-05-30