Lossy Computation For Large-Scale Machine Learning

Date
2024-08-05
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract

In recent years, machine learning (ML), particularly deep learning, has made significant strides in areas like image recognition and language processing. It's been shown that more parameters and data can greatly boost ML model performance. However, the growth in model and data size is outpacing hardware capabilities, leading to a gap between ML needs and hardware development. My research is aimed at creating scalable ML algorithms and systems to meet current and future ML demands, exploring methods like randomized and low-precision computations to handle larger data and model sizes without changing hardware.

First, for dealing with large datasets, such as in analyzing molecular structures or social networks where data is interconnected, Graph neural networks (GNNs) have recently emerged as one of the de-facto standard tools to analyze the graph data. Leveraging the message passing mechanism, GNNs learn the representation of each node by iteratively aggregating information from its neighbors to capture of graph structures and relationships. However, the key challenges in graph representation learning is the scalability issue as the real-word graphs may contain more than billions of nodes, resulting in significant memory and speed inefficiency when training GNNs on huge graphs. To address the challenges of memory and time inefficiency in large-scale graph learning, we introduce two lossy computation paradigms. First, we propose a memory-efficient framework for training GNNs with significantly compressed activations. Second, we present a time-efficient GNN training method with degree-based graph sparsification.

Second, regarding the challenge of handling large models, as the model size grows, large language models (LLMs) have exhibited human-like conversation ability. This advancement opens the door to a wave of new applications, such as custom AI agents. To achieve this, two essential steps are involved: fine-tuning and serving. Fine-tuning is the process of adapting the LLM to a specific task, such as understanding and responding to domain-specific inquiries. The second step, serving, is about generating outputs to the questions in real-time. However, both of these two steps are hard and expensive due to the large model scale, limiting their accessibility to most of the users. Similarly, to improve efficiency in fine-tuning and serving LLMs, we also employ lossy computation approaches. Our first method enhances memory efficiency in LLM fine-tuning through the use of randomized matrix multiplication. Our second approach introduces a prompt tuning framework that optimizes the accuracy-efficiency trade-off for compressed LLMs. Lastly, we implement an extreme low-bit quantization technique for the KV Cache to further enhance performance.

Description
Advisor
Degree
Doctor of Philosophy
Type
Thesis
Keywords
Machine Learning, lossy compression
Citation
Has part(s)
Forms part of
Published Version
Rights
Link to license
Citable link to this page