Fast and Expressive Sketch Structured Transform for Efficient Inference

Date
2024-12-06
Journal Title
Journal ISSN
Volume Title
Publisher
Embargo
Abstract

Linear transformations using learned weights are fundamental components of deep learning models. Prior research has shown that dense weight matrices can often be compressed by decomposition, quantization, sparsification, or random parameter sharing without losing accuracy, suggesting the benefit of more efficient transformations. Among variants of weight matrices, structured ones have limitations in expressivity and quality-efficiency tradeoffs. Unstructured matrices are incompatible with modern hardware, leading to slower training and inference. To address these challenges, we propose Sketch Structured Transform (SS1), an expressive and hardware-efficient operator that reduces tensor multiplications and accelerates inference. SS1 leverages random parameter sharing in a block-structured manner, reducing computation while preserving the expressiveness of parameter sharing. We empirically show that SS1 achieves better quality-efficiency tradeoffs than competing variants. Our theoretical analysis also indicates that SS1 can be combined with quantization for further compression, and the experimental results confirm this. Additionally, pre-trained models can be projected using SS1 and finetuned for efficient deployment. Our experiments highlight various applications of the SS1, including (a) Training GPT2 and DLRM models from scratch for faster inference. (b) Finetuning projected BERT models for 1.31× faster inference while maintaining GLUE scores. (c) Proof of concept with Llama-3-8b, showing 1.11× faster wall clock inference using projected SS1 layers without finetuning.

Description
Degree
Master of Science
Type
Thesis
Keywords
Efficiency, Acceleration, Compression, Parameter Sharing
Citation
Has part(s)
Forms part of
Published Version
Rights
Link to license
Citable link to this page