Fast and Expressive Sketch Structured Transform for Efficient Inference

Shrivastava, Anshumali2025-01-172024-122024-12-06December 2https://hdl.handle.net/1911/118230Linear transformations using learned weights are fundamental components of deep learning models. Prior research has shown that dense weight matrices can often be compressed by decomposition, quantization, sparsification, or random parameter sharing without losing accuracy, suggesting the benefit of more efficient transformations. Among variants of weight matrices, structured ones have limitations in expressivity and quality-efficiency tradeoffs. Unstructured matrices are incompatible with modern hardware, leading to slower training and inference. To address these challenges, we propose Sketch Structured Transform (SS1), an expressive and hardware-efficient operator that reduces tensor multiplications and accelerates inference. SS1 leverages random parameter sharing in a block-structured manner, reducing computation while preserving the expressiveness of parameter sharing. We empirically show that SS1 achieves better quality-efficiency tradeoffs than competing variants. Our theoretical analysis also indicates that SS1 can be combined with quantization for further compression, and the experimental results confirm this. Additionally, pre-trained models can be projected using SS1 and finetuned for efficient deployment. Our experiments highlight various applications of the SS1, including (a) Training GPT2 and DLRM models from scratch for faster inference. (b) Finetuning projected BERT models for 1.31× faster inference while maintaining GLUE scores. (c) Proof of concept with Llama-3-8b, showing 1.11× faster wall clock inference using projected SS1 layers without finetuning.application/pdfenEfficiency, Acceleration, Compression, Parameter SharingFast and Expressive Sketch Structured Transform for Efficient InferenceThesis2025-01-17