Fast and Expressive Sketch Structured Transform for Efficient Inference

Saedi, Kimia

Fast and Expressive Sketch Structured Transform for Efficient Inference

dc.contributor.advisor	Shrivastava, Anshumali	en_US
dc.creator	Saedi, Kimia	en_US
dc.date.accessioned	2025-01-17T17:21:58Z	en_US
dc.date.created	2024-12	en_US
dc.date.issued	2024-12-06	en_US
dc.date.submitted	December 2024	en_US
dc.date.updated	2025-01-17T17:21:58Z	en_US
dc.description.abstract	Linear transformations using learned weights are fundamental components of deep learning models. Prior research has shown that dense weight matrices can often be compressed by decomposition, quantization, sparsification, or random parameter sharing without losing accuracy, suggesting the benefit of more efficient transformations. Among variants of weight matrices, structured ones have limitations in expressivity and quality-efficiency tradeoffs. Unstructured matrices are incompatible with modern hardware, leading to slower training and inference. To address these challenges, we propose Sketch Structured Transform (SS1), an expressive and hardware-efficient operator that reduces tensor multiplications and accelerates inference. SS1 leverages random parameter sharing in a block-structured manner, reducing computation while preserving the expressiveness of parameter sharing. We empirically show that SS1 achieves better quality-efficiency tradeoffs than competing variants. Our theoretical analysis also indicates that SS1 can be combined with quantization for further compression, and the experimental results confirm this. Additionally, pre-trained models can be projected using SS1 and finetuned for efficient deployment. Our experiments highlight various applications of the SS1, including (a) Training GPT2 and DLRM models from scratch for faster inference. (b) Finetuning projected BERT models for 1.31× faster inference while maintaining GLUE scores. (c) Proof of concept with Llama-3-8b, showing 1.11× faster wall clock inference using projected SS1 layers without finetuning.	en_US
dc.embargo.lift	2025-06-01	en_US
dc.embargo.terms	2025-06-01	en_US
dc.format.mimetype	application/pdf	en_US
dc.identifier.uri	https://hdl.handle.net/1911/118230	en_US
dc.language.iso	en	en_US
dc.subject	Efficiency, Acceleration, Compression, Parameter Sharing	en_US
dc.title	Fast and Expressive Sketch Structured Transform for Efficient Inference	en_US
dc.type	Thesis	en_US
dc.type.material	Text	en_US
thesis.degree.department	Computer Science	en_US
thesis.degree.discipline	Computer Science	en_US
thesis.degree.grantor	Rice University	en_US
thesis.degree.level	Masters	en_US
thesis.degree.name	Master of Science	en_US

Files

License bundle

Now showing 1 - 2 of 2

Name:: PROQUEST_LICENSE.txt
Size:: 5.84 KB
Format:: Plain Text
Description:

Download

Name:: LICENSE.txt
Size:: 2.98 KB
Format:: Plain Text
Description:

Download

Collections

Rice University Theses and Dissertations