Jermaine, Christopher M.2025-05-302025-05-302025-052025-04-25May 2025https://hdl.handle.net/1911/118527The rapid growth in the size and complexity of large language models has imposed severe challenges on memory management, particularly when these models are deployed on GPUs with limited memory. This thesis introduces a fine-grained paging mechanism that dynamically offloads and reloads tensors at the granularity of individual operations, thereby mitigating out-of-memory (OOM) issues during the inference and prefill phase of transformer-based models. Instead of traditional static, layer-based offloading methods, the proposed approach uses compile-time, simulationbased memory allocation to optimize GPU memory usage, making runtime possible under severe memory constraints. This work is based off of the Einsummable system, a framework that represents tensor computations using Einstein summation notation. Einsummable transforms high-level mathematical specifications into an optimized execution pipeline through a series of intermediate representations, notably the taskgraph and the memgraph. The taskgraph captures the data dependencies and operational flow of tensor computations, while the memgraph extends this representation by incorporating detailed memory location information and managing offload-reload operations. The transformation from taskgraph to memgraph is achieved through a simulated execution process — the core of this thesis — that relies on two key components: an allocation horizon, which pre-allocates memory for future operations, and an execution horizon, which tracks the simulated execution progress of the computation. A key contribution of this thesis is the design and implementation of specialized memory allocation routines—simMalloc, simMallocForceReld, and simMallocOffld. These routines not only allocate memory for tensor outputs but also manage dependencies by inserting offload and reload nodes into the memgraph whenever GPU memory resources depletes. By leveraging full knowledge of the simulated execution order, our offload-reload heuristic selects tensors for offloading based on their computed reuse distance, thereby deferring memory transfers until they are most convenient. This future-aware strategy mitigates the frequency and impact of memory transfers compared to reactive approaches, enabling a finer control over GPU memory usage. Extensive experimental evaluations were conducted using two configurations of NVIDIA GPUs — Tesla P100 and V100 — to benchmark the performance of the proposed system against state-of-the-art techniques such as ZeRO-Inference. The evaluation focused on the prefill stage of inference in LLaMA models with 7B and 65B parameters, a phase known to be particularly memory-bound. The results demonstrate that the fine-grained paging mechanism supports a broader range of configurations, successfully executing inference tasks across varying batch sizes and sequence lengths. While the finer granularity of tensor-level management introduces some communication overhead due to more frequent offloading and reloading, the overall improvements in memory utilization and reduction in OOM errors outweigh these costs. In summary, this thesis makes a contribution to the field of deep learning by addressing the critical challenge of GPU memory constraints through a fine-grained paging mechanism. Future work will explore further optimizations to reduce communication overhead, overall computation latency, and GPU RAM utilization.application/pdfenMachine Learning SystemsMemory ManagementLarge Language ModelsFine-Grained Paging Mechanism for Offloading-Reloading Tensor for LLMThesis2025-05-30