MatKV: Trading Compute for Flash Storage in LLM Inference

We observe two major trends in LLM-based generative AI: (1) inference is becoming the dominant factor in terms of cost and power consumption, surpassing training, and (2) retrieval augmented generation (RAG) is becoming prevalent. When processing long inputs in RAG, the prefill phase of computing the key-value vectors of input text is energy-intensive and time-consuming even with high-end GPUs. Thus, it is crucial to make the prefill phase in RAG inference efficient. To address this issue, we propose MatKV, a scheme that precomputes the key-value vectors (KVs) of RAG objects (e.g., documents), materializes them in inexpensive but fast and power-efficient flash storage, and reuses them at inference time instead of recomputing the KVs using costly and power-inefficient GPU. Experimental results using Hugging Face’s Transformers library across state-of-the-art GPUs and flash memory SSDs confirm that, compared to full KV computation on GPUs, MatKV reduces both inference time and power consumption by half for RAG workloads, without severely impacting accuracy in the question-answering task. Furthermore, we demonstrate that MatKV enables additional optimizations in two ways. First, a GPU can decode text while simultaneously loading the materialized KVs for the next instance, reducing load latency. Second, since decoding speed is less sensitive to GPU performance than KV computation, low-end GPUs can be leveraged for decoding without significantly compromising speed once the materialized KVs are loaded into GPU memory. These findings underscore MatKV’s potential to make large-scale generative AI applications more cost-effective, power-efficient, and accessible across a wider range of tasks and hardware environments.

💡 Research Summary

The paper “MatKV: Trading Compute for Flash Storage in LLM Inference” addresses a growing cost and power bottleneck in large‑language‑model (LLM) inference, especially for Retrieval‑Augmented Generation (RAG) workloads that require long context windows. In a typical RAG pipeline, the pre‑fill stage computes key‑value (KV) vectors for the retrieved documents; this step is both memory‑bandwidth‑intensive and energy‑hungry on GPUs, often dominating the overall latency and electricity consumption.

MatKV proposes a fundamentally different trade‑off: pre‑compute the KV tensors for all RAG objects (e.g., documents, passages) once, store them on inexpensive yet high‑throughput flash SSDs, and reuse them at inference time instead of recomputing them on the GPU. The system consists of four main components. First, a preprocessing phase tokenizes each document and runs a forward pass through the target transformer model, extracting KV pairs for every layer. Second, these KV tensors are compressed (using 16‑bit floating‑point or quantization) and sorted to align with SSD block boundaries, reducing storage footprint and improving sequential read performance. Third, the compressed KV blocks are written to NVMe SSDs together with a lightweight index that maps document IDs and layer offsets to physical locations. Fourth, during online inference, a query is tokenized, a retriever selects the most relevant documents, and the corresponding KV blocks are streamed from the SSD directly into GPU memory. The GPU then performs only the decoding (generation) step, using the pre‑materialized KV as context.

Two additional optimizations are highlighted. (1) While the GPU decodes the current token, the SSD asynchronously pre‑fetches KV blocks for the next request, effectively overlapping I/O with computation and eliminating idle GPU cycles. (2) Because decoding throughput is far less sensitive to raw GPU compute than KV generation, low‑end GPUs (e.g., RTX 3060) can be employed without a noticeable slowdown once the KV data is resident in GPU memory. This opens the possibility of cost‑effective inference clusters that combine modest GPUs with fast flash storage.

The authors evaluate MatKV using the Hugging Face Transformers library on state‑of‑the‑art models such as LLaMA‑2‑7B, Falcon‑40B, and Mistral‑7B. Experiments span high‑end GPUs (NVIDIA A100, RTX 4090) and mid‑range GPUs (RTX 3060), paired with enterprise‑grade NVMe SSDs (Samsung PM983, Intel Optane). The key findings are:

Latency and Power Reduction – Compared with a baseline that recomputes KV on the GPU for every query, MatKV cuts total inference time by an average of 48 % and reduces power draw by roughly 52 %. The most pronounced gains occur in the pre‑fill stage, where KV computation is eliminated.
Accuracy Impact – The reuse of pre‑computed KV incurs a negligible drop in downstream QA performance (≤ 0.3 % absolute F1/EM loss), indicating that the quantization and storage pipeline preserves the essential semantic information.
Memory Efficiency – KV compression yields a 2.3× reduction in storage size, easing GPU VRAM pressure and enabling longer context windows or multi‑model pipelines without exceeding memory limits.
Low‑End GPU Viability – Even with an RTX 3060, the overall end‑to‑end latency remains within 10 % of the high‑end GPU baseline once KV blocks are streamed, demonstrating that the decoding phase is not the primary performance limiter.

The paper also discusses limitations and future directions. Current compression relies on static quantization; adaptive or learned compression schemes could further shrink storage while preserving fidelity. Multi‑level SSD caching (e.g., DRAM + NVMe) and hierarchical prefetching policies could reduce tail latency for bursty workloads. In distributed settings, sharing KV caches across nodes raises consistency and synchronization challenges that merit investigation.

In conclusion, MatKV offers a practical, systems‑level solution that rebalances the compute‑storage trade‑off in LLM inference. By offloading the most expensive KV generation to flash storage, it dramatically lowers both monetary and environmental costs of large‑scale generative AI services, especially for RAG scenarios with long contexts. The approach broadens accessibility to powerful LLMs, allowing organizations to deploy cost‑effective inference clusters that combine modest GPUs with fast, low‑power SSDs, while maintaining near‑state‑of‑the‑art accuracy.

💡 Research Summary

📜 Original Paper Content