MatKV: Trading Compute for Flash Storage in LLM Inference

Reading time: 5 minute
...

📝 Original Info

  • Title: MatKV: Trading Compute for Flash Storage in LLM Inference
  • ArXiv ID: 2512.22195
  • Date: 2025-12-20
  • Authors: Kun-Woo Shin, Jay H. Park, Moonwook Oh, Yohan Jo, Jaeyoung Do, Sang-Won Lee

📝 Abstract

We observe two major trends in LLM-based generative AI: (1) inference is becoming the dominant factor in terms of cost and power consumption, surpassing training, and (2) retrieval augmented generation (RAG) is becoming prevalent. When processing long inputs in RAG, the prefill phase of computing the key-value vectors of input text is energy-intensive and time-consuming even with high-end GPUs. Thus, it is crucial to make the prefill phase in RAG inference efficient. To address this issue, we propose MatKV, a scheme that precomputes the key-value vectors (KVs) of RAG objects (e.g., documents), materializes them in inexpensive but fast and power-efficient flash storage, and reuses them at inference time instead of recomputing the KVs using costly and power-inefficient GPU. Experimental results using Hugging Face's Transformers library across state-of-the-art GPUs and flash memory SSDs confirm that, compared to full KV computation on GPUs, MatKV reduces both inference time and power consumption by half for RAG workloads, without severely impacting accuracy in the question-answering task. Furthermore, we demonstrate that MatKV enables additional optimizations in two ways. First, a GPU can decode text while simultaneously loading the materialized KVs for the next instance, reducing load latency. Second, since decoding speed is less sensitive to GPU performance than KV computation, low-end GPUs can be leveraged for decoding without significantly compromising speed once the materialized KVs are loaded into GPU memory. These findings underscore MatKV's potential to make large-scale generative AI applications more cost-effective, power-efficient, and accessible across a wider range of tasks and hardware environments.

💡 Deep Analysis

📄 Full Content

MatKV: Trading Compute for Flash Storage in LLM Inference Kun-Woo Shin†, Jay H. Park‡, Moonwook Oh‡, Yohan Jo†, Jaeyoung Do†, Sang-Won Lee† †Seoul National University, ‡Samsung Electronics kunwooshin@snu.ac.kr, {tino.park, mw.oh}@samsung.com, {yohan.jo, jaeyoung.do, swlee69}@snu.ac.kr Abstract—We observe two major trends in LLM-based gen- erative AI: (1) inference is becoming the dominant factor in terms of cost and power consumption, surpassing training, and (2) retrieval augmented generation (RAG) is becoming prevalent. When processing long inputs in RAG, the prefill phase of computing the key-value vectors of input text is energy-intensive and time-consuming even with high-end GPUs. Thus, it is crucial to make the prefill phase in RAG inference efficient. To address this issue, we propose MatKV, a scheme that precomputes the key-value vectors (KVs) of RAG objects (e.g., documents), materializes them in inexpensive but fast and power-efficient flash storage, and reuses them at inference time instead of recomputing the KVs using costly and power-inefficient GPU.1 Experimental results using Hugging Face’s Transformers library across state- of-the-art GPUs and flash memory SSDs confirm that, compared to full KV computation on GPUs, MatKV reduces both inference time and power consumption by half for RAG workloads, without severely impacting accuracy in the question-answering task. Furthermore, we demonstrate that MatKV enables additional optimizations in two ways. First, a GPU can decode text while simultaneously loading the materialized KVs for the next in- stance, reducing load latency. Second, since decoding speed is less sensitive to GPU performance than KV computation, low- end GPUs can be leveraged for decoding without significantly compromising speed once the materialized KVs are loaded into GPU memory. These findings underscore MatKV’s potential to make large-scale generative AI applications more cost-effective, power-efficient, and accessible across a wider range of tasks and hardware environments. Index Terms—Generative AI, Retrieval augmented generation, Solid state drives, LLM Inference I. INTRODUCTION LLM Inference As generative AI (genAI) applications be- come increasingly ubiquitous, LLM (Large Language Model) inference speed has emerged as a critical factor in determining overall application latency. While much research has focused on optimizing LLM training, optimizing LLM inference is even more crucial due to its significantly larger market size and computational demand [1], [2]. LLM inference is inherently compute-intensive and power-hungry [3], making efficient inference execution a key challenge in deploying large-scale genAI applications. Retrieval Augmented Generation Among various genAI workloads, retrieval-augmented generation (RAG) has become a dominant paradigm because of its advantages, such as mitigating hallucinations, improving recency, and enhancing 1https://github.com/kunwooshin/MatKV 2017 2018 2019 2020 2021 2022 2023 2024 Year 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Cost per Performance SSD: Cost ($) per 10MB/s GPU: Cost ($) per GFLOPS (FP16) Fig. 1: GPU and SSD Cost/Performance Trend (2017 to 2024) security [4]. In RAG, a user query is augmented with retrieved objects (e.g., documents), enabling the LLM to generate more knowledge-intensive and truthful answers compared to relying solely on its parametric knowledge. However, RAG poses unique computational challenges [5]. In a typical RAG workflow, each retrieved chunk is usually quite large (e.g., 1,000 tokens per chunk) [6]. Consequently, when LLM inference processes these chunks as part of the prompt, the prefill phase (i.e., computing the key-value vectors of the input chunks) becomes highly compute-intensive and power-demanding [3]. Moreover, many objects are repeatedly retrieved and prefilled across multiple queries, exacerbating the computational overhead. Considering the ever-growing popularity of RAGs and accordingly the exploding compute demand for LLM inference, optimizing inference efficiency in RAG environments is more critical than ever. Flash Memory SSDs To meet the ever-growing compute demands in LLM training and inference, the compute power of modern GPUs has improved in terms of FLOPS and power efficiency. Likewise, flash memory SSDs have relentlessly im- proved in terms of cost, read bandwidth, and power efficiency, achieving astonishing performance metrics (Figure 1). For instance, Samsung’s 9100 pro [7], a commodity SSD, costs only $0.1/GB but provides a read bandwidth of 14GB/second, and its active power consumption is only 7 watts. MatKV (Materialized KVs) In this paper, we propose MatKV, a novel scheme designed to improve the efficiency of LLM inference in RAG workloads in terms of speed, power, and cost. It precomputes, when RAG objects are inserted to vector databases, their key-values once using GPU, materialize them on inexpensive but fast flash storage, and, whenever ob- Accepted for publication in IC

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut