📝 Original Info
- Title: MatKV: Trading Compute for Flash Storage in LLM Inference
- ArXiv ID: 2512.22195
- Date: 2025-12-20
- Authors: Kun-Woo Shin, Jay H. Park, Moonwook Oh, Yohan Jo, Jaeyoung Do, Sang-Won Lee
📝 Abstract
We observe two major trends in LLM-based generative AI: (1) inference is becoming the dominant factor in terms of cost and power consumption, surpassing training, and (2) retrieval augmented generation (RAG) is becoming prevalent. When processing long inputs in RAG, the prefill phase of computing the key-value vectors of input text is energy-intensive and time-consuming even with high-end GPUs. Thus, it is crucial to make the prefill phase in RAG inference efficient. To address this issue, we propose MatKV, a scheme that precomputes the key-value vectors (KVs) of RAG objects (e.g., documents), materializes them in inexpensive but fast and power-efficient flash storage, and reuses them at inference time instead of recomputing the KVs using costly and power-inefficient GPU. Experimental results using Hugging Face's Transformers library across state-of-the-art GPUs and flash memory SSDs confirm that, compared to full KV computation on GPUs, MatKV reduces both inference time and power consumption by half for RAG workloads, without severely impacting accuracy in the question-answering task. Furthermore, we demonstrate that MatKV enables additional optimizations in two ways. First, a GPU can decode text while simultaneously loading the materialized KVs for the next instance, reducing load latency. Second, since decoding speed is less sensitive to GPU performance than KV computation, low-end GPUs can be leveraged for decoding without significantly compromising speed once the materialized KVs are loaded into GPU memory. These findings underscore MatKV's potential to make large-scale generative AI applications more cost-effective, power-efficient, and accessible across a wider range of tasks and hardware environments.
💡 Deep Analysis
📄 Full Content
MatKV: Trading Compute for Flash Storage in
LLM Inference
Kun-Woo Shin†, Jay H. Park‡, Moonwook Oh‡, Yohan Jo†, Jaeyoung Do†, Sang-Won Lee†
†Seoul National University, ‡Samsung Electronics
kunwooshin@snu.ac.kr, {tino.park, mw.oh}@samsung.com, {yohan.jo, jaeyoung.do, swlee69}@snu.ac.kr
Abstract—We observe two major trends in LLM-based gen-
erative AI: (1) inference is becoming the dominant factor in
terms of cost and power consumption, surpassing training, and
(2) retrieval augmented generation (RAG) is becoming prevalent.
When processing long inputs in RAG, the prefill phase of
computing the key-value vectors of input text is energy-intensive
and time-consuming even with high-end GPUs. Thus, it is crucial
to make the prefill phase in RAG inference efficient. To address
this issue, we propose MatKV, a scheme that precomputes
the key-value vectors (KVs) of RAG objects (e.g., documents),
materializes them in inexpensive but fast and power-efficient flash
storage, and reuses them at inference time instead of recomputing
the KVs using costly and power-inefficient GPU.1 Experimental
results using Hugging Face’s Transformers library across state-
of-the-art GPUs and flash memory SSDs confirm that, compared
to full KV computation on GPUs, MatKV reduces both inference
time and power consumption by half for RAG workloads, without
severely impacting accuracy in the question-answering task.
Furthermore, we demonstrate that MatKV enables additional
optimizations in two ways. First, a GPU can decode text while
simultaneously loading the materialized KVs for the next in-
stance, reducing load latency. Second, since decoding speed is
less sensitive to GPU performance than KV computation, low-
end GPUs can be leveraged for decoding without significantly
compromising speed once the materialized KVs are loaded into
GPU memory. These findings underscore MatKV’s potential to
make large-scale generative AI applications more cost-effective,
power-efficient, and accessible across a wider range of tasks and
hardware environments.
Index Terms—Generative AI, Retrieval augmented generation,
Solid state drives, LLM Inference
I. INTRODUCTION
LLM Inference As generative AI (genAI) applications be-
come increasingly ubiquitous, LLM (Large Language Model)
inference speed has emerged as a critical factor in determining
overall application latency. While much research has focused
on optimizing LLM training, optimizing LLM inference is
even more crucial due to its significantly larger market size and
computational demand [1], [2]. LLM inference is inherently
compute-intensive and power-hungry [3], making efficient
inference execution a key challenge in deploying large-scale
genAI applications.
Retrieval Augmented Generation Among various genAI
workloads, retrieval-augmented generation (RAG) has become
a dominant paradigm because of its advantages, such as
mitigating hallucinations, improving recency, and enhancing
1https://github.com/kunwooshin/MatKV
2017
2018
2019
2020
2021
2022
2023
2024
Year
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Cost per Performance
SSD: Cost ($) per 10MB/s
GPU: Cost ($) per GFLOPS (FP16)
Fig. 1: GPU and SSD Cost/Performance Trend (2017 to 2024)
security [4]. In RAG, a user query is augmented with retrieved
objects (e.g., documents), enabling the LLM to generate more
knowledge-intensive and truthful answers compared to relying
solely on its parametric knowledge.
However, RAG poses unique computational challenges [5].
In a typical RAG workflow, each retrieved chunk is usually
quite large (e.g., 1,000 tokens per chunk) [6]. Consequently,
when LLM inference processes these chunks as part of the
prompt, the prefill phase (i.e., computing the key-value vectors
of the input chunks) becomes highly compute-intensive and
power-demanding [3]. Moreover, many objects are repeatedly
retrieved and prefilled across multiple queries, exacerbating
the computational overhead. Considering the ever-growing
popularity of RAGs and accordingly the exploding compute
demand for LLM inference, optimizing inference efficiency in
RAG environments is more critical than ever.
Flash Memory SSDs
To meet the ever-growing compute
demands in LLM training and inference, the compute power
of modern GPUs has improved in terms of FLOPS and power
efficiency. Likewise, flash memory SSDs have relentlessly im-
proved in terms of cost, read bandwidth, and power efficiency,
achieving astonishing performance metrics (Figure 1). For
instance, Samsung’s 9100 pro [7], a commodity SSD, costs
only $0.1/GB but provides a read bandwidth of 14GB/second,
and its active power consumption is only 7 watts.
MatKV (Materialized KVs)
In this paper, we propose
MatKV, a novel scheme designed to improve the efficiency of
LLM inference in RAG workloads in terms of speed, power,
and cost. It precomputes, when RAG objects are inserted to
vector databases, their key-values once using GPU, materialize
them on inexpensive but fast flash storage, and, whenever ob-
Accepted for publication in IC
Reference
This content is AI-processed based on open access ArXiv data.