LLMCache: Layer-Wise Caching Strategies for Accelerated Reuse in Transformer Inference

Reading time: 5 minute
...

📝 Abstract

Transformer-based language models have achieved remarkable performance across a wide range of tasks, yet their high inference latency poses a significant challenge for real-timeand large-scale deployment. While existing caching mechanisms,such as token-level key-value caches, offer speedups in autore-gressive decoding, they are limited in scope and applicability. In this paper, we present LLMCache, a novel layer-wise caching framework that accelerates transformer inference by reusing intermediate activations based on semantic similarity of input sequences. Unlike prior work, LLMCache is model-agnostic,operates across both encoder and decoder architectures, and supports caching at arbitrary transformer layers. We introduce a lightweight fingerprinting mechanism for matching seman-tically similar inputs and propose adaptive eviction strategies to manage cache staleness. Experiments on BERT and GPT-2 across SQuAD, WikiText-103, and OpenBookQA show up to 3.1 X speedup in inference time with <0.5% accuracy degradation. Our results highlight LLMCache as a practical and general-purpose solution for optimizing transformer inference in real-world applications

💡 Analysis

Transformer-based language models have achieved remarkable performance across a wide range of tasks, yet their high inference latency poses a significant challenge for real-timeand large-scale deployment. While existing caching mechanisms,such as token-level key-value caches, offer speedups in autore-gressive decoding, they are limited in scope and applicability. In this paper, we present LLMCache, a novel layer-wise caching framework that accelerates transformer inference by reusing intermediate activations based on semantic similarity of input sequences. Unlike prior work, LLMCache is model-agnostic,operates across both encoder and decoder architectures, and supports caching at arbitrary transformer layers. We introduce a lightweight fingerprinting mechanism for matching seman-tically similar inputs and propose adaptive eviction strategies to manage cache staleness. Experiments on BERT and GPT-2 across SQuAD, WikiText-103, and OpenBookQA show up to 3.1 X speedup in inference time with <0.5% accuracy degradation. Our results highlight LLMCache as a practical and general-purpose solution for optimizing transformer inference in real-world applications

📄 Content

LLMCache: Layer-Wise Caching Strategies for Accelerated Reuse in Transformer Inference Harsh Vardhan Bansal Analytics and AI/ML Specialist, Amazon Web Services, USA Abstract—Transformer-based language models have achieved remarkable performance across a wide range of tasks, yet their high inference latency poses a significant challenge for real-time and large-scale deployment. While existing caching mechanisms, such as token-level key-value caches, offer speedups in autore- gressive decoding, they are limited in scope and applicability. In this paper, we present LLMCache, a novel layer-wise caching framework that accelerates transformer inference by reusing intermediate activations based on semantic similarity of input sequences. Unlike prior work, LLMCache is model-agnostic, operates across both encoder and decoder architectures, and supports caching at arbitrary transformer layers. We introduce a lightweight fingerprinting mechanism for matching seman- tically similar inputs and propose adaptive eviction strategies to manage cache staleness. Experiments on BERT and GPT-2 across SQuAD, WikiText-103, and OpenBookQA show up to 3.1× speedup in inference time with <0.5% accuracy degradation. Our results highlight LLMCache as a practical and general- purpose solution for optimizing transformer inference in real- world applications. Index Terms—Transformer, Inference Acceleration, Caching, Large Language Models, Layer-wise Optimization I. INTRODUCTION Transformer-based large language models (LLMs) such as GPT [1], BERT [2], and PaLM [3] have become foundational in modern AI systems. These models enable impressive results across a wide range of tasks, from machine translation and question answering to code generation and medical report summarization. They have also begun to show promise in high-impact domains such as early detection and analysis of neurodegenerative diseases, including Alzheimer’s, where subtle linguistic patterns can provide critical diagnostic signals [4]. However, their computational demands during inference present a significant obstacle to real-time and large-scale deployment. The core bottleneck stems from the sequential nature of transformer inference. Even when processing similar or repet- itive input sequences—common in chat applications, docu- ment summarization pipelines, and retrieval-augmented gen- eration—standard inference pipelines perform full forward passes through all transformer layers. This leads to unnec- essary computation and latency, particularly in settings where many inputs share partial context or exhibit semantic overlap. To address this, researchers have explored optimization techniques such as quantization [5], pruning [6], and early exit strategies [7], each introducing trade-offs in model fidelity, complexity, or hardware compatibility. Another promising line of work leverages caching mechanisms. Key-value (KV) caching, used widely in autoregressive generation [8], [9], avoids recomputation in self-attention layers, but is limited to decoder-only settings and primarily targets token-level reuse. This paper introduces LLMCache, a layer-wise caching framework designed to accelerate inference by reusing inter- mediate activations across semantically similar inputs. Unlike traditional KV caching, our method is model-agnostic and sup- ports encoder and encoder-decoder architectures. LLMCache operates at each transformer layer by fingerprinting the input and matching it against a cached bank of activations. If a match is found within a defined similarity threshold, the cached representation is reused, bypassing the layer computation. Our motivation stems from the observation that intermediate representations in transformers are often stable across seman- tically related inputs. By exploiting this stability, LLMCache reduces redundant computation and enables significant latency reductions, particularly in real-world use cases where input drift is limited or controlled. We implement and evaluate LLMCache on multiple trans- former backbones and benchmark datasets, demonstrating up to 3.1× improvement in inference speed with negligible accu- racy loss. Our analysis also explores cache hit rates, memory trade-offs, and sensitivity to semantic similarity thresholds. The remainder of this paper is organized as follows. Sec- tion II reviews related work on transformer optimization and caching strategies. Section III presents the system architecture of LLMCache, detailing its core components and interactions. Section IV describes the proposed methodology, including fingerprint generation, cache matching, and refresh policies. Section V outlines the experimental setup and reports empir- ical results across multiple models and tasks. Section VI dis- cusses practical trade-offs, limitations, and future directions. Finally, Section VII concludes the paper and summarizes our key contributions. II. RELATED WORK A. Transformer Inference Optimization Transformer models [8] have been opt

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut