TyphoonMLA: A Mixed Naive-Absorb MLA Kernel For Shared Prefix
Multi-Head Latent Attention (MLA) is a recent attention mechanism adopted in state-of-the-art LLMs such as DeepSeek-v3 and Kimi K2. Thanks to its novel formulation, MLA allows two functionally equivalent but computationally distinct kernel implementations: naive and absorb. While the naive kernels (e.g., FlashAttention) are typically preferred in training and prefill for their computational efficiency, existing decoding kernels (e.g., FlashMLA) rely on the absorb method to minimize HBM bandwidth usage. However, the compute-bound nature of the absorb implementations prohibits performance benefits from data reuse opportunities in attention calculations, such as shared prefixes. In this work, we introduce TyphoonMLA, a hybrid approach that combines naive and absorb formulations to harness the strengths of both. TyphoonMLA effectively leverages the shared prefix by applying the naive formulation to the compute-bound parts of attention calculations, while reducing the bandwidth requirements for non-shared parts by using the absorb formulation. As a result, TyphoonMLA improves the throughput of attention calculations in MLA architectures by up to 3x and 3.24x on NPU and GPUs, with only a 3% overhead in HBM size.
💡 Research Summary
TyphoonMLA addresses a critical inefficiency in Multi‑Head Latent Attention (MLA) kernels used by state‑of‑the‑art large language models such as DeepSeek‑v3 and Kimi K2. MLA offers two mathematically equivalent implementations: a “naive” version that keeps the KV‑cache uncompressed and thus enjoys low arithmetic intensity but suffers from high HBM bandwidth consumption, and an “absorb” version that retains the KV‑cache in a compressed latent space, reducing memory traffic at the cost of additional matrix multiplications that make it compute‑bound. Existing decoding kernels (e.g., FlashMLA) rely exclusively on the absorb formulation because inference is typically memory‑bound, but this choice prevents them from exploiting data‑reuse opportunities that arise when multiple queries share a large prefix of the KV‑cache (system prompts, tree‑of‑thought, speculative decoding, etc.).
The authors observe that in the shared‑prefix region the workload becomes compute‑intensive: the naive formulation requires fewer floating‑point operations than absorb because it avoids the extra up‑projection of queries. Conversely, in the non‑shared region the absorb formulation remains preferable because it keeps memory traffic low. TyphoonMLA therefore partitions the attention computation into two components: (1) a naive path applied to the shared KV‑cache (uncompressed K and V tensors) and (2) an absorb path applied to the non‑shared KV‑cache (still in latent form). The algorithm first processes queries through common down‑projection, RMS‑norm, and RoPE layers, then splits the query vector into two sub‑vectors that feed the respective paths. The naive path performs a standard softmax over the uncompressed cache, while the absorb path first up‑projects the query, multiplies with the compressed caches, applies softmax, and finally up‑projects the result back. The two partial outputs are merged using log‑sum‑exp (LSE) to preserve numerical correctness. An automatic fallback switches to a pure absorb kernel when batch size is too small to benefit from data reuse, ensuring no performance regression.
A detailed computational analysis (Table 1) quantifies MAC counts and HBM read/write volumes for each implementation, showing that TyphoonMLA reduces memory traffic relative to naive and reduces arithmetic work relative to absorb. Experiments on both GPUs and NPUs with DeepSeek‑v3 and Kimi K2 demonstrate up to 3.24× higher token‑per‑second throughput for the attention layer and up to 1.48× overall generation speed, while incurring only a ~3 % increase in HBM footprint. Accuracy remains unchanged because the method is mathematically equivalent to existing MLA kernels and requires no retraining. Moreover, TyphoonMLA is compatible with other attention optimizations (PagedAttention, RadixAttention) and with standard parallelism strategies (tensor and sequence parallelism), allowing seamless integration into inference frameworks such as vLLM and SGLang.
In summary, TyphoonMLA introduces a hybrid naive‑absorb kernel that intelligently exploits shared prefixes to balance compute and memory demands, delivering substantial speedups for MLA‑based models without sacrificing memory efficiency or model quality. This work opens a new direction for adaptive kernel design in next‑generation LLM inference.
Comments & Academic Discussion
Loading comments...
Leave a Comment