Theoretically Optimal Attention/FFN Ratios in Disaggregated LLM Serving

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Attention-FFN disaggregation (AFD) is an emerging architecture for LLM decoding that separates state-heavy, KV-cache-dominated Attention computation from stateless, compute-intensive FFN computation, connected by per-step communication. While AFD enables independent scaling of memory and compute resources, its performance is highly sensitive to the Attention/FFN provisioning ratio: mis-sizing induces step-level blocking and costly device idle time. We develop a tractable analytical framework for sizing AFD bundles in an $r$A-$1$F topology, where the key difficulty is that Attention-side work is nonstationary-token context grows and requests are continuously replenished with random lengths-while FFN work is stable given the aggregated batch. Using a probabilistic workload model, we derive closed-form rules for the optimal A/F ratio that maximize average throughput per instance across the system. A trace-calibrated AFD simulator validates the theory: across workloads, the theoretical optimal A/F ratio matches the simulation-optimal within 10%, and consistently reduces idle time.

💡 Research Summary

The paper addresses a fundamental design problem in the emerging Attention‑FFN Disaggregation (AFD) architecture for large language model (LLM) decoding: how many Attention instances should be paired with a single FFN instance (the ratio r) to maximize throughput while minimizing idle time. In a monolithic decoder, Attention and FFN share the same hardware, which forces small batch sizes because each decoding step must finish within a strict time‑per‑output (TPO) budget. Consequently, the stateless FFN layers are under‑utilized. AFD separates the state‑heavy, memory‑bound Attention (which reads the growing KV‑cache) from the compute‑intensive, stateless FFN, allowing each side to be scaled independently on heterogeneous resources.

The authors formalize an r A – 1 F topology, where each Attention worker processes a micro‑batch of B requests, and a single shared FFN processes the aggregated r · B activations each step. The per‑step pipeline consists of four phases: (1) parallel Attention computation, (2) A→F communication, (3) FFN computation, and (4) F→A communication. Because the KV‑cache length grows with every generated token, the Attention latency increases linearly over time, while the FFN and communication latencies remain constant for a given B and r. This mismatch creates “pipeline bubbles” that waste hardware cycles.

To capture the stochastic nature of real serving workloads, the paper introduces a probabilistic model. Each request has a pre‑fill length P and a decode length D. The pre‑fill length follows an arbitrary bounded distribution with mean μ_P. Decode lengths are modeled as a geometric distribution Geo(p), reflecting the memoryless probability of emitting an end‑of‑sequence token at each step. Continuous batching is assumed: when a request finishes, its slot is instantly refilled with a new request, keeping the micro‑batch size constant.

Using these assumptions, the authors derive closed‑form expressions for the expected token load at step k:

Theoretically Optimal Attention/FFN Ratios in Disaggregated LLM Serving

💡 Research Summary

Comments & Academic Discussion

Leave a Comment