GoodSpeed: Optimizing Fair Goodput with Adaptive Speculative Decoding in Distributed Edge Inference

GoodSpeed: Optimizing Fair Goodput with Adaptive Speculative Decoding in Distributed Edge Inference
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) have revolutionized natural language processing, yet their high computational demands pose significant challenges for real-time inference, especially in multi-user server speculative decoding and resource-constrained environments. Speculative decoding has emerged as a promising technique to accelerate LLM inference by using lightweight draft models to generate candidate tokens, which are subsequently verified by a larger, more accurate model. However, ensuring both high goodput (the effective rate of accepted tokens) and fairness across multiple draft servers cooperating with a central verification server remains an open challenge. This paper introduces GOODSPEED, a novel distributed inference framework that optimizes goodput through adaptive speculative decoding. GOODSPEED employs a central verification server that coordinates a set of heterogeneous draft servers, each running a small language model to generate speculative tokens. To manage resource allocation effectively, GOODSPEED incorporates a gradient scheduling algorithm that dynamically assigns token verification tasks, maximizing a logarithmic utility function to ensure proportional fairness across servers. By processing speculative outputs from all draft servers in parallel, the framework enables efficient collaboration between the verification server and distributed draft generators, streamlining both latency and throughput. Through rigorous fluid sample path analysis, we show that GOODSPEED converges to the optimal goodput allocation in steady-state conditions and maintains near-optimal performance with provably bounded error under dynamic workloads. These results demonstrate that GOODSPEED provides a scalable, fair and efficient solution for multi-server speculative decoding in distributed LLM inference systems.


💡 Research Summary

The paper addresses the latency and resource challenges of deploying large language models (LLMs) for real‑time inference, especially in multi‑user, edge‑centric scenarios. It introduces GoodSpeed, a distributed speculative decoding framework that combines lightweight small language models (SLMs) running on heterogeneous edge draft servers with a central verification server that hosts a full‑scale LLM. In each discrete time slot, every draft server locally generates a short sequence of speculative tokens using its SLM. These candidate tokens are sent in parallel to the verification server, which batches them and uses GPU acceleration to verify the tokens against the large target model. Accepted tokens are returned to the draft servers, which then update their prompts and continue generating the next speculative batch.

A central contribution is the gradient‑based scheduling algorithm. For each draft server i, the system maintains an estimated token acceptance rate α_i. The verification server solves a utility maximization problem: maximize the sum of logarithmic utilities U_i(x_i)=log(x_i) subject to the total verification capacity constraint Σ_i x_i ≤ C, where x_i denotes the allocated goodput (expected accepted tokens per unit time) for server i. By applying a Lagrangian formulation and performing gradient ascent on the dual variable, the algorithm dynamically adjusts x_i in response to changing α_i and workload, ensuring proportional fairness (log‑utility guarantees) while driving the system toward the optimal goodput distribution.

The authors provide a rigorous theoretical analysis using fluid sample‑path techniques. They model the evolution of token allocations as a continuous‑time fluid system and construct a Lyapunov function to prove stability. The analysis shows that the gradient scheduling dynamics converge to the unique optimal point of the utility maximization problem, with the deviation bounded by O(1/√t) even under non‑stationary request patterns. This establishes asymptotic optimality and robustness of GoodSpeed.

Experimental evaluation employs state‑of‑the‑art LLMs, including Llama‑3‑70B and Qwen‑3‑14B, and compares GoodSpeed against several baselines: traditional single‑draft speculative decoding, multi‑draft approaches such as SpecInfer, and centralized batch serving systems. Results demonstrate that GoodSpeed reduces end‑to‑end latency by more than 30 % and improves system‑wide goodput by up to 1.8×. The gradient scheduler maintains fairness across draft servers with heterogeneous acceptance rates, preventing any single server from becoming a bottleneck. Moreover, because only speculative tokens (not full prompts) traverse the network, communication overhead is significantly lowered, making the approach suitable for bandwidth‑constrained edge deployments.

In summary, GoodSpeed delivers a complete solution for distributed edge inference: (1) it offloads cheap token generation to edge devices, minimizing user‑perceived latency; (2) it leverages a powerful central verifier to maintain output quality; (3) it employs a log‑utility‑based gradient scheduler that guarantees proportional fairness while maximizing overall goodput; and (4) it provides provable convergence guarantees via fluid‑model analysis. The paper suggests future directions such as automated selection of draft model architectures, asynchronous verification pipelines, extensions to multimodal inputs, and enhanced privacy/security mechanisms for edge‑cloud communication.


Comments & Academic Discussion

Loading comments...

Leave a Comment