When RL Meets Adaptive Speculative Training: A Unified Training-Serving System
Speculative decoding can significantly accelerate LLM serving, yet most deployments today disentangle speculator training from serving, treating speculator training as a standalone offline modeling problem. We show that this decoupled formulation introduces substantial deployment and adaptation lag: (1) high time-to-serve, since a speculator must be trained offline for a considerable period before deployment; (2) delayed utility feedback, since the true end-to-end decoding speedup is only known after training and cannot be inferred reliably from acceptance rate alone due to model-architecture and system-level overheads; and (3) domain-drift degradation, as the target model is repurposed to new domains and the speculator becomes stale and less effective. To address these issues, we present Aurora, a unified training-serving system that closes the loop by continuously learning a speculator directly from live inference traces. Aurora reframes online speculator learning as an asynchronous reinforcement-learning problem: accepted tokens provide positive feedback, while rejected speculator proposals provide implicit negative feedback that we exploit to improve sample efficiency. Our design integrates an SGLang-based inference server with an asynchronous training server, enabling hot-swapped speculator updates without service interruption. Crucially, Aurora supports day-0 deployment: a speculator can be served immediately and rapidly adapted to live traffic, improving system performance while providing immediate utility feedback. Across experiments, Aurora achieves a 1.5x day-0 speedup on recently released frontier models (e.g., MiniMax M2.1 229B and Qwen3-Coder-Next 80B). Aurora also adapts effectively to distribution shifts in user traffic, delivering an additional 1.25x speedup over a well-trained but static speculator on widely used models (e.g., Qwen3 and Llama3).
💡 Research Summary
The paper tackles the practical shortcomings of speculative decoding—a technique that pairs a lightweight draft model (the “speculator”) with a heavyweight verifier (the target LLM) to accelerate inference. While speculative decoding can theoretically reduce the number of expensive verifier steps, real‑world deployments still suffer from three intertwined problems: (1) a high time‑to‑serve because the draft model must be trained offline for a long period before it can be deployed; (2) delayed utility feedback, since the true end‑to‑end speedup depends not only on the acceptance rate but also on system‑level factors such as kernel implementations, numeric precision, batching, and hardware, which cannot be reliably inferred from offline metrics; and (3) domain‑drift degradation, as target models evolve for quality, safety, cost or hardware reasons while the draft model lags behind, becoming stale.
To close this loop, the authors introduce Aurora, a unified training‑serving system that treats the speculator as a reinforcement‑learning policy. Accepted tokens are positive rewards; rejected proposals provide implicit negative feedback. Aurora’s architecture consists of two loosely coupled components: an SGLang‑based inference server that runs speculative decoding and streams both accepted and rejected token prefixes (along with hidden states) into a bounded memory buffer, and an asynchronous training server that continuously samples on‑policy data from this buffer, updates the speculator, and hot‑swaps the new weights back into the inference server without interrupting ongoing requests. This design enables day‑0 deployment—an untrained or cold‑start speculator can be launched immediately, start collecting real traffic, and quickly adapt in situ.
Key technical contributions include: (i) framing speculative decoding as an asynchronous RL problem where the objective is serving efficiency (latency, tokens‑per‑second, cost per token) rather than rollout throughput; (ii) a lazy, carefully scheduled synchronization policy that avoids service jitter or cache invalidation; (iii) exploiting both positive (accepted) and negative (rejected) feedback to improve sample efficiency; and (iv) eliminating the massive activation‑collection pipeline traditionally required for offline distillation, thereby cutting storage and bandwidth costs.
Empirically, Aurora achieves a 1.5× speedup on day‑0 for frontier models such as MiniMax M2.1 229B and Qwen3‑Coder‑Next 80B, and an additional 1.25× improvement over a strong offline‑trained static speculator when faced with distribution shifts on widely used models like Qwen3 and Llama3. The system works across different speculative decoding variants (e.g., Eagle, MTP‑4), proving algorithm‑agnosticism and scalability to large GPU clusters.
In summary, Aurora redefines speculative decoding from a “train‑then‑serve” pipeline into a closed‑loop, serve‑to‑train flywheel. By integrating real‑time inference signals into the learning process, it delivers immediate utility, rapid adaptation to target‑model drift, and substantial infrastructure savings, representing a significant step forward for production‑grade LLM serving.
Comments & Academic Discussion
Loading comments...
Leave a Comment