LLM-42: Enabling Determinism in LLM Inference with Verified Speculation
In LLM inference, the same prompt may yield different outputs across different runs. At the system level, this non-determinism arises from floating-point non-associativity combined with dynamic batching and GPU kernels whose reduction orders vary with batch size. A straightforward way to eliminate non-determinism is to disable dynamic batching during inference, but doing so severely degrades throughput. Another approach is to make kernels batch-invariant; however, this tightly couples determinism to kernel design, requiring new implementations. This coupling also imposes fixed runtime overheads, regardless of how much of the workload actually requires determinism. Inspired by ideas from speculative decoding, we present LLM-42, a scheduling-based approach to enable determinism in LLM inference. Our key observation is that if a sequence is in a consistent state, the next emitted token is likely to be consistent even with dynamic batching. Moreover, most GPU kernels use shape-consistent reductions. Leveraging these insights, LLM-42 decodes tokens using a non-deterministic fast path and enforces determinism via a lightweight verify-rollback loop. The verifier replays candidate tokens under a fixed-shape reduction schedule, commits those that are guaranteed to be consistent across runs, and rolls back those violating determinism. LLM-42 mostly re-uses existing kernels unchanged and incurs overhead only in proportion to the traffic that requires determinism.
💡 Research Summary
The paper tackles the long‑standing problem of non‑deterministic outputs in large language model (LLM) inference, a phenomenon that arises not from stochastic sampling but from floating‑point non‑associativity combined with dynamic batching on GPUs. When a request is batched with different peers across runs, the batch size changes, causing GPU kernels to pick different reduction schedules (e.g., split‑K strategies in GEMM, different tree shapes in attention and normalization). These tiny numerical differences propagate through the autoregressive decoding loop and can eventually tip the sampler’s decision, producing different tokens for the same prompt.
Prior work (He et al.) proposed “batch‑invariant” kernels that enforce a single, universal reduction order regardless of batch size. While this guarantees determinism, it disables many performance‑critical optimizations such as split‑K, Tensor Memory Accelerators, and kernel fusion. Empirically, batch‑invariant GEMM kernels achieve only ~194 TFLOPS versus ~527 TFLOPS for cuBLAS, a 63 % slowdown. RMSNorm suffers a similar 50 % slowdown. Moreover, because deterministic inference is forced on the entire batch, a single deterministic request can collapse the throughput of all concurrent requests by more than 50 %, making the approach impractical for production serving.
LLM‑42 introduces a scheduling‑based solution inspired by speculative decoding. The key observations are: (O1) If a sequence is already in a “consistent” state, the next token is highly likely to be the same across runs even under dynamic batching; (O2) Most GPU kernels already use shape‑consistent reduction schedules (the same schedule for inputs of the same shape). Leveraging these, LLM‑42 decouples token generation from determinism enforcement via a decode‑verify‑rollback protocol:
-
Fast Path (Non‑Deterministic Decoding) – Uses the existing high‑throughput, batch‑size‑adaptive kernels to generate candidate tokens. No changes to the kernels are required, preserving all performance optimizations.
-
Verifier – Periodically re‑executes a fixed‑size window of recently generated tokens under a fixed‑shape reduction schedule. Because the input shape is constant during verification, the reduction order is deterministic, providing a reference execution.
-
Commit or Rollback – If the verifier’s outputs match the fast‑path candidates, the tokens are committed and returned to the user. If a mismatch occurs, the system rolls back to the last matching token and resumes decoding from that point.
The size of the verification window trades off two costs: smaller windows increase verification overhead (memory‑bound) but limit recomputation on rollback; larger windows reduce verification cost (compute‑bound) but increase the amount of work that must be redone after a mismatch. To obtain the best of both worlds, LLM‑42 introduces grouped verification: multiple requests each contribute a small window, which are batched together for verification. This amortizes the verification cost while keeping rollback granularity small.
Experiments on SGLang show that LLM‑42’s overhead is proportional to the fraction of traffic that actually requires determinism. When deterministic traffic is low, throughput remains near the non‑deterministic baseline (≈845 tokens/s). Even when all traffic is deterministic, LLM‑42 outperforms the batch‑invariant baseline, suffering only a ~56 % slowdown versus the >100 % slowdown observed with batch‑invariant kernels. Importantly, LLM‑42 reuses existing cuBLAS/CUDA kernels for the fast path, requiring no new kernel implementations, thus minimizing engineering effort and simplifying deployment across hardware generations.
In summary, LLM‑42 provides a practical, low‑overhead path to deterministic LLM inference by (1) exploiting the high probability of token‑level consistency, (2) leveraging shape‑consistent reduction behavior already present in GPU kernels, and (3) introducing a lightweight verify‑rollback mechanism with grouped verification. This approach decouples determinism from performance‑critical decoding, enabling selective deterministic execution without sacrificing the throughput gains of dynamic batching, and represents a significant advance over prior batch‑invariant methods.
Comments & Academic Discussion
Loading comments...
Leave a Comment