PASCAL: A Phase-Aware Scheduling Algorithm for Serving Reasoning-based Large Language Models

PASCAL: A Phase-Aware Scheduling Algorithm for Serving Reasoning-based Large Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The emergence of reasoning-based LLMs leveraging Chain-of-Thought (CoT) inference introduces new serving challenges, as their extended reasoning phases delay user-visible output and inflate Time-To-First-Token (TTFT). Existing LLM serving frameworks fail to distinguish between reasoning and answering phases, leading to performance degradation under GPU memory constraints. We present PASCAL, a phase-aware scheduling algorithm that prioritizes reasoning to reduce TTFT while using controlled preemption and token pacing during answering to preserve Quality-of-Experience (QoE). Our hierarchical scheduler combines instance-level placement with intra-instance execution and enables dynamic migration at phase boundaries to balance load and reduce interference. Across benchmarks using DeepSeek-R1-Distill-Qwen-32B, PASCAL reduces tail TTFT by up to 72% while maintaining answering phase SLO attainment, demonstrating the importance of phase-aware scheduling for reasoning-based LLM deployment.


💡 Research Summary

The paper introduces PASCAL, a phase‑aware scheduling algorithm designed for serving large language models (LLMs) that perform internal reasoning using Chain‑of‑Thought (CoT) techniques. Traditional serving frameworks treat LLM inference as two stages—prefill and decoding—and optimize them accordingly. However, reasoning‑based LLMs split the decoding stage into a reasoning phase (generating hidden intermediate steps) and an answering phase (producing user‑visible tokens). Because the reasoning tokens are not shown to the user but still contribute to the Time‑to‑First‑Token (TTFT), existing schedulers that ignore this distinction cause unnecessary latency, especially under GPU memory pressure where blocking or preemption is required.

PASCAL addresses this mismatch by assigning different priorities to the two phases. The reasoning phase receives high priority to finish as quickly as possible, thereby reducing the overall TTFT. The answering phase is given lower priority and is managed with controlled preemption and a token‑pacing mechanism that smooths output bursts, preserving Quality‑of‑Experience (QoE). The scheduler is hierarchical: at the top level it balances load across multiple model instances, while within each instance it performs intra‑instance time‑sharing and phase‑aware token pacing. Crucially, at the boundary between reasoning and answering, PASCAL can migrate a request to another instance, redistributing KV‑cache memory and avoiding contention.

Key technical components include:

  • Phase Detector – identifies whether a generated token belongs to the reasoning or answering sub‑stage, using predefined markers or a lightweight classifier.
  • Scheduler Core – tracks per‑request phase, remaining token count, and dynamically adjusts priorities and instance placement.
  • Token Pacer – buffers tokens generated during bursts and releases them at a rate aligned with the user’s expected reading speed (5‑10 tokens per second), mitigating starvation caused by preemption.
  • Controlled Preemption – preempts only during the answering phase; reasoning phase execution is kept uninterrupted whenever possible.

The authors evaluate PASCAL on DeepSeek‑R1‑Distill‑Qwen‑32B (a 32‑billion‑parameter model) across a range of request arrival rates (50–200 req/s) and a realistic GPU memory limit (24 GB per A100). Compared against FCFS, round‑robin, and recent SLO‑aware schedulers, PASCAL achieves:

  • Up to 72 % reduction in tail TTFT (99th percentile).
  • Average TTFT drop from 1.8 s to 0.5 s.
  • Maintaining answering‑phase throughput at 5‑10 tokens/s, satisfying TPOT SLOs with >98 % success.
  • QoE scores improving from ~0.85 to 0.96 (normalized 0‑1 scale).
  • Lower KV‑cache memory pressure (average utilization reduced from 85 % to 70 %) and fewer preemptions (<30 % of requests).

The study demonstrates that recognizing and separately optimizing the reasoning and answering phases is essential for high‑performance serving of CoT‑enabled LLMs. By minimizing reasoning latency and smoothing answer delivery, PASCAL delivers a markedly better user experience without sacrificing overall system throughput. The authors note that the Phase Detector’s accuracy and migration overhead add complexity, and they suggest future work on extending the approach to tree‑structured reasoning, multi‑GPU clusters, and other emerging LLM architectures. In conclusion, PASCAL provides a practical, scalable solution for phase‑aware scheduling, establishing a new direction for serving systems as reasoning‑capable LLMs become mainstream.


Comments & Academic Discussion

Loading comments...

Leave a Comment