inference-fleet-sim: A Queueing-Theory-Grounded Fleet Capacity Planner for LLM Inference

Sizing a GPU fleet for LLM inference is harder than it looks. The obvious questions -- how many GPUs, which type, where to split a two-pool fleet -- have no closed-form answers. They depend on the full token-length distribution, the routing policy, a…

Authors: Huamin Chen, Xunzhuo Liu, Yuhan Liu

inference-fleet-sim : A Queueing-Theory-Grounded Fleet Capacit y Planner for LLM Inference Huamin Chen 1 Xunzh uo Liu 1 Y uhan Liu 2 Junc hen Jiang 3 Bo wei He 4 ∗ Xue Liu 4 1 vLLM Seman tic Router Pro ject 2 Univ ersity of Chicago 3 T ensormesh Inc / UChicago 4 MBZUAI / McGill Univ ersity 2026 Abstract Sizing a GPU fleet for LLM inference is harder than it lo oks. The obvious questions— ho w man y GPUs, whic h t yp e, where to split a tw o-p ool fleet—hav e no closed-form answers. They dep end on the full tok en-length distribution, the routing p olicy , and queueing dynamics that turn ugly under hea vy-tailed w orkloads. Existing to ols [ Agraw al et al. , 2024 , Xu et al. , 2025 ] optimize p er-engine configuration for a fixed GPU count; none of them address the upstream question of ho w many GPUs to buy and how to arrange them. inference-fleet-sim fills that gap. It combines analytical M/G/ c queueing with discrete-ev ent simulation (DES) to find the minimum-cost fleet configuration that empiri- cally meets a P99 TTFT SLO. It includes a physics-informed GPU p erformance model cov- ering A10G, A100, and H100 across monolithic, tw o-po ol-routed, and disaggregated top olo- gies, all without requiring access to real hardware. W e run the to ol on seven fleet-planning scenarios drawn from tw o public workload traces (LMSYS, Azure) and one synthetic agent- hea vy trace. Eac h one surfaces a result that simple analysis gets wrong—the right split threshold, the cheapest GPU type, whether an apparently idle fleet is actually broken—and sho ws wh y joint simulation of queueing, routing, and hardware is necessary to find it. 1 In tro duction GPU infrastructure for LLM inference is exp ensiv e. A100 and H100 GPUs ren t for roughly $1.50–$7.00 p er GPU-hour dep ending on cloud pro vider and contract type [ Laurent , 2026 ]— A WS and Go ogle Cloud on-demand H100 rates settled around $3.00–$3.93 after a 44% A WS price cut in June 2025, while sp ecialt y clouds and mark etplaces undercut hyperscalers with H100 rates as lo w as $1.49–$1.99/hr. 1 A fleet of 24 no des therefore runs $315K–$1.47M p er y ear dep ending on provider, b efore soft ware, net working, and op erations costs. Despite this, the basic sizing question—ho w man y GPUs to serve λ requests p er second with P99 TTFT ≤ T ms?—has no clean analytical answ er. It dep ends on the join t distribution of prompt and ∗ Corresp onding author: Bowei.He@mbzuai.ac.ae 1 The sim ulator’s pre-built profiles ( fleet_sim/gpu_profiles/profiles.py ) use cost_per_hr defaults of $2.21 (A100 80 GB) and $4.02 (H100 80 GB), calibrated against 2024 Lam b da Labs on-demand rates. These defaults can b e ov erridden with current pricing via ManualProfile . All cost-efficiency rankings in this pap er are robust to mo derate ( ± 30%) price c hanges b ecause the ratios b et ween configurations matter more than the absolute dollar figures. 1 completion lengths, the routing p olicy , the GPU hardware, and nonlinear queueing dynamics that get esp ecially bad under heavy tails. A v ailable to ols are built for related but differen t problems. Vidur [ Agra wal et al. , 2024 ] and AIConfigurator [ Xu et al. , 2025 ] tune p er-engine configuration (tensor parallelism, batc h size, c hunk size, KV-cache fraction) for a given GPU cluster; they presuppose that the fleet size is already decided. Mélange [ Griggs et al. , 2024 ] picks the optimal mix of GPU types for a given workload and SLO, but it do es not mo del p ool routing or m ulti-p ool queue dynamics. SageServ e [ Jia et al. , 2025 ] and T ok enScale [ Dong et al. , 2024 ] autoscale a liv e fleet in resp onse to traffic; they need pro duction traces and real hardware , and they answ er a runtime question, not a pro curemen t one. DistServ e [ Zhong et al. , 2024 ] and Split wise [ Patel et al. , 2024 ] study prefill/deco de disaggregation within a single cluster, not fleet-level p ool routing. None of these tools answer the provisioning question: given a tok en-length CDF, an arriv al rate λ , an SLO, and a catalog of GPU t yp es, what is the minimum-cost fleet—num ber of p o ols, split b oundary B short , GPU type p er p ool, routing p olicy—that actually meets the SLO? inference-fleet-sim answers that question. Its con tributions are: 1. A t wo-phase optimizer that uses the M/G/ c Kimura approximation for a fast analytical sw eep to identify candidate configurations, then runs DES to v erify the top candidates against actual queueing dynamics. 2. A ph ysics-informed GPU performance model parameterized b y ( W , H, n max ) —baseline compute, memory-bandwidth cost p er concurrent sequence, and maximum KV-slot count— that computes exp ected service times without any hardware access. Constan ts are cali- brated from published hardw are data [ Xu et al. , 2025 ] for A10G, A100-80GB, and H100- 80GB. 3. Sev en capacity-planning case studies (Section 4 ) where simulation pro duces a different answ er than analytical intuition: the optimal split threshold is not readable off the CDF; a 30%-utilized fleet fails its SLO; a slow GPU b eats a fast one on cost; GPU scaling is sub-linear; the sizing router should not b e the pro duction router; mixed GPU t yp es can fail even as they sa ve money; and in disaggregated serving, the c heap er GPU should handle prefill, not deco de. 4. Reliabilit y-aw are sizing via a node_avail parameter deriv ed from published GPU failure- rate and MTTR data [ Kok olis et al. , 2024 , Cui et al. , 2025 ], so that pro duction fleet counts accoun t for no des under repair. 2 Bac kground 2.1 KV-Cac he Slots and the Cost Cliff LLM serving systems [ Kw on et al. , 2023 ] allo cate GPU memory in PagedA tten tion blo c ks. An A100-80GB holds 65,536 blocks of 16 tokens eac h. A sequence needing up to B tokens requires ⌈ B / 16 ⌉ blo c ks; the maxim um concurren t sequences p er GPU is n max ( B ) = ⌊ 65 536 / ⌈ B / 16 ⌉⌋ . A t B = 8 , 192 this is 128; at B = 65 , 536 it drops to 16. That 8 × ratio is the main lever on fleet cost. In a tw o-po ol fleet, requests with total token budget L in + L out ≤ B short go to the short p ool P s (man y concurren t sequences, high th roughput); longer requests go to the long p ool P l (few concurren t sequences, large KV cac he). The cost savings o ver a homogeneous fleet is not monotone in B short : it dep ends on the fraction of traffic b elo w B short and the resulting queue im balance. That interaction is what the simulator resolv es (Section 4.1 ). A request at B short + 1 tokens lands in the long p o ol and consumes a slot pro visioned for the full con text—8 × more capacity than its neighbor just b elo w the threshold. Requests in the b orderline band ( B short , γ B short ] are not genuinely long; Compress-and-Route [ Chen et al. , 2026 ] addresses this by squeezing suc h prompts back b elo w B short at the gatewa y . 2 2.2 The M/G/ c Queue and Kim ura’s Approximation Eac h GPU p ool is mo deled as an M/G/ c queue: Poisson arriv als at rate λ , general service time with mean E [ S ] and squared co efficien t of v ariation C 2 s = V ar[ S ] / ( E [ S ]) 2 , and c parallel servers (GPUs). The Erlang-C formula gives the probabilit y an arriving request has to wait: C ( c, ϱ ) = ( cϱ ) c / ( c ! (1 − ϱ )) P c − 1 k =0 ( cϱ ) k /k ! + ( cϱ ) c / ( c ! (1 − ϱ )) , (1) where ϱ = λ/ ( cµ ) is per-server utilization. The standard t wo-momen t M/G/ c appro xima- tion [ Kimura , 1994 ] gives the P99 queue wait: W 99 ≈ C ( c, ϱ ) cµ (1 − ϱ ) · 1 + C 2 s 2 · ln(100) . (2) F or high- C 2 s w orkloads—agent traffic where service times range from milliseconds to min utes— M/M/ c badly underestimates tail latency . The (1 + C 2 s ) / 2 term corrects for this, and the DES v alidates whether the correction is sufficient (Section 4.2 ). GPU iteration latency under contin uous batching scales with concurren t sequences n : t iter ( n ) = W + H · n, (3) where W (ms) is baseline compute and H (ms/slot) is the memory-bandwidth cost p er con- curren t sequence. F or Llama-3-70B on A100-80GB: W = 8 ms, H = 0 . 65 ms/slot. Expected service time for a request with L in input and L out output tokens is: E [ S ] = ⌈ L in /C ch unk ⌉ + L out n max · t iter ( n max ) , (4) where C ch unk is the prefill ch unk size. TTFT decomp oses as: TTFT = W queue + ⌈ L in /C ch unk ⌉ · t iter ( n max ) | {z } T prefill + t iter ( n max ) . (5) F or large requests near B short , T prefill alone can eat most of the SLO budget even when there is no queue wait at all. This is invisible in ( 2 ) and only shows up in a full simulation. 3 Sim ulator Design 3.1 T w o-Phase Optimization The join t space of ( n s , n l , B short , GPU t yp e ) is large enough that exhaustive DES simulation is impractical. The tw o-phase design (Figure 1 ) sidesteps this. Phase 1—Analytical sw eep. F or eac h candidate ( B short , n s , n l ) , the mo del: 1. Splits λ into λ s = λ · F ( B short ) and λ l = λ · (1 − F ( B short )) using the workload CDF F . 2. Computes E [ S ] and C 2 s for each p ool b y in tegrating o ver F restricted to that p ool’s length range. 3. Ev aluates ( 2 ) and the full TTFT ( 5 ) to chec k whether b oth p o ols meet the SLO under the utilization cap ϱ ≤ 0 . 85 . 4. Records total GPU cost c s · cost s + c l · cost l . The swee p runs in milliseconds and pro duces a ranked list of candidates. Phase 2—DES v erification. The top- k candidates are verified by sim ulation: 1. A Poisson arriv al stream at rate λ is generated; eac h request’s lengths are dra wn i.i.d. from the empirical CDF. 3 W orkload (CDF / T race) GPU Profiles (A10G/A100/H100) Fleet Config (P o olConfig) FleetOptimizer (analytical sweep) DES V erifier (top- k candidates) FleetSimResult P99 / Cost / Util P areto fron tier ( n s , n l , B short ) candidates Figure 1: T w o-phase fleet optimizer. The analytical sw eep finds the lo west-cost candidate p ool configurations; DES verifies the top candidates under actual queueing dynamics. 2. Requests are routed to p o ols; each p ool runs n GPU instances, each simulating contin uous batc hing with a min-heap ev ent queue. 3. The simulation collects per-request queue wait, TTFT, and end-to-end latency . The SLO c heck is P99 TTFT ≤ T . The DES is request-lev el, not token-lev el: eac h request fires exactly tw o even ts (arriv al and completion), so simulating 10 4 requests takes under one second. 3.2 GPU P erformance Mo del Eac h GPU t yp e is c haracterized by ( W , H, n max , C max ) : GPU W (ms) H (ms/slot) n max at 8K ctx VRAM (GB) A10G 24GB 12.0 0.90 64 24 A100 80GB 8.0 0.65 128 80 H100 80GB 4.0 0.32 256 80 These are the hand-calibrated constants in fleet_sim/gpu_profiles/profiles.py ( ManualProfile ), targeting Llama-3-70B with single-no de TP serving. ProfileBuilder can deriv e equiv alent constan ts from first principles using the ro ofline decomp osition from AIConfigurator [ Xu et al. , 2025 ]. Users can substitute measured constan ts from a Vidur [ Agraw al et al. , 2024 ] profiling run via ManualProfile for higher accuracy . Mo del fidelity . F or c hatb ot w orkloads (low C 2 s ), the Kimura mo del is conserv ative by 8–14% vs. DES: it o ver-predicts P99 TTFT, so the analytically selected GPU coun t passes DES comfortably . F or agent workloads (high C 2 s , service times from milliseconds to min utes), Erlang-C assumes b ounded v ariance and under-estimates tail latency; DES is authoritative in that regime. Puzzle 2 (Section 4.2 ) demonstrates this directly . 3.3 W orkload Mo del The simulator accepts tw o workload f ormats. Empirical CDF. A JSON file mapping cumulativ e probabilit y to token-budget breakp oin ts. Three CDF s ship with the to ol: • LMSYS [ Zheng et al. , 2024 ]: c hat conv ersations; long-tailed to 65K tokens; F (4 , 096) ≈ 0 . 984 . • Azure LLM T race [ Microsoft Azure , 2023 ]: en terprise chat; 78% of requests b elo w 2K tok ens; max con text 8K. 4 • Agen t-heavy (synthetic): a bimo dal CDF mo deling co ding-agen t sessions in the st yle of SWE-b enc h, with 46% of requests ab o ve 4K tokens and a heavy tail to 300K tok ens. This is not from a public production trace; results for it should b e read accordingly . P oisson with synthetic lengths. F or sensitivity analysis, the workload generator syn- thesizes Poisson arriv als with token lengths dra wn from a Pareto or log-normal distribution. Sub-stream Poisson note. Routing b y token length splits the P oisson stream with a deterministic rule, not indep enden t random thinning. By the P oisson thinning theorem, the sub-streams are not strictly P oisson—a standard engineering approximation [ Harc hol-Balter , 2013 ]. When prompt length and arriv al time are correlated, queue-length estimates ma y be off. The DES chec ks whether the appro ximation holds in each case. 3.4 Routing Algorithms The simulator includes four routing p olicies: LengthRouter Send to P s if total token budget ≤ B short , else to P l . Default pro duction p olicy . CompressAndRoute Compress b orderline requests ( B short < L total ≤ γ B short ) down to B short b efore sending to P s ; intended for fleet sizing, not pro duction [ Chen et al. , 2026 ]. RandomRouter Route uniformly at random across p o ols; baseline. ModelRouter Route to one of N mo del-specific po ols via a semantic classifier; supports m ulti- mo del fleets. 3.5 Reliabilit y-A ware Sizing GPU hardware fails at measurable rates. The node_avail parameter A ∈ (0 , 1] represen ts the fraction of no des in steady-state op eration; a p ool analytically sized to n GPUs gets rounded up to ⌈ n/ A ⌉ in pro duction. The simulator computes A as: A = 1 1 + r f · MTTR , (6) where r f is failures p er no de-da y and MTTR is mean time to repair in da ys. Pre-computed constan ts from the literature [ K okolis et al. , 2024 , Cui et al. , 2025 ]: Constan t V alue Scenario A100_AVAIL_RSC1_FAST 0.9989 Soft failure (driver reset, ∼ 4h MTTR) A100_AVAIL_RSC1_SLOW 0.9871 Hard failure (GPU/NVLink swap, ∼ 48h MTTR) H100_AVAIL_5PCT 0.9500 5% ov erprovisioning rule [ Cui et al. , 2025 ] The utilization cap ϱ max = 0 . 85 cov ers queueing stability ; node_avail cov ers har dwar e r eliability . These are indep enden t concerns and b oth apply in pro duction. 4 Case Studies Sev en scenarios follo w, eac h a question that comes up in practice and cannot b e resolved from the w orkload CDF alone. All runs use Llama-3-70B as the serv ed mo del unless noted. GPU costs are illustrative 2026 sp ot-instance rates: A10G $8.85K/yr, A100 $19.4K/yr, H100 $35.2K/yr; the qualitative ordering of results holds under mo derate price v ariation. 4.1 Puzzle 1: Where Exactly Should I Split? The w orkload CDF says most requests are short—but short compared to what? The split threshold B short con trols which requests land in the high-efficiency short p o ol. Getting it right is not obvious: to o low and the long po ol handles too muc h traffic; too high and the short 5 p ool’s slot adv an tage disappears. W e sw eep B short ∈ { 512 , 1024 , 2048 , 4096 , 8192 , 12288 } for three w orkloads at represen tative arriv al rates. LMSYS, λ = 100 req/s, A100, SLO=500 ms. T able 1: Pareto frontier for B short selection on LMSYS. Homogeneous baseline: 14 A100s at $271K/yr. B short α s n s n l GPUs $/yr Sa ving SLO 512 63.8% 2 13 15 $290K − 7 . 1% ✓ 1,024 83.1% 2 10 12 $232K +14 . 3% ✓ 2,048 94.8% 2 7 9 $174K +35 . 7% ✓ 4,096 98.4% 3 5 8 $155K +42 . 9% ✓ ← optimal 8,192 99.7% 4 4 8 $155K +42 . 9% ✓ 12,288 99.9% 5 3 8 $155K +42 . 9% ✓ B short = 4096 routes 98.4% of LMSYS traffic short. A t that b oundary the short p o ol runs 256 concurrent sequences vs. 16 for the long p o ol—a 16 × slot adv antage that cuts the fleet from 14 GPUs to 8 ( − 43% cost). By con trast, B short = 512 costs 7% mor e than going homogeneous: it lea ves only 64% of traffic in the short p ool, to o little to offset the Erlang fragmentation in the long p ool. Azure ( λ = 200 , A100). The entire Azure CDF fits within 8K tok ens (con text ratio = 2 × ). The b est Pareto p oin t ( B short = 3072 ) sa ves only 4% cost, but cuts short-p o ol P99 from 26 ms to 19 ms. Here, splitting is ab out latency isolation, not cost. Agen t ( λ = 200 , A100, SLO=500 ms). B short = 16384 (64 KV slots vs. 16 for homo) sa ves 64 GPUs ( − 13 . 3% ). A t B short = 32768 the SLO breaks—not from queue w ait, but b ecause long-p ool requests carry 300–600 ms prefill times that use up the en tire SLO budget. A dding more GPUs do es not help; the only fix is a lo wer B short . This failure is invisible in ( 2 ). Insigh t 1. The optimal B short cannot b e read off the CDF. It dep ends on the in teraction b et ween slot efficiency , traffic fraction, and Erlang fragmen tation across both po ols—and a factor-of-2 mistake can cost more than not splitting at all. 4.2 Puzzle 2: Why Is My Agent Fleet F ailing SLO? A 24-GPU H100 fleet at λ = 20 req/s sits at ∼ 30% utilization. The Erlang-C mo del says the fleet is fine. Users are seeing latency violations. T able 2: Agent fleet SLO analysis ( λ = 20 , H100, SLO=1000 ms). Config GPUs Cost/yr P99 TTFT SLO Homo 65K ctx 24 $845K 1,052 ms × F AIL T wo-po ol 4K/65K 25 $880K 17ms / 147ms ✓ The M/G/ c mo del assumes i.i.d. service times dra wn from a distribution with b ounded v ariance. Agen t requests span 10–300 seconds of service (SWE-b enc h-style co ding tasks); C s ≫ 1 . A single 300-second request lo c ks a KV slot for fiv e min utes, blo c king subsequen t requests ev en when GPU utilization reads low. Erlang-C, which uses only me an service time, misses this en tirely . The DES replays the actual arriv al sequence with realistic service time draws. At ϱ ≈ 0 . 30 , it measures P99 TTFT = 1,052 ms b ecause long-tail service ev ents create cascading bac k-pressure that p ersists for h undreds of seconds. 6 Mo ving to a t wo-po ol design routes the 46% of long requests ( > 4 K tokens) to a dedicated 23-GPU p ool, where their slo w service cannot blo c k short requests. Short-request P99 drops to 17 ms. The cost increase is +4% . Insigh t 2. F or agen t or hea vy-tail w orkloads, the analytical mo del do es not err on the safe side—it appr oves fleets that are brok en. DES is the only wa y to catch head-of-line blo c king when C 2 s ≫ 1 . 4.3 Puzzle 3: Which GPU Type Is Actually Cheap est? An op erator is picking b et w een A10G ($8.85K/yr), A100 ($19.4K/yr), and H100 ($35.2K/yr) for an Azure-workload fleet at λ = 100 req/s. The instinct is: faster GPU, few er GPUs, lo wer cost. T able 3: GPU type vs. la yout (Azure, λ = 100 , SLO=500 ms). GPU La yout GPUs Cost/yr P99 TTFT A10G T wo-po ol 19 $168K 155ms / 335ms H100 Homo 6 $211K 26ms A100 T w o-p o ol 12 $232K 52ms / 112ms H100 T w o-p o ol 7 $247K 13ms / 30ms The instinct is wrong. A10G in a t wo-po ol la yout is cheapest at $168K—$43K less than 6 H100s. A10G’s low p er-card cost ($8.85K/yr vs. $35.2K/yr) means 19 cards still totals less than 6 H100s. The tw o-p o ol lay out comp ensates for A10G’s low er throughput: at B short = 4096 , eac h A10G gets n max = 128 concurrent sequences in the short p o ol vs. 64 at max ctx=8K—a 2 × slot b on us. Differen t constraints call for different choices: Priorit y Choice Minim um annual cost A10G tw o-p o ol ($168K) Minim um rack space / p o w er H100 homo (6 GPUs) Best short-request latency H100 tw o-p o ol (13 ms P99) Long-con text / agen t workload H100 or A100 (A10G VRAM limits KV cac he) Insigh t 3. GPU cost dep ends on p ool top ology , not just price and throughput. The slot m ultiplier from a w ell-chosen B short can make a slow er, c heap er GPU the minimum-cost option. 4.4 Puzzle 4: When Do I Need to Add GPUs? T raffic is growing. At what arriv al rate do es the current fleet run out of headro om, and how m uch w arning do es the op erator ha v e? GPU scaling is sub-linear: traffic gro ws 16 × (25 → 400 req/s) but GPU coun t grows only 5.75 × (4 → 23). This is a consequence of Erlang-C con vexit y—each additional GPU pushes do wn utilization ϱ , whic h reduces tail latency at an accelerating rate, so each marginal GPU is increasingly effective. The table gives the exact λ at whic h each fleet size runs out of headro om. W aiting until SLO is already brok en means at least one traffic brack et with degraded P99 b efore new capacit y comes online. 7 T able 4: GPU step thresholds, H100 t wo-po ol fleet (Azure, SLO=500 ms). λ (req/s) GPUs Cost/yr Provision more b efore λ = 25 4 $141K 65 50 5 $176K 90 100 7 $247K 130 150 10 $352K 185 200 12 $423K 270 300 18 $634K 370 400 23 $810K — Insigh t 4. GPU provisioning do es not scale linearly with traffic. The whatif sweep pro- duces exact step thresholds, so capacity planning can sta y ahead of demand rather than react to violations. 4.5 Puzzle 5: Which Router Causes SLO Violations? The fleet is correctly sized. Does the choice of routing p olicy still matter? W e compare three routers on the agent fleet ( λ = 20 , n s = 2 , n l = 23 H100s, SLO=1000 ms). T able 5: Router comparison on the agen t fleet. Router P99 TTFT SLO 1000 ms LengthRouter 495 ms ✓ 99.98% CompressAndRoute 534 ms × 99.94% RandomRouter 292 ms ✓ 100% T wo results are unexp ected. First, CompressAndRoute fails the SLO ev en though it was designed to reduce the GPU coun t. It compresses b orderline requests and routes them to the 2-GPU short po ol; when sev eral arriv e together, they o verwhelm that p o ol and spik e P99. CompressAndRoute is a sizing tool—it finds the minim um GPU coun t—but LengthRouter should b e what runs in pro duction. Second, RandomRouter actually passes with the lo west P99 (292 ms) by spreading all 25 GPUs’ KV slots uniformly an d diluting heavy-tail service ev ents. But this is brittle: short requests share slots with long ones, so any shift in the traffic mix can cause unpredictable latency . F or standard c hatb ot w orkloads at low utilization, all three routers pass comfortably . Insigh t 5. The router used to size the fleet and the router deploy ed in pro duction should b e differen t. CompressAndRoute finds the flo or on GPU coun t; LengthRouter op erates that fleet safely . Conflating them pro duces SLO violations the sizing simulation never predicted. 4.6 Puzzle 6: Do es Mixing GPU Types Sav e Money? Short requests are memory-bandwidth-b ound and inexp ensiv e to serv e; long requests need large KV caches and fast prefill. Can an op erator sav e money by putting c heap GPUs in the short p ool and premium GPUs only in the long p o ol? On Azure, A10G+H100 sav es 9% vs. all-A100 with the same 12 GPUs. Cheap A10Gs handle 98% of the traffic; the H100s go where the long context w arrants them. On LMSYS at 65K context, the picture c hanges sharply . A10G+A100 is 11% cheaper on pap er, but it fails the SLO: prefill time for a 65K-token request on A100 reaches 700–2800 ms, 8 T able 6: Mixed GPU types, Azure w orkload ( λ = 100 , SLO=500 ms). Config GPUs Cost/yr P99-short P99-long All-A100 12 $232K 52 ms 112 ms A10G P s + H100 P l 12 $212K 155 ms 30 ms A10G P s + A100 P l 15 $206K 155 ms 112 ms T able 7: Mixed GPU types, LMSYS w orkload ( λ = 100 , max ctx=65K, SLO=500 ms). Config GPUs Cost/yr P99-short P99-long All-A100 8 $155K 43 ms 2,822 ms × A10G P s + H100 P l 7 $141K 129 ms 181 ms ✓ A10G P s + A100 P l 9 $132K 129 ms 2,822 ms × blo wing past the 500 ms budget b efore any queue wait is coun ted. H100’s larger c hunk size (1024 vs. 512) and low er W halve that time. A10G+H100 sa ves 9% vs. all-A100 and fixes the SLO that all-A100 couldn’t meet. Insigh t 6. Mixing GPU types is not just a cost optimization; the wrong long-p ool GPU mak es the SLO infeasible regardless of how many y ou add. Joint optimization ov er p ool assignmen t and GPU t yp e is required, and some pairings are simply inv alid. 4.7 Puzzle 7: When Should I Switc h to Disaggregated Serving? Disaggregated prefill/decode (P/D) serving [ Zhong et al. , 2024 , P atel et al. , 2024 ] separates compute-b ound prefill from memory-bandwidth-b ound deco de onto different GPU p ools. Whic h GPU should handle prefill, whic h deco de, and do es the cost sa ving justify the higher TTFT from KV-transfer ov erhead? T able 8: Disaggregated P/D configurations (Azure λ = 100 , TTFT SLO=500 ms, TPOT SLO=100 ms). KV-transfer adds 1 . 8 × raw prefill time ( BETA_TTFT=1.80 in fleet_sim/optimizer/disagg.py ). Config GPUs Cost/yr TTFT TPOT All-A100 aggregated 12 $232K 26 ms — All-H100 aggregated 6 $211K 8 ms — H100P + A100D 7 (1P+6D) $151K 162 ms 91 ms H100P + H100D 4 (1P+3D) $141K 162 ms 45 ms A100P + H100D 4 (1P+3D) $125K 492 ms 45 ms Disaggregation cuts cost by 35–46% vs. aggregated serving, at the price of higher TTFT from KV-transfer ov erhead. The optimal assignmen t—A100 prefill, H100 deco de—is coun ter- in tuitive. H100 deco de w orkers pro cess 2.5 × more requests p er second than A100 (lo wer W , faster p er-tok en iteration), so only 3 are needed vs. 6. One A100 handles all prefill at λ = 100 req/s. Despite H100 costing 1.82 × more p er card, it is the deco de p o ol where the premium pa ys off. T wo practical thresholds emerge. F or TTFT SLO ≤ 200 ms, use H100P+H100D ($141K, TTFT=162 ms). F or TTFT SLO ≤ 100 ms, disaggregated serving is not viable and aggregated H100 ($211K) is the only option. At arriv al rates b elo w ∼ 50 req/s, the op erational complexity 9 of disaggregation—separate scaling p olicies, KV-transfer net working—is not justified b y the sa vings. Insigh t 7. In disaggregated serving, the premium GPU should handle deco de, not prefill. The counter-in tuitive assignmen t emerges from join t optimization o ver deco de throughput, GPU cost, and KV-transfer TTFT o verhead—not something readily deriv ed by hand. 4.8 Puzzle 8: Ho w Muc h Grid P o wer Can I Shed Without an SLO Breac h? Data cen ters increasingly participate in grid demand-resp onse (DR) programs, where the grid op erator requests a temp orary p o wer reduction (t ypically 10–30% for 15–60 min utes) in ex- c hange for reduced electricity tariffs or ancillary-service rev enue. Th e GPU-to-Grid (G2G) framew ork [ Hassan et al. , 2025 ] demonstrates that capping the serving engine’s maximum in- fligh t batch size ( max_num_seqs in vLLM) is the most effectiv e softw are knob for mo dulating GPU p o wer: fewer concurrent requests reduce memory-bandwidth pressure, low ering p o wer dra w without touching clo c k frequency or voltage. The trade-off is higher queuing dela y , which ma y breach the TTFT SLO. inference-fleet-sim quantifies the trade-off via grid_flex_analysis() . The function sw eeps target p o wer-reduction p ercen tages, inv erts the GPU p o wer mo del to find the implied batc h cap ( n max ), and recomputes P99 TTFT with the reduced KV-slot coun t using the same M/G/ c approximation used in optimization. P ow er mo del. Each GPU profile implemen ts the logistic pow er curv e from the G2G pap er [ Hassan et al. , 2025 ] (Eq. 2): P ( b ) = P range 1 + e − k (log 2 b − x 0 ) + P idle where b ≈ n max (concurren t requests), P range = P nom − P idle , and the shap e parameters ( k = 1 . 0 , x 0 = 4 . 2) are fitted to ML.ENER GY Benc hmark v3.0 data [ Ch ung et al. , 2025 ] for H100-SXM5 running vLLM. The logistic fit gives P(1) ≈ 304 W and P(128) ≈ 583 W (mea- sured: ≈ 600 W, error < 3%). The M/G/c service rate is r e c alibr ate d at each batc h cap so the analytical mo del reflects the faster-p er-iteration throughput at low er concurrency . A DES run with N = 15 , 000 requests indep enden tly v erifies the analytical P99 estimates. T able 9: Grid flexibility curve for 40 H100 GPUs, λ = 200 req/s, SLO = 500 ms (Azure work- load). Logistic p o wer mo del, DES-verified. F or short-burst DR even ts ( ≲ 75 s) the fleet safely commits up to 40% p o wer curtailmen t; steady-state stability holds to 30%. Flex n max W/GPU Fleet k W P99 anal. P99 DES SLO 0% 128 583 W 23.3 k W 7.9 ms 35 ms ✓ 10% 48 540 W 21.6 k W 7.9 ms 38 ms ✓ 20% 24 479 W 19.1 k W 7.9 ms 41 ms ✓ 30% 13 413 W 16.5 k W 7.9 ms 51 ms ✓ 40% 6 350 W 14.0 k W ∞ 190 ms ✓ † 50% 1 304 W 12.2 k W ∞ ≫ SLO × † Safe for short DR even ts ( < 2 min); analytically unstable at steady state. The logistic mo del reveals a key G2G insight: at full pro duction load ( n max = 128 ), H100 p o wer is already at ≈ 97% of nominal. Halving the batc h from 128 to 64 sav es only ≈ 13 W ( ∼ 2%). Meaningful savings require deep batch reduction (e.g., 40% flex at n max = 6 , saving 9.3 k W fleet-wide), whic h approaches but do es not breac h the Erlang-C saturation threshold for time-limited DR even ts. 10 The DES verification confirms the analytical recalibrated estimates for 0–30% flex (b oth predict P99 ≈ 8–50 ms). A t 40% flex, the analytical mo del flags steady-state instabilit y ( ∞ ) while DES shows P99 = 190 ms during a 75-second ev ent window—still within the 500 ms SLO. A t 50%, b oth DES and analysis agree the queue collapses. Insigh t 8. The safe DR commitment depth dep ends on ev ent duration. F or sustaine d reduction (hours), the recalibrated M/G/c mo del sho ws 30% is the stability limit. F or short events (minutes), DES v erification sho ws 40% is feasible (sav es 9.3 k W from 23.3 k W baseline) before the queue collapses catastrophically at 50%. inference-fleet-sim is the only to ol that provides b oth b ounds together from a single w orkload CDF and GPU profile. 5 Limitations P oisson sub-stream approximation. Splitting a Poisson stream by token length is a deter- ministic rule, not random thinning, so the sub-streams are not strictly P oisson [ Harchol-Balter , 2013 ]. When prompt length correlates with arriv al time (e.g., long requests arrive in bursts), queue-length estimates from the analytical mo del are approximations. The DES c hecks the appro ximation in eac h case. Request-lev el service mo del. The DES fires one even t p er request, not one p er deco de iteration. It does not simulate preemption, batching reorder, or speculative deco ding. F or detailed scheduler comparison, Vidur [ Agraw al et al. , 2024 ] provides higher fidelity at the cost of slow er simulation. Linear ro ofline p erformance mo del. The GPU mo del uses a linear ( W, H ) roofline. Non- linear effects such as FlashAtten tion sp eedup, quantization, and NCCL ov erlap are absorb ed in to calibrated constan ts but not explicitly modeled. Users can replace an y constant with measured data for higher accuracy . Single-no de replica mo del. Eac h GPU instance is assumed to b e a single-no de replica. In ter-no de communication ov erhead for tensor-parallel configurations (NVLink vs. InfiniBand) is not mo deled; users should adjust W and H to reflect collectiv e latency when running m ulti- no de TP . Logistic p o wer mo del accuracy . Grid flex analysis (§ 4.8 ) uses a logistic p o wer curve fitted to ML.ENERGY Benchmark v3.0 H100-SXM5 data [ Chung et al. , 2025 ]. The fit is accurate to within ∼ 3% at batc h ≥ 16 ; at batc h < 8 (deep curtailment) it is less well-constrained b ecause few ML.ENER GY data points fall in that regime. The analytical M/G/c mo del is recalibrated at each batch-cap level, correcting for the faster per-iteration throughput at lo wer concurrency; DES verification pro vides an indep enden t c heck. Users should re-fit power_logistic_k and power_logistic_x0 from measured vllm serve profiling runs for their sp ecific model and GPU generation. 6 Related W ork Single-instance sim ulators. Vidur [ Agraw al et al. , 2024 ] simulates one engine replica at op erator level (GEMM, atten tion, KV-cac he managemen t, preemption) with < 9% latency pre- diction error. It optimizes engine configuration for a fixe d GPU cluster and do es not address fleet-lev el p o ol routing or cost. APEX [ Lin et al. , 2024 ] is an extensible, dynamism-aw are sim- ulator for selecting parallel execution plans (TP/PP/data parallelism) across dense and MoE 11 LLMs; it finds optimal plans 71 × faster than GPU-based search but targets intra-cluster paral- lelism, not inter-po ol fleet top ology . AIConfigurator [ Xu et al. , 2025 ] searches TP/EP/batch-size configurations against a calibrated k ernel p erformance database in under 30 seconds; its output (W/H constants) feeds inference-fleet-sim ’s ProfileBuilder. GPU type selection. Mélange [ Griggs et al. , 2024 ] form ulates heterogeneous GPU allo- cation as cost-aw are bin packing and ac hieves up to 77% cost reduction vs. single-GPU-t yp e deplo yments. It c ho oses GPU t yp es but does not model p ool routing or m ulti-p o ol queue dynam- ics. inference-fleet-sim takes the GPU t yp e as input—chosen optionally with Mélange—and sizes the fleet. Disaggregated serving. DistServe [ Zhong et al. , 2024 ] in tro duces P/D disaggregation and mo dels eac h phase as an M/D/1 queue. Split wise [ Patel et al. , 2024 ] co-designs heterogeneous hardw are for the t wo phases. inference-fleet-sim ’s DisaggFleetOptimizer builds on these ideas by sizing fleets of disaggregated p ool pairs using M/G/ c with the full token-length CDF. R untime autoscaling. SageServe [ Jia et al. , 2025 ] uses ARIMA forecasting and ILP scaling to manage an existing O365 fleet, sa ving 25% GPU-hours. T okenScale [ Dong et al. , 2024 ] uses T oken V elo cit y as a leading indicator for burst handling in disaggregated fleets. Both op erate at run time; inference-fleet-sim op erates at the pro visioning la yer and provides the p eak-hour sizing that SageServe and T ok enScale scale around. Queueing theory in systems. T wo-momen t M/G/ c approximations [ Kimura , 1994 ] ha ve b een applied to DNN serving in Clo c kwork [ Gujarati et al. , 2020 ] and AlpaServe [ Li et al. , 2023 ]. inference-fleet-sim extends this to m ulti-p o ol, multi-GPU-t yp e LLM fleet planning with DES v alidation. T o ol Core question answered Vidur Best batc hing/scheduling config for one GPU? APEX Best TP/PP/data-parallel plan for one cluster? AIConfigurator Best TP/EP/engine flags for one cluster? Mélange Whic h GPU types to mix for minim um cost? Split wise Whic h GPU generation for prefill vs. deco de? DistServ e Prefill-to-decode GPU ratio p er cluster replica? T okenScale Scale P/D p ools in real time under bursts? SageServ e VM count through a 24-hour demand cycle? inference-fleet-sim Pool top ology , routing p olicy , total GPU count, fleet cost? 7 Conclusion inference-fleet-sim is a t wo-phase LLM GPU fleet capacit y planner that combines analytical M/G/ c optimization with discrete-even t simulation to find minimum-cost fleet configurations that meet a P99 TTFT SLO. R unning it on eight scenarios across tw o public w orkload traces, one synthetic agent trace, and one demand- r esp onse study pro duced findings that resist simple analysis: the optimal split threshold is not readable off the CDF; a 30%-utilized fleet can fail its SLO; a slo w, cheap GPU can b e cheaper than a fast, exp ensiv e one; GPU scaling is sub-linear; sizing-time and pro duction routers should differ; mixed GPU p ools ha ve inv alid pairings the sim ulator flags b efore purchase; in disaggregated serving, the premium GPU earns back its cost in deco de, not prefill; and a 40-GPU H100 fleet can commit 30% sustained p o wer curtailment or 40% short-even t curtailmen t (sa ving 9.3 k W) while meeting SLO—with the G2G logistic p o wer curv e and DES v erification confirming b oth b ounds. These results emerge from the joint space of p ool top ology , routing p olicy , GPU p erformance, queueing dynamics, and now p o wer management. No single tool in the existing ecosystem explores that space; inference-fleet-sim do es. 12 inference-fleet-sim is op en-source and part of the vLLM Semantic Router pro ject [ vLLM Pro ject Con tributors , 2026 ]. All case-study results are repro ducible via the run_sim.py CLI and the CDF data files in the rep ository . References Amey Agra wal, Nitin Kedia, Ashish P anw ar, Ja yashree Mohan, Nipun K w atra, Bhargav S Gula v ani, Ramachandran Ramjee, and Alexey T umanov. Vidur: A large-scale sim ulation framew ork for LLM inference, 2024. URL . Huamin Chen, Xunzhuo Liu, Junc hen Jiang, Bow ei He, and Xue Liu. Compress-and-route: Routing-la yer prompt compression against the long-con text cost cliff in LLM inference fleets. arXiv pr eprint , 2026. Manuscript under review. Jae-W on Ch ung, W o osuk Kw on, and Ion Stoica. LLM-Pilot: Characterize and optimize infer- ence of LLMs on dedicated GPU serv ers. In Pr o c. NeurIPS Datasets and Benchmarks T r ack , 2025. URL https://ml.energy/leaderboard . ML.ENERGY Benchmark v3.0: measured H100-SXM5 GPU p o w er vs. batc h size (1–256) running vLLM on Llama-3.1 v arian ts. Idle (batc h ≈ 1) ≈ 300 W; saturated (batch=128) ≈ 600 W for 70B-class mo del. Shengkun Cui, Archit Patk e, Hung Nguyen, Adit ya Ranjan, Ziheng Chen, Ph uong Cao, Gre- gory Bauer, Brett Bo de, Catello Di Martino, Saurabh Jha, Chandra Nara yanasw ami, Daby So w, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer. Story of t wo GPUs: Charac- terizing the resilience of Hopp er H100 and Amp ere A100 GPUs. , 2025. URL . 11.7M GPU-hours on Delta (1,056 A100+H100 GPUs); recommends ∼ 5% H100 ov erprovisioning. Chenhe Dong et al. T ok enScale: Timely and accurate autoscaling for disaggregated LLM serving with token velocity . , 2024. URL . T yler Griggs, Xiaoxuan Liu, Jiaxiang Y u, Do young Kim, W ei-Lin Chiang, Alvin Cheung, and Ion Stoica. Mélange: Cost efficient large language mo del serving b y exploiting GPU hetero- geneit y . , 2024. URL . Arpan Gujarati, Reza Karimi, Safy a Alza yat, W ei Hao, Antoine Kaufmann, Ymir Vigfusson, and Jonathan Mace. Serving DNNs lik e clockw ork: Performance predictability from the b ottom up. In Pr o c. OSDI , 2020. Mor Harc hol-Balter. Performanc e Mo deling and Design of Computer Systems: Queueing The ory in A ction . Cambridge Univ ersity Press, 2013. M. Hassan, J. Lin, D. Kim, R. Bhatt, Y. W ang, N. Li, and S. Grijalv a. GPU-to-Grid: Coupling LLM inference with p o wer system control. , 2025. URL https://arxiv. org/abs/2602.05116 . Sho ws vLLM max_n um_seqs is the primary GPU p o wer knob; pow er follo ws a logistic curv e vs. log2(batch size); data from ML.ENERGY Benchmark on H100- SXM5. Ningxin Jia et al. SageServe: Optimizing LLM serving on cloud data centers with forecast a ware auto-scaling. , 2025. URL . T oshikazu Kim ura. T wo-momen t appro ximations for the mean waiting time in the M/G/c queue. J. Op er. R es. So c. Jap an , 37(3):238–256, 1994. Ap ostolos K okolis, Michael Kuchnik, John Hoffman, Adith ya Kumar, Parth Malani, F ay e Ma, Zac hary De Vito, Shubho Sengupta, Kaly an Saladi, and Carole-Jean W u. Revisiting reliability 13 in large-scale mac hine learning researc h clusters. , 2024. URL https: //arxiv.org/abs/2410.21680 . RSC-1 failure rate: 6.50 p er 1000 no de-da ys; MTTF at 1024 GPUs ≈ 7.9 h. W o osuk Kw on, Zh uohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Co dy Hao Y u, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory managemen t for large language mo del serving with P agedAtten tion. In Pr o c. SOSP , 2023. A drien Laurent. H100 rental prices compared: $1.49–$6.98/hr across 15+ cloud pro viders (2026). https://intuitionlabs.ai/articles/h100- rental- prices- cloud- comparison , Marc h 2026. Comprehensive 2026 survey of on-demand GPU cloud prices: H100 ranges $1.49– $6.98/GPU-hr (mark etplace to Azure); A WS P5 on-demand appro x. $3.93/GPU-hr after June 2025 44% price cut; A100 now b elo w $2.06/GPU-hr on ma jor providers and sub-$1/hr on op en marketplace. Zh uohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Y anping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. AlpaServ e: Statistical mul- tiplexing with mo del parallelism for deep learning serving. In Pr o c. OSDI , 2023. Yi-Chien Lin, W o osuk Kw on, Ronald Pineda, and F ann y Nina Para v ecino. APEX: An ex- tensible and dynamism-a ware simulator for automated parallel execution in LLM serving. arXiv:2411.17651 , 2024. URL . In tra-cluster paral- lelism plan search (TP/PP/data); 71 × faster than GPU-based exploration; v alidated against vLLM and SGLang. Microsoft Azure. Azure LLM inference trace 2023. https://github.com/Azure/ AzurePublicDataset , 2023. Prat yush Patel, Esha Choukse, Chao jie Zhang, Íñigo Goiri, Brijesh W arrier, Nithish Ma- halingam, and Ricardo Bianc hini. Split wise: Efficien t generative LLM inference using phase splitting. In Pr o c. ISCA , 2024. vLLM Pro ject Contributors. vLLM seman tic router. https://github.com/vllm- project/ semantic- router , 2026. Open-source LLM request routing framew ork; includes the inference-fleet-sim fleet capacity planner. Tianhao Xu, Yiming Liu, Xianglong Lu, Yijia Zhao, Xuting Zhou, Aichen F eng, Yiyi Chen, et al. AIConfigurator: Lightning-fast configuration optimization for multi-framew ork LLM serving. , 2025. URL . Lianmin Zheng, W ei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao W u, Y ong- hao Zhuang, Zh uohan Li, Zi Lin, Eric P . Xing, Joseph E. Gonzalez, Ion Stoica, and Hao Zhang. LMSYS-Chat-1M: A large-scale real-world LLM con versation dataset. In Pr o c. ICLR , 2024. Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yib o Zh u, Xuanzhe Liu, Xin Jin, and Hao Zhang. DistServe: Disaggregating prefill and deco ding for go o dput-optimized large language mo del serving. In Pr o c. OSDI , 2024. URL . 14

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment