Robust Batch-Level Query Routing for Large Language Models under Cost and Capacity Constraints

Robust Batch-Level Query Routing for Large Language Models under Cost and Capacity Constraints
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We study the problem of routing queries to large language models (LLMs) under cost, GPU resources, and concurrency constraints. Prior per-query routing methods often fail to control batch-level cost, especially under non-uniform or adversarial batching. To address this, we propose a batch-level, resource-aware routing framework that jointly optimizes model assignment for each batch while respecting cost and model capacity limits. We further introduce a robust variant that accounts for uncertainty in predicted LLM performance, along with an offline instance allocation procedure that balances quality and throughput across multiple models. Experiments on two multi-task LLM benchmarks show that robustness improves accuracy by 1-14% over non-robust counterparts (depending on the performance estimator), batch-level routing outperforms per-query methods by up to 24% under adversarial batching, and optimized instance allocation yields additional gains of up to 3% compared to a non-optimized allocation, all while strictly controlling cost and GPU resource constraints.


💡 Research Summary

The paper tackles the practical problem of routing user queries to large language models (LLMs) while respecting monetary cost budgets, GPU capacity limits, and concurrency constraints. Existing routing approaches operate on a per‑query basis, estimating a quality score l(q, m) and a cost c(q, m) for each model m, and then selecting the model that maximizes l – λ·c. This formulation, however, cannot guarantee that a batch of queries will stay within a prescribed cost budget, nor can it prevent oversubscription of limited‑capacity models when many hard queries arrive together. Moreover, it ignores the heterogeneity between locally hosted models (with strict GPU limits) and cloud‑based models (with higher monetary cost but virtually unlimited scaling).

To overcome these limitations, the authors propose a batch‑level routing framework formulated as an integer linear program (ILP). For a batch of N queries and M available models, each model j is characterized by a per‑query cost c_j (assumed query‑independent), a fixed number of instances I_j (determined by the number of GPUs allocated to that model), and a per‑instance concurrency limit l_j (maximum simultaneous queries per instance). Let a_ij be the estimated performance of model j on query i, and let x_ij∈{0,1} indicate whether query i is assigned to model j. The ILP maximizes the average predicted quality

 max (1/N) ∑{i=1}^N ∑{j=1}^M a_ij x_ij

subject to:

  1. Total cost constraint: (1/N) ∑_{i,j} c_j x_ij ≤ C (where C is the per‑query budget).
  2. Capacity constraints: for each model j, ∑_{i=1}^N x_ij ≤ l_j I_j.
  3. Assignment constraint: for each query i, ∑_{j=1}^M x_ij = 1.

Because the problem size (N up to a few hundred, M typically ≤ 10) is modest, off‑the‑shelf solvers such as SCIP can solve each batch in milliseconds, making the approach viable for real‑time systems.

A key contribution is the introduction of a robust variant that accounts for uncertainty in the performance estimates a_ij. The authors assume that for each pair (i, j) a prediction interval


Comments & Academic Discussion

Loading comments...

Leave a Comment