Analysis of Load Balancing in Large Heterogeneous Processor Sharing Systems
We analyze randomized dynamic load balancing schemes for multi-server processor sharing systems when the number of servers in the system is large and the servers have heterogeneous service rates. In particular, we focus on the classical power-of-two load balancing scheme and a variant of it in which a newly arrived job is assigned to the server having the least instantaneous Lagrange shadow cost among two randomly chosen servers. The instantaneous Lagrange shadow cost at a server is given by the ratio of the number of unfinished jobs at the server to the capacity of the server. Two different approaches of analysis are presented for each scheme. For exponential job length distribution, the analysis is done using the mean field approach and for more general job length distributions the analysis is carried out assuming an asymptotic independence property. Analytical expressions to compute mean sojourn time of jobs are found for both schemes. Asymptotic insensitivity of the schemes to the type of job length distribution is established. Numerical results are presented to validate the theoretical results and to show that, unlike the homogeneous scenario, the power-of-two type schemes considered in this paper may not always result in better behaviour in terms of the mean sojourn time of jobs.
💡 Research Summary
The paper investigates randomized dynamic load‑balancing policies for large‑scale parallel server farms in which each server operates under a processor‑sharing (PS) discipline but may have a different service capacity. Two main policies are studied: the classical “power‑of‑two” (SQ(2)) scheme, where an arriving job samples two servers uniformly at random and joins the one with the fewest unfinished jobs, and a variant that uses the instantaneous Lagrange shadow cost (unfinished jobs divided by server capacity) as the selection metric.
For exponential job‑size distributions the authors employ a mean‑field analysis. They define the state variable u_j^n(t) as the fraction of servers of class j (capacity C_j) that have exactly n jobs at time t. By taking the limit as the number of servers N → ∞, they derive a deterministic system of ordinary differential equations (ODEs) that governs the evolution of u_j^n(t). They prove that, under a mild load condition, this ODE system possesses a unique globally asymptotically stable equilibrium P_j^n. Moreover, the equilibrium tail decays doubly exponentially, which implies that the mean sojourn time is finite and the system is stable.
When the job‑size distribution is general (not necessarily exponential), the analysis relies on the propagation‑of‑chaos (asymptotic independence) property: in the infinite‑server limit any finite collection of servers behaves independently. Under this assumption the same mean‑field ODEs describe the limiting dynamics, and the equilibrium remains insensitive to the higher moments of the job‑size distribution; only the mean service time 1/µ enters the fixed‑point equations. This establishes insensitivity of both the SQ(2) and the shadow‑cost variant.
A central contribution is the characterization of the stability region. For a static, state‑independent routing policy (the optimal benchmark) stability holds whenever
λ < µ · ∑j γ_j C_j,
where γ_j is the proportion of servers of class j. For the SQ(2) scheme the stability condition is stricter:
λ < µ · min{I⊆J} (∑{j∈I} γ_j C_j) / (∑{j∈I} γ_j)^2.
Thus, in heterogeneous settings the power‑of‑two rule can actually shrink the admissible load compared with the optimal static policy, a counter‑intuitive finding that does not occur in homogeneous systems.
To overcome this limitation, the authors propose a hybrid scheme. Upon each arrival a server class C_j is first selected with probability p_j (chosen to minimize the overall mean sojourn time). Then two servers of that class are sampled uniformly, and the job joins the one with the lower shadow cost. By appropriately tuning the probabilities p_j, the hybrid policy restores the maximal stability region (the same as the static benchmark) while retaining the performance gains of the SQ(2) rule within each class. The mean‑field analysis shows that the hybrid system also admits a unique doubly‑exponential equilibrium.
Numerical experiments validate the theoretical predictions. Simulations confirm that, for heterogeneous capacities, the pure SQ(2) policy can yield higher mean response times than the static routing, whereas the hybrid policy consistently outperforms both. Experiments with non‑exponential job‑size distributions demonstrate the claimed insensitivity: the measured mean sojourn times match the analytical values that depend only on the mean service time.
In summary, the paper makes three major contributions: (1) it precisely quantifies the stability loss of power‑of‑two load balancing in heterogeneous PS systems; (2) it extends mean‑field and chaos‑based techniques to obtain explicit equilibrium distributions and prove doubly‑exponential tail decay and insensitivity; (3) it introduces a practical hybrid routing rule that recovers full stability and improves delay performance. These results are relevant for the design of large web‑service farms, cloud data centers, and any distributed processing platform where server speeds differ and low latency is critical.
Comments & Academic Discussion
Loading comments...
Leave a Comment