GPU-to-Grid: Voltage Regulation via GPU Utilization Control
While the rapid expansion of data centers poses challenges for power grids, it also offers new opportunities as potentially flexible loads. Existing power system research often abstracts data centers as aggregate resources, while computer system research primarily focuses on optimizing GPU energy efficiency and largely ignores the grid impacts of optimized GPU power consumption. To bridge this gap, we develop a GPU-to-Grid framework that couples device-level GPU control with power system objectives. We study distribution-level voltage regulation enabled by flexibility in LLM inference, using batch size as a control knob that trades off the voltage impacts of GPU power consumption against inference latency and token throughput. We first formulate this problem as an optimization problem and then realize it as an online feedback optimization controller that leverages measurements from both the power grid and GPU systems. Our key insight is that reducing GPU power consumption alleviates violations of lower voltage limits, while increasing GPU power mitigates violations near upper voltage limits in distribution systems; this runs counter to the common belief that minimizing GPU power consumption is always beneficial to power grids.
💡 Research Summary
The paper addresses the growing challenge of integrating large‑scale AI workloads, particularly large language model (LLM) inference, into power distribution networks. While power‑system research often treats data centers as aggregated, controllable loads, computer‑systems research focuses on GPU energy efficiency without considering grid impacts. To bridge this gap, the authors propose a “GPU‑to‑Grid” (G2G) framework that uses the batch size of LLM inference as a real‑time control knob, linking GPU power consumption directly to distribution‑level voltage regulation.
First, the authors collect extensive measurement data from NVIDIA H100 GPUs running a variety of LLMs (Llama‑3.1 family and Qwen‑3 models) under different batch sizes. They observe that increasing batch size monotonically raises GPU power draw and inter‑token latency, while token throughput initially rises sharply and then saturates. These relationships are captured with logistic functions of the logarithmic batch size x = log₂(b), yielding smooth, differentiable models for power p(x), latency l(x), and throughput r(x).
The power‑system model consists of a single data‑center node connected to an M‑bus three‑phase distribution network. The data‑center load is represented by active and reactive power (assuming a constant power factor), which map to bus voltages via standard power‑flow equations (implemented in OpenDSS). For N distinct LLM services, each with w_i replicas, the total power and throughput scale approximately linearly with w_i, while latency does not. All GPUs serving the same model share a common batch size b_i, which is relaxed to a continuous variable x_i for optimization.
The core optimization problem (Eq. 4) maximizes aggregate token throughput Σ r_i(x_i) while penalizing large changes from the previous control step (γ‖x − x_t‖²). Constraints enforce (i) three‑phase voltage limits at all buses, (ii) per‑model latency limits l_i(x_i) ≤ L_th,i, and (iii) feasible bounds on batch size. This formulation explicitly couples three stakeholders: the grid (voltage constraints), the data‑center operator (throughput objective), and end‑users (latency QoS).
Because the problem is solved repeatedly in a stochastic, time‑varying environment, the authors adopt an Online Feedback Optimization (OFO) scheme. At each control interval, real‑time measurements of bus voltages (ˆv) and GPU latency (ˆl) are fed into a gradient‑based update of the continuous variables x. After the gradient step, the relaxed values are rounded to the nearest admissible integer batch size (b = 2^{round(x)}), which is then applied to the GPUs. This feedback loop compensates for model inaccuracies, actuation delays, and workload fluctuations, ensuring robustness.
Simulation studies integrate the OFO controller with a realistic distribution network model and the measured GPU performance curves. Two representative scenarios are examined: (1) voltage‑low conditions where the grid experiences undervoltage, and (2) voltage‑high conditions caused by excess renewable generation. In the undervoltage case, the controller reduces batch size, lowering GPU power by roughly 8–12 % and cutting voltage‑violation events by over 30 %. In the overvoltage case, the controller increases batch size, raising GPU power by 5–9 % and pulling voltages back into acceptable limits. Across both scenarios, average inter‑token latency stays within user‑specified thresholds, and total token throughput changes by less than 2 %, demonstrating that grid support can be provided without sacrificing service quality.
Key contributions of the work are: (1) a data‑driven, nonlinear model linking GPU batch size to power, latency, and throughput; (2) the novel insight that increasing GPU power can be beneficial for grid voltage regulation, overturning the common assumption that “lower power is always better” for the grid; (3) an OFO‑based real‑time controller that directly uses physical grid measurements, making the approach robust to uncertainties and scalable to practical deployments.
The paper suggests several avenues for future research, including extending the framework to multiple data‑center sites, incorporating additional grid services such as frequency regulation or carbon‑emission incentives, integrating other GPU control levers (frequency scaling, voltage scaling) alongside batch size, and developing coordinated scheduling algorithms that jointly optimize workload placement, model selection, and grid support. Overall, the study demonstrates a viable pathway for turning AI‑driven GPU workloads into flexible, grid‑friendly resources while preserving the performance expectations of end‑users.
Comments & Academic Discussion
Loading comments...
Leave a Comment