LC-Opt: Benchmarking Reinforcement Learning and Agentic AI for End-to-End Liquid Cooling Optimization in Data Centers

Liquid cooling is critical for thermal management in high-density data centers with the rising AI workloads. However, machine learning-based controllers are essential to unlock greater energy efficiency and reliability, promoting sustainability. We present LC-Opt, a Sustainable Liquid Cooling (LC) benchmark environment, for reinforcement learning (RL) control strategies in energy-efficient liquid cooling of high-performance computing (HPC) systems. Built on the baseline of a high-fidelity digital twin of Oak Ridge National Lab’s Frontier Supercomputer cooling system, LC-Opt provides detailed Modelica-based end-to-end models spanning site-level cooling towers to data center cabinets and server blade groups. RL agents optimize critical thermal controls like liquid supply temperature, flow rate, and granular valve actuation at the IT cabinet level, as well as cooling tower (CT) setpoints through a Gymnasium interface, with dynamic changes in workloads. This environment creates a multi-objective real-time optimization challenge balancing local thermal regulation and global energy efficiency, and also supports additional components like a heat recovery unit (HRU). We benchmark centralized and decentralized multi-agent RL approaches, demonstrate policy distillation into decision and regression trees for interpretable control, and explore LLM-based methods that explain control actions in natural language through an agentic mesh architecture designed to foster user trust and simplify system management. LC-Opt democratizes access to detailed, customizable liquid cooling models, enabling the ML community, operators, and vendors to develop sustainable data center liquid cooling control solutions.

💡 Research Summary

The paper introduces LC‑Opt, a comprehensive benchmark environment designed to evaluate reinforcement‑learning (RL) and agentic AI approaches for end‑to‑end liquid cooling optimization in high‑performance data centers. LC‑Opt is built on a high‑fidelity digital twin of the Oak Ridge National Laboratory’s Frontier supercomputer cooling system, implemented in Modelica. The twin models the entire cooling chain—from site‑level cooling towers, heat exchangers, and distribution piping to individual IT cabinets, server blades, and optional heat‑recovery units (HRU). By exposing this detailed physics‑based simulation through a Gymnasium interface, the authors enable RL agents to control continuous variables such as liquid supply temperature, flow rate, and granular valve positions at the cabinet level, as well as setpoints for the cooling tower.

The benchmark poses a multi‑objective, real‑time optimization problem: maintain safe temperatures for all compute nodes, minimize total energy consumption (including pump and chiller power), and, when present, maximize recovered heat utilization. Workloads are varied over time to simulate realistic AI‑driven compute spikes, forcing agents to adapt dynamically.

Two families of control architectures are evaluated. A centralized agent treats the whole data center as a single Markov decision process (MDP) and learns a global policy. While this approach can theoretically achieve the best global optimum, the authors observe severe scalability issues: the state‑action space grows explosively, learning becomes unstable, and communication latency in the simulation hampers timely decisions. In contrast, a decentralized multi‑agent RL (MARL) framework assigns separate agents to logical subsystems (cooling tower, cabinet groups, rack clusters). Agents receive local observations and a limited set of shared signals (e.g., global energy cost, temperature violation flags). Cooperative mechanisms such as shared reward components and message‑passing are explored. Empirically, the MARL setup converges faster, exhibits higher robustness to sudden workload changes, and achieves comparable or better overall energy savings than the centralized baseline.

To address the “black‑box” nature of deep RL policies, the authors apply policy distillation. Trained neural policies are approximated by decision trees for discrete actions and regression trees for continuous setpoints. The resulting tree‑based controllers retain most of the performance while offering human‑readable rules that can be audited, modified to satisfy safety regulations, or integrated into existing rule‑based management systems. The trade‑off between tree depth (interpretability) and fidelity to the original policy is quantified, highlighting a practical sweet spot for operational deployment.

A novel contribution is the integration of large language models (LLMs) into an “agentic mesh” architecture. After an RL agent selects an action, an LLM generates a natural‑language explanation that links the current workload pattern, temperature readings, and the chosen control adjustment (e.g., “Because the workload on rack 12 increased by 30 %, we raised the supply temperature by 2 °C to reduce pump power while staying within thermal limits”). This explanatory layer is intended to build operator trust, simplify troubleshooting, and provide a transparent interface for non‑expert users. The authors discuss potential hallucination risks and propose a verification pipeline that cross‑checks LLM statements against the underlying physics model.

The benchmark also includes optional heat‑recovery modules, allowing experiments that evaluate the trade‑off between cooling efficiency and useful heat extraction for district heating or on‑site power generation. This adds a sustainability dimension beyond pure energy consumption.

Limitations are acknowledged. The high‑fidelity digital twin demands substantial computational resources, making large‑scale hyperparameter sweeps expensive. Workload dynamics are modeled using predefined profiles, which may not capture the full stochasticity of real AI workloads. Policy distillation can lead to performance degradation when the distilled tree becomes too deep, reducing interpretability. Finally, LLM explanations, while useful, must be rigorously validated to avoid misleading operators.

In summary, LC‑Opt democratizes access to a realistic, customizable liquid‑cooling environment, enabling the research community, data‑center operators, and equipment vendors to develop, benchmark, and interpret advanced RL and LLM‑augmented control strategies. The paper’s contributions span benchmark design, comparative analysis of centralized versus decentralized RL, interpretable policy extraction, and human‑centric AI explanations. Future work suggested includes lightweight surrogate twins for faster iteration, more sophisticated stochastic workload generators, safety‑constrained RL formulations, and robust verification mechanisms for LLM‑generated narratives. Together, these advances aim to accelerate the deployment of AI‑driven, energy‑efficient cooling solutions essential for the sustainability of next‑generation data centers.

💡 Research Summary

📜 Original Paper Content