What Artificial Intelligence can do for High-Performance Computing systems?

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

High-performance computing (HPC) centers consume substantial power, incurring environmental and operational costs. This review assesses how artificial intelligence (AI), including machine learning (ML) and optimization, improves the efficiency of operational HPC systems. Approximately 1,800 publications from 2019 to 2025 were manually screened using predefined inclusion/exclusion criteria; 74 “AI for HPC” papers were retained and grouped into six application areas: performance estimation, performance optimization, scheduling, surrogate modeling, fault detection, and language-model-based automation. Scheduling is the most active area, spanning research-oriented reinforcement-learning schedulers to production-friendly hybrids that combine ML with heuristics. Supervised performance estimation is foundational for both scheduling and optimization. Graph neural networks and time-series models strengthen anomaly detection by capturing spatio-temporal dependencies in production telemetry. Domain-specialized language models for HPC can outperform general-purpose LLMs on targeted coding and automation tasks. Together, these findings highlight integration opportunities such as LLM-based operating-system concepts and underscore the need for advances in MLOps, standardization of AI components, and benchmarking methodology.

💡 Research Summary

This review paper investigates how artificial intelligence (AI) can improve the efficiency, power consumption, and operational costs of high‑performance computing (HPC) systems. The authors performed a systematic literature search covering the period 2019‑2025, retrieving roughly 1,800 publications from IEEE Xplore, Scopus, Google Scholar, and other databases. After applying predefined inclusion and exclusion criteria—such as focusing on “AI for HPC” rather than “HPC for AI,” discarding pre‑2018 works, and retaining only peer‑reviewed articles—74 papers were selected for detailed analysis.

The retained studies were classified into six application domains: (1) performance estimation, (2) performance optimization, (3) scheduling, (4) surrogate modeling, (5) fault detection, and (6) language‑model‑based automation. The authors argue that these domains are highly inter‑dependent; for instance, accurate performance estimators feed both schedulers and optimizers, while graph neural networks (GNNs) and time‑series deep models enhance anomaly detection by capturing spatial‑temporal correlations in telemetry data.

In the performance‑estimation category, supervised learning models (e.g., regression forests, deep neural networks, GNNs) predict job runtimes, memory footprints, and power draw with sufficient accuracy to guide downstream decisions. The paper highlights that most scheduling approaches rely on these estimators as inputs, underscoring their foundational role.

Scheduling emerges as the most active research area. Pure reinforcement‑learning (RL) schedulers (e.g., GARL) demonstrate adaptability to dynamic workloads and heterogeneous resources, but their deployment in production environments is limited by stability concerns and training overhead. Consequently, hybrid designs that combine RL policies with classic heuristics (First‑Come‑First‑Serve, Shortest‑Job‑First, priority‑based rules) are presented as a pragmatic compromise. Empirical results from large‑scale clusters show reductions in average queue wait time and improvements in power‑aware utilization ranging from 10 % to 20 %.

Performance optimization is often tackled through hybrid AI‑non‑AI pipelines. Tensor‑parallelism frameworks such as Alpha and Colossal‑AI use integer linear programming (ILP) together with empirical profiling to select optimal parallelization strategies. In contrast, Colossal‑Auto replaces the ILP search with a machine‑learning surrogate that predicts performance, dramatically shrinking the search space and cutting optimization time.

Surrogate modeling (or “ML‑around‑HPC”) focuses on replacing expensive simulations with learned meta‑models. The review notes that transfer learning and meta‑learning techniques enable these surrogates to generalize across diverse workloads using limited labeled data, thereby accelerating design‑space exploration and parameter tuning.

Fault detection leverages GNNs and temporal convolutional networks to model the inter‑node and inter‑link relationships in HPC telemetry. Compared with traditional statistical anomaly detectors, these deep models achieve lower false‑positive rates and faster detection of hardware failures, network congestion, or software bugs.

The language‑model‑based automation section examines domain‑specific large language models (LLMs) fine‑tuned on HPC codebases, job scripts, and system documentation. Compared with generic LLMs, the specialized models improve code generation accuracy and scripting productivity by roughly 15 % on benchmark tasks such as automatic job‑script creation, firewall rule generation, and performance‑tuning suggestions. The authors also discuss early work on LLM‑assisted cybersecurity, noting promising but still preliminary results.

A recurring theme throughout the paper is the lack of standardized MLOps pipelines, shared APIs, and benchmark suites for AI‑enhanced HPC workflows. The authors cite calls for FAIR (Findable, Accessible, Interoperable, Reusable) principles and stress that without common data formats and evaluation metrics, reproducibility and cross‑system integration remain challenging.

In conclusion, the review confirms that AI techniques—ranging from supervised estimators to reinforcement‑learning schedulers, surrogate models, graph‑based anomaly detectors, and domain‑specific LLMs—offer substantial opportunities to make HPC systems more efficient, power‑aware, and easier to operate. However, realizing this potential at production scale will require concerted efforts in MLOps engineering, standardization, and the development of comprehensive benchmark frameworks that capture both system‑level performance and AI model robustness.

What Artificial Intelligence can do for High-Performance Computing systems?

💡 Research Summary

Comments & Academic Discussion

Leave a Comment