Improving HPC Code Generation Capability of LLMs via Online Reinforcement Learning with Real-Machine Benchmark Rewards

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) have demonstrated strong code generation capabilities, yet the runtime performance of generated code is not guaranteed, and there have been few attempts to train LLMs using runtime performance as a reward in the HPC domain. We propose an online reinforcement learning approach that executes LLM-generated code on a supercomputer and directly feeds back the measured runtime performance (GFLOPS) as a reward. We further introduce a Staged Quality-Diversity (SQD) algorithm that progressively varies the permitted optimization techniques on a per-problem basis, enabling the model to learn code optimization from diverse perspectives. We build a distributed system connecting a GPU training cluster with a CPU benchmarking cluster, and train Qwen2.5 Coder 14B on a double-precision matrix multiplication task using Group Relative Policy Optimization (GRPO). Through two experiments, we show that reinforcement learning combining runtime performance feedback with staged optimization can improve the HPC code generation capability of LLMs.

💡 Research Summary

This paper presents a novel online reinforcement‑learning (RL) framework that directly optimizes the runtime performance of high‑performance‑computing (HPC) code generated by large language models (LLMs). Instead of using conventional correctness‑only rewards (e.g., passing unit tests), the authors execute each generated code snippet on a real supercomputer, measure its double‑precision matrix‑multiplication throughput in gigaflops (GFLOPS), and feed that measurement back as the reward signal. The approach eliminates the need for hand‑crafted reference implementations, which are scarce in the HPC domain, and enables the model to learn purely from hardware‑grounded feedback.

The RL algorithm employed is Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization that removes the critic network. For each prompt, the model samples a group of G = 64 code candidates; the mean reward of the group serves as a baseline, and each candidate’s advantage is computed relative to this baseline. This design drastically reduces memory consumption while still encouraging the policy to produce “better than its peers” outputs, aligning naturally with the goal of maximizing performance. A KL‑divergence penalty β controls how far the updated policy may drift from the pre‑trained Qwen2.5 Coder 14B baseline.

A second key contribution is the Staged Quality‑Diversity (SQD) algorithm, which structures the optimization process into six progressive stages. Each stage permits a specific subset of optimization techniques—register blocking, cache blocking, AVX‑512 SIMD, OpenMP thread parallelism, and memory prefetching—detected via regular‑expression filters. During each training iteration, the pool of generated codes is split: the top 40 % of codes by GFLOPS are promoted to the next stage (deepening), the middle 30 % generate new variants at the same stage (improving), and the bottom 30 % that caused errors are repaired (repairing). This three‑branch selection balances exploration of diverse optimization paths with exploitation of high‑performing solutions, and it provides the model with both optimization logic and error‑recovery strategies without any human‑curated examples.

The system architecture decouples the GPU‑based LLM training cluster from the CPU‑based benchmarking cluster. A coordinator node orchestrates data transfer and command execution via SSH tunneling, allowing the two clusters to be physically separate yet tightly coupled in a generate‑evaluate‑learn loop. The training pipeline runs on the “Genkai” supercomputer (8 × NVIDIA H100 GPUs, tensor‑parallelism 8) while benchmarks run on the “Flow” cluster (AVX‑512‑enabled Xeon Gold CPUs). Matrix size is fixed at 256 × 256 × 256 to keep individual runs short; each measurement averages three post‑warm‑up runs.

Two experiments validate the approach. Experiment 1 uses eight fixed prompts that request all optimization techniques and explores 12 hyper‑parameter configurations (compiler level O0/O3, learning rates 2e‑7/5e‑7, KL β = 0/0.001/0.01). Results show that with O3 compilation the mean reward quickly rises to 70–78, and a configuration with β = 0 and LR = 5e‑7 achieves a dramatic jump from 54 GFLOPS to 225 GFLOPS (≈4×) near the end of training. In contrast, O0 compilation yields only 7–15 mean reward, confirming that compiler optimizations are essential for the model to translate code quality into performance.

Experiment 2 applies SQD. The staged constraints force the model to start with simple register‑blocking code (≈7–30 GFLOPS) and gradually unlock SIMD, OpenMP, and prefetching, culminating in a stage where all techniques are allowed. Rewards are normalized within each stage (GFLOPS / stage‑max × 100) to preserve a strong learning signal even when absolute performance differences are small. By the final stage the model reaches 549 GFLOPS, comparable to hand‑tuned OpenBLAS on the same hardware (≈500 GFLOPS). The model also learns to repair runtime and verification errors, demonstrating an ability to acquire both optimization and debugging skills autonomously.

The study’s contributions are threefold: (1) a practical RL framework that uses real‑machine performance as a reward, (2) a systematic analysis of how compiler optimization level, learning rate, and KL penalty affect learning dynamics, and (3) the SQD algorithm that enables diverse, staged exploration of optimization techniques without curated training data. Limitations include the focus on a single kernel (dense matrix multiplication) and the high computational cost of online benchmarking, which may hinder scaling to larger codebases or more complex kernels. Nonetheless, the work establishes a compelling blueprint for integrating hardware‑in‑the‑loop RL with LLMs, opening avenues for automated HPC code synthesis, performance‑aware coding assistants, and future research on cost‑effective online RL for software optimization.

Improving HPC Code Generation Capability of LLMs via Online Reinforcement Learning with Real-Machine Benchmark Rewards

💡 Research Summary

Comments & Academic Discussion

Leave a Comment