Fine-Tuning GPT-5 for GPU Kernel Generation
Developing efficient GPU kernels is essential for scaling modern AI systems, yet it remains a complex task due to intricate hardware architectures and the need for specialized optimization expertise. Although Large Language Models (LLMs) demonstrate strong capabilities in general sequential code generation, they face significant challenges in GPU code generation because of the scarcity of high-quality labeled training data, compiler biases when generating synthetic solutions, and limited generalization across hardware generations. This precludes supervised fine-tuning (SFT) as a scalable methodology for improving current LLMs. In contrast, reinforcement learning (RL) offers a data-efficient and adaptive alternative but requires access to relevant tools, careful selection of training problems, and a robust evaluation environment. We present Makora’s environment and tools for reinforcement learning finetuning of frontier models and report our results from fine-tuning GPT-5 for Triton code generation. In the single-attempt setting, our fine-tuned model improves kernel correctness from 43.7% to 77.0% (+33.3 percentage points) and increases the fraction of problems outperforming TorchInductor from 14.8% to 21.8% (+7 percentage points) compared to baseline GPT-5, while exceeding prior state-of-the-art models on KernelBench. When integrated into a full coding agent, it is able to solve up to 97.4% of problems in an expanded KernelBench suite, outperforming the PyTorch TorchInductor compiler on 72.9% of problems with a geometric mean speedup of 2.12x. Our work demonstrates that targeted post-training with reinforcement learning can unlock LLM capabilities in highly specialized technical domains where traditional supervised learning is limited by data availability, opening new pathways for AI-assisted accelerator programming.
💡 Research Summary
The paper presents a novel approach to generating high‑performance GPU kernels by fine‑tuning the large language model GPT‑5 using reinforcement learning (RL). Traditional supervised fine‑tuning (SFT) is hampered in this domain because high‑quality, labeled kernel data are scarce, and synthetic data generated by compilers suffer from performance ceilings, boiler‑plate code, and reliance on internal libraries. Consequently, SFT cannot reliably teach a model to discover novel optimizations or to generalize across GPU architectures.
To overcome these limitations, the authors reformulate kernel generation as an RL problem with verifiable rewards (RL‑VR). A policy model πθ receives a natural‑language prompt and a reference PyTorch implementation, then generates a Triton kernel k. A deterministic verifier V compiles k, runs it, and returns a scalar reward composed of: (1) a binary term for successful compilation and functional correctness, and (2) a continuous term proportional to the speed‑up over a baseline (e.g., TorchInductor). The reward is passed through a logistic function σ with a shift parameter δ=1.8, which balances the incentive between merely correct kernels and those that achieve substantial performance gains.
The authors built the Makora environment to support this RL loop at scale. Makora provides multi‑turn interaction (generation → compilation → benchmarking → reward), caching of execution results, canonicalization of kernels, and a distributed backend that can benchmark thousands of kernels in parallel on an H100 GPU cluster. To prevent reward hacking, static reachability analysis and an LLM‑based hack detector are integrated, ensuring that models cannot obtain high rewards by exploiting verification loopholes.
For training data, a curated set of 264 kernel problems covering matrix multiplication, convolution, reduction, element‑wise ops, and common AI primitives is used. GPT‑5 is chosen as the base model because it already knows Triton syntax and GPU parallel patterns, allowing a substantial fraction of initial samples to compile and receive non‑zero rewards—a prerequisite for effective RL. Experiments with smaller open‑source models (Qwen‑4B/8B/32B) showed rapid reward saturation, confirming the need for a strong base model.
Results are reported in two regimes. In the single‑attempt setting (one generation per problem), fine‑tuned GPT‑5 raises functional correctness from 43.7 % to 77.0 % (a 33.3 percentage‑point gain) and increases the proportion of kernels that outperform TorchInductor from 14.8 % to 21.8 %. With three attempts per problem, 221 out of 264 kernels are functionally correct. When the fine‑tuned model is embedded in the full MakoraGenerate agent, it solves up to 97.4 % of an expanded KernelBench suite and outperforms TorchInductor on 72.9 % of tasks, achieving a geometric‑mean speed‑up of 2.12×. These figures surpass prior state‑of‑the‑art models and demonstrate that RL can unlock both correctness and performance improvements without massive labeled datasets.
The paper’s contributions are threefold: (1) a methodology for curating RL‑friendly kernel datasets, (2) a robust, scalable evaluation infrastructure that includes multi‑turn debugging, reward‑hacking mitigation, and distributed benchmarking, and (3) empirical evidence that targeted RL fine‑tuning of a strong base model yields state‑of‑the‑art GPU kernel generation. The authors discuss remaining challenges, such as extending the approach to other kernel languages (e.g., CUDA C++) and refining reward functions for diverse hardware targets, but overall the work establishes a viable pathway for AI‑assisted accelerator programming in data‑limited, highly specialized domains.
Comments & Academic Discussion
Loading comments...
Leave a Comment