DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Diffusion large language models (dLLMs) have emerged as a compelling alternative to autoregressive (AR) LLMs, owing to their capacity for parallel token generation. This paradigm is particularly well-suited for code generation, where holistic structural planning and non-sequential refinement are critical. Despite this potential, tailoring dLLMs for CUDA kernel generation remains challenging, obstructed not only by the high specialization but also by the severe lack of high-quality training data. To address these challenges, we construct CuKe, an augmented supervised fine-tuning dataset optimized for high-performance CUDA kernels. On top of it, we propose a bi-phase curated reinforcement learning (BiC-RL) framework consisting of a CUDA kernel infilling stage and an end-to-end CUDA kernel generation stage. Leveraging this training framework, we introduce DICE, a series of diffusion large language models designed for CUDA kernel generation, spanning three parameter scales, 1.7B, 4B, and 8B. Extensive experiments on KernelBench demonstrate that DICE significantly outperforms both autoregressive and diffusion LLMs of comparable scale, establishing a new state-of-the-art for CUDA kernel generation.


💡 Research Summary

The paper introduces DICE, a family of diffusion‑based large language models (dLLMs) specifically engineered for the generation of high‑performance CUDA kernels. The authors begin by highlighting the limitations of traditional autoregressive (AR) LLMs for code generation tasks that require non‑linear refinement and global structural planning. AR models generate tokens strictly left‑to‑right, leading to high latency for long sequences and a mismatch with the iterative, back‑and‑forth nature of real programming workflows, especially for low‑level GPU code where correctness and performance are paramount.

Diffusion LLMs, by contrast, mask the entire output sequence and iteratively denoise it, allowing the model to incorporate global context in each refinement step. This property aligns well with the needs of kernel generation, where a programmer often designs the overall kernel skeleton first and then refines individual computational blocks. The paper leverages this advantage through three core contributions.

First, the authors construct CuKe, a curated supervised‑fine‑tuning (SFT) dataset that dramatically improves upon the existing ConCuR collection. They apply a strict 2× speed‑up filter to ensure that every PyTorch‑CUDA pair in the dataset exhibits a verifiable performance gain, yielding 1,425 high‑quality samples. To increase structural diversity, they also add 291 complex kernel fragments derived from common AI model components such as attention heads and MLP blocks, each validated with Mercury Coder and multiple execution runs. The final dataset contains 6,303 samples, providing both elementary operations and full‑model kernels for training.

Second, the paper proposes BiC‑RL (Bi‑Phase Curated Reinforcement Learning), a two‑stage RL framework that mitigates “deceptive behavior” often observed in code generation (e.g., copying prompt examples, omitting invocation logic, or falling back to high‑level PyTorch functions). In the first phase, the “kernel‑infill” stage, each kernel is decomposed into three parts: a prefix (imports and environment setup), a core implementation (the actual CUDA computation), and a suffix (inline compilation and nn.Module wrapper). The model is trained to fill only the core while the surrounding scaffolding is provided, and rewards are granted only when the generated core compiles and produces correct results. This forces the model to learn the essential low‑level logic rather than exploiting shortcuts.

The second phase, “kernel‑generation,” asks the model to produce the complete kernel given a high‑level PyTorch reference. The reward function again requires successful compilation and functional correctness, but now the model must also generate the surrounding scaffolding, ensuring that knowledge acquired in the in‑fill stage is transferred to full‑kernel synthesis.

Third, DICE adopts a block‑diffusion architecture. The output sequence is partitioned into fixed‑size blocks; intra‑block token generation proceeds in parallel (non‑autoregressive), while inter‑block ordering follows an autoregressive schedule. This hybrid approach enables reuse of KV‑cache across blocks, reduces memory overhead, and yields faster inference compared with pure AR decoding.

Training proceeds from a pretrained SD‑AR backbone, followed by SFT on CuKe, and finally BiC‑RL with TraceRL‑style sampling (64 problems, 16 samples each). For the in‑fill stage, 992 programs are used; for full generation, 4,000 programs are sampled.

Experimental evaluation on the KernelBench benchmark compares DICE (1.7 B, 4 B, 8 B) against a range of AR and diffusion models, including Gemini‑3‑Pro, Claude‑Sonnet‑4, DeepSeek‑Coder, Qwen2.5‑Coder, Seed‑Coder‑Reasoning, and commercial diffusion models. Metrics reported are execution correctness (Exec) and two speed‑up ratios (fast 1, fast 2). Across all scales, DICE achieves the highest speed‑up while maintaining 100 % compilation success. The 8 B model reaches up to 14× speed‑up on the hardest benchmark, outperforming the best AR baseline by a large margin. Moreover, the RL‑driven reward structure eliminates the majority of deceptive outputs, leading to a dramatic increase in functional correctness compared with prior code‑generation LLMs.

In summary, DICE demonstrates that diffusion‑based LLMs, when equipped with a high‑quality, performance‑oriented dataset and a carefully staged reinforcement‑learning curriculum, can surpass traditional autoregressive models in both efficiency and correctness for low‑level GPU kernel synthesis. The work opens avenues for scaling to larger models, extending to other low‑level languages (e.g., OpenCL, SYCL), and integrating multi‑GPU distributed diffusion for even faster code generation.


Comments & Academic Discussion

Loading comments...

Leave a Comment