KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta

KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Making deep learning recommendation model (DLRM) training and inference fast and efficient is important. However, this presents three key system challenges -model architecture diversity, kernel primitive diversity, and hardware generation and architecture heterogeneity. The combination of the three diversity dimensions leads to a complex optimization space. This paper presents KernelEvolve -an agentic kernel coding framework -to tackle heterogeneity at-scale for DLRM training and inference. KernelEvolve is designed to take kernel specifications as input and automate the process of kernel generation and optimization for recommendation model across heterogeneous hardware architectures through multiple programming abstractions, including Triton, CuTe DSL, and low-level hardware diagnostic languages, spanning the full hardware-software optimization stack. The kernel optimization process is described as graph-based search with selection policy, universal operator, fitness function, and termination rule, dynamically adapts to runtime execution context through retrieval-augmented prompt synthesis. The system integrates a persistent knowledge base encoding hardware-specific constraints for heterogeneous AI accelerators, enabling effective kernel generation even for proprietary architectures absent from LLM training corpora. We designed, implemented, and deployed KernelEvolve to optimize a wide variety of production recommendation models across generations of NVIDIA and AMD GPUs, as well as Meta’s latestgeneration AI accelerators (MTIA v3). We validate KernelEvolve on the publicly-available KernelBench suite, achieving 100% pass rate on all 250 problems across three difficulty levels, and 160 PyTorch ATen operators across three heterogeneous hardware platforms, demonstrating 100% correctness over all 480 operator-platform configurations. KernelEvolve reduces development time from weeks to hours and achieves substantial performance improvements by up to 17 times over PyTorch baselines across diverse production use cases and for heterogeneous AI systems at-scale. Beyond performance efficiency improvements, KernelEvolve significantly mitigates the programmability barrier for new AI hardware by enabling automated kernel generation for proprietary accelerators. We hope the insights and deployment experience presented in this paper will shed new light on the design of AI systems and optimization at-scale.


💡 Research Summary

KernelEvolve is an agentic kernel‑coding framework introduced by Meta to address the threefold diversity that modern deep‑learning recommendation models (DLRMs) encounter: heterogeneous model architectures, a wide variety of kernel primitives, and an ever‑changing landscape of AI accelerators. The system takes a high‑level kernel specification—describing input and output tensors, the target operation, and any hardware constraints—and automatically produces optimized kernel implementations for any supported accelerator, ranging from NVIDIA and AMD GPUs to Meta’s proprietary MTIA v3 chips.

The core of KernelEvolve is a multi‑abstraction pipeline. Depending on the target hardware, the framework selects the most appropriate programming model among Triton (high‑level DSL), CuTe (intermediate DSL), or low‑level hardware diagnostic languages such as PTX, AMD GCN assembly, or MTIA‑specific IR. This abstraction layer decouples the user from the intricacies of each language and enables a single specification to be compiled into many different code forms.

Kernel optimization is framed as a graph‑based search problem. Each node in the search graph represents a concrete code transformation—e.g., changing thread‑block dimensions, tiling strategy, register allocation, or memory‑coalescing pattern—while edges encode the order in which transformations can be applied. A hybrid policy that blends reinforcement‑learning‑driven selection with heuristic pruning explores this space efficiently, even when the number of possible configurations reaches hundreds of thousands. The fitness function is multi‑objective: it measures runtime latency, memory footprint, power consumption, and compliance with hardware limits (register count, shared‑memory capacity, instruction‑set constraints). A termination rule fires when a predefined performance target is met, the search budget is exhausted, or the improvement curve plateaus.

A distinctive feature of KernelEvolve is its runtime‑adaptive prompt synthesis. After each candidate kernel is compiled and profiled, the system extracts execution statistics (cache‑miss rates, warp‑occupancy, divergence, etc.) and feeds them into a retrieval‑augmented prompt that is sent to a large language model (LLM). The prompt combines retrieved knowledge‑base entries (similar kernels, hardware quirks) with the fresh profiling data, guiding the LLM to generate new code snippets that respect the latest hardware characteristics. This mechanism allows the framework to handle accelerators that are not represented in the LLM’s pre‑training corpus, such as the proprietary MTIA v3, by leveraging the persistent knowledge base that stores hardware‑specific constraints and past optimization outcomes.

The knowledge base (KB) is a graph database that encodes hardware specifications, previously discovered optimal configurations, and constraint rules. When a new kernel request arrives, the KB is queried for analogous cases; the retrieved information seeds the initial search direction and reduces the number of unnecessary explorations. As the system discovers new constraints or performance models, it automatically updates the KB, creating a virtuous cycle of continual improvement.

Experimental validation covers three axes. First, KernelEvolve achieved a 100 % pass rate on the public KernelBench suite (250 problems across three difficulty levels), demonstrating that the automatically generated kernels are functionally correct. Second, the authors evaluated 160 PyTorch ATen operators on three heterogeneous platforms (NVIDIA A100, AMD MI250, and MTIA v3), yielding 480 operator‑platform configurations—all verified to be 100 % correct. Third, on production‑grade recommendation workloads, KernelEvolve delivered speed‑ups ranging from 4.2× on average to a peak of 17× compared with hand‑written PyTorch baselines, while cutting development time from weeks to a few hours. Notably, the framework succeeded in generating performant kernels for MTIA v3 despite the accelerator’s closed ISA, thanks to the KB‑augmented prompt mechanism.

Beyond raw performance, KernelEvolve dramatically lowers the programmability barrier for emerging AI hardware. Engineers no longer need deep expertise in each accelerator’s assembly language; instead, they provide a concise specification and let the system synthesize, compile, and tune the kernel. This capability is especially valuable for proprietary or nascent architectures that lack extensive community tooling.

The paper concludes with several future directions. Extending the specification language and KB to cover domains beyond recommendation (e.g., NLP, vision) will broaden the framework’s applicability. Training hardware‑specific LLMs or fine‑tuning existing models on accelerator‑level code could further improve generation quality. Security and privacy safeguards are needed when handling closed‑source ISAs to prevent inadvertent leakage of proprietary details. Finally, the authors envision a community‑driven KB sharing platform that aggregates optimization knowledge across organizations, narrowing the gap between open‑source and custom solutions.

In summary, KernelEvolve integrates multi‑level abstraction, graph‑based exploration, runtime‑adaptive LLM prompting, and a self‑updating knowledge base to automate kernel generation and optimization across heterogeneous AI accelerators. It achieves correctness, substantial performance gains, and dramatic reductions in development effort, offering a compelling blueprint for scalable AI system engineering in the era of rapidly diversifying hardware.


Comments & Academic Discussion

Loading comments...

Leave a Comment