KForge: Program Synthesis for Diverse AI Hardware Accelerators
📝 Abstract
GPU kernels are critical for ML performance but difficult to optimize across diverse accelerators. We present KForge, a platform-agnostic framework built on two collaborative LLM-based agents: a generation agent that produces and iteratively refines programs through compilation and correctness feedback, and a performance analysis agent that interprets profiling data to guide optimization. This agent-based architecture requires only a single-shot example to target new platforms. We make three key contributions: (1) introducing an iterative refinement system where the generation agent and performance analysis agent collaborate through functional and optimization passes, interpreting diverse profiling data (from programmatic APIs to GUI-based tools) to generate actionable recommendations that guide program synthesis for arbitrary accelerators; (2) demonstrating that the generation agent effectively leverages cross-platform knowledge transfer, where a reference implementation from one architecture substantially improves generation quality for different hardware targets; and (3) validating the platform-agnostic nature of our approach by demonstrating effective program synthesis across fundamentally different parallel computing platforms: NVIDIA CUDA and Apple Metal.
💡 Analysis
GPU kernels are critical for ML performance but difficult to optimize across diverse accelerators. We present KForge, a platform-agnostic framework built on two collaborative LLM-based agents: a generation agent that produces and iteratively refines programs through compilation and correctness feedback, and a performance analysis agent that interprets profiling data to guide optimization. This agent-based architecture requires only a single-shot example to target new platforms. We make three key contributions: (1) introducing an iterative refinement system where the generation agent and performance analysis agent collaborate through functional and optimization passes, interpreting diverse profiling data (from programmatic APIs to GUI-based tools) to generate actionable recommendations that guide program synthesis for arbitrary accelerators; (2) demonstrating that the generation agent effectively leverages cross-platform knowledge transfer, where a reference implementation from one architecture substantially improves generation quality for different hardware targets; and (3) validating the platform-agnostic nature of our approach by demonstrating effective program synthesis across fundamentally different parallel computing platforms: NVIDIA CUDA and Apple Metal.
📄 Content
Writing high-performance compute kernels requires mastering domain-specific languages such as CUDA (NVIDIA), OpenCL (Khronos Group), Metal (Apple Inc., 2014), or Triton (Tillet et al., 2019). Porting kernels across accelerators is extremely challenging and requires fundamental algorithmic restructuring. A kernel optimized for NVIDIA’s H100 cannot be easily adapted for AMD’s MI300X or Apple’s Mseries chips, as each platform demands architecture-specific optimizations.
Most models have optimized implementations for NVIDIA’s hardware because training is usually conducted on NVIDIA accelerators. However, a growing number of neural network workloads are running on various accelerators both in the cloud and on edge devices, such as smartphones or laptops, where users increasingly run inference tasks, such as on-device language models, computer vision, and speech 1 Gimlet Labs, San Francisco, California, USA 2 Department of Electrical Engineering, Stanford University, Stanford, California, USA 3 Department of Computer Science, Stanford University, Stanford, California, USA. Correspondence to: Taras Sereda taras@gimletlabs.ai. synthesis or recognition systems.
Compilers such as torch.compile (Ansel et al., 2024) and TensorRT-LLM (NVIDIA, 2023) greatly speed up neural computation graphs by leveraging automatic kernel fusion, dynamic shape specialization, and graph optimizations.
Nevertheless, building high-performance kernels, as demonstrated by FlashAttention (Dao et al., 2022;Dao, 2023), requires combining clever algorithmic techniques with careful hardware utilization. Specifically, integrating online softmax (Milakov & Gimelshein, 2018) with tiled attention computation, while leveraging hardware-specific instructions, enables superior performance. Together, these optimizations reduce kernel scheduling overhead and optimize memory access patterns, maximizing arithmetic intensity while minimizing memory pipeline bubbles. This work explores whether large language models (LLMs) can generate kernel programs for multiple hardware accelerators, leveraging both algorithmic and hardware-specific optimizations. We target two distinct ecosystems: NVIDIA’s CUDA with its mature tooling and comprehensive Py-Torch (Paszke et al., 2019) support, and Apple’s Metal for Silicon GPUs with limited programmatic profiling capabilities.
Figure 1. Iterative program synthesis and optimization loop using LLMs. The workflow consists of two main phases: (1) a functional pass that iteratively refines synthesized programs until the code compiles, executes without errors, and produces correct output, and (2) an optimization pass that provides performance feedback to the LLM for iterative performance improvement.
We propose an agentic program synthesis framework, described in Figure 1 to mirror the real-world workflow of kernel engineers, who typically first make sure that kernel implementation is functionally correct. The kernel is then iteratively optimized using hardware utilization metrics such as memory bandwidth utilization, warp occupancy, or kernel arithmetic intensity. This setup allows the model to first arrive at a functionally correct program that is later incrementally improved based on previous attempts, simulating a practical development loop.
Recent work has explored using LLMs to automate GPU kernel generation and optimization, addressing the challenge of writing efficient kernels for machine learning workloads.
KernelBench (Ouyang et al., 2025) introduced a benchmark framework with 250 PyTorch workloads to evaluate LLMs’ ability to generate efficient GPU kernels. The benchmark uses a f ast p metric measuring both correctness and speedup over baseline implementations. Results show that even frontier reasoning models match PyTorch baseline performance in fewer than 20% of cases, revealing a critical trade-off between optimization complexity and correctness. Sakana AI’s CUDA Engineer (Lange et al., 2025) presented an agentic framework for automatic CUDA kernel discovery using evolutionary optimization. Initially claiming 10-100x speedups over PyTorch operations, the system was later found to exploit evaluation framework vulnerabilities (“reward hacking”), leading to inflated performance claims. The project released a dataset of 30,000+ generated kernels, but highlighted the challenges of robust evaluation in automated optimization.
Liger Kernel (Hsu et al., 2025) provides production-ready Triton kernels for LLM training, achieving 20% throughput increase and 60% memory reduction through kernel fusion and optimization. Unlike automated approaches, it offers curated, hand-optimized kernels for common operations like RMSNorm and SwiGLU.
KernelLLM (Fisches et al., 2025) fine-tuned an 8B parameter model based on Llama 3.1 Instruct specifically for translating PyTorch modules into Triton kernels, achieving competitive performance on KernelBench-Triton despite its smaller size compared to general-purpose models.
FlashInfer (
This content is AI-processed based on ArXiv data.