Optimizing PyTorch Inference with LLM-Based Multi-Agent Systems

Maximizing performance on available GPU hardware is an ongoing challenge for modern AI inference systems. Traditional approaches include writing custom GPU kernels and using specialized model compiler

Optimizing PyTorch Inference with LLM-Based Multi-Agent Systems

Maximizing performance on available GPU hardware is an ongoing challenge for modern AI inference systems. Traditional approaches include writing custom GPU kernels and using specialized model compilers to tune high-level code for specific GPU targets. Recent work shows that LLM-based multi-agent systems can effectively perform such tuning, often outperforming existing compilers and eliminating the need for manual kernel development. However, the dynamics of multi-agent systems for this task remain unexplored. In this work, we present a logical framework for comparing multi-agent PyTorch optimization systems. Our evaluation shows that exploit-heavy strategies perform best when paired with error-fixing agents, and that performance correlates with the granularity of optimization steps. The best implementation achieves an average 2.88× speedup on an H100 GPU across diverse tasks in KernelBench, a benchmark suite covering a range of machine learning architectures in PyTorch. INTRODUCTION One of the most crucial areas of performance optimization today is that of AI/ML workloads. As the landscape of GPUs rapidly evolves, AI/ML software must constantly adapt to capitalize on the performance benefits of new hardware. Unfortunately, the optimization of GPU kernels remains one of the most technically demanding frontiers in performance engineering, requiring mastery of GPU parallelism, memory hierarchies, and scheduling behavior. Numerous model-level compilers enable fully automated AI/ML model optimization (Ansel et al., 2024; NVIDIA Corporation, 2024; Microsoft Corporation, 2024; Sabne, 2020) , but they require continual updates to target new GPUs and often fall short of hand-tuned performance. For instance, the original custom FlashAttention kernel (Dao et al., 2022) achieved a 7.6× speedup over a generic Py-Torch implementation before being integrated as a built-in attention operator. Domain-specific languages and libraries simplify manual GPU development and tuning (Kerr et al., 2017; Tillet et al., 2019; Spector et al., 2024) , yet achieving peak performance with tools like Triton (Tillet et al., 2019) still demands substantial time and expertise, including mastery of GPU memory hierarchies, tiling, and block sizing.


📜 Original Paper Content

🚀 Synchronizing high-quality layout from 1TB storage...