CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning
Matrix multiplication (matmul) is one of the most fundamental operations in LLMs. However, manually optimizing Matmul kernels is challenging due to the fact that different matrix dimension (M, N, K) r
Matrix multiplication (matmul) is one of the most fundamental operations in LLMs. However, manually optimizing Matmul kernels is challenging due to the fact that different matrix dimension (M, N, K) require different optimization strategies and that optimizations rarely transform across different GPU architectures, which make comprehensive manual tuning hard at scale. In this paper, we propose CUDA-L2, a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. Using CUDA execution speed as the RL reward, CUDA-L2 automatically optimizes HGEMM kernels across 1,000 configurations. These configurations represent all 10 3
📜 Original Paper Content
🚀 Synchronizing high-quality layout from 1TB storage...