Hexcute: A Compiler Framework for Automating Layout Synthesis in GPU Programs
Efficient GPU programming is crucial for achieving high performance in deep learning (DL) applications. The performance of GPU programs depends on how data is parallelized across threads and arranged within memory subsystems. The mapping functions describing tensors on GPUs are known as \emph{tensor layouts}. Low-level programming frameworks, such as CUTLASS and Hidet, provide expressive layout abstractions but often require \emph{considerable programming effort} to manually specify optimal layouts. High-level GPU programming languages, such as Triton, rely on compiler heuristics to generate dataflow, layouts, and pipelining strategies in GPU programs. However, the heuristics for dataflow and pipelining strategies are not generalizable to complex operators. To balance expressiveness and programmability, we propose Hexcute, a compiler framework that automates layout synthesis while providing explicit control over dataflow and pipelining. Hexcute formalizes layout synthesis as a constraint programming problem and solves it with a type-inference-based algorithm. This approach enables systematic exploration of optimal layouts and instructions. Our evaluation shows that Hexcute matches the performance of libraries like cuBLAS and FlashAttention on GEMM, Attention, and their variants, while reducing the amount of code by 1.27$\times$-7.94$\times$ compared to CUTLASS. For mixed-type mixture-of-experts (MoE) operators, Hexcute achieves an average speedup of 6.46$\times$ over Triton. In the end-to-end evaluations of vLLM, Hexcute delivers up to 2.60$\times$ speedup on DeepSeek-R1-AWQ and 2.04$\times$ on a Mamba-based model.
💡 Research Summary
Paper Overview
The authors present Hexcute, a compiler framework that automates tensor layout synthesis for GPU kernels while preserving explicit control over dataflow and pipelining. The motivation stems from a dichotomy in existing GPU programming ecosystems: low‑level libraries such as CUTLASS and Hidet expose powerful layout abstractions but demand substantial manual effort, whereas high‑level languages like Triton generate layouts heuristically, which often fails for complex operators (e.g., mixed‑precision Mixture‑of‑Experts, scan‑based Mamba). Hexcute bridges this gap by formalizing layout synthesis as a constraint‑programming problem and solving it with a type‑inference‑based algorithm.
Key Technical Contributions
-
Layout as Functions – Tensor layouts are modeled as functions mapping logical coordinates to memory addresses. Shared‑memory tensors use CUTL‑E‑style “shape:stride” tuples; register tensors use Thread‑Value (TV) layouts that combine a thread mode and a value mode. This functional view enables composition and inversion, giving layouts a monoid structure.
-
Embedding Layouts in Types – Layouts are embedded directly into tensor types. Consequently, any operation that consumes a tensor carries its layout constraints, allowing the compiler to reason about layout compatibility across the whole program.
-
Constraint Generation – For each hardware instruction (e.g., Tensor‑Core GEMM,
ldmatrix), the authors extract concrete layout requirements (e.g., specific thread‑to‑element mapping, bank‑conflict‑free strides). These become algebraic constraints over the layout functions. -
Type‑Inference‑Based Solver – Building on Hindley‑Milner style inference, the solver treats layout variables as unknowns and propagates constraints through the program’s type graph. Because layouts compose, the solver can systematically construct complex mappings from simpler primitives, guaranteeing correctness by construction.
-
Tile‑Level DSL – Hexcute is implemented as a Python‑embedded DSL extending Hidet Script. It exposes shared memory and registers explicitly, allowing developers to write concise tile‑level kernels while the framework automatically infers optimal layouts and selects appropriate instructions.
Evaluation Highlights
- Operator‑Level Benchmarks: On GEMM and FlashAttention, Hexcute matches the performance of cuBLAS and FlashAttention libraries, while reducing source lines of code by a factor of 1.27×–7.94× compared with CUTLASS.
- Mixed‑Precision MoE: For INT4‑weight / FP16‑activation Mixture‑of‑Experts kernels, Hexcute achieves an average 6.46× speedup over Triton, which suffers from excessive copies and sub‑optimal memory placement.
- End‑to‑End vLLM: Integrating Hexcute kernels into the vLLM inference engine yields up to 2.60× speedup on DeepSeek‑R1‑AWQ and 2.04× on a Mamba‑based model, demonstrating real‑world impact.
Insights and Implications
Hexcute shows that treating layouts as first‑class functional objects and solving their synthesis via type inference can systematically explore a vast design space that traditional integer‑linear programming cannot. The monoid property of layout composition enables modular reasoning, making the framework extensible to new hardware instructions without hand‑crafted heuristics. By automating layout selection while still allowing explicit dataflow and pipelining specifications, Hexcute offers a sweet spot for experienced kernel engineers: high performance with dramatically reduced boilerplate.
Future Directions
The authors suggest augmenting the constraint solver with a cost model (e.g., accounting for memory bandwidth, register pressure) to prioritize more efficient layouts during synthesis. Extending the approach to non‑NVIDIA GPUs (AMD CDNA, Intel Xe) and exploring integration with emerging compiler infrastructures (MLIR, TVM) are also identified as promising avenues.
In summary, Hexcute advances GPU compiler technology by delivering automated, provably correct layout synthesis that scales to complex, mixed‑precision deep‑learning operators, achieving library‑level performance with far less developer effort.
Comments & Academic Discussion
Loading comments...
Leave a Comment