Pushing Tensor Accelerators Beyond MatMul in a User-Schedulable Language

Pushing Tensor Accelerators Beyond MatMul in a User-Schedulable Language
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Tensor accelerators now represent a growing share of compute resources in modern CPUs and GPUs. However, they are hard to program, leading developers to use vendor-provided kernel libraries that support tensor accelerators. As a result, the usage of tensor accelerators is limited to the provided interface, mainly designed for traditional ML and scientific computing workloads. In this paper, we show that tensor accelerators can improve the performance of applications beyond simple variants of MatMul. For example, many image processing pipelines are linear transformations over matrices in disguise and can therefore utilize such specialized hardware. This is nonetheless hindered by the difficulties in programming tensor accelerators. We tackle this problem with compiler-based techniques. We use the Halide user-schedulable language and express operations as Halide algorithms succinctly. To this end, we implement a flexible tensor instruction selector based on equality saturation. The tensor instruction selector supports both CPU- and GPU-attached tensor accelerators and works with existing scheduling operations (e.g., producer-consumer fusion). Together, this enables developers to write diverse accelerator-leveraging applications in a few dozen lines. Using our system, we demonstrate the potential of tensor accelerators beyond their traditional domains. We implement several image processing pipelines (e.g., filtering, resampling, and denoising) in our system and evaluate them against non-accelerator-leveraging baselines. We show that these pipelines can achieve significant speedups. For example, a downsampling routine is sped up by $6.1\times$ by utilizing Tensor Cores on an Nvidia RTX 4070 GPU.


💡 Research Summary

This paper presents a novel compiler-based approach to unlock the potential of tensor accelerators (such as NVIDIA Tensor Cores and Intel AMX) for applications beyond the traditional domain of matrix multiplication (MatMul). The core challenge addressed is the difficulty in programming these specialized hardware units, which forces developers to rely on vendor-provided libraries with narrow, rigid interfaces, limiting optimization opportunities like fusion and tiling.

The authors’ solution is built upon the user-schedulable language Halide, which separates algorithm definition from performance-oriented scheduling. They introduce “HARD BOILED,” a flexible tensor instruction selector integrated into the Halide compiler. The key innovation of HARD BOILED is its use of equality saturation (EqSat), a program optimization technique that efficiently explores a space of program variants defined by rewrite rules. Instead of relying on brittle, pattern-matching hardcoded for specific syntactic forms, HARD BOILED expresses tensor computation patterns using Halide’s existing vectorized intermediate representation (IR) nodes (Ramp, Broadcast, vector_reduce_add). It then uses EqSat (via the egglog system) to flexibly identify matches between these IR patterns and the semantics of target hardware tensor instructions.

The workflow for a developer is straightforward: they write a Halide algorithm and a schedule. Using scheduling primitives like store_in(MemoryType::AMXTile), they can indicate that certain computations should reside in and utilize tensor accelerator memory. The Halide compiler lowers this to a vectorized IR. HARD BOILED then processes this IR, annotates data movement between host and accelerator memory, and formulates an EqSat problem. After applying a set of predefined tensor operation rules, it extracts an optimized program where suitable computations are mapped to accelerator intrinsics. This process works seamlessly with existing Halide scheduling operations, such as producer-consumer fusion.

The paper demonstrates the effectiveness of this system by implementing and evaluating several image and signal processing pipelines that are linear transformations in disguise, including 1D/2D convolution, integer and non-integer factor resampling, recursive filtering, and denoising via discrete cosine transform (DCT). Evaluations on an NVIDIA RTX 4070 GPU show significant speedups by leveraging Tensor Cores: for instance, a downsampling routine achieves a 6.1x speedup. End-to-end applications see speedups ranging from 1.1x to 1.4x. The authors also validate their system on classical ML workloads, showing performance comparable to library implementations.

In summary, this work makes two primary contributions: 1) the design and implementation of HARD BOILED, a flexible, EqSat-based tensor instruction selector for Halide that supports both CPU- and GPU-attached accelerators, and 2) a concrete demonstration that tensor accelerators can provide substantial performance benefits for a broader class of applications beyond MatMul when programmed through a high-level, schedulable abstraction. This approach lowers the programming barrier for tensor accelerators and opens new avenues for leveraging their power in diverse computational domains.


Comments & Academic Discussion

Loading comments...

Leave a Comment