AscendCraft: Automatic Ascend NPU Kernel Generation via DSL-Guided Transcompilation

AscendCraft: Automatic Ascend NPU Kernel Generation via DSL-Guided Transcompilation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The performance of deep learning models critically depends on efficient kernel implementations, yet developing high-performance kernels for specialized accelerators remains time-consuming and expertise-intensive. While recent work demonstrates that large language models (LLMs) can generate correct and performant GPU kernels, kernel generation for neural processing units (NPUs) remains largely underexplored due to domain-specific programming models, limited public examples, and sparse documentation. Consequently, directly generating AscendC kernels with LLMs yields extremely low correctness, highlighting a substantial gap between GPU and NPU kernel generation. We present AscendCraft, a DSL-guided approach for automatic AscendC kernel generation. AscendCraft introduces a lightweight DSL that abstracts non-essential complexity while explicitly modeling Ascend-specific execution semantics. Kernels are first generated in the DSL using category-specific expert examples and then transcompiled into AscendC through structured, constraint-driven LLM lowering passes. Evaluated on MultiKernelBench across seven operator categories, AscendCraft achieves 98.1% compilation success and 90.4% functional correctness. Moreover, 46.2% of generated kernels match or exceed PyTorch eager execution performance, demonstrating that DSL-guided transcompilation can enable LLMs to generate both correct and competitive NPU kernels. Beyond benchmarks, AscendCraft further demonstrates its generality by successfully generating two correct kernels for newly proposed mHC architecture, achieving performance that substantially surpasses PyTorch eager execution.


💡 Research Summary

AscendCraft tackles the long‑standing difficulty of automatically generating high‑performance kernels for Huawei’s Ascend neural processing unit (NPU). While large language models (LLMs) have shown impressive results for GPU kernel synthesis, direct generation of AscendC code suffers from extremely low correctness (below 5%) due to the intricate, hardware‑specific programming model, scarce public examples, and strict constraints such as memory alignment, buffer placement, and pipeline synchronization.

The authors propose a two‑stage pipeline that bridges high‑level algorithmic intent and low‑level AscendC implementation through a lightweight domain‑specific language (DSL). The DSL is deliberately concise, abstracting away verbose details (e.g., DataCopyPad arguments) while exposing essential concepts: tiling strategy, data flow, on‑chip buffer allocation, and the three‑stage pipeline (CopyIn‑Compute‑CopyOut). It adopts a Triton‑like syntax to keep programs short and structurally regular, which makes them easier for LLMs to generate without syntax errors. Category‑specific expert DSL examples are supplied for each of the seven operator families (vector, scalar, cube, etc.), allowing the model to learn reusable optimization patterns within a category and generalize to unseen configurations.

In the second stage, the DSL program is transcompiled into valid AscendC code via a sequence of structured LLM‑driven lowering passes. Each pass handles a distinct DSL facet—buffer declaration, memory copy, compute mapping, queue management—while enforcing hard constraints (e.g., alignment, buffer ordering, explicit copy‑in/compute/copy‑out mapping to AI‑Core functions). By decomposing the translation into well‑defined sub‑tasks, the system reduces hallucination and ensures that the generated AscendC respects the hardware’s execution semantics.

Evaluation on the MultiKernelBench suite (seven operator categories) shows dramatic improvements: 98.1 % compilation success, 90.4 % functional correctness, and competitive performance—46.2 % of generated kernels match or exceed PyTorch eager execution, 57.7 % reach at least 80 % of the baseline, and 82.7 % achieve at least 20 %. This contrasts sharply with prior direct LLM generation, which rarely produced compilable code.

Beyond benchmarks, AscendCraft is applied to two newly proposed kernels from the mHC architecture. The system automatically produces correct implementations that achieve 6.6× and 3.0× speed‑ups over PyTorch eager execution. Subsequent human‑in‑the‑loop optimization, assisted by LLMs, pushes the final performance to 15.9× and 7.2×, demonstrating that the DSL also serves as a solid foundation for iterative tuning.

Key insights include: (1) a high‑level, hardware‑aware DSL can steer LLMs toward generating semantically meaningful code; (2) structured, constraint‑driven lowering passes dramatically improve both compile‑time success and runtime correctness; (3) providing category‑level expert examples enables LLMs to capture and reuse optimization heuristics across similar operators. The authors argue that this paradigm—DSL‑guided transcompilation—can be generalized to other specialized accelerators (e.g., TPU, Habana) where low‑level programming is similarly opaque.

In summary, AscendCraft demonstrates that coupling a carefully crafted DSL with staged LLM generation yields reliable, high‑performance Ascend NPU kernels, closing the gap between GPU and NPU kernel synthesis and opening avenues for broader automated accelerator programming.


Comments & Academic Discussion

Loading comments...

Leave a Comment