REASONING COMPILER: LLM-Guided Optimizations for Efficient Model Serving

REASONING COMPILER: LLM-Guided Optimizations for Efficient Model Serving
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

While model serving has unlocked unprecedented capabilities, the high cost of serving large-scale models continues to be a significant barrier to widespread accessibility and rapid innovation. Compiler optimizations have long driven substantial performance improvements, but existing compilers struggle with neural workloads due to the exponentially large and highly interdependent space of possible transformations. Although existing stochastic search techniques can be effective, they are often sample-inefficient and fail to leverage the structural context underlying compilation decisions. We set out to investigate the research question of whether reasoning with large language models (LLMs), without any retraining, can leverage the context-aware decision space of compiler optimizations to significantly improve sample efficiency. To that end, we introduce a novel compilation framework (dubbed REASONING COMPILER) that formulates optimization as a sequential, context-aware decision process guided by a large language model and structured Monte Carlo tree search (MCTS). The LLM acts as a proposal mechanism, suggesting hardware-informed transformations that reflect the current program state and accumulated performance feedback. MCTS incorporates the LLM-generated proposals to balance exploration and exploitation, facilitating a structured, context-sensitive traversal of the expansive compiler optimization space. By achieving substantial speedups with markedly fewer samples than leading neural compilers, our approach demonstrates the potential of LLM-guided reasoning to transform the landscape of compiler optimization.


💡 Research Summary

The paper tackles the pressing problem of high inference cost in serving large neural models by introducing a novel compilation framework called REASONING COMPILER. The authors observe that traditional compilers and rule‑based heuristics struggle with neural workloads because the space of valid program transformations (tiling, fusion, layout changes, etc.) grows exponentially and exhibits strong inter‑dependencies. To address this, they cast the optimization task as a finite‑horizon Markov Decision Process (MDP) where each state represents the current program after a sequence of transformations, each action selects a new transformation, and the reward is a hardware‑specific performance metric such as latency or energy.

The core innovation is to use a pre‑trained large language model (LLM) as a context‑aware proposal engine without any fine‑tuning. At each decision step the LLM receives a prompt containing the current schedule, hardware description, and recent performance feedback, and it generates a short list of plausible next transformations that respect the hardware constraints and the transformation history. Because the LLM has been trained on massive code and system documentation, it can reason about non‑local interactions (e.g., “apply tiling factor 32 now that loops have been fused”) that are difficult for static heuristics.

These LLM‑generated proposals are fed into a structured Monte Carlo Tree Search (MCTS). MCTS expands the most promising nodes according to an Upper Confidence Bound (UCT) policy, runs fast cost‑model rollouts to estimate the value of a candidate sequence, and back‑propagates the results to update node statistics. This combination gives the system the best of both worlds: the LLM provides high‑quality, context‑sensitive suggestions (a chain‑of‑thought style reasoning), while MCTS supplies a principled exploration‑exploitation mechanism that systematically evaluates and refines those suggestions. The authors prove that, under standard assumptions, MCTS converges to the optimal sequence as the number of simulations grows, and they demonstrate empirically that even with a modest simulation budget the system finds near‑optimal solutions.

Experiments cover five representative neural layers (Llama‑3‑8B attention, DeepSeek‑R1 MoE, FLUX attention, FLUX convolution, Llama‑4‑Scout MLP) across five hardware platforms (Amazon Graviton2, AMD EPYC 7R13, Apple M2 Pro, Intel Core i9, Intel Xeon E3). The baseline is TVM, a state‑of‑the‑art neural compiler that relies on evolutionary search. Results show that REASONING COMPILER achieves an average 5.0× speedup while using 5.8× fewer samples. On an end‑to‑end Llama‑3‑8B benchmark, it delivers a 4.0× speedup with only 3.9× the number of samples, corresponding to a 5.6× improvement in sample efficiency. These gains are consistent across all layers and hardware targets, confirming that the LLM‑guided approach generalizes well.

The paper also discusses limitations. Prompt engineering remains manual and may affect performance on unseen domains; the cost model used in rollouts is approximate and can mislead the search if inaccurate; and translating LLM text outputs into concrete compiler actions introduces parsing overhead. Future work is outlined: integrating real‑time profiling to tighten the cost model, employing ensembles of LLMs to increase proposal diversity, and automating prompt optimization to reduce human effort.

In summary, REASONING COMPILER demonstrates that large language models, even without fine‑tuning, can serve as powerful, context‑aware reasoning agents for compiler optimization. By coupling LLM proposals with Monte Carlo Tree Search, the framework dramatically improves sample efficiency and achieves superior performance over existing neural compilers, opening a promising new direction for cost‑effective model serving.


Comments & Academic Discussion

Loading comments...

Leave a Comment