GraphMend: Code Transformations for Fixing Graph Breaks in PyTorch 2

GraphMend: Code Transformations for Fixing Graph Breaks in PyTorch 2
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper presents GRAPHMEND, a high-level compiler technique that eliminates FX graph breaks in PyTorch 2 programs. Although PyTorch 2 introduced TorchDynamo and TorchInductor to enable just-in-time graph compilation, unresolved dynamic control flow and unsupported Python constructs often fragment models into multiple FX graphs. These fragments force frequent fallbacks to eager mode, introduce costly CPU-to-GPU synchronizations, and reduce optimization opportunities. GRAPHMEND addresses this limitation by analyzing and transforming source code before execution. Built on the Jaseci compilation framework, GRAPHMEND introduces two code transformations that remove graph breaks due to dynamic control flow and Python side effects. This design allows PyTorch’s compilation pipeline to capture larger, uninterrupted FX graphs without requiring manual refactoring by developers. Evaluation across eight Hugging Face models shows that GRAPHMEND removes graph breaks due to dynamic control flow and Python side effects, reducing the break count to 0 in 6 models and reducing it from 5 to 2 in another model. On NVIDIA RTX 3090 and A40 GPUs, GRAPHMEND achieves up to 75% latency reductions and up to 8% higher end-to-end throughput. These results demonstrate that high-level code transformation is an effective complement to PyTorch’s dynamic JIT compilation pipeline, substantially improving both usability and performance.


💡 Research Summary

GraphMend tackles a fundamental limitation of PyTorch 2’s just‑in‑time compilation pipeline: the fragmentation of models into multiple FX graphs caused by dynamic control flow and Python side‑effects. While TorchDynamo and TorchInductor can trace many tensor operations, they fall back to eager execution whenever they encounter constructs that cannot be symbolically evaluated at the bytecode level, such as data‑dependent conditionals (if x.sum() > 10:) or I/O calls like print. Each fallback inserts a “graph break”, forcing a CPU‑GPU synchronization, increasing latency, and preventing cross‑graph optimizations such as kernel fusion.

The authors propose GraphMend, a high‑level compiler augmentation built on the Jaseci framework. Jaseci parses Python (extended as the Jac language) into an abstract syntax tree (AST) and a control‑flow graph (CFG), merging them into a unified intermediate representation (UniIR). On top of this representation, GraphMend adds two transformation passes:

  1. Predicated Dynamic Control Flow – Data‑dependent branches are rewritten using tensor‑level primitives (torch.where, torch.cond). The conditional logic is thus expressed as a pure tensor operation that can be captured inside a single FX graph. For example, an if‑else that depends on a runtime tensor value is transformed into a torch.where call that selects between the two branches without leaving the tracing context.

  2. Graph Epilogue Deferred Side Effects – Operations that produce side‑effects (printing, logging, file writes) are deferred until after the FX graph has executed. The transformation replaces the side‑effecting call with a buffer write during tracing and inserts a flush at the end of the function. This preserves program semantics while keeping the traced region side‑effect‑free.

These passes are applied before TorchDynamo’s bytecode‑level tracing, ensuring that the high‑level semantic information lost during bytecode generation is still available for optimization. The authors verify semantic equivalence through extensive testing, confirming that outputs and side‑effects remain unchanged after transformation.

Evaluation is performed on eight Hugging Face models that naturally trigger graph breaks (e.g., Phi‑4‑mini, Llama‑2, BLOOM). GraphMend eliminates all fixable breaks, reducing the break count to zero in six models and from five to two in another; one model retains a single break due to a more complex pattern not yet supported. Performance measurements on NVIDIA RTX 3090 and A40 GPUs show substantial gains: cold‑start forward‑pass latency drops by 30 %–75 %, steady‑state latency improves by 2.5 %–25 %, and end‑to‑end throughput increases by 5 %–8 %. Profiling traces illustrate that eliminating graph breaks removes intermediate device‑to‑host transfers and synchronization stalls, allowing a single continuous CUDA launch and better GPU utilization.

The paper’s contributions are threefold: (1) a realistic benchmark suite of graph‑break‑prone models, (2) the design and implementation of the GraphMend compilation technique that automatically restructures high‑level code, and (3) a comprehensive empirical evaluation demonstrating both correctness and performance benefits. The authors also discuss limitations—such as handling nested loops with side‑effects or dynamic code generation—and outline future work to extend the transformation framework.

In summary, GraphMend shows that high‑level source‑code transformations, when integrated with existing JIT pipelines, can dramatically reduce the overhead of graph breaks in PyTorch 2, delivering faster inference without requiring developers to manually refactor their models. This approach bridges the gap between PyTorch’s dynamic flexibility and the performance of statically compiled GPU kernels.


Comments & Academic Discussion

Loading comments...

Leave a Comment