Tree Training: Accelerating Agentic LLMs Training via Shared Prefix Reuse

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Agentic large language model (LLM) training often involves multi-turn interaction trajectories that branch into multiple execution paths due to concurrent tool use, think-mode, sub-agent, context management and other runtime designs. As a result, the token produced by a single task naturally forms a tree-structured token trajectory with shared prefixes, rather than a linear sequence. Existing training pipelines linearize such trajectories and treat each branch independently, leading to substantial redundant computation in both forward and backward passes. To eliminate such redundancy, we introduce Tree Training, an efficient training framework for tree-structured trajectories. Its core component, Gradient Restoration, enables correct gradient aggregation across shared prefixes, allowing each prefix to be computed exactly once while remaining mathematically equivalent to independent training on all branches. To support large trajectory trees in practice, we redesign the training engine to natively ingest tree-structured data and propose Tree Packing, a memory-efficient partitioning strategy that preserves high prefix reuse. Experiments conducted on dense and MOE models of real-world agentic trajectories show 6.2x training speedup for both supervised fine-tuning and the model update phase in reinforcement learning.

💡 Research Summary

The paper addresses a fundamental inefficiency in training agentic large language models (LLMs) that interact with environments over multiple turns. In such settings, a single task often generates a tree‑structured token trajectory: the initial context (prefix) is shared across many divergent execution paths caused by tool calls, sub‑agents, “think‑mode” token discarding, and other runtime mechanisms. Existing training pipelines flatten these trees into independent linear sequences, which forces the model to recompute the same prefix for each branch during both the forward and backward passes, leading to substantial wasted computation and memory.

To eliminate this redundancy, the authors propose Tree Training, a framework that processes the tree directly. Its cornerstone is Gradient Restoration, a mathematically exact method that aggregates gradient contributions from all descendant branches so that each shared prefix is evaluated only once. The key insight is that while forward propagation benefits from causal masking (allowing prefix caching), backward propagation is anti‑causal: gradients for a prefix depend on all suffix tokens. Gradient Restoration adds a per‑token compensation term that sums the gradients from each branch, reproducing exactly the same parameter updates as if each branch had been processed independently. The authors derive the required conditions for linear layers and extend the proof to the full Transformer architecture (attention and feed‑forward layers), showing that the overhead is negligible.

Processing large trees still poses a memory challenge because a whole trajectory may exceed GPU capacity. The paper therefore introduces Tree Packing, a heuristic depth‑first‑search (DFS) partitioning algorithm that splits a large tree into sub‑trees that fit in memory while preserving as much prefix sharing as possible. The algorithm merges as many branches as memory permits, then recursively creates additional packs. Empirically, Tree Packing reduces the total token count that must be stored from 164 k (naïve flattening) to 102 k for an 83 k‑token tree under a 60 k‑token memory limit, a 38 % reduction.

The authors evaluate the approach on both dense and Mixture‑of‑Experts (MoE) LLMs across realistic agentic datasets that involve tool usage, sub‑agent coordination, and think‑mode token resets. They test two training regimes:

Supervised Fine‑Tuning (SFT) – where the model learns from human‑annotated trajectories.
Reinforcement Learning (RL) – where the model is updated using policy‑gradient methods on generated rollouts.

In both regimes, Tree Training achieves an average 6.2× speedup in end‑to‑end training time without any measurable degradation in downstream performance (e.g., task success rate, answer correctness). The gradient restoration guarantees that the learned policy is identical to that obtained with the baseline linearized training, which the authors confirm by reporting identical evaluation metrics and loss curves.

Key contributions highlighted by the paper are:

Recognition of Tree‑Structured Prefix Reuse – a pervasive but previously overlooked property of agentic LLM training.
Gradient Restoration Algorithm – a low‑overhead, provably correct method for aggregating gradients across shared prefixes.
Tree Packing Strategy – a practical solution for fitting large trajectory trees into limited GPU memory while maintaining high reuse.
Comprehensive Empirical Validation – demonstrating consistent speedups across model families (dense, MoE) and training paradigms (SFT, RL) with no loss in model fidelity.

Overall, the work provides a solid theoretical foundation and an engineering‑ready system for training agentic LLMs more efficiently. By eliminating redundant computation in both forward and backward passes, it paves the way for scaling up agentic training to larger models and longer, more complex interaction histories, which are essential for next‑generation autonomous AI systems.

Tree Training: Accelerating Agentic LLMs Training via Shared Prefix Reuse

💡 Research Summary

Comments & Academic Discussion

Leave a Comment