Token Pruning for In-Context Generation in Diffusion Transformers

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In-context generation significantly enhances Diffusion Transformers (DiTs) by enabling controllable image-to-image generation through reference examples. However, the resulting input concatenation drastically increases sequence length, creating a substantial computational bottleneck. Existing token reduction techniques, primarily tailored for text-to-image synthesis, fall short in this paradigm as they apply uniform reduction strategies, overlooking the inherent role asymmetry between reference contexts and target latents across spatial, temporal, and functional dimensions. To bridge this gap, we introduce ToPi, a training-free token pruning framework tailored for in-context generation in DiTs. Specifically, ToPi utilizes offline calibration-driven sensitivity analysis to identify pivotal attention layers, serving as a robust proxy for redundancy estimation. Leveraging these layers, we derive a novel influence metric to quantify the contribution of each context token for selective pruning, coupled with a temporal update strategy that adapts to the evolving diffusion trajectory. Empirical evaluations demonstrate that ToPi can achieve over 30% speedup in inference while maintaining structural fidelity and visual consistency across complex image generation tasks.

💡 Research Summary

The paper tackles the computational bottleneck that arises when Diffusion Transformers (DiTs) are used for in‑context generation, a setting where reference images (or videos) are concatenated with noisy latent tokens to guide image‑to‑image or image‑to‑video synthesis. This concatenation dramatically inflates the sequence length, causing the quadratic self‑attention cost to explode. Existing token‑reduction methods, originally designed for text‑to‑image diffusion or discriminative vision transformers, treat reference and target tokens uniformly and therefore either keep too many redundant tokens or prune essential semantic anchors, degrading generation quality.

To address this, the authors first conduct a systematic analysis of token redundancy along three axes: spatial, temporal, and functional. Spatially, they discover that only a small subset of attention layers (the “pivotal layers”) actually depend heavily on the reference context, while the majority of layers contribute negligible context information. Temporally, the importance of reference tokens is high in early, high‑noise denoising steps and decays as the diffusion process proceeds, indicating that the model gradually shifts from context‑driven guidance to self‑refinement. Functionally, the reliance on reference tokens varies with task difficulty: fine‑grained edits (e.g., resizing, detail preservation) retain higher attention to references, whereas high‑freedom tasks (e.g., style transfer) quickly diminish that reliance.

Based on these observations, the paper proposes ToPi (Training‑free, Offline‑calibration, Periodic‑scoring, Incremental‑pruning), a token‑pruning framework that is both training‑free and dynamic. The pipeline consists of three mechanisms:

Offline Calibration & Representative Layer Selection – Using a small calibration dataset, the method computes a Context Sensitivity Score for each attention layer by averaging the attention mass from noisy tokens to reference tokens. The top‑M layers with the highest scores are designated as representative layers, serving as a proxy for the whole network’s context processing while keeping the overhead low.
Attention‑Weighted Importance Scoring – During inference, only the representative layers are consulted. For each reference token j, an influence metric I_j(t) is calculated as the sum of attention weights from all noisy tokens across heads and selected layers at timestep t. This metric captures the token’s contribution to the current denoising state. The scores are normalized and a binary mask M(t) is generated, keeping the most influential tokens and discarding the rest. The mask is refreshed every ΔT steps, allowing the system to adapt to the evolving diffusion trajectory.
Step‑wise Pruning & Context Realignment – At each denoising step, the current mask is applied to the reference token set, producing a reduced context C′. The concatenated input becomes Z_t′ =

Token Pruning for In-Context Generation in Diffusion Transformers

💡 Research Summary

Comments & Academic Discussion

Leave a Comment