SpiralFormer: Looped Transformers Can Learn Hierarchical Dependencies via Multi-Resolution Recursion

SpiralFormer: Looped Transformers Can Learn Hierarchical Dependencies via Multi-Resolution Recursion
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recursive (looped) Transformers decouple computational depth from parameter depth by repeatedly applying shared layers, providing an explicit architectural primitive for iterative refinement and latent reasoning. However, early looped Transformers often underperform non-recursive baselines of equal compute. While recent literature has introduced more effective recursion mechanisms to mitigate this gap, existing architectures still operate at a fixed, full-token resolution, neglecting the potential efficiency of computing over compressed latent representations. In this paper, we propose SpiralFormer, a looped Transformer that executes recurrence under a multi-resolution recursion schedule. We provide probing evidence that multi-resolution recursion enables the model to learn hierarchical dependencies by inducing iteration-wise functional specialization across different scales. Empirically, SpiralFormer achieves better parameter and compute efficiency than both looped and non-looped baselines across model scales from 160M to 1.4B, establishing sequence resolution as a potential axis for scaling recursive architectures.


💡 Research Summary

SpiralFormer addresses a key inefficiency in existing looped (recursive) Transformers: while they reuse a shared set of layers across many iterations, they still process the full‑length token sequence at every step, leading to unnecessary compute especially when many reasoning steps could be performed on compressed representations. The authors introduce “multi‑resolution recursion,” a novel architectural axis that varies the sequence length used at each loop iteration. Early iterations operate on heavily down‑sampled “chunk‑level” latents, capturing global context cheaply; later iterations progressively increase the effective resolution, refining token‑level details. This schedule is defined by a resolution factor rₜ∈(0,1] for each iteration t, yielding an effective length Lₜ=⌊rₜ·L⌋.

The core of the method consists of three operations per iteration: (1) a causal down‑sampling operator Sₜ↓ that aggregates token embeddings into Lₜ chunk vectors zₜ, either by simple mean pooling or by a learned scorer Aₜ that produces attention‑weighted sums within each chunk; (2) a shared Transformer core f_loop applied to the compressed sequence, producing updated chunk representations b_zₜ; (3) an up‑sampling operator Sₜ↑ that distributes each chunk output back to the original token positions using either uniform broadcasting or a learned router Bₜ that predicts allocation weights βₜ for each token inside a chunk. Because a chunk contains future tokens relative to earlier positions, strict autoregressive causality is enforced by a right‑shift of size sₜ (default sₜ=gₜ−1, where gₜ=⌊1/rₜ⌋ is the chunk size). This shift creates a one‑token overlap between the chunk that generates an update and the chunk that receives it, guaranteeing that no token’s update depends on information from tokens to its right. An additional offset ωₜ (default ⌊gₜ/2⌋) shifts chunk boundaries each iteration, providing a form of “sliding window” that lets each token be processed by multiple overlapping chunks across the loop.

SpiralFormer can be combined with two state‑update topologies. The Anchor topology treats the pre‑loop output h(0) as a fixed anchor and adds each iteration’s causal update to this anchor. The MeSH topology introduces a multi‑slot memory buffer M(t) and learned read/write routers, allowing richer cross‑iteration information flow. Both topologies benefit from the multi‑resolution schedule: early low‑resolution passes quickly propagate global information, while later high‑resolution passes refine local details without increasing parameter count.

The authors evaluate SpiralFormer on the Pythia suite (models ranging from 160 M to 1.4 B parameters). They compare three configurations: (i) the non‑recursive baseline (standard Pythia), (ii) a full‑resolution looped Transformer (Looped Transformer) that repeats the shared core at the original token length, and (iii) SpiralFormer with either Anchor (SpiralFormer‑B) or MeSH (SpiralFormer‑L) topologies. FLOPs are measured for a 4096‑token context. Across all scales, SpiralFormer reduces FLOPs by roughly 10‑30 % relative to the baseline while achieving lower validation perplexity (average reduction ≈0.4‑0.5). In zero‑shot and five‑shot evaluations on downstream tasks such as WikiText, LAMBADA, and PIQA, SpiralFormer matches or exceeds the best scores, with the MeSH‑based large variant (SpiralFormer‑L) often attaining the top performance.

Probing analyses reveal that attention heads specialize according to resolution: early low‑resolution iterations show heads attending broadly across chunks, whereas later high‑resolution iterations exhibit more localized patterns. This functional specialization suggests that the model learns hierarchical dependencies in a “coarse‑to‑fine” manner, mirroring human reasoning that first forms a global sketch before refining details.

Limitations include the sensitivity of performance to the manually chosen schedule parameters (rₜ, sₜ, ωₜ) and the potential information loss inherent in chunk aggregation, which may become more pronounced for very long sequences (e.g., >32 k tokens). The current implementation uses fixed schedules; future work could explore adaptive resolution schedules learned end‑to‑end, dynamic chunk sizes, or more expressive up‑sampling decoders (e.g., a small Transformer decoder) to mitigate compression loss.

In summary, SpiralFormer demonstrates that varying the sequence resolution across recursion steps is an effective and under‑explored axis for scaling recursive architectures. By reusing a single shared core across multiple scales, it achieves superior parameter and compute efficiency while learning hierarchical, scale‑dependent representations, opening new avenues for efficient latent reasoning in large language models.


Comments & Academic Discussion

Loading comments...

Leave a Comment