점진적 깊이 성장으로 트랜스포머 효율과 추론 성능 향상

Reading time: 6 minute
...

📝 Abstract

Gradually growing the depth of Transformers during training can not only reduce training cost but also lead to improved reasoning performance, as shown by MIDAS [40]. Thus far, however, a mechanistic understanding of these gains has been missing. In this work, we establish a connection to recent work showing that layers in the second half of non-grown, pre-layernorm Transformers contribute much less to the final output distribution than those in the first half-also known as the Curse of Depth [9, 44] . Using depth-wise analyses, we demonstrate that growth via gradual middle stacking yields more effective utilization of model depth, alters the residual stream structure, and facilitates the formation of permutable computational blocks. In addition, we propose a lightweight modification of MIDAS that yields further improvements in downstream reasoning benchmarks. Overall, this work highlights how the gradual growth of model depth can lead to the formation of distinct computational circuits and overcome the limited depth utilization seen in standard non-grown models.

💡 Analysis

Gradually growing the depth of Transformers during training can not only reduce training cost but also lead to improved reasoning performance, as shown by MIDAS [40]. Thus far, however, a mechanistic understanding of these gains has been missing. In this work, we establish a connection to recent work showing that layers in the second half of non-grown, pre-layernorm Transformers contribute much less to the final output distribution than those in the first half-also known as the Curse of Depth [9, 44] . Using depth-wise analyses, we demonstrate that growth via gradual middle stacking yields more effective utilization of model depth, alters the residual stream structure, and facilitates the formation of permutable computational blocks. In addition, we propose a lightweight modification of MIDAS that yields further improvements in downstream reasoning benchmarks. Overall, this work highlights how the gradual growth of model depth can lead to the formation of distinct computational circuits and overcome the limited depth utilization seen in standard non-grown models.

📄 Content

Do Depth-Grown Models Overcome the Curse of Depth? An In-Depth Analysis Ferdinand Kapl1,2∗‡ Emmanouil Angelis1,2∗‡ Tobias H¨oppe1,2∗‡ Kaitlin Maile3† Johannes von Oswald3† Nino Scherrer3† Stefan Bauer1,2† 1 Technical University of Munich 2 Helmholtz AI, Munich 3 Google, Paradigms of Intelligence Team Abstract Gradually growing the depth of Transformers during training can not only re- duce training cost but also lead to improved reasoning performance, as shown by MIDAS [40]. Thus far, however, a mechanistic understanding of these gains has been missing. In this work, we establish a connection to recent work showing that layers in the second half of non-grown, pre-layernorm Transformers con- tribute much less to the final output distribution than those in the first half—also known as the Curse of Depth [9, 44]. Using depth-wise analyses, we demon- strate that growth via gradual middle stacking yields more effective utilization of model depth, alters the residual stream structure, and facilitates the formation of permutable computational blocks. In addition, we propose a lightweight modifica- tion of MIDAS that yields further improvements in downstream reasoning bench- marks. Overall, this work highlights how the gradual growth of model depth can lead to the formation of distinct computational circuits and overcome the limited depth utilization seen in standard non-grown models. 1 Introduction The remarkable success of large language models (LLMs) has been accompanied by immense com- putational and energy demands. This trend of training larger and larger networks is correlated with the increasing depth of model architectures [22, 25]. As Transformers [47] lack recurrence, their computational capacity is directly linked to their depth. Greater depth enables more complex com- putations and improves capabilities like reasoning, compositional generalization and goal reaching [28, 38, 49]. However, this pursuit of greater scale uncovers a critical inefficiency, as training such models is extremely resource-intensive [46]. A core issue of the current paradigm is the observation that not all layers contribute equally to the final model’s performance [18, 29, 31, 54]. Csord´as et al. [9] and Sun et al. [44] demonstrate that deeper layers of modern pre-layer Transformers tend to be less effective than their earlier counterparts, with many layers in the second half of the model contributing minimally to the fi- nal output—also known as the Curse of Depth [44]. This observation, which highlights a kind of over-parametrization, is supported by findings that various architectures are remarkably robust to perturbations like skipping layers without significant performance loss [28, 54]. The Curse of Depth represents a major resource inefficiency in today’s paradigm. As highlighted by Csord´as et al. [9], addressing these limitations is a pressing need for the community to avoid wasting valuable resources and to develop more efficient architectures that can leverage deep layers effectively. A promising solution lies in gradually grown architectures, which dynamically expand a model’s depth or width during training. These novel training strategies, such as gradual stacking [17, 39], enable efficient training by using layers from a smaller model to initialize the next stage. Of partic- ular interest is the MIDAS method [40], which gradually increases depth by inserting new layers into *Equal contribution. †Provided equal in-depth feedback and guidance. ‡Correspondence:{ferdinand.kapl,emmanouil.angelis,tobias.hoeppe}@tum.de. arXiv:2512.08819v1 [cs.CL] 9 Dec 2025 0.00 5.00 Depth score MQuAKE MATH A 6.01 5.88 7.20 8.82 7.45 9.33 5.99 6.72 Depth score by dataset 0 5 10 15 20 0.2 0.4 0.6 0.8 Top-5 Overlap (mean) B Top-5 Overlap after Early Exit 0 5 10 15 20 Layer 0.25 0.50 0.75 1.00 Relative Accuracy C Early Exit Performance with Tuned Lens Baseline MIDAS LIDAS LN-Scaling Figure 1: Depth-grown models use their depth more (1.7B). (A) Depth score [9] on MATH [20] and MQuAKE [56]. Grown models (MIDAS, LIDAS) have consistently higher depth scores. (B) Top-5 overlap between each layer’s early-exit vocabulary and model’s final vocabulary on 20 prompts from GSM8K [7]. Both grown models studied in this work exhibit lower overlap at later layers, indicating that these later layers still contribute additional features necessary for the final prediction. (C) Early-exit relative accuracy versus layer on Variable Assignment Math reasoning primitive. The baseline reaches near its final performance early, whereas accuracy for MIDAS and LIDAS continues to rise up to the last layer. Using these metrics, however, LN-Scaling shows no discernible benefit over the baseline in depth utilisation. the middle of the model. MIDAS has been shown not only to speed up training but also to improve performance on reasoning-heavy benchmarks, suggesting that this growth procedure introduces a favourable inductive bias. However, a clear mechanistic unders

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut