Geometric and Dynamic Scaling in Deep Transformers

Reading time: 1 minute
...

๐Ÿ“ Original Info

  • Title: Geometric and Dynamic Scaling in Deep Transformers
  • ArXiv ID: 2601.01014
  • Date: 2026-01-03
  • Authors: Haoran Su, Chenyu You

๐Ÿ“ Abstract

Scaling Transformer architectures to extreme depth often leads to rank collapse: representations become redundant and degenerate despite modern normalization schemes. We argue this is fundamentally a geometric problem. Standard residual connections implicitly assume monotonic feature accumulation is beneficial, but provide no mechanism to constrain update directions or erase outdated information. As depth increases, this causes uncontrolled drift from the semantic manifold and representational collapse. We propose the Manifold-Geometric Transformer (MGT), a unified framework addressing these failures through two orthogonal principles. First, manifold-constrained hyper-connections (mHC) restrict residual updates to valid tangent space directions, preventing manifold drift. Second, deep delta learning (DDL) enables data-dependent, non-monotonic updates that support feature erasure rather than unconditional accumulation. Together, mHC controls update direction while DDL controls magnitude and sign, yielding stable geometric evolution across depth. Our theoretical analysis predicts that coupling geometric constraints with dynamic erasure is essential for scaling beyond current depth limits. We design a rigorous evaluation protocol for ultra-deep networks (100+ layers) to test whether geometry, not depth itself, is the fundamental bottleneck in Transformer scalability.

๐Ÿ“„ Full Content

...(๋ณธ๋ฌธ ๋‚ด์šฉ์ด ๊ธธ์–ด ์ƒ๋žต๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์‚ฌ์ดํŠธ์—์„œ ์ „๋ฌธ์„ ํ™•์ธํ•ด ์ฃผ์„ธ์š”.)

Start searching

Enter keywords to search articles

โ†‘โ†“
โ†ต
ESC
โŒ˜K Shortcut