Symmetry Breaking in Transformers for Efficient and Interpretable Training
The attention mechanism in its standard implementation contains extraneous rotational degrees of freedom that are carried through computation but do not affect model activations or outputs. We introduce a simple symmetry-breaking protocol that inserts a preferred direction into this rotational space through batchwise-sampled, unlearned query and value biases. This modification has two theoretically motivated and empirically validated consequences. First, it can substantially improve the performance of simple, memory-efficient optimizers, narrowing – and in some cases closing – the gap to successful but more complex memory-intensive adaptive methods. We demonstrate this by pretraining 124M parameter transformer models with four optimization algorithms (AdamW, SOAP, SGDM, and Energy Conserving Descent(ECD)) and evaluating both validation loss and downstream logical reasoning. Second, it enables an interpretable use of otherwise redundant rotational degrees of freedom, selectively amplifying semantically meaningful token classes within individual attention heads. Overall, our results show that minimal, principled architectural changes can simultaneously improve performance and interpretability.
💡 Research Summary
The paper investigates a subtle but important source of inefficiency in transformer training: the continuous rotational symmetries inherent in the attention mechanism. Because the attention scores depend only on inner products between queries and keys, any orthogonal transformation applied jointly to the query and key weight matrices (and similarly to the value‑output matrices) leaves the forward computation unchanged. This symmetry creates a high‑dimensional manifold of equivalent parameter configurations. By Noether’s theorem, each continuous symmetry gives rise to a conserved quantity—in this case, components of the conjugate momentum associated with the rotated directions.
Energy‑Conserving Descent (ECD) is a physics‑inspired optimizer that removes friction and conserves the total Hamiltonian energy of the system. While this leads to elegant theoretical properties, the conserved angular momenta generated by the attention symmetries restrict the chaotic exploration that ECD relies on to find low‑loss regions. Adaptive optimizers such as AdamW and SOAP implicitly break these symmetries through per‑parameter statistics and preconditioning, but they require roughly three auxiliary variables per model parameter, inflating memory usage.
To address both the performance gap and the interpretability deficit, the authors introduce a minimal architectural change: batch‑wise, untrained query and value biases (denoted b_Q and b_V). For each training batch, a fixed random vector is added to the query and value projections before the soft‑max. This bias explicitly selects a preferred direction in the otherwise degenerate rotational subspace, thereby breaking the symmetry and eliminating the associated conserved momenta. The bias also modifies the attention weights multiplicatively as e^{k·b_Q}, where k is a key vector. Consequently, token classes whose average key vectors align with b_Q receive amplified attention, while those anti‑aligned are suppressed. This provides a transparent, class‑level control over attention that was previously unavailable.
Empirically, the authors pre‑train 124‑million‑parameter GPT‑2 models using four optimizers: AdamW, SOAP, SGDM, and ECD. They evaluate both validation loss and downstream logical reasoning benchmarks (e.g., logical deduction and proof‑writing tasks). Without symmetry breaking, ECD lags behind adaptive methods, showing higher validation loss and lower reasoning accuracy. After inserting b_Q and b_V, ECD’s validation loss drops dramatically and its reasoning accuracy matches or exceeds that of SOAP, despite using only 2N auxiliary variables (half the memory of AdamW/SOAP). SGDM also benefits, though to a lesser extent, confirming that the symmetry‑breaking effect is not limited to ECD.
Beyond optimization, the authors analyze the learned alignment between b_Q and various token categories. They find a strong positive correlation (≈0.7) between the magnitude of alignment and downstream logical performance across all optimizers. Visualizations of attention maps reveal that heads with strong b_Q alignment focus on semantically meaningful tokens such as numbers, logical operators, and connective words, effectively “highlighting” the logical structure of the input. This demonstrates that the bias serves as an interpretable knob for shaping model behavior, offering a new avenue for model introspection and controlled generation.
In summary, the paper makes three key contributions: (1) a Hamiltonian‑based explanation of why ECD struggles with transformer training due to conserved angular momenta from attention symmetries; (2) a simple, memory‑efficient symmetry‑breaking protocol that restores ECD’s performance while preserving its low‑memory footprint; and (3) evidence that the same protocol yields an interpretable mechanism for token‑class‑specific attention modulation, which correlates with improved logical reasoning. The work suggests that careful analysis of architectural symmetries can reveal low‑cost modifications that simultaneously boost training efficiency and model interpretability, and it opens the door for future studies on larger scales and other physics‑inspired optimizers.
Comments & Academic Discussion
Loading comments...
Leave a Comment