Representation Collapse in Machine Translation Through the Lens of Angular Dispersion

Representation Collapse in Machine Translation Through the Lens of Angular Dispersion
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Modern neural translation models based on the Transformer architecture are known for their high performance, particularly when trained on high-resource datasets. A standard next-token prediction training strategy, while widely adopted in practice, may lead to overlooked artifacts such as representation collapse. Previous works have shown that this problem is especially pronounced in the representation of the deeper Transformer layers, where it often fails to efficiently utilize the geometric space. Representation collapse is even more evident in end-to-end training of continuous-output neural machine translation, where the trivial solution would be to set all vectors to the same value. In this work, we analyze the dynamics of representation collapse at different levels of discrete and continuous NMT transformers throughout training. We incorporate an existing regularization method based on angular dispersion and demonstrate empirically that it not only mitigates collapse but also improves translation quality. Furthermore, we show that quantized models exhibit similar collapse behavior and that the benefits of regularization are preserved even after quantization.


💡 Research Summary

The paper investigates a subtle but important failure mode in modern Transformer‑based neural machine translation (NMT) systems: representation collapse. When training with the standard next‑token prediction objective, the hidden states and token embeddings tend to occupy only a small region of the high‑dimensional space. Two variants are distinguished: complete collapse, where all vectors converge to a single point (especially problematic for continuous‑output NMT, CoNMT), and dimensional collapse, where the representations lie in a low‑dimensional subspace despite the model’s nominal dimensionality. Both phenomena degrade translation quality and hinder downstream use of the learned representations (e.g., retrieval‑augmented generation, k‑NN MT).

Previous remedies such as contrastive learning rely on data augmentation and negative sampling, which are difficult to define for discrete text and often require large batch sizes. The authors therefore turn to a geometric perspective: treating normalized vectors as points on a unit sphere Sⁿ and encouraging them to be uniformly spread. This “angular dispersion” is a natural proxy for avoiding collapse because a uniformly dispersed set of directions maximizes spherical coverage.

The core contribution is the introduction of a sliced‑dispersion regularizer. For any pair of orthogonal directions (P, Q) defining a great circle on the sphere, the method projects all representation vectors onto that 2‑D circle, computes the angular positions, and measures the distance δ to an ideal equally‑spaced configuration. The regularizer Rₛₗᵢ𝚌ₑ𝒹(𝒁) is the expectation of δ over uniformly sampled great‑circle pairs, which can be estimated efficiently with Monte‑Carlo sampling. This yields an O(N) computation (linear in the number of vectors) rather than the O(N²) cost of pairwise kernels.

The regularizer is added to the standard NMT loss: L_RNMT = L_NMT + γ·R(𝒁), where γ controls strength. The authors apply it to three latent spaces: the decoder output H, the decoder embedding matrix E, and the encoder output F. To keep the cost low for E, they randomly subsample embeddings of rare tokens (those in the lower half of the vocabulary frequency rank).

To quantify collapse, three complementary metrics are employed: (1) average cosine similarity across token representations (high values indicate clustering), (2) Rényi entropy of the eigenvalue distribution of the covariance (low entropy signals variance concentrated in few dimensions), and (3) spherical variance, defined as 1 minus the norm of the mean of normalized vectors (values near zero mean directions are tightly clustered).

Experiments use the WMT19 English‑German pair (34 M sentence pairs) and evaluate with sacreBLEU and COMET on newstest2018/2019. Both Transformer‑base (≈65 M parameters) and Transformer‑big (≈210 M) are trained, as well as a CoNMT model with the same architecture. Training runs for 50 k steps on a single H100 GPU with Adam (lr = 5e‑4, 10 k warm‑up). γ is tuned on the dev set over {10⁻², 10⁻¹, 10⁰, 10¹, 10²}.

Results show that adding the sliced‑dispersion term consistently reduces average cosine similarity by 0.1–0.2, raises Rényi entropy and spherical variance by roughly 10–20 %, indicating richer, more uniform representations. Translation quality improves across the board: BLEU gains of 0.5–1.2 points and COMET gains of 0.02–0.05 over the baseline. The effect is strongest for the CoNMT model, where without regularization the loss quickly collapses to the trivial solution (all embeddings identical), but with regularization training remains stable and yields competitive scores.

A further set of experiments quantizes the trained models to int8 and float16. The regularized models retain their high entropy and low cosine similarity after quantization, and the BLEU drop is less than half of that observed for unregularized models. This demonstrates that angular dispersion regularization not only combats collapse in full‑precision training but also preserves representation diversity when the model is compressed for edge deployment.

In summary, the paper makes three key contributions: (1) a clear empirical characterization of representation collapse in both discrete and continuous NMT settings, (2) a computationally cheap spherical‑geometry‑based regularizer (sliced dispersion) that directly targets angular uniformity, and (3) evidence that the regularizer improves translation quality and remains effective after aggressive quantization. The work opens avenues for applying spherical dispersion to larger multilingual models, integrating it with contrastive objectives, and exploring its impact on downstream tasks that rely on rich contextual embeddings.


Comments & Academic Discussion

Loading comments...

Leave a Comment