Understanding the Reversal Curse Mitigation in Masked Diffusion Models through Attention and Training Dynamics

Understanding the Reversal Curse Mitigation in Masked Diffusion Models through Attention and Training Dynamics
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Autoregressive language models (ARMs) suffer from the reversal curse: after learning that “$A$ is $B$”, they often fail on the reverse query “$B$ is $A$”. Masked diffusion-based language models (MDMs) exhibit this failure in a much weaker form, but the underlying reason has remained unclear. A common explanation attributes this mitigation to the any-order training objective. However, observing “[MASK] is $B$” during training does not necessarily teach the model to handle the reverse prompt “$B$ is [MASK]”. We show that the mitigation arises from architectural structure and its interaction with training. In a one-layer Transformer encoder, weight sharing couples the two directions by making forward and reverse attention scores positively correlated. In the same setting, we further show that the corresponding gradients are aligned, so minimizing the forward loss also reduces the reverse loss. Experiments on both controlled toy tasks and large-scale diffusion language models support these mechanisms, explaining why MDMs partially overcome a failure mode that persists in strong ARMs.


💡 Research Summary

The paper investigates why masked diffusion language models (MDMs) mitigate the “reversal curse”—the failure of autoregressive models (ARMs) to answer reversed queries such as “B is A?” after learning “A is B”—and provides a rigorous analysis that attributes this mitigation to architectural coupling and training dynamics rather than to the any‑order training objective alone.

First, the authors formalize the reversal curse in ARMs as a consequence of the unidirectional conditional probability p(y = B | x = A) that is optimized during left‑to‑right next‑token prediction. The reverse conditional p(y = A | x = B) receives no learning signal, which explains why ARMs collapse on reversed queries despite strong forward performance.

MDMs, trained with random masking, appear to provide supervision for both forward and reverse directions, but the authors show that the training objective only directly supervises p(x = A | y = B) (from “


Comments & Academic Discussion

Loading comments...

Leave a Comment