Remasking Discrete Diffusion Models with Inference-Time Scaling

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Part of the success of diffusion models stems from their ability to perform iterative refinement, i.e., repeatedly correcting outputs during generation. However, modern masked discrete diffusion lacks this capability: when a token is generated, it cannot be updated again, even when it introduces an error. Here, we address this limitation by introducing the remasking diffusion model (ReMDM) sampler, a method that can be applied to pretrained masked diffusion models in a principled way and that is derived from a discrete diffusion model with a custom remasking backward process. Most interestingly, ReMDM endows discrete diffusion with a form of inference-time compute scaling. By increasing the number of sampling steps, ReMDM generates natural language outputs that approach the quality of autoregressive models, whereas when the computation budget is limited, ReMDM better maintains quality. ReMDM also improves sample quality of masked diffusion models for discretized images, and in scientific domains such as molecule design, ReMDM facilitates diffusion guidance and pushes the Pareto frontier of controllability relative to classical masking and uniform noise diffusion. We provide the code along with a blog post on the project page: https://guanghanwang.com/remdm

💡 Research Summary

The paper tackles a fundamental limitation of modern masked discrete diffusion models—once a token is unmasked during generation it cannot be changed again, preventing iterative refinement and limiting controllability, speed, and sample quality. To overcome this, the authors introduce the Remasking Diffusion Model (ReMDM) sampler, a principled method that can be applied on top of pretrained masked diffusion models without retraining.

The core technical contribution is a new backward (denoising) transition that, for an already unmasked token, allows it to be remasked with probability σₜ. When σₜ = 0 the process reduces to the standard masked diffusion (MDLM); when σₜ > 0 the model can “undo” a previously generated token and try again. The authors define the posterior qσ(zₛ|zₜ, x) in equations (4)–(6) and prove (Theorem 3.1) that the marginal qσ(zₜ|x) remains identical to the original masked diffusion marginal, which justifies re‑using a model trained with the MDLM objective.

From an ELBO perspective, the negative evidence lower bound (NELBO) for ReMDM (eq. 9) is a re‑weighted version of the MDLM loss, with each term increasing monotonically with σₜ. Consequently, training can be performed with σₜ = 0 (i.e., the usual MDLM loss) and the same parameters can be used at inference with any σₜ schedule, similar to using a different noise schedule in continuous diffusion.

The paper proposes several practical σₜ schedules:

Max‑capped – caps σₜ at a user‑defined η₍cap₎.
Rescale – multiplies the maximal admissible σ₍max₎ by a scalar η₍rescale₎.
Confidence‑based – scales σₜ per token according to a confidence score ψₗₜ derived from the model’s probability when the token was last unmasked; less confident tokens are more likely to be remasked.

In addition, the authors introduce “turn‑on/off” mechanisms to apply remasking only during certain phases of generation:

Switch – activates remasking after a fixed timestep t₍switch₎.
Loop – divides sampling into three phases: (i) initial MDLM decoding (σₜ = 0), (ii) a constant‑α loop where σₜ > 0 and the model repeatedly remasks and re‑predicts, and (iii) a final MDLM clean‑up phase. This loop effectively implements iterative error correction while preserving good tokens.

Theoretical analysis shows that ReMDM subsumes prior discrete predictor‑corrector samplers. Theorem 4.1 demonstrates that a ReMDM step can be decomposed into an MDLM predictor followed by a corrector whose form matches the FB and DFM correctors for specific choices of σₜ (Propositions 4.2 and 4.3). Thus ReMDM provides a unified, more flexible framework.

Empirical evaluation spans three domains:

Natural language generation (OpenWebText). Increasing the number of sampling steps dramatically improves MAUVE scores, approaching those of autoregressive (AR) models with the same architecture. When the step budget is reduced, ReMDM degrades more gracefully than vanilla MDLM, confirming its inference‑time scaling property.
Discrete image generation (CIFAR‑10, ImageNet‑32). ReMDM‑loop yields lower FID and higher sample diversity compared to MaskGiT and other masked diffusion baselines, especially when the number of steps is limited.
Molecule design (SMILES strings). When combined with diffusion guidance, ReMDM pushes the novelty‑property Pareto frontier beyond both masking‑based and uniform‑noise diffusion methods, demonstrating superior controllability in scientific generation tasks.

Overall, ReMDM offers a simple yet powerful augmentation to existing masked discrete diffusion models: by introducing a controllable remasking probability, it enables iterative refinement, inference‑time compute scaling, and better controlled generation across language, vision, and scientific domains—all without requiring retraining of the underlying model. The work bridges the gap between diffusion’s iterative refinement strengths and the efficiency of parallel masked decoding, opening new avenues for high‑quality, controllable discrete generation.

Remasking Discrete Diffusion Models with Inference-Time Scaling

💡 Research Summary

Comments & Academic Discussion

Leave a Comment