Discrete Diffusion Trajectory Alignment via Stepwise Decomposition

Discrete Diffusion Trajectory Alignment via Stepwise Decomposition
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Discrete diffusion models have demonstrated great promise in modeling various sequence data, ranging from human language to biological sequences. Inspired by the success of RL in language models, there is growing interest in further improving the models by alignment with a certain reward. In this work, we propose an offline preference optimization method to approach trajectory alignment for discrete diffusion models. Instead of applying the reward on the final output and backpropagating the gradient to the entire denoising process, we decompose the problem into a set of stepwise alignment objectives by matching the per-step posterior. This framework enables efficient diffusion optimization, is compatible with arbitrary reward functions, and importantly, yields an equivalent optimal solution under additive factorization of the trajectory reward. Experiments across multiple domains including DNA sequence design, protein inverse folding, and language modeling consistently demonstrate the superiority of our approach. Notably, it achieves an up to 12% improvement over the most competitive RL-based baseline in terms of predicted activity on DNA sequence design, and further improves the GSM8K score from 78.6 to 81.2 on LLaDA-8B-Instruct for language modeling.


💡 Research Summary

This paper tackles the problem of aligning pretrained discrete diffusion models with arbitrary reward functions, a task that has been largely unexplored compared to the extensive work on reinforcement‑learning‑with‑human‑feedback (RLHF) for autoregressive language models. Discrete diffusion models generate sequences by iteratively denoising a masked latent variable through a Markov chain of length T, which makes direct reward‑based fine‑tuning difficult for two reasons. First, the reward is usually defined only on the final clean sequence x₀ (e.g., human preference, predicted biological activity), while the diffusion process involves many intermediate latent states x₁,…,x_T. Second, back‑propagating a reward signal through the entire stochastic chain is computationally expensive and often unstable, especially when the latent space is discrete.

To overcome these challenges, the authors propose Stepwise Decomposition Preference Optimization (SDPO), an offline preference‑optimization framework that decomposes the global trajectory‑level alignment objective into a set of per‑step posterior‑matching problems. The original trajectory objective can be written as

\


Comments & Academic Discussion

Loading comments...

Leave a Comment