.Preprint. February 11, 2026. 1988) representation for molecules is based on heuristic rules such as depth-first search which does not use a unified fixed ordering (Lee et al., 2025).
In many applications in chemistry and biology, there are well-defined notions of quality for the data; for example, drug-likeness (Bickerton et al., 2012) for molecular structures or enhancer activity (Taskiran et al., 2024;Wang et al., 2025) for DNA sequences. Thus, generative models must not only produce natural samples that resemble the training data but also achieve high-quality scores according to such domain-specific criteria.
In the emerging regime of test-time scaling, quality considerations are incorporated by defining a reward function and optimizing it during inference through reward-guided sampling. The simplest strategy is Best-of-N sampling, where N samples are generated, and the one with the highest reward is selected. However, this brute-force method is inefficient since it does not perform any structured guidance. Recent works (Kim et al., 2025b;Wu et al., 2023;Yu et al., 2023;Bansal et al., 2023;Chung et al., 2023) have instead proposed guidance using process rewards or intermediate rewards, which are computable at intermediate steps of generation.
While methods involving intermediate rewards are technically applicable to discrete diffusion models, they pose particular challenges for many types of chemistry and biology data as reward functions in these domains are often non-smooth, meaning that small perturbations in the data can cause large changes in the reward. For example, in molecular structures, modifying even a single element in the string representation can render the entire molecule invalid, collapsing the reward to zero, as shown in Fig. 1 (left). As a result, relying on intermediate rewards does not provide an effective local search strategy in these cases.
The key question that arises here is how to enable local search in reward-guided generation without relying on intermediate rewards. We introduce Clean-Sample Markov Chain (CSMC) Sampler which performs iterative search over clean data samples using the Metropolis-Hastings (MH) algorithm. To propose a new clean sample from an existing one, we use forward-backward combinations- 1. Comparison of discrete diffusion inference-time scaling methods for reward-guided sampling. Our CSMC applies to all discrete diffusion frameworks and leverages the clean reward while guiding the sampling trajectory.
applying the forward process to corrupt clean data followed by running the reverse process to obtain a new sample. While the acceptance probability for the MH algorithm is intractable due to intractable clean sample probabilities, we show that the forward-backward proposal distribution makes the acceptance probability tractable, enabling efficient sampling.
We validate CSMC on molecule and biological sequence generation across four different reward functions. CSMC achieves the highest reward in all settings, even for SMILES string generation where other methods fail due to inaccurate intermediate rewards.
In continuous diffusion models, reward-guided sampling is conventionally performed using gradient-based methods (Dhariwal & Nichol, 2021;Ho & Salimans, 2022;Chung et al., 2023;Song et al., 2023;Rozet et al., 2024;Bansal et al., 2023;Yoon et al., 2025;Kim et al., 2025b;Wu et al., 2023) which offer strong guidance towards highreward regions. However, gradient-based approaches cannot be applied to discrete diffusion models as gradients are illdefined in discrete spaces and it is not theoretically valid to add a continuous gradient to a discrete objective.
Training-Free Reward-Guided Sampling for Discrete Diffusion. Recently, inference-time scaling methods for discrete diffusion models have been proposed to tackle reward-guided sampling. A comparison of these methods and our method is shown in Tab. 1.
The simplest method is Best-of-N (BoN) sampling (Stiennon et al., 2020), which generates N samples independently and selects the one with the highest reward. Due to its simplicity, BoN is applicable to all types of discrete diffusion models. However, BoN does not guide the denoising trajectory using the reward, resulting in inefficient search, especially when high-reward samples lie in low-density regions unlikely to be sampl