Reward-Guided Discrete Diffusion via Clean-Sample Markov Chain for Molecule and Biological Sequence Design

February 20, 2026

Reading time: 5 minute

...

📝 Original Info

Title: Reward-Guided Discrete Diffusion via Clean-Sample Markov Chain for Molecule and Biological Sequence Design
ArXiv ID: 2602.09424
Date: 2026-02-10
Authors: ** - Prin Phunyaphibarn (프린 푸냐피반) – KAIST, 이메일: prin10517@kaist.ac.kr - Minhyuk Sung (성민혁) – KAIST, 이메일: mhsung@kaist.ac.kr **

📝 Abstract

Discrete diffusion models have recently emerged as a powerful class of generative models for chemistry and biology data. In these fields, the goal is to generate various samples with high rewards (e.g., drug-likeness in molecules), making reward-based guidance crucial. Most existing methods are based on guiding the diffusion model using intermediate rewards but tend to underperform since intermediate rewards are noisy due to the non-smooth nature of reward functions used in scientific domains. To address this, we propose Clean-Sample Markov Chain (CSMC) Sampler, a method that performs effective test-time reward-guided sampling for discrete diffusion models, enabling local search without relying on intermediate rewards. CSMC constructs a Markov chain of clean samples using the Metropolis-Hastings algorithm such that its stationary distribution is the target distribution. We design a proposal distribution by sequentially applying the forward and backward diffusion processes, making the acceptance probability tractable. Experiments on molecule and biological sequence generation with various reward functions demonstrate that our method consistently outperforms prior approaches that rely on intermediate rewards.

💡 Deep Analysis

📄 Full Content

Discrete diffusion models have recently emerged as a powerful generative framework for discrete data, showing particular promise in chemistry and biology for generating complex structures such as molecules and DNA sequences. Unlike autoregressive models that assume a canonical leftto-right ordering on the data, discrete diffusion models are more naturally suited for scientific domains whose data (e.g., molecules or DNA sequences) lack a natural fixed ordering. For instance, the widely used SMILES (Weininger, 1 KAIST, Daejeon, South Korea. Correspondence to: Prin Phunyaphibarn , Minhyuk Sung .

Preprint. February 11, 2026. 1988) representation for molecules is based on heuristic rules such as depth-first search which does not use a unified fixed ordering (Lee et al., 2025).

In many applications in chemistry and biology, there are well-defined notions of quality for the data; for example, drug-likeness (Bickerton et al., 2012) for molecular structures or enhancer activity (Taskiran et al., 2024;Wang et al., 2025) for DNA sequences. Thus, generative models must not only produce natural samples that resemble the training data but also achieve high-quality scores according to such domain-specific criteria.

In the emerging regime of test-time scaling, quality considerations are incorporated by defining a reward function and optimizing it during inference through reward-guided sampling. The simplest strategy is Best-of-N sampling, where N samples are generated, and the one with the highest reward is selected. However, this brute-force method is inefficient since it does not perform any structured guidance. Recent works (Kim et al., 2025b;Wu et al., 2023;Yu et al., 2023;Bansal et al., 2023;Chung et al., 2023) have instead proposed guidance using process rewards or intermediate rewards, which are computable at intermediate steps of generation.

While methods involving intermediate rewards are technically applicable to discrete diffusion models, they pose particular challenges for many types of chemistry and biology data as reward functions in these domains are often non-smooth, meaning that small perturbations in the data can cause large changes in the reward. For example, in molecular structures, modifying even a single element in the string representation can render the entire molecule invalid, collapsing the reward to zero, as shown in Fig. 1 (left). As a result, relying on intermediate rewards does not provide an effective local search strategy in these cases.

The key question that arises here is how to enable local search in reward-guided generation without relying on intermediate rewards. We introduce Clean-Sample Markov Chain (CSMC) Sampler which performs iterative search over clean data samples using the Metropolis-Hastings (MH) algorithm. To propose a new clean sample from an existing one, we use forward-backward combinations- 1. Comparison of discrete diffusion inference-time scaling methods for reward-guided sampling. Our CSMC applies to all discrete diffusion frameworks and leverages the clean reward while guiding the sampling trajectory.

applying the forward process to corrupt clean data followed by running the reverse process to obtain a new sample. While the acceptance probability for the MH algorithm is intractable due to intractable clean sample probabilities, we show that the forward-backward proposal distribution makes the acceptance probability tractable, enabling efficient sampling.

We validate CSMC on molecule and biological sequence generation across four different reward functions. CSMC achieves the highest reward in all settings, even for SMILES string generation where other methods fail due to inaccurate intermediate rewards.

In continuous diffusion models, reward-guided sampling is conventionally performed using gradient-based methods (Dhariwal & Nichol, 2021;Ho & Salimans, 2022;Chung et al., 2023;Song et al., 2023;Rozet et al., 2024;Bansal et al., 2023;Yoon et al., 2025;Kim et al., 2025b;Wu et al., 2023) which offer strong guidance towards highreward regions. However, gradient-based approaches cannot be applied to discrete diffusion models as gradients are illdefined in discrete spaces and it is not theoretically valid to add a continuous gradient to a discrete objective.

Training-Free Reward-Guided Sampling for Discrete Diffusion. Recently, inference-time scaling methods for discrete diffusion models have been proposed to tackle reward-guided sampling. A comparison of these methods and our method is shown in Tab. 1.

The simplest method is Best-of-N (BoN) sampling (Stiennon et al., 2020), which generates N samples independently and selects the one with the highest reward. Due to its simplicity, BoN is applicable to all types of discrete diffusion models. However, BoN does not guide the denoising trajectory using the reward, resulting in inefficient search, especially when high-reward samples lie in low-density regions unlikely to be sampl

📄 Read Full PDF on ArXiv

Reference

This content is AI-processed based on open access ArXiv data.

Reward-Guided Discrete Diffusion via Clean-Sample Markov Chain for Molecule and Biological Sequence Design

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

Reference

Related Posts

ASPEN: Spectral-Temporal Fusion for Cross-Subject Brain Decoding

Adaptive Semi-Supervised Training of P300 ERP-BCI Speller System with Minimum Calibration Effort

Approximation Theory for Lipschitz Continuous Transformers

Start searching

No results found