Diffusion language models (DLMs) have recently emerged as a compelling alternative to autoregressive generation, offering parallel generation and improved global coherence. During inference, DLMs generate text by iteratively denoising masked sequences in parallel; however, determining which positions to unmask and which tokens to commit forms a large combinatorial search problem. Existing inference methods approximate this search using heuristics, which often yield suboptimal decoding paths; other approaches instead rely on additional training to guide token selection. To introduce a principled search mechanism for DLMs inference, we introduce MEDAL, an inference-time scaling framework that integrates Monte Carlo Tree SEarch initialization for Diffusion LAnguage Model inference. We employ Monte Carlo Tree Search at the initialization stage to explore promising unmasking trajectories, providing a robust starting point for subsequent refinement. This design enables efficient inference-time scaling, allowing generation quality to improve as the search budget increases, without additional training. Across multiple benchmarks, MEDAL achieves up to 22.0% improvement over existing inference strategies, establishing a new paradigm for search-based inference in DLMs.
In recent years, diffusion language models (DLMs) have emerged as a powerful alternative for generative modeling over discrete sequences (Zhu et al., 2025;Ye et al., 2025). Unlike autoregressive (AR) models (Achiam et al., 2023;Minaee et al., 2024), which rely on a strictly left-to-right factorization, DLMs learn to invert a stochastic corruption process that independently masks tokens (Lou et al., 2023;Shi et al., 2024;Nie et al., 2025). This formulation enables parallel refinement, improves global coherence, and offers flexible qualitylatency trade-offs, thus challenging the dominance of AR paradigms (Li et al., 2025b).
The inference process of DLMs can be naturally formulated as a search problem: starting from a corrupted sequence, the model iteratively decides which positions to unmask and which tokens to assign, navigating an exponential space of possible trajectories. Existing inference-time methods approximate this search using confidencedriven heuristics, such as greedily unmasking the highest-confidence tokens (Kim et al., 2025;Ben-Hamu et al., 2025;Luxembourg et al., 2025). Although effective in reducing short-term uncertainty, these strategies are inherently myopic: once highconfidence tokens are fixed, subsequent steps are forced to adapt around them, often leading to suboptimal trajectories. Another line of work adjusts masking schedules dynamically (Peng et al., 2025;Zhao et al., 2024), but such approaches typically require training auxiliary samplers to determine token updates at each step, introducing additional complexity and limiting general applicability.
These limitations highlight the need for a principled search mechanism that can explore alternative unmasking trajectories without additional training overhead. To this end, we propose MEDAL, a framework that integrates Monte Carlo Tree SEarch initialization for Diffusion LAnguage Model inference. Unlike heuristic or schedulebased methods, which either commit to tokens greedily or depend on auxiliary samplers, MEDAL adopts a principled search-based approach, using Monte Carlo Tree Search (MCTS) (Yoon et al., 2025;Browne et al., 2012) to balance exploitation of high-confidence tokens with exploration of alternative unmasking trajectories relying on signals obtained from the model’s own distribution. We propose two key innovations that enable MCTS to be applied effectively to DLM inference. First, we introduce a confidence-guided filtering mechanism that focuses inference on the most promising tokens and positions. Second, we design an information-gain reward that guides MCTS by favoring token choices that not only resolve the current position but also increase the model’s confidence in predicting the remaining tokens. Together, these innovations make it possible to enable the four stages of MCTS, Selection, Expansion, Simulation, and Backpropagation, within DLMs. Rather than applying MCTS exhaustively, we employ it strategically in the early stages of inference to construct a robust initialization, after which the process continues with efficient heuristics. To further address complex prompts that induce high uncertainty, we incorporate a task-decomposition module that automatically splits the input into smaller subtasks, thereby reducing ambiguity and providing structured guidance for subsequent unmasking decisions. Our contributions are summarized as follows:
• Novel Formulation: We frame DLM inference as a search problem and introduce MEDAL, the first framework to integrate MCTS into DLM inference, enabling principled exploration beyond greedy heuristics or schedule-based methods
• Novel Design: We design a new DLM inference approach that combines an MCTSguided initialization module with a taskdecomposition module, enabling both efficient search and improved handling of complex tasks.
• Extensive Experiments: We conduct evaluations on various benchmarks, demonstrating that our method outperforms existing inference strategies for DLMs by up to 22.0% when restricting MCTS to initialization. We show that generation quality continues to improve even further with diminishing gains as the MCTS initialization budget increases, validating the effectiveness of our search-based approach.
DLMs adapt the diffusion paradigm from continuous domains (e.g., image generation) to discrete text sequences. Let x 0 = (x 1 0 , . . . , x L 0 ) denote a token sequence of length L sampled from the data distribution. The core idea is to define a forward noising process that progressively corrupts x 0 into increasingly noisy sequences {x t } T t=1 , and to train a neural model to learn the corresponding reverse denoising process that reconstructs x 0 from noise.
Forward process. In discrete DLMs, the forward process is typically defined by a time-dependent transition matrix Q t over the vocabulary (Li et al., 2025b). At each time t, the probability of a state x t given an initial state x 0 is given by a categorical distribution:
Reverse Process The reverse process learns to invert the corruption by predicting the original token distribution given a corrupted sequence:
Here, p θ is parameterized by a transformer, trained to minimize a cross-entropy objective over masked positions, and U t denotes the set of positions unmasked at timestep t.
Generation. During inference, generation begins from a fully masked sequence:
At each denoising step t, the model outputs a distribution over the vocabulary for every position. A subset of tokens with the highest confidence is selected, unmasked, and fixed, while the remaining positions stay masked. The process then advances to step t -1, where the model re-predicts distributions conditioned on both the fixed tokens and the still-masked positions (Zhu et al., 2025;Nie et al., 2025). This iterative refinement continues until all positions are resolved, yielding the final sequence x 0 . The generation can be viewed as a sequence of partially completed states, where the model progressively transitions from a fully masked input to a coherent, fully unmasked output.
MCTS is a general algorithm for decision-making in large combinatorial search spaces, where exhaustive enumeration of all trajectories is infeasible (Browne et al., 2012). The goal of MCTS
, where i identifies the masked position to reveal and v denotes the token selected for that position.
is to evaluate possible decision paths efficiently, balancing the exploration of new options with the refinement of promising ones. The search tree begins at the root node, representing the initial state of the problem, and grows toward leaf nodes, which correspond to unexplored frontier states. Each iteration of MCTS consists of four steps: (i) Selection: choose the node from the root to a leaf according to a selection policy, such as the upper confidence bound (UCB); (ii) Expansion: add one or more child nodes at the leaf; (iii) Simulation: evaluate the newly expanded node by performing rollouts or applying heuristic approximations; (iv) Back propagation: propagate the result upward to update statistics of visited nodes.
In this section, we present our framework MEDAL for enhancing DLM inference (Figure 1). We start with the notation used throughout the section. Second, we introduce our key contributions: confidence score filtering (Section 3.1.1) and information-gain reward (Section 3.1.2) that enable MCTS to be effectively applied to DLMs. Building on these foundations, we then introduce our MCTSguided initialization strategy to efficiently explore promising candidates during the early stages of generation. Finally, we describe how task decomposition via prompt guidance can further reduce uncertainty and improve generation quality.
We use V for the vocabulary (size |V|); L for target sequence length; x = (x 1 , . . . , x L ) for a (partially) formed sequence; x i the token at position i; x \i the sequence with position i masked; t ∈ {0, . . . , T } the reverse denoising step index with T total steps; M t the set of masked positions at step t.
To bridge MCTS with DLMs, we construct the search tree over unmasked sequences, where the root corresponds to the initial masked input, and leaves as partially unmasked sequences.
The classical MCTS aims to balance exploitation and exploration, but in DLMs the search space spans the full vocabulary at every masked position, making naïve search intractable. To address this, we introduce a confidence filtering strategy that restricts the search space of masked positions and tokens to a far smaller action set. We define the action as a position-token pair a i v = (i, v), where i is a masked position and v is the token-candidate chosen, and it represents a specific unmasking decision of the search precess. Next, we describe how to build the action set.
Given the model’s logits ℓ i ∈ R |V| of an input sequence at a masked position i, we first convert ℓ i into probabilities
For a candidate token v at position i, we then define the confidence-adjusted score as:
where ϕ i ent and ϕ i mar are respectively the entropy penalty and top-2 margin, defined below:
• Entropy Penalty.
We use
where ε is a small constant ensuring numerical stability. Positions with higher entropy (low confidence) are down-weighted.
• Top-2 Margin. Let p i (1) ≥ p i (2) be the toptwo probabilities. The margin
(2) is mapped to ϕ i mar = σ(γ∆ i ) with γ is a hyperparameter and σ is a sigmoid function.
For each masked position i ∈ M t , we retain the top-K 1 candidate tokens ranked by the confidenceadjusted score s i v , forming an action candidate set:
From the action candidate set A t , we then select top-K 2 actions with the highest score:
The best K 2 -scoring actions are then applied to the simulation of MCTS (Section 3.1.3).
Given the selected action set A ′ t , the next step is to evaluate the impact of each candidate action. An effective action should not only specify a token at one masked position, but also provide contextual information that improves the model’s confidence over Algorithm 1 Confidence-Guided MCTS for DLMs Inference (CGMCTS) Require: Prompt tokens x 1:|P | , target length L, MCTS init length L c <L, action candidate token number K 1 , the highest-scoring number K 2 , candidate size C, margin scale γ.
Stop when collect C candidates with length L c 3:
x ← NodeSelection(root) ▷ UCB selection 4:
Identify masked positions M t in x 5:
Score tokens:
end for 11:
for all a ∈ A ′ t do 14:
x a ← Apply action a to x 15:
Node← ExpandNode(X a )
16:
Rollout to obtain r IG (a)
17:
Backpropagate r IG (a) ▷ Update search tree 18:
end for 19: end while 20: return C collected candidates with length L c the remaining masked positions. In other words, a good choice at position i should make the subsequent predictions for M t \ {i} more confident.
Our objective is therefore to design a reward function that quantifies how much an action improves the model’s confidence in the unresolved positions. Formally, for an action
We define the reward as the information-gain reward, which measures the entropy reduction across the remaining positions:
where H θ (j) denotes the predictive entropy at position j. Position i is excluded from the second summation: once it has been filled, its entropy is always zero regardless of the token chosen, and including it would contribute only a constant term.
Finally, each action in A ′ t is rolled out and assigned a reward value, which serves as the feedback signal to guide tree selection. This design ensures that the search prioritizes actions that not only fill a mask but also help improve model confidence for future predictions.
Building on the description above, we detail how the four traditional steps of MCTS-Selection, Expansion, Simulation, and Backpropagation-are adapted. The algorithm overview is in Algorithm 1, and an example is shown in Section A.3.
Selection. In this phase, we traverse the tree from the root to a leaf using UCB (Yoon et al., 2025;Kocsis and Szepesvári, 2006), selecting the child node that maximizes Q(x, a) + c ln N (x)
N (x,a) , where x represents the current partially masked sequence state, Q(x, a) is the mean reward of action a, N (x) is the visit count of node x, N (x, a) is the visit count of the child node, and c is the exploration constant.
Expansion We build the action candidate set A ′ t using confidence score filtering (Section 3.1.1). We then pass each action a i v ∈ A ′ t to the simulation step to evaluate its reward.
Simulation. The simulation step evaluates the impact of each action a i v ∈ A ′ t by rolling out the remaining masked positions by DLMs. Specifically, after applying action a i v to fill position i with token v, we let the DLM fill the remaining masked positions M ′ t = M t \ {i} by sampling from the model’s predicted distribution. This yields a completed sequence. The information-gain reward r IG (a i v ) is then computed based on the entropy reduction across the remaining masked positions, as described in Section 3.1.2.
Backpropagation. After the simulation step, the reward obtained from evaluating the complete sequence is backpropagated through the tree to update the value estimates of all parent nodes along the path to the root.
end for 12:
Sample action a i v ∝ Softmax(s i v )
13:
Apply action: x[i] ← v 14: end for 15: return x repeated rollouts over T denoising steps incur exponential cost. This leads to full-sequence MCTS being impractical for large-scale DLM inference. To balance search quality with efficiency, we restrict MCTS to the initialization phase. Specifically, we terminate the MCTS once a candidate set of C partially unmasked sequences, each of length L c , has been formed. Each candidate is then fully rolled out by the DLM to compute its information-gain reward, and the sequence with the highest reward is selected as the final resolved output. Once a partially resolved sequence is obtained, the remaining tokens are filled without further tree search: at each step, we directly apply the confidence-adjusted score to select high-confidence tokens until no masks remain. This strategy enables test-time scaling during critical early decision stages through structured search over generation trajectories, while maintaining tractable inference via confidence-guided decoding in later stages.
In this section, we present our method for decomposing complex tasks into simpler sub-tasks, improving confidence, and providing guidance for the DLM during reasoning.
Given an input prompt P , instead of asking the model to directly generate an answer, we guide the model to break down the problem into a series of manageable steps. Specifically, we provide the model with an augmented prompt P including illustrative two-shot examples showing how a complex question can be divided into a sequence of subtasks. Each subtask is framed with a distinct goal-such as understanding the input, identifying relevant information, or synthesizing the final response-so that guide the model to solve the problem in a structured manner.
At inference time, the model is encouraged to produce its own decomposition for the given input, guided by the example provided in the prompt P . It then solves the subtasks sequentially, with intermediate outputs serving as context for subsequent steps. In summary, the full algorithm of MEDAL is shown in Algorithm 2.
In this section, we provide a theoretical guarantee for MEDAL. At each decoding step i, when revealing a set z i of tokens independently, the total error decomposes into a model term and a jointdependence term KL(q(x z i | x z<i ) ∥ ℓ∈z i q(x ℓ | x z<i )). Following Ben-Hamu et al. (2025), this dependence error is upper-bounded by the entropy-gap surrogate
Our MCTS initialization explicitly minimizes the surrogate cost J(z 1:K ) = K i=1 B(z i ) across candidate schedules, thus selecting a prefix that achieves the smallest available upper bound on cumulative dependence error among explored options. This establishes that the proposed initialization is theoretically grounded: although we cannot minimize the true dependence error directly, MCTS optimizes a computable surrogate that provably controls it. A full derivation is in Appendix A.4.
In this section, we conduct extensive experiments to evaluate MEDAL, guided by the following questions: RQ1: How does our proposed method perform on various datasets compared to state-of-theart baselines? RQ2: What’s the contribution of each component in our framework to the overall performance? RQ3: How does the model’s performance vary with different hyperparameter settings?
We conduct experiments on six widely-used datasets: GSM8K (Cobbe et al., 2021), ARC-C (Clark et al., 2018), HumanEval (Chen et al., 2021), MMLU (Hendrycks et al., 2020), DROP (Dua et al., 2019) and Countdown (Pan et al., 2025). Experiments are conducted on models with a similar scale (7B-8B parameters) to ensure a fair comparison. We evaluate our method on three backbone DLMs, LLaDA-8B-Instruct (Nie et al., 2025), LLaDA1.5-8B (Zhu et al., 2025), and Dream-7B (Ye et al., 2025), and compare their performance against (i) the original models, (ii) the original models with a Best-of-5 decoding strategy (generating five samples and selecting the majority answer, bst5), and (iii) the original models with our method. We additionally report results for Llama-3 8B (Touvron et al., 2023) as an autoregressive LLM baseline. We use accuracy for GSM8K, ARC-C, and MMLU; Exact Match for Countdown; pass@1 for HumanEval; and F1 for DROP. For MCTS initialization on all datasets, we set the MCTS initialization candidate length L c to 20, K 1 = 3, K 2 = 5, and the stability constant ε = 10 -8 . The generation length for DLMs is 256. For taskdecomposition prompting, we set the number of subtasks to 3. We set the candidate size C = 3.
All the experiments are conducted on 2 NVIDIA A100 GPUs with 40GB memory, and the random seed is set to 1. Detailed settings are provided in Appendix A.1, and computational costs in Appendix A.2.
We evaluate our method on the five datasets using three different backbone DLMs. The results are shown in Table 1. We observe that our method consistently improves the performance of all backbone models across all datasets, with up to 18.2% average and 8.1 absolute improvement, indicating the effectiveness and generality of our approach. Notably, even a model that underperforms Llama (i.e., Llada, which lags behind Llama on 4 out of 5 baselines in Table 1) achieves comparable or superior results on most datasets when equipped with our method. These results highlight the potential of DLMs when guided with appropriate strategies. The standard deviation is reported in Table 6.
To understand the contribution of each component in our framework, we conduct ablation studies on ARC-C, HumanEval and DROP on backbone model LLaDA. In this section, we analyze the impact of key hyperparameters in our framework, including the number of decomposed tasks, the number of candidate tokens selected K 2 during MCTS, and MCTS initialization candidate length L c . We vary the number of decomposed tasks and evaluate the performance on ARC-C, HumanEval, and DROP using LLaDA as the backbone model. The results are shown in Table 3. We observe that decomposing the task into 3 subtasks yields the best performance. Decomposing into too few (1) or too many (5 or 10) subtasks leads to a performance drop, indicating that an optimal level of decomposition is crucial for balancing complexity and guidance.
Then we vary the number of candidate tokens selected during MCTS and evaluate the performance on the same datasets. The results are shown in Table 4. We find that selecting 5 candidate tokens during MCTS achieves the best performance across all datasets. Selecting too few (1 or 3) candidate tokens would made the exploration space too small, limiting the potential of finding better generation paths. On the other hand, selecting too many (10) tokens would introduce too much noise, making it harder for the model to focus on the most promising paths. Therefore, we find that selecting 5 candidate tokens works well which is essential for effective exploration and exploitation during MCTS.
Finally, we vary the number of MCTS initialization candidate length L c and evaluate the performance on data, including ARC-C, HumanEval, DROP and Countdown, for generating sentences of length 256. The results are shown in Figure 2. We observe that increasing L c can improve the performance, and the performance gain tends to saturate after 20 steps. It indicates that using MCTS for initialization is sufficient, and it is a feasible approach for effective generation. This result suggests that the most critical generation trajectory decisions occur early, meaning that extended search beyond this point yields diminishing returns relative to the computational cost.
In this section, we present a case study exploring the potential of using DLMs in an agentic setting. We integrate our method with the ADAS (Hu et al., 2024), which utilizes an LLM to automatically invent building blocks and design powerful agentic systems to solve given tasks. We replace the LLMs in ADAS with our DLM (LLaDA with our method) and compare the performance with the original ADAS using LLaDA and Llama as backbones. The results on DROP and MMLU are shown in Table 5. We observe that integrating our method with ADAS leads to further performance improvement compared to using LLaDA and LLama, demonstrating our method’s ability to enhance the reasoning and planning capabilities of DLMs in complex agentic settings. These findings suggest that DLMs, when equipped with effective strategies like ours, can be powerful tools for building intelligent agents capable of solving challenging tasks.
Diffusion Language Models. Diffusion models are a powerful class of generative models (Podell et al., 2023;Rombach et al., 2022;Li et al., 2025b;Zhang et al., 2023;Li et al., 2025a). Building on their success, DLMs have emerged as promising alternatives for text generation. One line of work adapts continuous diffusion to discrete text via continuous relaxations (Li et al., 2022;Strudel et al., 2022), while another operates directly in the discrete token space, corrupting text by masking or token replacement (He et al., 2022;Austin et al., 2021). Leveraging mature scaling techniques, large-scale DLMs have been developed (Nie et al., 2025;Gong et al., 2024;Nie et al., 2024), achieving performance competitive with AR models (Touvron et al., 2023). Despite strong generative ability, DLMs often struggle with controllability and reasoning in deciding the token order to unmask tokens during inference. To address this, we propose a principled search-based framework with MCTS to enhance DLM generation capabilities.
Language Models. To improve the reasoning performance of DLMs Kim et al. (2025) propose incorporating model uncertainty into the diffusion process to enhance generation quality. Entropybased planning methods have been introduced (Ben-Hamu et al., 2025;Ye et al., 2025) to bet- ter capture the model’s confidence. Planner-guided generation is studied in (Peng et al., 2025), enabling token-level refinement during sampling. More advanced reinforcement learning (RL) approaches further boost performance (Zekri and Boullé, 2025;Zhu et al., 2025;Gong et al., 2025).
Particularly, Diffu-GRPO (Zhao et al., 2025) applies policy-gradient RL to DLMs via a mean-field approximation and prompt masking, while TraDo (Wang et al., 2025) aligns DLM inference trajectories with training objectives by optimizing the sampling process. Additionally, wd1 (Tang et al., 2025) reformulates the RL objective as a weighted likelihood to avoid biased policy ratios, improving reasoning accuracy. While effective, these methods rely on additional training that involves substantial computational cost and engineering overhead. In contrast, we enhance DLM reasoning purely at inference time, making our approach more efficient and practical for real-world applications.
In this work, we introduced MEDAL, a framework that casts diffusion language model inference as a structured search problem and integrates it with MCTS. By applying MCTS during initialization, MEDAL explores promising unmasking trajectories before refinement. Its confidence-guided filtering and information-gain reward enable efficient, targeted search and improve global certainty. Across multiple benchmarks, MEDAL consistently outperforms existing inference strategies, achiev-ing gains of up to 22.0%.
While our study demonstrates the effectiveness of confidence-guided MCTS initialization for diffusion language models, several limitations remain. First, we primarily evaluate unimodal text-based DLMs; extending our framework to multimodal DLMs (e.g., vision-language or audio-language models) would provide a more comprehensive assessment of its generality. Second, our current experiments focus on standalone inference; integrating the method into agentic settings, where reasoning and decision-making unfold over multiple steps and interactions, poses both challenges and opportunities for future exploration.
This work does not introduce additional ethical or societal risks beyond those associated with existing large language models. While our method utilizes test-time scaling to enhance generation quality, it operates by allocating compute to resolve uncertainty within the fixed model distribution. Consequently, it does not alter the model’s fundamental objectives, training data, or safety alignment.
A.1 Experiment Setting
In this section, we provide more details about the experimental settings. We use three different backbone models: LLaDA-8B-Instruct (Nie et al., 2025), LLaDA1.5-8B (Zhu et al., 2025), and Dream-7B (Ye et al., 2025), and compare the results with the original models without our method.
We also include Llama3-8B (Touvron et al., 2023) as an AR LLM baseline for comparison. The details of the models are as follows:
• LLaDA-8B-Instruct. • Dream-7B. Dream-7B is a diffusion large language model that generates text by iteratively refining sequences in parallel rather than sequentially like autoregressive models. It is trained using two key techniques: initializing its weights from a pre-trained AR model and employing a context-adaptive token-level noise rescheduling mechanism. While achieving performance competitive with leading AR models like Qwen2.5 7B on general, mathematical, and coding benchmarks, Dream 7B demonstrates substantial abilities on complex planning tasks such as Sudoku and trip planning.
The data we used for evaluation includes six widely used datasets: GSM8K (Cobbe et al., 2021), ARC-C (Clark et al., 2018), HumanEval (Chen et al., 2021), MMLU (Hendrycks et al., 2020), DROP (Dua et al., 2019) and Countdown (Pan et al., 2025). The details of the datasets are as follows:
• GSM8K. A dataset to evaluate the mathematical and scientific reasoning capabilities of large language models. The dataset consists of high-quality, grade-school-level math word problems that require multiple steps to solve. These problems are designed to test a model’s ability to perform • Countdown. A benchmark that contains arithmetic reasoning problems where the model is given three or four numbers and must construct an expression that uses each number exactly once with basic operations (+, -, ×, ÷) to reach a target integer. Each example includes a list of numbers and a target.
The prompt we used for each dataset is shown in Figure 4, 5, 6 , 7, and 8.
In this section, we report the standard deviation and the computing overhead.
Standard Deviation. The standard deviation of all the methods across various datasets. The results are shown in Table 6. Overall, the results show that incorporating our method generally stabilizes model performance by reducing variance across benchmarks, particularly on reasoning-intensive tasks such as HumanEval, MMLU, and DROP.
Computational Costs. We evaluate computational overhead by comparing a standard LLaDA run with LLaDA using Best-of-15 (Bst15) decoding on GSM8K. The results are reported in Table 7. As shown, MEDAL achieves comparable runtime to Bst15 while delivering better accuracy, indicating that MEDAL allocates inference-time compute more effectively under constrained resources.
We provide a single-step illustrative example of applying MCTS to DLM inference in Figure 3. (1) Selection: among three partially unmasked sequence nodes, we pick the one with the highest UCB score (Node 1). ( 2) Expansion: given the model logits at Node 1, we compute confidence-adjusted scores for all masked positions and construct the candidate action set A ′ t . For each candidate a ∈ A ′ t (e.g., inserting the token “quick” at position 1), we apply the edit and create a new child node. (3) Simulation: from each expanded node, we unmask the remaining tokens using the DLM to obtain a completed sequence. (4) Backpropagation: we compute the information-gain reward (Equation ( 9)) for each simulation outcome and propagate it up the tree, updating value estimates for all parent nodes along the path to the root.
Setup and notation. Let x = (x 1 , . . . , x n ) be a sequence of discrete tokens. We denote by q the (unknown) true data distribution and by p θ the model distribution. Decoding proceeds in steps i = 1, 2, . . .; at step i we reveal a (possibly multitoken) index set z i ⊆ {1, . . . , n}, and write x z<i for all tokens revealed before step i.
The term B(U | x z<i ) measures the total uncertainty of the tokens in U , penalized by subtracting the largest single-token entropy. Intuitively, it is small when one token in U dominates the uncertainty and large when multiple tokens are simultaneously high-entropy, making it a useful surrogate for the dependence error incurred by unmasking them together. To simplify the notation, we write B(z i ) := B(z i | x z<i ) when the context is clear.
Per-step decomposition When sampling x z i independently given x z<i , the per-
where the first term, model error, sums the KL divergence for each l ∈ z i , quantifying the per-token discrepancy between the model’s conditional distribution and the true conditional given the current context, and the joint-dependence error quantifies the penalty from sampling all tokens in z i independently, measuring how far the true joint conditional distribution q(x z i | x z<i ) is from the product of its marginals. It captures the correlations among tokens in z i that are ignored when they are sampled independently, rather than jointly conditioned on one another. Our analysis focuses on the dependence term DepErr i .
Standing assumption.
Assumption 1 (Entropy-gap upper bound). For every step i and context x z<i ,
Our contribution is to show how the MCTS initializer is designed to minimize the RHS (a computable surrogate), thereby controlling the cumulative dependence error.
Cumulative bound.
Lemma 1 (Prefix dependence is bounded by cumulative entropy gaps). For any K ≥ 1,
Proof. Apply Assumption 1 to each step and sum over i = 1, . . . , K.
Search space and surrogate objective. Fix K ≥ 1 and let S K be the set of feasible K-step schedules z 1:K = (z 1 , . . . , z K ) (e.g., obeying any architectural or mask constraints). Define the surrogate cost of a schedule by J(z 1:K ) := K i=1 B(z i ).
(Optionally, one may add a tokenwise uncertainty term; see Remark 1 below.)
MCTS estimator and selection rule. Given a budget of N simulations, MCTS constructs an empirical estimate J N (z 1:K ) for candidate schedules and returns z (N )
1:K ∈ arg min z 1:K ∈S K explored J N (z 1:K ).
We adopt the UCT tree policy (Kocsis and Szepesvári, 2006), which is asymptotically consistent: under standard assumptions (bounded costs and unbiased rollout estimates), the empirical estimates J N (z 1:K ) converge to the true surrogate
Table 2: Ablation study results on ARC-C, HumanEval, and DROP using LLaDA as the backbone model.
nal prompt using MCTS (W/o T. Dcp.). Additionally, we remove the confidence-adjusted score and using only use top-2 margin (W/o Cf.+t2) to select the candidate tokens during MCTS.The results are presented in Table2. Overall,
nal prompt using MCTS (W/o T. Dcp.). Additionally, we remove the confidence-adjusted score and using only use top-2 margin (W/o Cf.+t2) to select the candidate tokens during MCTS.The results are presented in Table2
nal prompt using MCTS (W/o T. Dcp.). Additionally, we remove the confidence-adjusted score and using only use top-2 margin (W/o Cf.+t2) to select the candidate tokens during MCTS.The results are presented in Table
nal prompt using MCTS (W/o T. Dcp.). Additionally, we remove the confidence-adjusted score and using only use top-2 margin (W/o Cf.+t2) to select the candidate tokens during MCTS.
This content is AI-processed based on open access ArXiv data.