SPG: Sandwiched Policy Gradient for
Masked Diffusion Language Models

Chenyu Wang^1,2, Paria Rashidinejad^1,3, DiJia Su¹, Song Jiang¹, Sid Wang¹,
Siyan Zhao^1,4, Cai Zhou², Shannon Zejiang Shen^1,2, Feiyu Chen¹,
Tommi Jaakkola², Yuandong Tian¹, Bo Liu¹

1 Meta Superintelligence Labs, 2 MIT, 3 USC, 4 UCLA

Paper Code

Overview

TL;DR: We propose a new policy gradient algorithm, SPG, which reduces bias by optimizing sandwiched variational bounds based on reward and utilizes a block-wise masking technique to improve training efficiency and stability.

Diffusion large language models (dLLMs) are emerging as an efficient alternative to autoregressive models due to their ability to decode multiple tokens in parallel. However, aligning dLLMs with human preferences or task-specific rewards via reinforcement learning (RL) is challenging because their intractable log-likelihood precludes the direct application of standard policy gradient methods.

While prior work uses surrogates like the evidence lower bound (ELBO), these one-sided approximations can introduce significant policy gradient bias. To address this, we propose the Sandwiched Policy Gradient (SPG) that leverages both an upper and a lower bound of the true log-likelihood.

Experiments show that SPG significantly outperforms baselines based on ELBO or one-step estimation. Specifically, SPG improves the accuracy over state-of-the-art RL methods for dLLMs by 3.6% in GSM8K, 2.6% in MATH500, 18.4% in Countdown and 27.0% in Sudoku.

Performance comparison of SPG with baselines on GSM8K, MATH500, Countdown, and Sudoku

Figure 1: SPG yields the best performance across all RL methods with LLaDA-8B-Instruct as base model on four mathematical and logical reasoning benchmarks. All methods are evaluated with a generation length of 256 in 128 denoising steps.

Method

Overview of SPG

The Sandwiched Policy Gradient algorithm aims to address a critical challenge of applying RL to dLLMs: the intractability of the log-likelihood. SPG introduces a novel approach to compute a more robust and less biased policy gradient via:

"Sandwiching" the intractable log-likelihood of a generated sequence by maximizing a tractable lower bound for positive-reward sequences while minimizing an upper bound for negative-reward ones;
Employing a block-wise masking strategy to better align data distributions during policy rollout and optimization and ensure a stable Monte Carlo estimation of these bounds.

Figure 2: Illustration of the SPG algorithm. SPG optimizes sandwiched variational bounds based on reward and utilizes a block-wise masking technique to improve training efficiency and stability.

Algorithm Details

The SPG algorithm operates by alternating between sampling a group of trajectories and updating the policy. The key innovation is the use of both upper and lower bounds in the policy gradient estimation, which provides a more accurate signal than ELBO-based or one-step estimation-based RL methods.

Practical Considerations

Block-wise Masking Strategy for Monte Carlo Estimation. To better align data distributions between policy rollout (i.e., block-wise semi-autoregressive unmasking) and optimization, we adopt a block-wise masking strategy instead of random masking for Monte Carlo estimation of ELBO and EUBO, where a random block is selected for masking, with earlier blocks kept clean and later blocks fully masked
Mixture of Upper and Lower Bound for Negative Advantage Traces. To improve the stability of the policy gradient estimation, we use a mixture of upper and lower bound for negative advantage traces in practice. Such mixture leads to a confidence-aware weighting and lower variance and more stable training.

Experimental Results

Main Results

SPG consistently achieve significant improvements over the baselines across diverse reasoning benchmarks, including mathematical reasoning (GSM8K, MATH500) and logical reasoning (Countdown, Sudoku). Specifically, SPG improves the accuracy over the previous state-of-the-art by 3.6% in GSM8K, 2.6% in MATH500, 18.4% in Countdown and 27.0% in Sudoku.

Training Dynamics

We analyze the reward dynamics throughout training. SPG shows a rapid and steady increase in reward over the optimization steps, further demonstrating its efficiency and robustness.

Figure 3: Reward dynamics of SPG w/ Mixture during RL training, compared with D1, WD1, and UniGRPO. SPG consistently leads to faster convergence and higher reward level.

Ablations on Inference Strategies

To assess the robustness of SPG using the block-wise masking strategy, we conduct ablation studies on different inference strategies, including different combinations of decoding orders (i.e., semi-autoregressive (semi-AR) decoding with varying block sizes and full sequence decoding) and unmasking approaches (i.e., confidence-based and random unmasking). Despite being trained under confidence-based semi-AR decoding, SPG consistently outperforms all baselines by a substantial margin across all inference strategies, demonstrating its robustness and strong generalizability.

Figure 4: Ablation study on different inference strategies. SPG consistently outperforms all baselines by a substantial margin across all inference strategies.

Please refer to the full paper for more results and ablations.

Citation

If you find SPG useful in your research, please cite our paper:

@article{wang2025spg,
  title={SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models},
  author={Wang, Chenyu and Rashidinejad, Paria and Su, DiJia and Jiang, Song and Wang, Sid and Zhao, Siyan and Zhou, Cai and Shen, Shannon Zejiang and Chen, Feiyu and Jaakkola, Tommi and Tian, Yuandong and Liu, Bo},
  journal={arXiv preprint arXiv:2510.09541},
  year={2025}
}