SPG: Sandwiched Policy Gradient for
Masked Diffusion Language Models

Chenyu Wang1,2, Paria Rashidinejad1,3, DiJia Su1, Song Jiang1, Sid Wang1,
Siyan Zhao1,4, Cai Zhou2, Shannon Zejiang Shen1,2, Feiyu Chen1,
Tommi Jaakkola2, Yuandong Tian1, Bo Liu1

1 Meta Superintelligence Labs, 2 MIT, 3 USC, 4 UCLA

Overview

TL;DR: We propose a new policy gradient algorithm, SPG, which reduces bias by optimizing sandwiched variational bounds based on reward and utilizes a block-wise masking technique to improve training efficiency and stability.

Diffusion large language models (dLLMs) are emerging as an efficient alternative to autoregressive models due to their ability to decode multiple tokens in parallel. However, aligning dLLMs with human preferences or task-specific rewards via reinforcement learning (RL) is challenging because their intractable log-likelihood precludes the direct application of standard policy gradient methods.

While prior work uses surrogates like the evidence lower bound (ELBO), these one-sided approximations can introduce significant policy gradient bias. To address this, we propose the Sandwiched Policy Gradient (SPG) that leverages both an upper and a lower bound of the true log-likelihood.

Experiments show that SPG significantly outperforms baselines based on ELBO or one-step estimation. Specifically, SPG improves the accuracy over state-of-the-art RL methods for dLLMs by 3.6% in GSM8K, 2.6% in MATH500, 18.4% in Countdown and 27.0% in Sudoku.

Performance comparison of SPG with baselines on GSM8K, MATH500, Countdown, and Sudoku
Figure 1: SPG yields the best performance across all RL methods with LLaDA-8B-Instruct as base model on four mathematical and logical reasoning benchmarks. All methods are evaluated with a generation length of 256 in 128 denoising steps.

Method

Overview of SPG

The Sandwiched Policy Gradient algorithm aims to address a critical challenge of applying RL to dLLMs: the intractability of the log-likelihood. SPG introduces a novel approach to compute a more robust and less biased policy gradient via:

Illustration of SPG algorithm
Figure 2: Illustration of the SPG algorithm. SPG optimizes sandwiched variational bounds based on reward and utilizes a block-wise masking technique to improve training efficiency and stability.

Algorithm Details

The SPG algorithm operates by alternating between sampling a group of trajectories and updating the policy. The key innovation is the use of both upper and lower bounds in the policy gradient estimation, which provides a more accurate signal than ELBO-based or one-step estimation-based RL methods.

Algorithm of SPG

Practical Considerations

Experimental Results

Main Results

SPG consistently achieve significant improvements over the baselines across diverse reasoning benchmarks, including mathematical reasoning (GSM8K, MATH500) and logical reasoning (Countdown, Sudoku). Specifically, SPG improves the accuracy over the previous state-of-the-art by 3.6% in GSM8K, 2.6% in MATH500, 18.4% in Countdown and 27.0% in Sudoku.

Main results of SPG

Training Dynamics

We analyze the reward dynamics throughout training. SPG shows a rapid and steady increase in reward over the optimization steps, further demonstrating its efficiency and robustness.

Reward curve of SPG
Figure 3: Reward dynamics of SPG w/ Mixture during RL training, compared with D1, WD1, and UniGRPO. SPG consistently leads to faster convergence and higher reward level.

Ablations on Inference Strategies

To assess the robustness of SPG using the block-wise masking strategy, we conduct ablation studies on different inference strategies, including different combinations of decoding orders (i.e., semi-autoregressive (semi-AR) decoding with varying block sizes and full sequence decoding) and unmasking approaches (i.e., confidence-based and random unmasking). Despite being trained under confidence-based semi-AR decoding, SPG consistently outperforms all baselines by a substantial margin across all inference strategies, demonstrating its robustness and strong generalizability.

Ablation on inference strategies
Figure 4: Ablation study on different inference strategies. SPG consistently outperforms all baselines by a substantial margin across all inference strategies.

Please refer to the full paper for more results and ablations.

Citation

If you find SPG useful in your research, please cite our paper:

@article{wang2025spg,
  title={SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models},
  author={Wang, Chenyu and Rashidinejad, Paria and Su, DiJia and Jiang, Song and Wang, Sid and Zhao, Siyan and Zhou, Cai and Shen, Shannon Zejiang and Chen, Feiyu and Jaakkola, Tommi and Tian, Yuandong and Liu, Bo},
  journal={arXiv preprint arXiv:2510.09541},
  year={2025}
}