NeurIPS 2025 | Learning Diffusion Models with
Flexible Representation Guidance

Chenyu Wang1,*, Cai Zhou1,*, Sharut Gupta1, Zongyu Lin2, Stefanie Jegelka1,3, Stephen Bates1, Tommi Jaakkola1

* Equal contribution

1 MIT, 2 UCLA, 3 TU Munich

Overview

REED (Representation-Enhanced Elucidation of Diffusion) is a theoretical and practical framework for guiding diffusion models with high-quality pretrained representations.

Theoretical Characterizations

We introduce a general theoretical framework for representation-enhanced diffusion models. Building on the DDPM framework, we explore how representations can offer additional information and enhance the diffusion generation process. We provide a brief overview below. For more details, please refer to Section 2 in our paper.

Variational Bound with Decomposed Probabilistic Framework. We build a variational framework for diffusion models augmented with pretrained representations. By decomposing the joint probability $p(x_{0:T}, z)$ into conditional components and introducing a timestep weight schedule $\alpha_t$, the framework controls when and how the representation $z$ influences the reverse diffusion process. Injecting representations earlier yields more accurate conditional transitions but may degrade the approximation of $p(z|x_t)$; therefore, REED uses a weighted combination of conditional and unconditional paths. The resulting hybrid distribution $\tilde{p}_\theta(x_{t-1} | x_t, z; A_t)$ gradually phases in representation guidance during training.

Multi-Latent Representation Structure. The expressiveness of latent representations can be significantly enhanced through a multi-latent structure, as shown in prior works like NVAE. Instead of modeling a single representation $z$, we extend our formulation to a series of representations $\{z_l\}_{l=1}^L$, each indicating a different level of abstraction. This inspires us to leverage multimodal representations for REED as multi-level latents.

Connecting Existing Approaches, RCG and REPA. Our theoretical formulation provides a general and flexible framework for representation-enhanced generation by introducing two key components: a customizable weighting schedule $\{\alpha_t\}_{t=1}^T$ and a hierarchical latent structure $\{z_l\}_{l=1}^L$. This framework recovers several existing approaches as special cases, in particular RCG and REPA:

Our REED method builds upon the REPA setting, adopting the linear weighting schedule. Importantly, we move beyond the original $L=1$ setting to the more general case of $L>1$, enabling the novel use of richer multimodal representations through synthetic data. Additionally, inspired by our variational bound formulation, we propose a better training curriculum that dynamically balances the data generation and representation modeling objectives, further enhancing model performance and flexibility.

Algorithm Details

Building on the theoretical insights, we make two key innovations to drive REED's success:


Illustration of REED's multimodal alignment framework showing image, protein, and molecule modalities
Figure 1: High-level illustration of the REED framework. Diffusion models are aligned with rich representations of synthetic auxiliary data modalities, generated from vision-language models (left), folding models for protein sequences (top right), and multimodal models for molecular structures (bottom right). The joint training pipeline leverages synthetic captions, sequence & structure embeddings, and 2D-3D molecular embeddings.

REED implements three practical instantiations for domain representations:

Results

REED delivers consistent improvements across diverse generative modelling tasks, including image generation, protein inverse folding, and molecule generation.

Training speed improvements on ImageNet and protein inverse folding
Figure2: On image generation, REED (green) reaches low FID much faster than SiT-XL/2 (blue) and REPA (red). On protein inverse folding, REED (green) accelerates diffusion training by 3.6×, achieving higher sequence recovery at fewer epochs.

Image Generation

On the class-conditional ImageNet 256×256 benchmark, aligning SiT-XL/2 with DINOv2 and Qwen2-VL representations using REED's curriculum accelerates training and improves FID scores. REED achieves a 23.3× training speedup over the original SiT-XL, reaching FID = 8.2 in just 300 K iterations without classifier-free guidance, and a speedup over REPA while matching its FID 1.80 performance with only 200 epochs of training.

Image generation results showing FID improvements

We provide several qualitative samples generated by the SiT-XL/2+REED model after 1M training iterations as below.

sampled images from REED
Figure 3: Selected samples on ImageNet 256×256 generated by the SiT-XL/2+REED model after 1M training iterations. We use classifier-free guidance with w = 1.275.
sampled class-conditioned images from REED
sampled class-conditioned images from REED
Figure 4: Selected samples on ImageNet 256×256 generated by the SiT-XL/2+REED model after 1M training iterations. We use classifier-free guidance with w = 4.0. Each row corresponds to the same class label. From top to bottom, the class labels are: “macaw” (88), “flamingo” (130), “borzoi, Russian wolfhound” (169), “Samoyed” (258), “Egyptian cat” (285), “otter” (360), “dogsled” (537), “jigsaw puzzle” (611), “laptop” (620), “school bus” (779), “cheeseburger” (933), “cliff” (972), “earthstar” (995), and “toilet tissue” (999), respectively.

Protein Inverse Folding

On protein inverse folding, aligning discrete diffusion models with structure and sequence embeddings from AlphaFold3 accelerates training by 3.6× and improves metrics such as sequence recovery, RMSD, and pLDDT.

Protein inverse folding results

We provide several qualitative samples generated by REED as below.

sampled protein sequences from REED
sampled protein sequences from REED
Figure 5: Selected samples on protein inverse folding. Green color denotes the ground truth structure and blue color denotes the generated sequence folded by ESMFold.

Molecule Generation

On molecule generation tasks, REED enhances stability, validity, energy, and strain by pairing flow-matching models with pretrained 2D-3D joint embeddings.

Molecule generation results

We provide several qualitative samples generated by REED as below.

sampled molecule structures from REED
Figure 6: Selected samples on molecule generation.

Citation

If you find REED useful in your research, please cite the paper:

@article{wang2025learning,
  title={Learning Diffusion Models with Flexible Representation Guidance},
  author={Wang, Chenyu and Zhou, Cai and Gupta, Sharut and Lin, Zongyu and Jegelka, Stefanie and Bates, Stephen and Jaakkola, Tommi},
  journal={The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS)},
  year={2025}
}