Diffusion Models: A Frontier-Lab Interview Deep Dive

Why this exists. Diffusion is the dominant paradigm for image, video, and 3D generation, and it's increasingly applied beyond. Interviewers probe: forward/reverse processes, why we predict noise, classifier-free guidance, latent vs pixel-space diffusion, flow matching. This document covers the math without the dense Bayesian notation.

1. The big picture

A diffusion model has two processes:

Forward (fixed): progressively add noise to data over $T$ steps, transforming $x_{0}$ (data) into $x_{T}$ (pure Gaussian noise).

Reverse (learned): progressively denoise, starting from $x_{T}$ (pure noise) and producing $x_{0}$ (data sample).

The model learns the reverse process. At sampling time, run the reverse process with a fresh random Gaussian; you get a fresh sample from the learned data distribution.

Why this works: the forward process has a simple form (add Gaussian noise; tractable mathematically). The reverse process — which is what generates data — can be learned by training the model to undo each forward step.

2. The forward process

$q (x_{t} ∣ x_{t - 1}) = N (x_{t}; 1 - β_{t} x_{t - 1}, β_{t} I)$

At each step, mix $x_{t - 1}$ with Gaussian noise of variance $β_{t}$ . Iterate $T$ steps. $β_{t}$ follows a schedule (linear, cosine, etc.) — typically small early, larger later.

Closed form

A key property: you can sample $x_{t}$ directly from $x_{0}$ without iterating:

$q (x_{t} ∣ x_{0}) = N (x_{t}; \overset{α}{ˉ}_{t} x_{0}, (1 - \overset{α}{ˉ}_{t}) I)$

where $α_{t} = 1 - β_{t}$ and $\overset{α}{ˉ}_{t} = \prod_{s = 1}^{t} α_{s}$ . So:

$x_{t} = \overset{α}{ˉ}_{t} x_{0} + 1 - \overset{α}{ˉ}_{t} ε, ε \sim N (0, I)$

This direct sampling is critical: during training, you don't need to iterate the forward process — you sample a random $t$ and a random $ε$ and compute $x_{t}$ directly.

Variance schedule

Linear: $β_{t}$ linearly interpolated from $β_{1} = 1 0^{- 4}$ to $β_{T} = 0.02$ over $T = 1000$ steps. Original DDPM choice.

Cosine (Nichol & Dhariwal 2021): $\overset{α}{ˉ}_{t} = cos^{2} (\cdot)$ . Smoother decay; better for high-resolution images.

Variance-preserving (VP) vs variance-exploding (VE): different parameterizations of the diffusion process. VP keeps $Var (x_{t})$ near 1; VE lets it grow. VP (DDPM-style) is the more common choice.

3. The reverse process

We want to learn $p_{θ} (x_{t - 1} ∣ x_{t})$ . The true posterior $q (x_{t - 1} ∣ x_{t}, x_{0})$ is also Gaussian (Bayes on the forward Markov chain), with mean derivable in terms of $x_{0}$ and $x_{t}$ . The training trick: parameterize the model to predict either $x_{0}$ , $ε$ (the noise), or the score (gradient of log-density).

Predicting noise (the standard choice)

DDPM (Ho et al. 2020) parameterizes the reverse mean as:

$μ_{θ} (x_{t}, t) = \frac{1}{α _{t}} (x_{t} - \frac{β _{t}}{1 - α ˉ _{t}} ε_{θ} (x_{t}, t))$

where $ε_{θ}$ is the model's prediction of the noise that was added to get to $x_{t}$ .

The training loss simplifies dramatically:

$L = E_{t, x_{0}, ε} [ε - ε_{θ} (x_{t}, t)^{2}]$

MSE between predicted and actual noise. That's the entire training objective.

Why predict noise specifically?

Mathematically equivalent options:

Predict $x_{0}$ directly.
Predict $ε$ .
Predict the score $\nabla_{x} lo g p_{t} (x)$ (Tweedie's formula).

Empirically, predicting $ε$ works best because:

The target has constant scale (normalized).
Loss is well-conditioned across all timesteps.
Easier to train than $x_{0}$ prediction (which needs to span full data range).

Score matching connection

Predicting $ε$ is equivalent to predicting the score (gradient of log density):

$ε \approx - σ_{t} \nabla_{x} lo g p_{t} (x)$

So diffusion models are score-based generative models — they learn to follow gradients of log-density. This is the Song & Ermon line of work; DDPM (Ho et al.) and score-matching (Song & Ermon) are equivalent up to parameterization.

4. Sampling

Once trained, sample from $p (x_{0})$ by running the reverse process:

Start with $x_{T} \sim N (0, I)$ .
For $t$ from $T$ down to $1$ :
- Predict noise: $\overset{ε}{^} = ε_{θ} (x_{t}, t)$ .
- Compute mean: $μ = \frac{1}{α _{t}} (x_{t} - \frac{β _{t}}{1 - α ˉ _{t}} \overset{ε}{^})$ .
- Add Gaussian noise: $x_{t - 1} = μ + σ_{t} z, z \sim N (0, I)$ .
Return $x_{0}$ .

For $T = 1000$ , this is 1000 model forward passes per sample. Slow.

DDIM: deterministic sampling with fewer steps

DDIM (Song et al. 2021) reformulates the reverse process to be deterministic and to allow skipping steps:

$x_{t - k} = \overset{α}{ˉ}_{t - k} \frac{x _{t} - 1 - α ˉ _{t} ε ^}{α ˉ _{t}} + 1 - \overset{α}{ˉ}_{t - k} \overset{ε}{^}$

Same model, different sampling. With DDIM you can sample in 50–100 steps with quality comparable to 1000-step DDPM. Standard for production diffusion.

Even faster sampling

DPM-Solver, DPM-Solver++ (Lu et al.): ODE/SDE solvers that exploit the structure of the diffusion ODE. ~20 steps with strong quality.
Consistency Models (Song et al. 2023): distill diffusion into a model that goes from noise to data in 1–2 steps. Sacrifices some quality for speed.
Rectified Flow / Flow Matching: learn a straighter trajectory; sample in fewer steps.

The active research direction is reducing sampling steps from 1000 to 1–4 while preserving quality.

5. The ELBO and the loss derivation

For interview-grade understanding (often asked):

The ELBO for diffusion models:

$lo g p_{θ} (x_{0}) \geq E_{q} [lo g p_{θ} (x_{0} ∣ x_{1})] - t > 1 \sum E_{q} [KL (q (x_{t - 1} ∣ x_{t}, x_{0}) ∥ p_{θ} (x_{t - 1} ∣ x_{t}))] - KL (q (x_{T} ∣ x_{0}) ∥ p (x_{T}))$

After algebra (dropping irrelevant constants), each KL term reduces to:

$L_{t} = E_{x_{0}, ε} [\frac{β _{t}^{2}}{2 σ _{t}^{2} α _{t} ( 1 - α ˉ _{t} )} ∥ ε - ε_{θ} (x_{t}, t) ∥^{2}]$

DDPM uses the simplified loss (drop the prefactor):

$L_{simple} = E_{t, x_{0}, ε} [∥ ε - ε_{θ} (x_{t}, t) ∥^{2}]$

Empirically, the simplified loss works better than the weighted ELBO. The prefactor would over-weight some timesteps.

6. Classifier-free guidance (CFG)

A critical technique for conditional generation (text-to-image, etc.).

Setup

The model is trained jointly:

Conditional: $ε_{θ} (x_{t}, t, c)$ where $c$ is the conditioning (e.g., text embedding).
Unconditional: $ε_{θ} (x_{t}, t, \emptyset)$ — replace $c$ with a null embedding 10–20% of the time during training.

At sampling

Combine the two predictions:

$\overset{ε}{^}_{guided} = ε_{θ} (x_{t}, t, \emptyset) + w \cdot (ε_{θ} (x_{t}, t, c) - ε_{θ} (x_{t}, t, \emptyset))$

$w$ is the guidance scale (typically 1.5–7.5). $w = 1$ means no guidance (just use conditional). $w > 1$ amplifies the conditional signal.

Why this works

The difference $ε_{cond} - ε_{uncond}$ is a direction that "points toward the condition" in score space. Amplifying it pushes the sample more strongly toward the condition.

Trade-offs

High $w$ : stronger adherence to condition, but may produce overexposed / oversaturated images. Sample diversity drops.
Low $w$ : more diverse samples, weaker adherence to condition.
Stable Diffusion typically uses $w = 7.5$ .

CFG is ubiquitous in text-to-image. Almost every paper since 2022 uses it.

Classifier guidance (older)

The original conditioning method (Dhariwal & Nichol 2021): use a separate classifier's gradient $\nabla_{x} lo g p (c ∣ x_{t})$ to push samples toward the condition. Replaced by CFG which doesn't need a separate classifier.

7. Latent diffusion (Stable Diffusion)

The problem

Pixel-space diffusion is expensive. A 512×512 RGB image has 786K pixels. Forward/reverse passes through a UNet on this is slow.

The fix

Latent diffusion (Rombach et al. 2022, Stable Diffusion):

Encode image to a smaller latent $z$ via a pretrained autoencoder (4–8x downsampling).
Run diffusion in the latent space $z$ (much smaller, faster).
Decode back to pixels at the end.

$encode: x_{pixel} \to z_{latent} (VAE encoder)$ $diffuse + denoise z_{latent}$ $decode: z_{latent} \to x_{pixel} (VAE decoder)$

Why it works

Most "perceptual" content (textures, semantics) is captured in the latent.
Diffusion in latent space is 4–8x cheaper.
Final image quality limited by the VAE's reconstruction quality, but in practice this is fine.

Stable Diffusion family

SD 1.x, SD 2.x, SDXL, SD 3 — all latent diffusion. Differences: VAE quality, UNet vs Transformer (DiT), training data, conditioning model (CLIP vs T5), schedules.

8. Architecture: UNet vs DiT

UNet (DDPM, Stable Diffusion 1.x/2.x)

Convolutional U-shape with skip connections. Down-sampling encoder + up-sampling decoder. Cross-attention layers for conditioning. Standard for diffusion until ~2023.

DiT (Diffusion Transformer, Peebles & Xie 2022)

Replace the UNet with a transformer over image patches. Same idea as ViT. Better scaling properties; SD 3, FLUX use DiT.

Why DiT wins at scale

Transformers scale predictably with parameters and data. Convolutional UNets have hand-crafted inductive biases that limit scalability. As diffusion models grow, DiT-style architectures dominate.

9. Flow Matching and Rectified Flow (recent)

A reformulation of diffusion that's becoming dominant:

Learn a velocity field $v_{θ} (x_{t}, t)$ that transforms noise → data along a continuous path.

Key ideas:

Straighter paths: flow matching produces ODEs with straighter trajectories than diffusion. Fewer sampling steps for equivalent quality.
Simpler training: the loss is similar to noise prediction but conceptually cleaner.
Same model in practice: the trained network is equivalent to a diffusion network, but the training objective and sampling are different.

Used in Stable Diffusion 3, FLUX, recent video models. Likely to replace pure diffusion as the dominant paradigm.

Text is discrete; diffusion is naturally continuous.
Workarounds: diffuse in embedding space, or use special discrete diffusion processes.

Recent: SEDD (Score Entropy Discrete Diffusion), Diffusion-LM. Promising but not at frontier-LLM scale yet.

For text generation, autoregressive models still dominate.

12. Common interview gotchas

Gotcha	Strong answer
"Why predict noise instead of $x_{0}$ ?"	MSE loss is well-conditioned across timesteps. $ε$ targets have constant scale; $x_{0}$ targets span full data range.
"Is diffusion training computationally expensive?"	Each step: one forward pass on a noisy image. Many steps over time but each is parallelizable. Comparable to other generative models.
"Why is sampling slow?"	Need many denoising steps (1000 for DDPM, 50–100 for DDIM). Recent: consistency models can do it in 1–4 steps.
"What's CFG?"	Combine conditional and unconditional predictions during sampling; amplify the conditional direction. Standard for text-to-image.
"Why latent diffusion?"	Diffuse in compressed latent space (via VAE encoder), much cheaper than pixel space. Stable Diffusion innovation.
"DiT vs UNet?"	DiT (transformer) scales better than UNet (conv). Modern flagship models use DiT.
"What's flow matching?"	Reformulation with straighter trajectories; fewer sampling steps. Used in SD3, FLUX. Likely to replace pure diffusion.
"Diffusion vs GANs?"	Diffusion: stable training, no mode collapse, slower sampling. GANs: fast sampling, harder training, mode collapse. Diffusion has won.
"Is diffusion an MLE?"	Approximately, via the ELBO. Simplified loss is not exactly MLE but works better empirically.

13. The 10 most-asked diffusion interview questions

What's the forward process? Add Gaussian noise over $T$ steps. Closed-form: $x_{t} = \overset{α}{ˉ}_{t} x_{0} + 1 - \overset{α}{ˉ}_{t} ε$ .
What's the reverse process? Learned denoising. Predict noise $\overset{ε}{^} = ε_{θ} (x_{t}, t)$ and use it to compute $μ$ for $x_{t - 1}$ .
Why predict noise not data? Better-conditioned loss; constant scale across timesteps.
What's DDIM? Deterministic sampler that allows fewer steps (50–100 vs 1000). Same trained model.
What's classifier-free guidance? Train conditional + unconditional jointly; combine at sampling: $\overset{ε}{^} = ε_{unc} + w \cdot (ε_{cond} - ε_{unc})$ . Standard for text-to-image.
Why latent diffusion? Diffuse in compressed latent space; 4–8x cheaper than pixel space.
DiT vs UNet? DiT (transformer) scales better; modern flagship models use it.
What's flow matching? Reformulation with straighter paths; fewer sampling steps. Likely future of diffusion.
What's the ELBO for diffusion? Sum of KL terms across timesteps. Simplified MSE loss works better empirically.
Connection between score matching and diffusion? Equivalent. Predicting $ε \approx$ predicting $- σ \nabla lo g p$ . Diffusion models are score-based generative models.

14. Drill plan

Memorize the closed-form $x_{t} = \overset{α}{ˉ}_{t} x_{0} + 1 - \overset{α}{ˉ}_{t} ε$ .
Memorize the simplified DDPM loss: MSE on noise prediction.
Know CFG: train cond+uncond, combine at sampling.
Know latent diffusion's role (Stable Diffusion).
Know DiT and flow matching as the modern direction.
Drill INTERVIEW_GRILL.md.

15. Further reading

Sohl-Dickstein et al., "Deep Unsupervised Learning using Nonequilibrium Thermodynamics" (2015) — original diffusion idea.
Ho, Jain, Abbeel, "Denoising Diffusion Probabilistic Models" (DDPM, 2020).
Song & Ermon, "Generative Modeling by Estimating Gradients of the Data Distribution" (score-based, 2019).
Song et al., "Denoising Diffusion Implicit Models" (DDIM, 2021).
Dhariwal & Nichol, "Diffusion Models Beat GANs on Image Synthesis" (classifier guidance, 2021).
Ho & Salimans, "Classifier-Free Diffusion Guidance" (2022).
Rombach et al., "High-Resolution Image Synthesis with Latent Diffusion Models" (Stable Diffusion, 2022).
Peebles & Xie, "Scalable Diffusion Models with Transformers" (DiT, 2023).
Lipman et al., "Flow Matching for Generative Modeling" (2023).
Liu et al., "Rectified Flow" (2022).

ML & LLM Interview Prep — Deep Dives