Diffusion Models: Comprehensive Interview Q&A

Q1: What are diffusion models? How do they work?

Answer:

Diffusion Models:

  • Generative models that learn to reverse a gradual noising process
  • Work by iteratively removing noise from data, starting from pure noise
  • Achieve state-of-the-art results in image generation (DALL-E, Stable Diffusion)

How They Work:

1. Forward Process (Fixed):

  • Gradually add Gaussian noise to data
  • q(x_t | x_{t-1}) = N(x_t; √(1-β_t)x_{t-1}, β_t I)
  • After T steps, data becomes pure noise N(0, I)

2. Reverse Process (Learned):

  • Learn to remove noise step by step
  • p_θ(x_{t-1} | x_t) = N(x_{t-1}; μ_θ(x_t, t), Σ_θ(x_t, t))
  • Neural network predicts how to denoise

3. Training:

  • Predict the noise that was added
  • Loss: L = E[||ε - ε_θ(x_t, t)||²]
  • Model learns: given noisy data x_t and timestep t, predict noise ε

4. Generation:

  • Start from pure noise x_T ~ N(0, I)
  • Iteratively apply reverse process: x_T → x_{T-1} → ... → x_0
  • End with clean, generated sample

Key Insight:

  • Break down complex generation into many simple denoising steps
  • Each step only needs to remove small amount of noise
  • Much easier to learn than generating directly

Q2: What is the mathematical formulation of diffusion models?

Answer:

Forward Diffusion Process:

Given data x₀, define sequence x₁, x₂, ..., x_T where:

q(x_t | x_{t-1}) = N(x_t; √(1-β_t)x_{t-1}, β_t I)

Where:

  • β_t is variance schedule (0 < β_t < 1)
  • √(1-β_t) preserves signal
  • β_t adds noise

Closed-Form Expression:

Can sample x_t directly from x₀:

q(x_t | x_0) = N(x_t; √(ᾱ_t)x_0, (1-ᾱ_t)I)

Where:

  • α_t = 1 - β_t
  • ᾱ_t = ∏_{s=1}^t α_s (cumulative product)

Reverse Diffusion Process:

Learn to reverse the process:

p_θ(x_{t-1} | x_t) = N(x_{t-1}; μ_θ(x_t, t), Σ_θ(x_t, t))

Training Objective:

Predict noise that was added:

L = E_{t,x_0,ε} [||ε - ε_θ(x_t, t)||²]

Where:

  • x_t = √(ᾱ_t)x_0 + √(1-ᾱ_t)ε (noisy data)
  • ε_θ(x_t, t) is predicted noise
  • ε is actual noise

Sampling:

Given predicted noise ε_θ, compute:

μ_θ = (1/√(α_t))(x_t - (β_t/√(1-ᾱ_t))ε_θ)
x_{t-1} ~ N(μ_θ, Σ_t)

Q3: How do you train a diffusion model?

Answer:

Training Algorithm:

1. Setup:

  • Define variance schedule β_t (linear or cosine)
  • Precompute α_t, ᾱ_t for efficiency

2. Training Loop:

For each batch:
  a. Sample data: x_0 ~ q(x_0)
  b. Sample timestep: t ~ Uniform({1, 2, ..., T})
  c. Sample noise: ε ~ N(0, I)
  d. Create noisy data: x_t = √(ᾱ_t)x_0 + √(1-ᾱ_t)ε
  e. Predict noise: ε_pred = ε_θ(x_t, t)
  f. Compute loss: L = ||ε - ε_pred||²
  g. Update: θ ← θ - α∇_θ L

Key Points:

  • Randomly sample timesteps (not sequential)
  • Model sees all noise levels during training
  • Simple MSE loss (predict noise)

Best Practices:

  • Learning rate: 1e-4 to 1e-3
  • Use learning rate scheduling (cosine annealing)
  • Gradient clipping (norm = 1.0)
  • Monitor loss and generate samples during training
  • More timesteps = better quality but slower

Variance Schedule:

  • Linear: β_t = (β_max - β_min) * (t/T) + β_min
  • Cosine: ᾱ_t = cos²(π/2 * (t/T)) (often better)

Q4: How do you generate samples from a diffusion model?

Answer:

Generation Algorithm:

1. Start from Noise:

x_T ~ N(0, I)  # Pure noise

2. Reverse Diffusion:

For t = T, T-1, ..., 1:
  a. Predict noise: ε_t = ε_θ(x_t, t)
  b. Compute mean: μ_t = (1/√(α_t))(x_t - (β_t/√(1-ᾱ_t))ε_t)
  c. Sample: x_{t-1} ~ N(μ_t, Σ_t)

3. Return:

x_0  # Generated sample

Step-by-Step:

At step t:

  • Model predicts noise ε_t that was added
  • Use this to compute predicted mean μ_t of x_{t-1}
  • Sample x_{t-1} from N(μ_t, Σ_t)
  • This gives slightly less noisy version
  • Repeat until clean data x_0

Variance:

  • Can be fixed: Σ_t = β_t I
  • Or learned: Σ_t = Σ_θ(x_t, t)

Conditional Generation:

  • Condition on text, class, etc.
  • Model becomes: ε_θ(x_t, t, c)
  • Same process but with conditioning

Q5: What are discrete diffusion models? How do they work for NLP?

Answer:

The Challenge:

  • Standard diffusion works on continuous data (images)
  • Text is discrete (tokens), need adaptation

Discrete Forward Process:

Instead of Gaussian noise, use transition matrix:

q(x_t | x_{t-1}) = Categorical(x_t; Q_t x_{t-1})

Where Q_t is transition matrix defining how tokens are corrupted.

Common Approaches:

1. Absorbing State:

  • Have special [MASK] token
  • At each step, tokens transition to [MASK] with probability β_t
  • After T steps, all tokens become [MASK]

2. Uniform Transition:

  • Tokens can transition to any other token uniformly
  • More general but harder to learn

Discrete Reverse Process:

Learn to predict original token:

p_θ(x_{t-1} | x_t) = Categorical(x_{t-1}; p_θ(x_t, t))

Where p_θ(x_t, t) is probability distribution over vocabulary.

Training:

  • Loss: Cross-entropy between predicted and original tokens
  • L = E[-log p_θ(x_0 | x_t, t)]

Advantages for NLP:

  • Non-autoregressive (can generate in parallel)
  • Better for editing tasks (text inpainting)
  • More flexible control

Example: Text Inpainting

  • Mask some tokens: "The [MASK] sat on the [MASK]"
  • Use reverse diffusion to fill in masked tokens
  • Better than autoregressive for editing

Q6: How do you evaluate diffusion models?

Answer:

For Images:

1. FID (Frechet Inception Distance):

  • Measures quality and diversity
  • Lower is better
  • Compares feature distributions from Inception network

2. IS (Inception Score):

  • Measures quality and diversity
  • Higher is better (typically 1-10)
  • Based on Inception network predictions

3. Reconstruction Error:

  • Test if model can recover original from noisy version
  • Lower is better
  • Measures denoising capability

For Text:

1. BLEU Score:

  • Measures n-gram overlap with reference
  • Higher is better (0-1)
  • Standard NLP metric

2. Perplexity:

  • Measures how well model predicts tokens
  • Lower is better
  • Standard language modeling metric

3. Diversity Metrics:

  • Distinct-n: Ratio of unique n-grams
  • Self-BLEU: Average BLEU between generated samples
  • Higher distinct = more diverse

Diffusion-Specific:

1. Denoising Accuracy:

  • Test accuracy at each timestep
  • Measures how well model denoises
  • Should improve with training

2. Sample Quality:

  • Visual inspection (for images)
  • Human evaluation (for text)
  • Compare with real data

Best Practices:

  • Use multiple metrics
  • Compare with baselines
  • Generate samples during training
  • Monitor metrics over time

Q7: What are the advantages and disadvantages of diffusion models?

Answer:

Advantages:

1. High Quality:

  • State-of-the-art results in image generation
  • Better than GANs in many cases
  • Very realistic samples

2. Stable Training:

  • More stable than GANs
  • No mode collapse
  • Simple loss function (MSE)

3. Flexible:

  • Can condition on various inputs
  • Good for editing tasks
  • Non-autoregressive (for discrete)

4. Theoretical Foundation:

  • Well-grounded in probability theory
  • Clear mathematical formulation
  • Understandable process

Disadvantages:

1. Slow Generation:

  • Requires many steps (1000-4000)
  • Much slower than GANs or autoregressive models
  • Each step requires forward pass

2. Memory:

  • Need to store intermediate states
  • Can be memory intensive
  • Especially for high-resolution images

3. Training:

  • Can be slow (many timesteps)
  • Need to sample random timesteps
  • More complex than some methods

4. Discrete Data:

  • Need special handling for discrete data
  • Transition matrices can be complex
  • Less natural than continuous

Comparison:

AspectDiffusionGANsAutoregressive
QualityExcellentVery GoodGood
SpeedSlowFastMedium
TrainingStableUnstableStable
FlexibilityHighMediumLow

Q8: What are use cases of diffusion models in NLP?

Answer:

1. Non-Autoregressive Text Generation:

  • Generate all tokens in parallel
  • Faster than autoregressive models
  • Better for controlled generation

2. Text Inpainting:

  • Fill in masked tokens
  • Edit specific parts of text
  • Example: "The [MASK] sat on the [MASK]" → "The cat sat on the mat"

3. Text-to-Image:

  • DALL-E, Stable Diffusion
  • Generate images from text descriptions
  • Multimodal understanding

4. Text Editing:

  • Style transfer
  • Paraphrasing
  • Rewriting with constraints

5. Controllable Generation:

  • Generate with specific attributes
  • Control length, style, topic
  • More flexible than autoregressive

6. Text Completion:

  • Complete partial text
  • Better than simple next-token prediction
  • Can consider full context

7. Data Augmentation:

  • Generate training data
  • Improve model robustness
  • Handle rare patterns

8. Multimodal Tasks:

  • Text-to-image
  • Image-to-text
  • Cross-modal generation

Industry Examples:

  • DALL-E: Text-to-image generation
  • Stable Diffusion: Open-source text-to-image
  • Research: Non-autoregressive text generation
  • Editing tools: Text inpainting and rewriting

Q9: Compare diffusion models with autoregressive models (GPT) for text generation.

Answer:

Generation Process:

Autoregressive (GPT):

  • Generate left-to-right, one token at a time
  • Each token depends on all previous tokens
  • Sequential: t₁ → t₂ → t₃ → ...

Diffusion:

  • Generate all tokens in parallel (discrete diffusion)
  • Iteratively refine all tokens together
  • Parallel: All tokens refined simultaneously

Advantages:

Autoregressive:

  • Faster single-pass generation
  • Simpler implementation
  • Better for long sequences
  • More established for text

Diffusion:

  • Non-autoregressive (parallel)
  • Better for editing tasks
  • More flexible control
  • Can edit specific parts

Disadvantages:

Autoregressive:

  • Sequential (can't parallelize)
  • Hard to edit specific parts
  • Less flexible control

Diffusion:

  • Slower (many iterations)
  • More complex
  • Less established for text
  • May need more training data

When to Use:

Autoregressive:

  • Standard text generation
  • Long sequences
  • When speed is important
  • Established use cases

Diffusion:

  • Text editing/inpainting
  • Controlled generation
  • When need parallel generation
  • Research/experimental

Current State:

  • Autoregressive (GPT) dominates text generation
  • Diffusion better for images
  • Discrete diffusion promising for text
  • Active area of research

Q10: How do you implement a simple diffusion model from scratch?

Answer:

Key Components:

1. Variance Schedule:

def linear_beta_schedule(timesteps):
    return torch.linspace(0.0001, 0.02, timesteps)

2. Forward Diffusion:

def q_sample(x_0, t, sqrt_alphas_cumprod, sqrt_one_minus_alphas_cumprod):
    noise = torch.randn_like(x_0)
    return sqrt_alphas_cumprod[t] * x_0 + sqrt_one_minus_alphas_cumprod[t] * noise

3. Model (Noise Predictor):

class DiffusionModel(nn.Module):
    def forward(self, x_t, t):
        # Predict noise given noisy data and timestep
        return noise_pred

4. Training:

# Sample timestep and noise
t = torch.randint(0, timesteps, (batch_size,))
noise = torch.randn_like(x_0)
x_t = q_sample(x_0, t, ...)

# Predict and compute loss
noise_pred = model(x_t, t)
loss = F.mse_loss(noise_pred, noise)

5. Sampling:

# Start from noise
x = torch.randn(shape)

# Reverse diffusion
for t in reversed(range(timesteps)):
    noise_pred = model(x, t)
    x = denoise_step(x, noise_pred, t)

See diffusion_code.py for complete implementation!


Summary

Diffusion models are powerful generative models that learn to reverse a gradual noising process. They achieve state-of-the-art results in image generation and are increasingly applied to NLP tasks through discrete diffusion. Key advantages include high quality, stable training, and flexibility, though they are slower than alternatives. For NLP, discrete diffusion enables non-autoregressive generation, text inpainting, and better control over generation, making them valuable for editing and controlled generation tasks.