Topic 44: Normalization Techniques (Batch Norm & Layer Norm)

🔥 For interviews, read these first:

NORMALIZATION_DEEP_DIVE.md — frontier-lab interview deep dive: BN/LN/RMSNorm/GroupNorm, why BN fails for transformers, pre-LN vs post-LN, the affine transform, the loss-landscape-smoothing argument (and why "internal covariate shift" is wrong).

INTERVIEW_GRILL.md — 40 active-recall questions with strong answers.

What You'll Learn

This topic teaches you normalization techniques comprehensively:

Batch Normalization (BatchNorm)
Layer Normalization (LayerNorm)
Mathematical formulations
Differences and when to use each
Why transformers use LayerNorm
Implementation details

Why We Need This

Interview Importance

Common question: "Explain BatchNorm vs LayerNorm"
Transformer understanding: Why transformers use LayerNorm
Implementation: May ask to implement from scratch

Real-World Application

CNNs: Often use BatchNorm
Transformers: Use LayerNorm
Training stability: Critical for deep networks
Convergence: Helps training converge faster

Industry Use Cases

1. Batch Normalization

Use Case: CNNs, image classification

Normalizes across batch dimension
Requires batch statistics
Works well with large batches

2. Layer Normalization

Use Case: Transformers, RNNs, NLP

Normalizes across feature dimension
Independent of batch size
Works with any batch size

Core Intuition

Normalization helps training by controlling activation scale and making optimization more stable.

The intuition is not just "make values smaller."

It is:

keep activations in a reasonable range
reduce sensitivity to scale changes across layers
make optimization easier and more stable

BatchNorm

BatchNorm uses statistics across examples in the batch.

That makes it work well in settings like CNNs where:

batch statistics are meaningful
batch sizes are usually large enough

LayerNorm

LayerNorm uses statistics within each example across its feature dimension.

That makes it useful when:

batch size is small or variable
sequence models need consistent behavior per token/sample

This is why transformers use LayerNorm much more naturally than BatchNorm.

Technical Details Interviewers Often Want

Why Transformers Prefer LayerNorm

Transformers often use variable sequence lengths, small effective batches, and token-wise computations.

LayerNorm is attractive because it:

does not depend on batch statistics
behaves consistently across training and inference
fits sequence modeling well

Why BatchNorm Can Be Awkward in NLP

BatchNorm depends on batch-level statistics, which can be less stable or less natural in autoregressive and sequence-heavy settings, especially with variable lengths or small batches.

Learnable Parameters Matter

Both BatchNorm and LayerNorm typically use learnable scale and shift parameters.

Why?

after normalization, the model still needs flexibility to represent useful scales and offsets

Common Failure Modes

explaining normalization only as "faster training" without mechanism
forgetting that BatchNorm behaves differently at training and inference
ignoring batch-size dependence in BatchNorm
not being able to explain why LayerNorm is common in transformers

Edge Cases and Follow-Up Questions

Why does BatchNorm depend on batch size?
Why is LayerNorm more natural for transformers?
Why do normalized activations still need learnable scale and bias?
Why can BatchNorm become awkward with very small batches?
Why is training-time vs inference-time behavior different for BatchNorm?

What to Practice Saying Out Loud

The difference between normalizing across batch vs across features
Why LayerNorm is standard in transformer architectures
Why normalization is really about optimization stability, not just value scaling

Theory

Batch Normalization

What it is:

Normalizes activations across the batch dimension
Uses batch statistics (mean, variance)
Helps with training stability and convergence

Mathematical Formulation:

μ_B = (1/m) ∑ x_i
σ²_B = (1/m) ∑ (x_i - μ_B)²
x̂ = (x - μ_B) / √(σ²_B + ε)
y = γ * x̂ + β

Layer Normalization

What it is:

Normalizes activations across the feature dimension
Uses per-sample statistics
Independent of batch size

Mathematical Formulation:

μ_L = (1/d) ∑ x_i
σ²_L = (1/d) ∑ (x_i - μ_L)²
x̂ = (x - μ_L) / √(σ²_L + ε)
y = γ * x̂ + β

Industry-Standard Boilerplate Code

See detailed files for complete implementations:

normalization_implementations.py: Complete implementations from scratch
normalization_theory.md: Detailed theoretical explanations
normalization_qa.md: Comprehensive interview Q&A

Exercises

Implement BatchNorm from scratch
Implement LayerNorm from scratch
Compare BatchNorm vs LayerNorm
Understand why transformers use LayerNorm
Test with different batch sizes

Next Steps

Review transformer architecture
Understand training dynamics
Explore other normalization techniques

ML & LLM Interview Prep — Deep Dives