Topic 44: Normalization Techniques (Batch Norm & Layer Norm)
🔥 For interviews, read these first:
NORMALIZATION_DEEP_DIVE.md— frontier-lab interview deep dive: BN/LN/RMSNorm/GroupNorm, why BN fails for transformers, pre-LN vs post-LN, the affine transform, the loss-landscape-smoothing argument (and why "internal covariate shift" is wrong).INTERVIEW_GRILL.md— 40 active-recall questions with strong answers.
What You'll Learn
This topic teaches you normalization techniques comprehensively:
- Batch Normalization (BatchNorm)
- Layer Normalization (LayerNorm)
- Mathematical formulations
- Differences and when to use each
- Why transformers use LayerNorm
- Implementation details
Why We Need This
Interview Importance
- Common question: "Explain BatchNorm vs LayerNorm"
- Transformer understanding: Why transformers use LayerNorm
- Implementation: May ask to implement from scratch
Real-World Application
- CNNs: Often use BatchNorm
- Transformers: Use LayerNorm
- Training stability: Critical for deep networks
- Convergence: Helps training converge faster
Industry Use Cases
1. Batch Normalization
Use Case: CNNs, image classification
- Normalizes across batch dimension
- Requires batch statistics
- Works well with large batches
2. Layer Normalization
Use Case: Transformers, RNNs, NLP
- Normalizes across feature dimension
- Independent of batch size
- Works with any batch size
Core Intuition
Normalization helps training by controlling activation scale and making optimization more stable.
The intuition is not just "make values smaller."
It is:
- keep activations in a reasonable range
- reduce sensitivity to scale changes across layers
- make optimization easier and more stable
BatchNorm
BatchNorm uses statistics across examples in the batch.
That makes it work well in settings like CNNs where:
- batch statistics are meaningful
- batch sizes are usually large enough
LayerNorm
LayerNorm uses statistics within each example across its feature dimension.
That makes it useful when:
- batch size is small or variable
- sequence models need consistent behavior per token/sample
This is why transformers use LayerNorm much more naturally than BatchNorm.
Technical Details Interviewers Often Want
Why Transformers Prefer LayerNorm
Transformers often use variable sequence lengths, small effective batches, and token-wise computations.
LayerNorm is attractive because it:
- does not depend on batch statistics
- behaves consistently across training and inference
- fits sequence modeling well
Why BatchNorm Can Be Awkward in NLP
BatchNorm depends on batch-level statistics, which can be less stable or less natural in autoregressive and sequence-heavy settings, especially with variable lengths or small batches.
Learnable Parameters Matter
Both BatchNorm and LayerNorm typically use learnable scale and shift parameters.
Why?
- after normalization, the model still needs flexibility to represent useful scales and offsets
Common Failure Modes
- explaining normalization only as "faster training" without mechanism
- forgetting that BatchNorm behaves differently at training and inference
- ignoring batch-size dependence in BatchNorm
- not being able to explain why LayerNorm is common in transformers
Edge Cases and Follow-Up Questions
- Why does BatchNorm depend on batch size?
- Why is LayerNorm more natural for transformers?
- Why do normalized activations still need learnable scale and bias?
- Why can BatchNorm become awkward with very small batches?
- Why is training-time vs inference-time behavior different for BatchNorm?
What to Practice Saying Out Loud
- The difference between normalizing across batch vs across features
- Why LayerNorm is standard in transformer architectures
- Why normalization is really about optimization stability, not just value scaling
Theory
Batch Normalization
What it is:
- Normalizes activations across the batch dimension
- Uses batch statistics (mean, variance)
- Helps with training stability and convergence
Mathematical Formulation:
μ_B = (1/m) ∑ x_i
σ²_B = (1/m) ∑ (x_i - μ_B)²
x̂ = (x - μ_B) / √(σ²_B + ε)
y = γ * x̂ + β
Layer Normalization
What it is:
- Normalizes activations across the feature dimension
- Uses per-sample statistics
- Independent of batch size
Mathematical Formulation:
μ_L = (1/d) ∑ x_i
σ²_L = (1/d) ∑ (x_i - μ_L)²
x̂ = (x - μ_L) / √(σ²_L + ε)
y = γ * x̂ + β
Industry-Standard Boilerplate Code
See detailed files for complete implementations:
normalization_implementations.py: Complete implementations from scratchnormalization_theory.md: Detailed theoretical explanationsnormalization_qa.md: Comprehensive interview Q&A
Exercises
- Implement BatchNorm from scratch
- Implement LayerNorm from scratch
- Compare BatchNorm vs LayerNorm
- Understand why transformers use LayerNorm
- Test with different batch sizes
Next Steps
- Review transformer architecture
- Understand training dynamics
- Explore other normalization techniques