Topic 12: Comprehensive Theory
What You'll Learn
This topic provides comprehensive theory on:
- Classical ML theory
- LLM theory
- LLM inference theory
- Bias-variance tradeoff
- Regularization theory
- Optimization theory
Why We Need This
Interview Importance
- Theory questions: "Explain bias-variance tradeoff"
- Deep understanding: Theory helps answer "why"
- Problem-solving: Theory guides solutions
Real-World Application
- Decision-making: Theory helps choose approaches
- Debugging: Understand why things work/don't work
- Innovation: Build on theoretical foundations
Key Theory Topics
1. Bias-Variance Tradeoff
Bias: Error from oversimplifying model
- High bias = Underfitting
- Low bias = Model can learn complex patterns
Variance: Error from model sensitivity to training data
- High variance = Overfitting
- Low variance = Model generalizes well
Tradeoff: Can't minimize both simultaneously
- Simple model: High bias, low variance
- Complex model: Low bias, high variance
- Goal: Find balance
2. Regularization Theory
Why Regularization Works:
- Prevents overfitting
- Improves generalization
- Controls model complexity
L1 vs L2:
- L1: Promotes sparsity (feature selection)
- L2: Shrinks weights (smoother)
3. Optimization Theory
Convex vs Non-convex:
- Convex: One global minimum
- Non-convex: Multiple local minima
- Deep learning: Non-convex optimization
Gradient Descent:
- Converges to local minimum
- Learning rate critical
- Momentum helps escape local minima
4. LLM Theory
Transformer Architecture:
- Self-attention: Relate all positions
- Position encoding: Add position info
- Layer normalization: Stabilize training
Attention Mechanism:
- Query-Key-Value paradigm
- Scaled dot-product
- Multi-head for different subspaces
Generation:
- Autoregressive: One token at a time
- KV caching: Avoid recomputation
- Sampling: Control randomness
Core Intuition
Theory matters in interviews because it helps you explain why a method works, when it fails, and what trade-off it is making.
If implementation tells you "how," theory usually tells you:
- what is being optimized
- what assumption is being made
- what failure mode to expect
Bias-Variance
This is one of the most important mental models in ML.
- Bias means your model family is too restrictive or systematically wrong
- Variance means the model reacts too strongly to sample-specific noise
The goal is not minimizing one of them alone. The goal is minimizing generalization error.
Regularization
Regularization is best understood as inductive bias.
It tells the learning algorithm:
- prefer smaller weights
- prefer simpler explanations
- prefer more stable solutions
This is stronger and more precise than just saying "regularization prevents overfitting."
Optimization
Optimization theory matters because training behavior depends on geometry.
Interviewers often want to hear:
- whether the objective is convex or not
- why learning rate matters
- why conditioning affects convergence
- why adaptive optimizers behave differently
LLM Theory
For LLMs, theory often shows up as a chain of concepts:
- language modeling objective
- tokenization
- transformer attention
- positional information
- decoding and inference trade-offs
Technical Details Interviewers Often Want
Bias-Variance Is About Expected Behavior
Many people explain bias and variance too informally.
More precise intuition:
- bias is about average systematic error across datasets
- variance is about sensitivity to which sample you saw
Convex vs Non-Convex
Convex problems are easier to reason about because local minima are global minima.
Deep learning is non-convex, but that does not mean optimization is hopeless. It means the geometry and initialization matter more, and guarantees are weaker.
Why Theory Helps Debugging
If a model fails, theory gives you a checklist:
- underfitting or overfitting?
- optimization issue or capacity issue?
- objective mismatch or metric mismatch?
- variance problem or bias problem?
Common Failure Modes
- using theory terms loosely without mechanism
- treating bias-variance as only a cartoon instead of a real modeling trade-off
- claiming regularization is always good
- ignoring objective mismatch when discussing LLM quality
- mixing optimization failure with generalization failure
Edge Cases and Follow-Up Questions
- Why can a lower training loss still mean a worse model?
- Why can a more complex model generalize better with enough data?
- Why is regularization an inductive bias?
- Why can optimization succeed but downstream quality still fail?
- Why is theory useful even when deep learning is non-convex and messy?
What to Practice Saying Out Loud
- The difference between optimization and generalization
- The difference between bias and variance
- Why theory guides debugging and model choice
Detailed Explanations
See individual theory files for detailed explanations.
Exercises
- Derive gradient formulas
- Prove bias-variance decomposition
- Analyze convergence rates
- Understand attention complexity
Next Steps
- Topic 13: Interview Q&A
- Review all topics