Topic 12: Comprehensive Theory

What You'll Learn

This topic provides comprehensive theory on:

Classical ML theory
LLM theory
LLM inference theory
Bias-variance tradeoff
Regularization theory
Optimization theory

Why We Need This

Interview Importance

Theory questions: "Explain bias-variance tradeoff"
Deep understanding: Theory helps answer "why"
Problem-solving: Theory guides solutions

Real-World Application

Decision-making: Theory helps choose approaches
Debugging: Understand why things work/don't work
Innovation: Build on theoretical foundations

Key Theory Topics

1. Bias-Variance Tradeoff

Bias: Error from oversimplifying model

High bias = Underfitting
Low bias = Model can learn complex patterns

Variance: Error from model sensitivity to training data

High variance = Overfitting
Low variance = Model generalizes well

Tradeoff: Can't minimize both simultaneously

Simple model: High bias, low variance
Complex model: Low bias, high variance
Goal: Find balance

2. Regularization Theory

Why Regularization Works:

Prevents overfitting
Improves generalization
Controls model complexity

L1 vs L2:

L1: Promotes sparsity (feature selection)
L2: Shrinks weights (smoother)

3. Optimization Theory

Convex vs Non-convex:

Convex: One global minimum
Non-convex: Multiple local minima
Deep learning: Non-convex optimization

Gradient Descent:

Converges to local minimum
Learning rate critical
Momentum helps escape local minima

4. LLM Theory

Transformer Architecture:

Self-attention: Relate all positions
Position encoding: Add position info
Layer normalization: Stabilize training

Attention Mechanism:

Query-Key-Value paradigm
Scaled dot-product
Multi-head for different subspaces

Generation:

Autoregressive: One token at a time
KV caching: Avoid recomputation
Sampling: Control randomness

Core Intuition

Theory matters in interviews because it helps you explain why a method works, when it fails, and what trade-off it is making.

If implementation tells you "how," theory usually tells you:

what is being optimized
what assumption is being made
what failure mode to expect

Bias-Variance

This is one of the most important mental models in ML.

Bias means your model family is too restrictive or systematically wrong
Variance means the model reacts too strongly to sample-specific noise

The goal is not minimizing one of them alone. The goal is minimizing generalization error.

Regularization

Regularization is best understood as inductive bias.

It tells the learning algorithm:

prefer smaller weights
prefer simpler explanations
prefer more stable solutions

This is stronger and more precise than just saying "regularization prevents overfitting."

Optimization

Optimization theory matters because training behavior depends on geometry.

Interviewers often want to hear:

whether the objective is convex or not
why learning rate matters
why conditioning affects convergence
why adaptive optimizers behave differently

LLM Theory

For LLMs, theory often shows up as a chain of concepts:

language modeling objective
tokenization
transformer attention
positional information
decoding and inference trade-offs

Technical Details Interviewers Often Want

Bias-Variance Is About Expected Behavior

Many people explain bias and variance too informally.

More precise intuition:

bias is about average systematic error across datasets
variance is about sensitivity to which sample you saw

Convex vs Non-Convex

Convex problems are easier to reason about because local minima are global minima.

Deep learning is non-convex, but that does not mean optimization is hopeless. It means the geometry and initialization matter more, and guarantees are weaker.

Why Theory Helps Debugging

If a model fails, theory gives you a checklist:

underfitting or overfitting?
optimization issue or capacity issue?
objective mismatch or metric mismatch?
variance problem or bias problem?

Common Failure Modes

using theory terms loosely without mechanism
treating bias-variance as only a cartoon instead of a real modeling trade-off
claiming regularization is always good
ignoring objective mismatch when discussing LLM quality
mixing optimization failure with generalization failure

Edge Cases and Follow-Up Questions

Why can a lower training loss still mean a worse model?
Why can a more complex model generalize better with enough data?
Why is regularization an inductive bias?
Why can optimization succeed but downstream quality still fail?
Why is theory useful even when deep learning is non-convex and messy?

What to Practice Saying Out Loud

The difference between optimization and generalization
The difference between bias and variance
Why theory guides debugging and model choice

Detailed Explanations

See individual theory files for detailed explanations.

Exercises

Derive gradient formulas
Prove bias-variance decomposition
Analyze convergence rates
Understand attention complexity

Next Steps

Topic 13: Interview Q&A
Review all topics