MLE and MAP: Interview Q&A
Q1: Derive MLE for a coin flip (Bernoulli distribution).
Answer:
Setup:
- Data: n flips, k heads, (n-k) tails
- Model: P(heads) = θ, P(tails) = 1-θ
- Goal: Find θ that maximizes likelihood
Step 1: Likelihood
L(θ) = θᵏ × (1-θ)ⁿ⁻ᵏ
Step 2: Log-Likelihood
log L(θ) = k log θ + (n-k) log(1-θ)
Step 3: Take Derivative
d/dθ [log L(θ)] = k/θ - (n-k)/(1-θ)
Step 4: Set to Zero
k/θ - (n-k)/(1-θ) = 0
k(1-θ) = (n-k)θ
k = nθ
θ = k/n
Result:
θ̂_MLE = k/n = (number of heads) / (total flips)
Intuition: The MLE is simply the observed proportion!
Q2: Derive MLE for linear regression.
Answer:
Setup:
- Model: y = Xw + ε, where ε ~ N(0, σ²)
- Data: {(x₁, y₁), ..., (xₙ, yₙ)}
- Goal: Find w that maximizes likelihood
Step 1: Likelihood
L(w, σ²) = ∏ᵢ (1/√(2πσ²)) exp(-(yᵢ - xᵢᵀw)²/(2σ²))
Step 2: Log-Likelihood
log L(w, σ²) = -n/2 log(2πσ²) - (1/(2σ²)) Σᵢ(yᵢ - xᵢᵀw)²
Step 3: Maximize with respect to w
Since σ² doesn't depend on w:
argmax_w log L(w, σ²) = argmin_w [Σᵢ(yᵢ - xᵢᵀw)²]
= argmin_w ||y - Xw||²
Step 4: Take Derivative and Set to Zero
∂/∂w [||y - Xw||²] = -2Xᵀ(y - Xw) = 0
Xᵀy = XᵀXw
w = (XᵀX)⁻¹Xᵀy
Result:
ŵ_MLE = (XᵀX)⁻¹Xᵀy (Ordinary Least Squares!)
Key Insight: MLE for linear regression with Gaussian noise = OLS!
Q3: Explain the connection between MLE and MAP.
Answer:
MLE:
θ̂_MLE = argmax_θ log P(D|θ)
MAP:
θ̂_MAP = argmax_θ [log P(D|θ) + log P(θ)]
= argmax_θ [log L(θ) + log P(θ)]
= MLE + log(prior)
Key Relationship:
MAP = MLE + Prior
When they're the same:
- Uniform prior: P(θ) = constant → log P(θ) = constant → MAP = MLE
- Large dataset: Data overwhelms prior → MAP ≈ MLE
When they differ:
- Small dataset: Prior has more influence
- Strong prior: MAP pulled toward prior mean
Regularization Connection:
- L2 (Ridge): MAP with Gaussian prior
- L1 (Lasso): MAP with Laplace prior
- Regularization = Bayesian prior
Q4: Derive MAP for linear regression with Gaussian prior (Ridge).
Answer:
Setup:
- Model: y = Xw + ε, where ε ~ N(0, σ²)
- Prior: w ~ N(0, σ²_prior I)
- Goal: Find w that maximizes posterior
Step 1: Prior
P(w) ∝ exp(-||w||²/(2σ²_prior))
Step 2: Likelihood
L(w) ∝ exp(-||y - Xw||²/(2σ²))
Step 3: Log-Posterior
log P(w|D) = log L(w) + log P(w) + constant
= -||y - Xw||²/(2σ²) - ||w||²/(2σ²_prior) + constant
Step 4: Maximize
Take derivative:
∂/∂w [log P(w|D)] = -1/σ² × Xᵀ(y - Xw) - 1/σ²_prior × w
Set to zero:
-1/σ² × Xᵀ(y - Xw) - 1/σ²_prior × w = 0
Xᵀ(y - Xw) = -σ²/σ²_prior × w
Xᵀy = (XᵀX + λI)w (where λ = σ²/σ²_prior)
w = (XᵀX + λI)⁻¹Xᵀy
Result:
ŵ_MAP = (XᵀX + λI)⁻¹Xᵀy (Ridge regression!)
Key Insight: MAP with Gaussian prior = Ridge regression (L2 regularization)!
Q5: Why do we use log-likelihood instead of likelihood?
Answer:
Reasons:
1. Numerical Stability:
- Likelihood: Product of many small probabilities → very small numbers
- Log-likelihood: Sum of logs → more stable
- Example: 0.1 × 0.1 × ... × 0.1 (100 times) = 10⁻¹⁰⁰ (underflow!)
- Log: log(0.1) + ... + log(0.1) = -100 × log(10) (manageable)
2. Mathematical Convenience:
- Products become sums: log(∏ᵢ Pᵢ) = Σᵢ log Pᵢ
- Derivatives easier: d/dθ log f(θ) = (1/f(θ)) × f'(θ)
- No product rule needed
3. Monotonicity:
- log is monotonic: maximizing log L(θ) = maximizing L(θ)
- Same maximum, easier optimization
4. Additive Properties:
- Log-likelihoods can be added: log L₁(θ) + log L₂(θ) = log(L₁(θ) × L₂(θ))
- Useful for combining datasets
Example:
L(θ) = 0.1 × 0.1 × 0.1 = 0.001 (hard to work with)
log L(θ) = log(0.1) + log(0.1) + log(0.1) = -3 × log(10) ≈ -6.91 (easier!)
Q6: What's the difference between MLE and MAP in practice?
Answer:
MLE:
- Approach: Frequentist
- Prior: None (or uniform)
- Result: Point estimate
- Use when: Large dataset, no prior knowledge
- Example: θ̂ = k/n for coin flip
MAP:
- Approach: Bayesian
- Prior: Informative prior
- Result: Point estimate (mode of posterior)
- Use when: Small dataset, have prior knowledge, need regularization
- Example: θ̂ = (k+α-1)/(n+α+β-2) for coin flip with Beta prior
Practical Differences:
1. Small Data:
- MLE: Can be extreme (e.g., 3/3 heads → θ = 1.0)
- MAP: More reasonable (incorporates prior)
2. Regularization:
- MLE: No regularization (can overfit)
- MAP: Natural regularization (prior prevents overfitting)
3. Uncertainty:
- MLE: No uncertainty estimate
- MAP: Can estimate uncertainty (though full Bayesian is better)
4. Computation:
- MLE: Usually simpler
- MAP: Similar complexity (just add prior term)
When to use:
- MLE: Large data, simple models, no prior knowledge
- MAP: Small data, need regularization, have domain knowledge
Q7: How does MAP relate to regularization?
Answer:
Key Insight:
Regularization = Bayesian Prior
Regularized Loss = -log Posterior
L2 Regularization (Ridge):
Loss = MSE + λ||w||²
= -log L(w) + λ||w||²
= -[log L(w) + log P(w)] (where P(w) ∝ exp(-λ||w||²))
This is MAP with Gaussian prior!
L1 Regularization (Lasso):
Loss = MSE + λ||w||₁
= -[log L(w) + log P(w)] (where P(w) ∝ exp(-λ||w||₁))
This is MAP with Laplace prior!
Interpretation:
- Regularization strength λ: Controls prior strength
- Higher λ: Stronger prior → more regularization
- Lower λ: Weaker prior → less regularization
Why this matters:
- Understanding regularization as prior helps choose λ
- Bayesian interpretation provides theoretical foundation
- Can use Bayesian methods to learn λ from data
Summary
Key Points:
- MLE: Maximizes P(data|parameters) - frequentist approach
- MAP: Maximizes P(parameters|data) - Bayesian approach
- Relationship: MAP = MLE + log(prior)
- Regularization: MAP with appropriate prior
- Use MLE: Large data, no prior knowledge
- Use MAP: Small data, need regularization, have prior knowledge
Understanding MLE and MAP is fundamental to understanding:
- How models learn from data
- Why regularization works
- Bayesian vs Frequentist thinking
- Connection between different ML approaches