MLE and MAP: Detailed Derivations with Intuitive Explanations
Table of Contents
- Maximum Likelihood Estimation (MLE)
- Maximum A Posteriori (MAP)
- Connection: MLE vs MAP
- Examples
- When to Use Each
Maximum Likelihood Estimation (MLE)
Intuitive Explanation
The Core Idea:
Imagine you're a detective trying to figure out what happened. You have some evidence (data), and you want to figure out what scenario (parameters) is most likely to have produced that evidence.
MLE asks: "Given the data I observed, what parameter values make this data most probable?"
Example:
- You flip a coin 10 times and get 7 heads, 3 tails
- MLE asks: "What probability of heads makes 7 heads out of 10 flips most likely?"
- Answer: p = 0.7 (the observed proportion)
Key Insight: MLE finds the parameter values that maximize the probability of observing the data you actually saw.
Mathematical Setup
Given:
- Data: D = {x₁, x₂, ..., xₙ}
- Model: P(x|θ) (probability of data given parameters)
- Goal: Find θ that maximizes P(D|θ)
Likelihood Function:
L(θ) = P(D|θ) = P(x₁, x₂, ..., xₙ | θ)
Key Point: Likelihood is a function of parameters θ, not data. We treat data as fixed and vary parameters.
Detailed Derivation
Step 1: Write the Likelihood
For independent observations:
L(θ) = P(x₁|θ) × P(x₂|θ) × ... × P(xₙ|θ)
= ∏ᵢ P(xᵢ|θ)
Why product? Each observation is independent, so probability of all observations = product of individual probabilities.
Step 2: Log-Likelihood (Why We Use It)
Problem: Products are hard to work with (numerical issues, derivatives)
Solution: Take logarithm
log L(θ) = log ∏ᵢ P(xᵢ|θ)
= Σᵢ log P(xᵢ|θ)
Why this works:
- log is monotonic: maximizing log L(θ) = maximizing L(θ)
- Products become sums (easier!)
- More numerically stable
Step 3: Maximize Log-Likelihood
MLE estimate:
θ̂_MLE = argmax_θ log L(θ)
= argmax_θ Σᵢ log P(xᵢ|θ)
How to find it:
- Take derivative with respect to θ
- Set derivative to zero
- Solve for θ
Mathematical:
∂/∂θ [Σᵢ log P(xᵢ|θ)] = 0
This gives us the maximum likelihood estimate.
Example 1: MLE for Coin Flip (Bernoulli)
Setup:
- Data: n flips, k heads, (n-k) tails
- Model: P(heads) = θ, P(tails) = 1-θ
- Goal: Find θ that maximizes likelihood
Step 1: Likelihood
L(θ) = θᵏ × (1-θ)ⁿ⁻ᵏ
Why?
- k heads: each has probability θ → θᵏ
- (n-k) tails: each has probability (1-θ) → (1-θ)ⁿ⁻ᵏ
- Independent: multiply them
Step 2: Log-Likelihood
log L(θ) = log[θᵏ × (1-θ)ⁿ⁻ᵏ]
= k log θ + (n-k) log(1-θ)
Step 3: Take Derivative
d/dθ [log L(θ)] = d/dθ [k log θ + (n-k) log(1-θ)]
= k/θ - (n-k)/(1-θ)
Step 4: Set to Zero
k/θ - (n-k)/(1-θ) = 0
k/θ = (n-k)/(1-θ)
k(1-θ) = (n-k)θ
k - kθ = nθ - kθ
k = nθ
θ = k/n
Result:
θ̂_MLE = k/n = (number of heads) / (total flips)
Intuition: The MLE estimate is simply the observed proportion! This makes sense: if you see 7 heads out of 10, the best estimate is 0.7.
Example 2: MLE for Normal Distribution
Setup:
- Data: {x₁, x₂, ..., xₙ} from N(μ, σ²)
- Goal: Find μ and σ² that maximize likelihood
Step 1: Likelihood
L(μ, σ²) = ∏ᵢ (1/√(2πσ²)) exp(-(xᵢ-μ)²/(2σ²))
= (1/√(2πσ²))ⁿ exp(-Σᵢ(xᵢ-μ)²/(2σ²))
Step 2: Log-Likelihood
log L(μ, σ²) = -n/2 log(2πσ²) - Σᵢ(xᵢ-μ)²/(2σ²)
Step 3: Maximize with respect to μ
Take derivative with respect to μ:
∂/∂μ [log L(μ, σ²)] = ∂/∂μ [-n/2 log(2πσ²) - Σᵢ(xᵢ-μ)²/(2σ²)]
= -1/(2σ²) × ∂/∂μ [Σᵢ(xᵢ-μ)²]
= -1/(2σ²) × Σᵢ 2(xᵢ-μ)(-1)
= 1/σ² × Σᵢ(xᵢ-μ)
Set to zero:
1/σ² × Σᵢ(xᵢ-μ) = 0
Σᵢ(xᵢ-μ) = 0
Σᵢ xᵢ - nμ = 0
μ = (1/n) Σᵢ xᵢ = x̄
Result:
μ̂_MLE = x̄ = (1/n) Σᵢ xᵢ (sample mean)
Step 4: Maximize with respect to σ²
Take derivative with respect to σ²:
∂/∂σ² [log L(μ, σ²)] = -n/(2σ²) + Σᵢ(xᵢ-μ)²/(2(σ²)²)
Set to zero:
-n/(2σ²) + Σᵢ(xᵢ-μ)²/(2(σ²)²) = 0
-nσ² + Σᵢ(xᵢ-μ)² = 0
σ² = (1/n) Σᵢ(xᵢ-μ)²
Result:
σ̂²_MLE = (1/n) Σᵢ(xᵢ-μ̂)² (sample variance, biased)
Note: This is the biased estimator. The unbiased version divides by (n-1) instead of n.
Intuition:
- MLE for mean = sample mean (makes sense!)
- MLE for variance = average squared deviation from mean
Example 3: MLE for Linear Regression
Setup:
- Model: y = Xw + ε, where ε ~ N(0, σ²)
- Data: {(x₁, y₁), ..., (xₙ, yₙ)}
- Goal: Find w that maximizes likelihood
Step 1: Likelihood
For each data point:
P(yᵢ|xᵢ, w, σ²) = (1/√(2πσ²)) exp(-(yᵢ - xᵢᵀw)²/(2σ²))
For all data (independent):
L(w, σ²) = ∏ᵢ (1/√(2πσ²)) exp(-(yᵢ - xᵢᵀw)²/(2σ²))
Step 2: Log-Likelihood
log L(w, σ²) = -n/2 log(2πσ²) - (1/(2σ²)) Σᵢ(yᵢ - xᵢᵀw)²
Step 3: Maximize with respect to w
Since σ² doesn't depend on w, we can ignore the first term:
argmax_w log L(w, σ²) = argmax_w [- (1/(2σ²)) Σᵢ(yᵢ - xᵢᵀw)²]
= argmin_w [Σᵢ(yᵢ - xᵢᵀw)²]
= argmin_w ||y - Xw||²
Result:
ŵ_MLE = (XᵀX)⁻¹Xᵀy (ordinary least squares!)
Key Insight: MLE for linear regression with Gaussian noise = Ordinary Least Squares (OLS)!
Intuition:
- We assume errors are normally distributed
- Maximizing likelihood = minimizing sum of squared errors
- This is exactly what OLS does!
Maximum A Posteriori (MAP)
Intuitive Explanation
The Core Idea:
MLE asks: "What parameters make the data most likely?"
MAP asks: "What parameters are most likely given both the data AND my prior beliefs?"
Example:
- You flip a coin 3 times and get 3 heads
- MLE says: p = 1.0 (100% heads) - but this seems wrong!
- MAP says: "Wait, I know coins are usually fair (p ≈ 0.5), so maybe p = 0.7" - incorporates prior knowledge
Key Insight: MAP = MLE + Prior Beliefs
Mathematical Setup
Bayes' Theorem:
P(θ|D) = P(D|θ) × P(θ) / P(D)
Where:
- P(θ|D): Posterior (what we want)
- P(D|θ): Likelihood (same as MLE)
- P(θ): Prior (our beliefs before seeing data)
- P(D): Evidence (normalizing constant, doesn't depend on θ)
MAP Estimate:
θ̂_MAP = argmax_θ P(θ|D)
= argmax_θ P(D|θ) × P(θ) (P(D) doesn't depend on θ)
= argmax_θ [log P(D|θ) + log P(θ)]
= argmax_θ [log L(θ) + log P(θ)]
Key Insight: MAP = MLE + log(prior)
Detailed Derivation
Step 1: Write the Posterior
P(θ|D) ∝ P(D|θ) × P(θ)
∝ L(θ) × P(θ)
Why proportional? P(D) is constant with respect to θ, so we can ignore it when maximizing.
Step 2: Log-Posterior
log P(θ|D) = log L(θ) + log P(θ) + constant
Step 3: Maximize Log-Posterior
θ̂_MAP = argmax_θ [log L(θ) + log P(θ)]
How to find it:
- Take derivative with respect to θ
- Set derivative to zero
- Solve for θ
Mathematical:
∂/∂θ [log L(θ) + log P(θ)] = 0
Example 1: MAP for Coin Flip with Beta Prior
Setup:
- Data: n flips, k heads
- Prior: θ ~ Beta(α, β) (conjugate prior for Bernoulli)
- Goal: Find θ that maximizes posterior
Step 1: Prior
P(θ) = (θ^(α-1) × (1-θ)^(β-1)) / B(α, β)
Where B(α, β) is the Beta function (normalizing constant)
Step 2: Likelihood (same as MLE)
L(θ) = θᵏ × (1-θ)ⁿ⁻ᵏ
Step 3: Posterior
P(θ|D) ∝ L(θ) × P(θ)
∝ θᵏ × (1-θ)ⁿ⁻ᵏ × θ^(α-1) × (1-θ)^(β-1)
∝ θ^(k+α-1) × (1-θ)^(n-k+β-1)
This is Beta(k+α, n-k+β)!
Step 4: MAP Estimate
Mode of Beta(α', β') is:
θ_mode = (α' - 1) / (α' + β' - 2)
So:
θ̂_MAP = (k + α - 1) / (n + α + β - 2)
Special Case: Uniform Prior (α=1, β=1)
θ̂_MAP = (k + 1 - 1) / (n + 1 + 1 - 2)
= k / n
= θ̂_MLE
Intuition:
- With uniform prior (no prior beliefs), MAP = MLE
- With informative prior, MAP incorporates prior knowledge
- Prior acts like "pseudo-observations": α-1 heads, β-1 tails
Example 2: MAP for Linear Regression with Gaussian Prior
Setup:
- Model: y = Xw + ε, where ε ~ N(0, σ²)
- Prior: w ~ N(0, σ²_prior I)
- Goal: Find w that maximizes posterior
Step 1: Prior
P(w) ∝ exp(-||w||²/(2σ²_prior))
Step 2: Likelihood (same as MLE)
L(w) ∝ exp(-||y - Xw||²/(2σ²))
Step 3: Log-Posterior
log P(w|D) = log L(w) + log P(w) + constant
= -||y - Xw||²/(2σ²) - ||w||²/(2σ²_prior) + constant
Step 4: Maximize
Take derivative:
∂/∂w [log P(w|D)] = -1/σ² × Xᵀ(y - Xw) - 1/σ²_prior × w
Set to zero:
-1/σ² × Xᵀ(y - Xw) - 1/σ²_prior × w = 0
Xᵀ(y - Xw) = -σ²/σ²_prior × w
Xᵀy - XᵀXw = -λw (where λ = σ²/σ²_prior)
Xᵀy = (XᵀX + λI)w
w = (XᵀX + λI)⁻¹Xᵀy
Result:
ŵ_MAP = (XᵀX + λI)⁻¹Xᵀy (Ridge regression!)
Key Insight: MAP for linear regression with Gaussian prior = Ridge regression (L2 regularization)!
Intuition:
- Prior: w ~ N(0, σ²_prior) means we believe parameters should be small
- This is exactly L2 regularization!
- λ = σ²/σ²_prior controls regularization strength
Connection: MLE vs MAP
Mathematical Relationship
MLE:
θ̂_MLE = argmax_θ log L(θ)
= argmax_θ log P(D|θ)
MAP:
θ̂_MAP = argmax_θ [log L(θ) + log P(θ)]
= argmax_θ [log P(D|θ) + log P(θ)]
= MLE + Prior
Key Insight: MAP = MLE + log(prior)
When They're the Same
1. Uniform Prior:
- P(θ) = constant
- log P(θ) = constant
- MAP = MLE (prior doesn't affect optimization)
2. Large Dataset:
- Data overwhelms prior
- MAP ≈ MLE (data dominates)
When They Differ
1. Small Dataset:
- Prior has more influence
- MAP can be very different from MLE
2. Strong Prior:
- Prior strongly influences result
- MAP pulled toward prior mean
Regularization Connection
L2 Regularization (Ridge):
Loss = MSE + λ||w||²
= -log L(w) + λ||w||²
= -[log L(w) + log P(w)] (where P(w) ∝ exp(-λ||w||²))
This is exactly MAP with Gaussian prior!
L1 Regularization (Lasso):
Loss = MSE + λ||w||₁
= -[log L(w) + log P(w)] (where P(w) ∝ exp(-λ||w||₁))
This is exactly MAP with Laplace prior!
Key Insight: Regularization = Bayesian prior Regularized optimization = MAP estimation
Examples
Example 1: Coin Flip Comparison
Data: 3 flips, 3 heads
MLE:
θ̂_MLE = 3/3 = 1.0 (100% heads)
MAP with Beta(2, 2) prior (slightly favors fair coin):
θ̂_MAP = (3 + 2 - 1) / (3 + 2 + 2 - 2) = 4/5 = 0.8
MAP with Beta(10, 10) prior (strongly favors fair coin):
θ̂_MAP = (3 + 10 - 1) / (3 + 10 + 10 - 2) = 12/21 ≈ 0.57
Intuition:
- MLE: Only looks at data → extreme estimate
- MAP: Incorporates prior → more reasonable estimate
- Stronger prior → closer to prior mean (0.5)
Example 2: Linear Regression Comparison
Data: Small dataset, many features
MLE (OLS):
ŵ_MLE = (XᵀX)⁻¹Xᵀy
- Can overfit with small data
- Parameters can be very large
MAP (Ridge):
ŵ_MAP = (XᵀX + λI)⁻¹Xᵀy
- Prior shrinks parameters toward 0
- Prevents overfitting
- More stable with small data
When to Use Each
Use MLE When:
- Large dataset: Data overwhelms any prior
- No prior knowledge: Don't have strong beliefs
- Computational simplicity: MLE is simpler
- Frequentist approach: Want point estimates only
Use MAP When:
- Small dataset: Need to incorporate prior knowledge
- Strong prior beliefs: Have domain knowledge
- Regularization needed: Want to prevent overfitting
- Bayesian approach: Want to incorporate uncertainty
Practical Guidelines:
For most ML problems:
- Large data: MLE or MAP with weak prior (≈ MLE)
- Small data: MAP with informative prior
- Regularization: MAP (regularization = prior)
For research:
- Frequentist: MLE
- Bayesian: MAP (or full Bayesian inference)
Summary
MLE:
- Maximizes P(data|parameters)
- Frequentist approach
- No prior beliefs
- Simple, data-driven
MAP:
- Maximizes P(parameters|data)
- Bayesian approach
- Incorporates prior beliefs
- MAP = MLE + Prior
Connection:
- Regularization = Bayesian prior
- Ridge = MAP with Gaussian prior
- Lasso = MAP with Laplace prior
Key Insight: Understanding MLE and MAP helps understand:
- How models learn (MLE)
- Why regularization works (MAP)
- Bayesian vs Frequentist thinking