MLE and MAP: Detailed Derivations with Intuitive Explanations

Maximum Likelihood Estimation (MLE)
Maximum A Posteriori (MAP)
Connection: MLE vs MAP
Examples
When to Use Each

Maximum Likelihood Estimation (MLE)

Intuitive Explanation

The Core Idea:

Imagine you're a detective trying to figure out what happened. You have some evidence (data), and you want to figure out what scenario (parameters) is most likely to have produced that evidence.

MLE asks: "Given the data I observed, what parameter values make this data most probable?"

Example:

You flip a coin 10 times and get 7 heads, 3 tails
MLE asks: "What probability of heads makes 7 heads out of 10 flips most likely?"
Answer: p = 0.7 (the observed proportion)

Key Insight: MLE finds the parameter values that maximize the probability of observing the data you actually saw.

Mathematical Setup

Given:

Data: D = {x₁, x₂, ..., xₙ}
Model: P(x|θ) (probability of data given parameters)
Goal: Find θ that maximizes P(D|θ)

Likelihood Function:

L(θ) = P(D|θ) = P(x₁, x₂, ..., xₙ | θ)

Key Point: Likelihood is a function of parameters θ, not data. We treat data as fixed and vary parameters.

Detailed Derivation

Step 1: Write the Likelihood

For independent observations:

L(θ) = P(x₁|θ) × P(x₂|θ) × ... × P(xₙ|θ)
     = ∏ᵢ P(xᵢ|θ)

Why product? Each observation is independent, so probability of all observations = product of individual probabilities.

Step 2: Log-Likelihood (Why We Use It)

Problem: Products are hard to work with (numerical issues, derivatives)

Solution: Take logarithm

log L(θ) = log ∏ᵢ P(xᵢ|θ)
         = Σᵢ log P(xᵢ|θ)

Why this works:

log is monotonic: maximizing log L(θ) = maximizing L(θ)
Products become sums (easier!)
More numerically stable

Step 3: Maximize Log-Likelihood

MLE estimate:

θ̂_MLE = argmax_θ log L(θ)
      = argmax_θ Σᵢ log P(xᵢ|θ)

How to find it:

Take derivative with respect to θ
Set derivative to zero
Solve for θ

Mathematical:

∂/∂θ [Σᵢ log P(xᵢ|θ)] = 0

This gives us the maximum likelihood estimate.

Example 1: MLE for Coin Flip (Bernoulli)

Setup:

Data: n flips, k heads, (n-k) tails
Model: P(heads) = θ, P(tails) = 1-θ
Goal: Find θ that maximizes likelihood

Step 1: Likelihood

L(θ) = θᵏ × (1-θ)ⁿ⁻ᵏ

Why?
- k heads: each has probability θ → θᵏ
- (n-k) tails: each has probability (1-θ) → (1-θ)ⁿ⁻ᵏ
- Independent: multiply them

Step 2: Log-Likelihood

log L(θ) = log[θᵏ × (1-θ)ⁿ⁻ᵏ]
         = k log θ + (n-k) log(1-θ)

Step 3: Take Derivative

d/dθ [log L(θ)] = d/dθ [k log θ + (n-k) log(1-θ)]
                = k/θ - (n-k)/(1-θ)

Step 4: Set to Zero

k/θ - (n-k)/(1-θ) = 0
k/θ = (n-k)/(1-θ)
k(1-θ) = (n-k)θ
k - kθ = nθ - kθ
k = nθ
θ = k/n

Result:

θ̂_MLE = k/n = (number of heads) / (total flips)

Intuition: The MLE estimate is simply the observed proportion! This makes sense: if you see 7 heads out of 10, the best estimate is 0.7.

Example 2: MLE for Normal Distribution

Setup:

Data: {x₁, x₂, ..., xₙ} from N(μ, σ²)
Goal: Find μ and σ² that maximize likelihood

Step 1: Likelihood

L(μ, σ²) = ∏ᵢ (1/√(2πσ²)) exp(-(xᵢ-μ)²/(2σ²))
         = (1/√(2πσ²))ⁿ exp(-Σᵢ(xᵢ-μ)²/(2σ²))

Step 2: Log-Likelihood

log L(μ, σ²) = -n/2 log(2πσ²) - Σᵢ(xᵢ-μ)²/(2σ²)

Step 3: Maximize with respect to μ

Take derivative with respect to μ:

∂/∂μ [log L(μ, σ²)] = ∂/∂μ [-n/2 log(2πσ²) - Σᵢ(xᵢ-μ)²/(2σ²)]
                    = -1/(2σ²) × ∂/∂μ [Σᵢ(xᵢ-μ)²]
                    = -1/(2σ²) × Σᵢ 2(xᵢ-μ)(-1)
                    = 1/σ² × Σᵢ(xᵢ-μ)

Set to zero:

1/σ² × Σᵢ(xᵢ-μ) = 0
Σᵢ(xᵢ-μ) = 0
Σᵢ xᵢ - nμ = 0
μ = (1/n) Σᵢ xᵢ = x̄

Result:

μ̂_MLE = x̄ = (1/n) Σᵢ xᵢ  (sample mean)

Step 4: Maximize with respect to σ²

Take derivative with respect to σ²:

∂/∂σ² [log L(μ, σ²)] = -n/(2σ²) + Σᵢ(xᵢ-μ)²/(2(σ²)²)

Set to zero:

-n/(2σ²) + Σᵢ(xᵢ-μ)²/(2(σ²)²) = 0
-nσ² + Σᵢ(xᵢ-μ)² = 0
σ² = (1/n) Σᵢ(xᵢ-μ)²

Result:

σ̂²_MLE = (1/n) Σᵢ(xᵢ-μ̂)²  (sample variance, biased)

Note: This is the biased estimator. The unbiased version divides by (n-1) instead of n.

Intuition:

MLE for mean = sample mean (makes sense!)
MLE for variance = average squared deviation from mean

Example 3: MLE for Linear Regression

Setup:

Model: y = Xw + ε, where ε ~ N(0, σ²)
Data: {(x₁, y₁), ..., (xₙ, yₙ)}
Goal: Find w that maximizes likelihood

Step 1: Likelihood

For each data point:

P(yᵢ|xᵢ, w, σ²) = (1/√(2πσ²)) exp(-(yᵢ - xᵢᵀw)²/(2σ²))

For all data (independent):

L(w, σ²) = ∏ᵢ (1/√(2πσ²)) exp(-(yᵢ - xᵢᵀw)²/(2σ²))

Step 2: Log-Likelihood

log L(w, σ²) = -n/2 log(2πσ²) - (1/(2σ²)) Σᵢ(yᵢ - xᵢᵀw)²

Step 3: Maximize with respect to w

Since σ² doesn't depend on w, we can ignore the first term:

argmax_w log L(w, σ²) = argmax_w [- (1/(2σ²)) Σᵢ(yᵢ - xᵢᵀw)²]
                     = argmin_w [Σᵢ(yᵢ - xᵢᵀw)²]
                     = argmin_w ||y - Xw||²

Result:

ŵ_MLE = (XᵀX)⁻¹Xᵀy  (ordinary least squares!)

Key Insight: MLE for linear regression with Gaussian noise = Ordinary Least Squares (OLS)!

Intuition:

We assume errors are normally distributed
Maximizing likelihood = minimizing sum of squared errors
This is exactly what OLS does!

Maximum A Posteriori (MAP)

Intuitive Explanation

The Core Idea:

MLE asks: "What parameters make the data most likely?"

MAP asks: "What parameters are most likely given both the data AND my prior beliefs?"

Example:

You flip a coin 3 times and get 3 heads
MLE says: p = 1.0 (100% heads) - but this seems wrong!
MAP says: "Wait, I know coins are usually fair (p ≈ 0.5), so maybe p = 0.7" - incorporates prior knowledge

Key Insight: MAP = MLE + Prior Beliefs

Mathematical Setup

Bayes' Theorem:

P(θ|D) = P(D|θ) × P(θ) / P(D)

Where:
- P(θ|D): Posterior (what we want)
- P(D|θ): Likelihood (same as MLE)
- P(θ): Prior (our beliefs before seeing data)
- P(D): Evidence (normalizing constant, doesn't depend on θ)

MAP Estimate:

θ̂_MAP = argmax_θ P(θ|D)
      = argmax_θ P(D|θ) × P(θ)  (P(D) doesn't depend on θ)
      = argmax_θ [log P(D|θ) + log P(θ)]
      = argmax_θ [log L(θ) + log P(θ)]

Key Insight: MAP = MLE + log(prior)

Detailed Derivation

Step 1: Write the Posterior

P(θ|D) ∝ P(D|θ) × P(θ)
      ∝ L(θ) × P(θ)

Why proportional? P(D) is constant with respect to θ, so we can ignore it when maximizing.

Step 2: Log-Posterior

log P(θ|D) = log L(θ) + log P(θ) + constant

Step 3: Maximize Log-Posterior

θ̂_MAP = argmax_θ [log L(θ) + log P(θ)]

How to find it:

Take derivative with respect to θ
Set derivative to zero
Solve for θ

Mathematical:

∂/∂θ [log L(θ) + log P(θ)] = 0

Example 1: MAP for Coin Flip with Beta Prior

Setup:

Data: n flips, k heads
Prior: θ ~ Beta(α, β) (conjugate prior for Bernoulli)
Goal: Find θ that maximizes posterior

Step 1: Prior

P(θ) = (θ^(α-1) × (1-θ)^(β-1)) / B(α, β)

Where B(α, β) is the Beta function (normalizing constant)

Step 2: Likelihood (same as MLE)

L(θ) = θᵏ × (1-θ)ⁿ⁻ᵏ

Step 3: Posterior

P(θ|D) ∝ L(θ) × P(θ)
      ∝ θᵏ × (1-θ)ⁿ⁻ᵏ × θ^(α-1) × (1-θ)^(β-1)
      ∝ θ^(k+α-1) × (1-θ)^(n-k+β-1)

This is Beta(k+α, n-k+β)!

Step 4: MAP Estimate

Mode of Beta(α', β') is:

θ_mode = (α' - 1) / (α' + β' - 2)

So:

θ̂_MAP = (k + α - 1) / (n + α + β - 2)

Special Case: Uniform Prior (α=1, β=1)

θ̂_MAP = (k + 1 - 1) / (n + 1 + 1 - 2)
      = k / n
      = θ̂_MLE

Intuition:

With uniform prior (no prior beliefs), MAP = MLE
With informative prior, MAP incorporates prior knowledge
Prior acts like "pseudo-observations": α-1 heads, β-1 tails

Example 2: MAP for Linear Regression with Gaussian Prior

Setup:

Model: y = Xw + ε, where ε ~ N(0, σ²)
Prior: w ~ N(0, σ²_prior I)
Goal: Find w that maximizes posterior

Step 1: Prior

P(w) ∝ exp(-||w||²/(2σ²_prior))

Step 2: Likelihood (same as MLE)

L(w) ∝ exp(-||y - Xw||²/(2σ²))

Step 3: Log-Posterior

log P(w|D) = log L(w) + log P(w) + constant
           = -||y - Xw||²/(2σ²) - ||w||²/(2σ²_prior) + constant

Step 4: Maximize

Take derivative:

∂/∂w [log P(w|D)] = -1/σ² × Xᵀ(y - Xw) - 1/σ²_prior × w

Set to zero:

-1/σ² × Xᵀ(y - Xw) - 1/σ²_prior × w = 0
Xᵀ(y - Xw) = -σ²/σ²_prior × w
Xᵀy - XᵀXw = -λw  (where λ = σ²/σ²_prior)
Xᵀy = (XᵀX + λI)w
w = (XᵀX + λI)⁻¹Xᵀy

Result:

ŵ_MAP = (XᵀX + λI)⁻¹Xᵀy  (Ridge regression!)

Key Insight: MAP for linear regression with Gaussian prior = Ridge regression (L2 regularization)!

Intuition:

Prior: w ~ N(0, σ²_prior) means we believe parameters should be small
This is exactly L2 regularization!
λ = σ²/σ²_prior controls regularization strength

Connection: MLE vs MAP

Mathematical Relationship

MLE:

θ̂_MLE = argmax_θ log L(θ)
      = argmax_θ log P(D|θ)

MAP:

θ̂_MAP = argmax_θ [log L(θ) + log P(θ)]
      = argmax_θ [log P(D|θ) + log P(θ)]
      = MLE + Prior

Key Insight: MAP = MLE + log(prior)

When They're the Same

1. Uniform Prior:

P(θ) = constant
log P(θ) = constant
MAP = MLE (prior doesn't affect optimization)

2. Large Dataset:

Data overwhelms prior
MAP ≈ MLE (data dominates)

When They Differ

1. Small Dataset:

Prior has more influence
MAP can be very different from MLE

2. Strong Prior:

Prior strongly influences result
MAP pulled toward prior mean

Regularization Connection

L2 Regularization (Ridge):

Loss = MSE + λ||w||²
     = -log L(w) + λ||w||²
     = -[log L(w) + log P(w)]  (where P(w) ∝ exp(-λ||w||²))

This is exactly MAP with Gaussian prior!

L1 Regularization (Lasso):

Loss = MSE + λ||w||₁
     = -[log L(w) + log P(w)]  (where P(w) ∝ exp(-λ||w||₁))

This is exactly MAP with Laplace prior!

Key Insight: Regularization = Bayesian prior Regularized optimization = MAP estimation

Examples

Example 1: Coin Flip Comparison

Data: 3 flips, 3 heads

MLE:

θ̂_MLE = 3/3 = 1.0  (100% heads)

MAP with Beta(2, 2) prior (slightly favors fair coin):

θ̂_MAP = (3 + 2 - 1) / (3 + 2 + 2 - 2) = 4/5 = 0.8

MAP with Beta(10, 10) prior (strongly favors fair coin):

θ̂_MAP = (3 + 10 - 1) / (3 + 10 + 10 - 2) = 12/21 ≈ 0.57

Intuition:

MLE: Only looks at data → extreme estimate
MAP: Incorporates prior → more reasonable estimate
Stronger prior → closer to prior mean (0.5)

Example 2: Linear Regression Comparison

Data: Small dataset, many features

MLE (OLS):

ŵ_MLE = (XᵀX)⁻¹Xᵀy

Can overfit with small data
Parameters can be very large

MAP (Ridge):

ŵ_MAP = (XᵀX + λI)⁻¹Xᵀy

Prior shrinks parameters toward 0
Prevents overfitting
More stable with small data

When to Use Each

Use MLE When:

Large dataset: Data overwhelms any prior
No prior knowledge: Don't have strong beliefs
Computational simplicity: MLE is simpler
Frequentist approach: Want point estimates only

Use MAP When:

Small dataset: Need to incorporate prior knowledge
Strong prior beliefs: Have domain knowledge
Regularization needed: Want to prevent overfitting
Bayesian approach: Want to incorporate uncertainty

Practical Guidelines:

For most ML problems:

Large data: MLE or MAP with weak prior (≈ MLE)
Small data: MAP with informative prior
Regularization: MAP (regularization = prior)

For research:

Frequentist: MLE
Bayesian: MAP (or full Bayesian inference)

Summary

MLE:

Maximizes P(data|parameters)
Frequentist approach
No prior beliefs
Simple, data-driven

MAP:

Maximizes P(parameters|data)
Bayesian approach
Incorporates prior beliefs
MAP = MLE + Prior

Connection:

Regularization = Bayesian prior
Ridge = MAP with Gaussian prior
Lasso = MAP with Laplace prior

Key Insight: Understanding MLE and MAP helps understand:

How models learn (MLE)
Why regularization works (MAP)
Bayesian vs Frequentist thinking

ML & LLM Interview Prep — Deep Dives

MLE and MAP: Detailed Derivations with Intuitive Explanations

Table of Contents

Maximum Likelihood Estimation (MLE)

Intuitive Explanation

Mathematical Setup

Detailed Derivation

Step 1: Write the Likelihood

Step 2: Log-Likelihood (Why We Use It)

Step 3: Maximize Log-Likelihood

Example 1: MLE for Coin Flip (Bernoulli)

Example 2: MLE for Normal Distribution

Example 3: MLE for Linear Regression

Maximum A Posteriori (MAP)

Intuitive Explanation

Mathematical Setup

Detailed Derivation

Step 1: Write the Posterior

Step 2: Log-Posterior

Step 3: Maximize Log-Posterior

Example 1: MAP for Coin Flip with Beta Prior

Example 2: MAP for Linear Regression with Gaussian Prior

Connection: MLE vs MAP

Mathematical Relationship

When They're the Same

When They Differ

Regularization Connection

Examples

Example 1: Coin Flip Comparison

Example 2: Linear Regression Comparison

When to Use Each

Use MLE When:

Use MAP When:

Practical Guidelines:

Summary