MLE and MAP Estimation — Deep Dive

Frontier-lab interview prep. Pair with INTERVIEW_GRILL.md.

This topic underpins almost everything in classical and modern ML. Cross-entropy, ridge/lasso, Bayesian deep learning, RLHF reward modeling — all are MLE/MAP under specific likelihoods and priors. Senior interviews probe whether you can derive these cleanly, not just recognize them.

1. The likelihood function

Given iid data $X_{1}, \dots, X_{n} \sim p (\cdot ∣ θ)$ and a parametric family ${p (\cdot ∣ θ) : θ \in Θ}$ :

$L (θ) = i = 1 \prod n p (x_{i} ∣ θ), ℓ (θ) = lo g L (θ) = i = 1 \sum n lo g p (x_{i} ∣ θ)$

The likelihood treats $θ$ as the variable and the data as fixed — opposite of how $p$ is usually written.

Why log? Sums are easier than products, numerically stable (no underflow), and convex programming on $ℓ$ is often tractable.

MLE: $\hat{θ}_{MLE} = ar g max_{θ} ℓ (θ)$ .

2. Worked MLE derivations

Bernoulli

$x_{i} \in {0, 1}$ , $p (x ∣ θ) = θ^{x} (1 - θ)^{1 - x}$ .

$ℓ (θ) = i \sum [x_{i} lo g θ + (1 - x_{i}) lo g (1 - θ)] = s lo g θ + (n - s) lo g (1 - θ)$

where $s = \sum x_{i}$ . Setting $\partial ℓ / \partial θ = 0$ :

$\frac{s}{θ} - \frac{n - s}{1 - θ} = 0 ⟹ \hat{θ}_{MLE} = \frac{s}{n} = \overset{x}{ˉ}$

Pure intuition: MLE for Bernoulli is the empirical frequency.

Gaussian (mean and variance unknown)

$ℓ (μ, σ^{2}) = - \frac{n}{2} lo g (2 π) - \frac{n}{2} lo g σ^{2} - \frac{1}{2 σ ^{2}} i \sum (x_{i} - μ)^{2}$

$\partial ℓ / \partial μ = 0$ : $\overset{μ}{^} = \overset{x}{ˉ}$ .

$\partial ℓ / \partial σ^{2} = 0$ : $\overset{σ}{^}^{2} = \frac{1}{n} \sum_{i} (x_{i} - \overset{x}{ˉ})^{2}$ .

The variance MLE has a $1/ n$ , not $1/ (n - 1)$ . It's biased (too small) — Bessel's correction unbiases it.

Multinomial

$x_{i}$ is a one-hot category among $K$ classes. Parameters $θ_{k}$ , $\sum_{k} θ_{k} = 1$ .

$ℓ (θ) = i \sum k \sum x_{i, k} lo g θ_{k} = k \sum n_{k} lo g θ_{k}$

where $n_{k} = \sum_{i} x_{i, k}$ . With Lagrangian for the simplex constraint:

$\hat{θ}_{k} = n_{k} / n$

Empirical frequency of each category.

Poisson

$p (x ∣ λ) = e^{- λ} λ^{x} / x!$ . $ℓ (λ) = \sum_{i} [- λ + x_{i} lo g λ - lo g x_{i}!]$ .

$\hat{λ}_{MLE} = \overset{x}{ˉ}$ .

Linear regression

$y_{i} = w^{⊤} x_{i} + ϵ_{i}$ , $ϵ_{i} \sim N (0, σ^{2})$ . Likelihood is Gaussian:

$ℓ (w) = - \frac{1}{2 σ ^{2}} i \sum (y_{i} - w^{⊤} x_{i})^{2} + const$

Maximizing $ℓ$ = minimizing squared error = OLS:

$\overset{w}{^}_{MLE} = (X^{⊤} X)^{- 1} X^{⊤} y$

Key insight: OLS is MLE under Gaussian noise. The choice of squared loss isn't arbitrary — it's the negative log-likelihood of a Gaussian.

Logistic regression

$y_{i} \in {0, 1}$ , $p (y = 1∣ x) = σ (w^{⊤} x)$ .

$ℓ (w) = i \sum [y_{i} lo g σ (w^{⊤} x_{i}) + (1 - y_{i}) lo g (1 - σ (w^{⊤} x_{i}))]$

This is the negative cross-entropy loss (up to sign). MLE = minimize cross-entropy. No closed form; use iteratively reweighted least squares (IRLS) or gradient descent.

3. Asymptotic theory of MLE

Under regularity conditions (smooth likelihood, identifiable, true $θ_{0}$ in interior of parameter space):

Consistency: $\hat{θ}_{n} \to_{p} θ_{0}$ .

Asymptotic normality:

$n (\hat{θ}_{n} - θ_{0}) \to N (0, I (θ_{0})^{- 1})$

where $I (θ) = - E_{x} [\partial^{2} lo g p (x ∣ θ) / \partial θ^{2}]$ is the Fisher information per observation. (Defining it on the joint log-likelihood $ℓ = \sum_{i} lo g p (x_{i} ∣ θ)$ would scale with $n$ and contradict the formula above; per-observation is correct.)

Asymptotic efficiency: variance achieves the Cramér-Rao bound.

Invariance: if $\hat{θ}_{MLE}$ estimates $θ$ , then $g (\hat{θ}_{MLE})$ estimates $g (θ)$ . So the MLE of standard deviation is the square root of MLE of variance.

These properties make MLE the default estimator in classical ML — but only asymptotically. Finite-sample MLE can be biased, can overfit, and can be unbounded.

4. Bayesian setup and MAP

Bayes' theorem applied to a parameter:

$p (θ ∣ x) = \frac{p ( x ∣ θ ) p ( θ )}{p ( x )}$

The MAP estimate maximizes the posterior:

$\hat{θ}_{MAP} = ar g θ max p (θ ∣ x) = ar g θ max [lo g p (x ∣ θ) + lo g p (θ)]$

Equivalent to MLE plus a regularizer that comes from the log-prior. Several important consequences:

Gaussian prior → Ridge

$p (w) = N (0, τ^{2} I)$ . Then $lo g p (w) = - \frac{1}{2 τ ^{2}} ∥ w ∥^{2} + const$ .

For linear regression with Gaussian likelihood:

$\overset{w}{^}_{MAP} = ar g w min [\frac{1}{2 σ ^{2}} ∥ y - Xw ∥^{2} + \frac{1}{2 τ ^{2}} ∥ w ∥^{2}]$

Multiply through by $σ^{2}$ : ridge regression with $λ = σ^{2} / τ^{2}$ .

Laplace prior → Lasso

$p (w_{j}) \propto exp (- ∣ w_{j} ∣/ b)$ . Log-prior is $- ∣ w ∣/ b$ → $ℓ_{1}$ penalty. Lasso = MAP under Laplace prior.

Beta prior + Bernoulli → Smoothed estimate

Prior $θ \sim Beta (α, β)$ . Posterior $θ ∣ x \sim Beta (α + s, β + n - s)$ .

MAP: mode of Beta = $\frac{α + s - 1}{α + β + n - 2}$ (when $α, β > 1$ ).

Posterior mean: $\frac{α + s}{α + β + n}$ — gives a smoothed estimate. With $α = β = 1$ (uniform prior), posterior mean is $(s + 1) / (n + 2)$ — Laplace smoothing.

This is exactly what NLP people call add-one smoothing.

5. Conjugate priors — the catalog

Likelihood	Conjugate prior	Posterior
Bernoulli	Beta( $α, β$ )	Beta( $α + s, β + n - s$ )
Multinomial	Dirichlet( $α$ )	Dirichlet( $α + n$ ), where $n = (n_{1}, \dots, n_{K})$ are per-category counts
Poisson	Gamma( $α, β$ )	Gamma( $α + \sum x_{i}, β + n$ )
Gaussian (mean, variance known)	Gaussian	Gaussian
Gaussian (variance, mean known)	Inverse-Gamma	Inverse-Gamma
Gaussian (both unknown)	Normal-Inverse-Gamma (or Normal-Inverse-Wishart)	Same family
Exponential	Gamma	Gamma

Conjugate priors give closed-form posteriors. They also yield clean intuition: hyperparameters of the prior look like pseudo-counts — the prior acts as if you'd seen some imaginary data before.

6. MLE vs MAP vs Bayesian — the spectrum

Method	Output	Captures uncertainty?	Computational cost	When
MLE	Point estimate	No	Cheap (optimization)	Lots of data, no strong prior
MAP	Point estimate	No (just the mode)	Cheap (optimization with regularizer)	Want regularization with Bayesian interpretation
Bayesian	Full posterior	Yes	Expensive (MCMC/VI)	Need uncertainty, decision-theoretic problems

Modern deep learning is almost entirely MLE (cross-entropy, MSE) plus MAP (weight decay, dropout). True Bayesian deep learning is a research area (Bayesian NNs, dropout-as-Bayes-approx, deep ensembles for posterior approximation).

7. Why MLE = minimum cross-entropy = minimum forward KL

For data drawn from true distribution $p^{*}$ :

$ar g θ max E_{x \sim p^{*}} [lo g p (x ∣ θ)] = ar g θ min E_{x \sim p^{*}} [- lo g p (x ∣ θ)]$

The right-hand expression is the cross-entropy of $p^{*}$ relative to $p_{θ}$ . Equivalent:

$= ar g θ min KL (p^{*} ∥ p_{θ}) + H (p^{*})$

The $H (p^{*})$ term doesn't depend on $θ$ , so MLE = minimize forward KL from $p^{*}$ to model. This is the "mode-covering" KL — penalizes putting low probability on regions where $p^{*}$ is high. (Reverse KL would be mode-seeking; that's what variational inference uses.)

8. Common interview gotchas

Question	Common wrong answer	Right answer
Is MLE always unbiased?	Yes	No — Gaussian variance MLE is biased; many MLEs are biased in finite samples
What's the relationship between MAP and regularization?	They're different	MAP = MLE + log-prior; weight decay = Gaussian prior; lasso = Laplace prior
What does cross-entropy minimize?	Cross-entropy	Forward KL (with constant offset $H (p^{*})$ )
MLE objective for OLS?	Minimize squared loss	MLE under Gaussian noise → minimizing squared loss
Why log-likelihood instead of likelihood?	Same thing	Numerics + sums vs products + matches concavity for many models
Why is MLE for variance biased?	It isn't	Plug-in $\overset{x}{ˉ}$ is closer to data than $μ$ , so $\sum (x - \overset{x}{ˉ})^{2}$ is too small
MAP = mean of posterior?	Yes	No, it's the mode. Posterior mean is a different point estimator

9. Eight most-asked interview questions

Derive MLE for a Gaussian (both parameters). (Set partials to zero; recognize that variance MLE is biased.)
Show that OLS equals MLE under Gaussian noise. (Write log-likelihood, drop constants, recognize squared loss.)
Show that ridge equals MAP under Gaussian prior. (Write log-posterior, recognize $ℓ_{2}$ penalty with $λ = σ^{2} / τ^{2}$ .)
What's the relationship between cross-entropy and MLE? (CE = negative log-likelihood; minimizing CE = MLE.)
Bayesian smoothing for Bernoulli — derive Laplace's rule of succession. (Beta(1,1) prior + observed data → posterior mean $(s + 1) / (n + 2)$ .)
What's a conjugate prior and why is it useful? (Same-family posterior; closed-form updates; pseudo-count intuition.)
What are the asymptotic properties of MLE? (Consistent, asymp. normal, efficient, invariant.)
MAP vs Bayesian inference? (MAP gives a point estimate (mode); Bayesian gives full posterior + uncertainty; cost increases.)

10. Drill plan

Derive MLE for: Bernoulli, Gaussian, Poisson, multinomial, exponential. 5 minutes each.
Derive MAP for: linear regression with Gaussian prior (ridge), with Laplace prior (lasso).
Derive Beta-Bernoulli posterior. Recite mean and mode.
Recognize: OLS = MLE Gaussian, ridge = MAP Gaussian, lasso = MAP Laplace, cross-entropy = MLE general.
Practice writing log-likelihoods cleanly without dropping constants until the end.

11. Further reading

Murphy, Machine Learning: A Probabilistic Perspective, ch. 4–5.
Bishop, Pattern Recognition and Machine Learning, ch. 1, 2.
Wasserman, All of Statistics, ch. 9, 11.
MacKay, Information Theory, Inference, and Learning Algorithms — beautiful Bayesian framing.

ML & LLM Interview Prep — Deep Dives