MLE and MAP Estimation — Deep Dive
Frontier-lab interview prep. Pair with
INTERVIEW_GRILL.md.
This topic underpins almost everything in classical and modern ML. Cross-entropy, ridge/lasso, Bayesian deep learning, RLHF reward modeling — all are MLE/MAP under specific likelihoods and priors. Senior interviews probe whether you can derive these cleanly, not just recognize them.
1. The likelihood function
Given iid data and a parametric family :
The likelihood treats as the variable and the data as fixed — opposite of how is usually written.
Why log? Sums are easier than products, numerically stable (no underflow), and convex programming on is often tractable.
MLE: .
2. Worked MLE derivations
Bernoulli
, .
where . Setting :
Pure intuition: MLE for Bernoulli is the empirical frequency.
Gaussian (mean and variance unknown)
: .
: .
The variance MLE has a , not . It's biased (too small) — Bessel's correction unbiases it.
Multinomial
is a one-hot category among classes. Parameters , .
where . With Lagrangian for the simplex constraint:
Empirical frequency of each category.
Poisson
. .
.
Linear regression
, . Likelihood is Gaussian:
Maximizing = minimizing squared error = OLS:
Key insight: OLS is MLE under Gaussian noise. The choice of squared loss isn't arbitrary — it's the negative log-likelihood of a Gaussian.
Logistic regression
, .
This is the negative cross-entropy loss (up to sign). MLE = minimize cross-entropy. No closed form; use iteratively reweighted least squares (IRLS) or gradient descent.
3. Asymptotic theory of MLE
Under regularity conditions (smooth likelihood, identifiable, true in interior of parameter space):
Consistency: .
Asymptotic normality:
where is the Fisher information per observation. (Defining it on the joint log-likelihood would scale with and contradict the formula above; per-observation is correct.)
Asymptotic efficiency: variance achieves the Cramér-Rao bound.
Invariance: if estimates , then estimates . So the MLE of standard deviation is the square root of MLE of variance.
These properties make MLE the default estimator in classical ML — but only asymptotically. Finite-sample MLE can be biased, can overfit, and can be unbounded.
4. Bayesian setup and MAP
Bayes' theorem applied to a parameter:
The MAP estimate maximizes the posterior:
Equivalent to MLE plus a regularizer that comes from the log-prior. Several important consequences:
Gaussian prior → Ridge
. Then .
For linear regression with Gaussian likelihood:
Multiply through by : ridge regression with .
Laplace prior → Lasso
. Log-prior is → penalty. Lasso = MAP under Laplace prior.
Beta prior + Bernoulli → Smoothed estimate
Prior . Posterior .
MAP: mode of Beta = (when ).
Posterior mean: — gives a smoothed estimate. With (uniform prior), posterior mean is — Laplace smoothing.
This is exactly what NLP people call add-one smoothing.
5. Conjugate priors — the catalog
| Likelihood | Conjugate prior | Posterior |
|---|---|---|
| Bernoulli | Beta() | Beta() |
| Multinomial | Dirichlet() | Dirichlet(), where are per-category counts |
| Poisson | Gamma() | Gamma() |
| Gaussian (mean, variance known) | Gaussian | Gaussian |
| Gaussian (variance, mean known) | Inverse-Gamma | Inverse-Gamma |
| Gaussian (both unknown) | Normal-Inverse-Gamma (or Normal-Inverse-Wishart) | Same family |
| Exponential | Gamma | Gamma |
Conjugate priors give closed-form posteriors. They also yield clean intuition: hyperparameters of the prior look like pseudo-counts — the prior acts as if you'd seen some imaginary data before.
6. MLE vs MAP vs Bayesian — the spectrum
| Method | Output | Captures uncertainty? | Computational cost | When |
|---|---|---|---|---|
| MLE | Point estimate | No | Cheap (optimization) | Lots of data, no strong prior |
| MAP | Point estimate | No (just the mode) | Cheap (optimization with regularizer) | Want regularization with Bayesian interpretation |
| Bayesian | Full posterior | Yes | Expensive (MCMC/VI) | Need uncertainty, decision-theoretic problems |
Modern deep learning is almost entirely MLE (cross-entropy, MSE) plus MAP (weight decay, dropout). True Bayesian deep learning is a research area (Bayesian NNs, dropout-as-Bayes-approx, deep ensembles for posterior approximation).
7. Why MLE = minimum cross-entropy = minimum forward KL
For data drawn from true distribution :
The right-hand expression is the cross-entropy of relative to . Equivalent:
The term doesn't depend on , so MLE = minimize forward KL from to model. This is the "mode-covering" KL — penalizes putting low probability on regions where is high. (Reverse KL would be mode-seeking; that's what variational inference uses.)
8. Common interview gotchas
| Question | Common wrong answer | Right answer |
|---|---|---|
| Is MLE always unbiased? | Yes | No — Gaussian variance MLE is biased; many MLEs are biased in finite samples |
| What's the relationship between MAP and regularization? | They're different | MAP = MLE + log-prior; weight decay = Gaussian prior; lasso = Laplace prior |
| What does cross-entropy minimize? | Cross-entropy | Forward KL (with constant offset ) |
| MLE objective for OLS? | Minimize squared loss | MLE under Gaussian noise → minimizing squared loss |
| Why log-likelihood instead of likelihood? | Same thing | Numerics + sums vs products + matches concavity for many models |
| Why is MLE for variance biased? | It isn't | Plug-in is closer to data than , so is too small |
| MAP = mean of posterior? | Yes | No, it's the mode. Posterior mean is a different point estimator |
9. Eight most-asked interview questions
- Derive MLE for a Gaussian (both parameters). (Set partials to zero; recognize that variance MLE is biased.)
- Show that OLS equals MLE under Gaussian noise. (Write log-likelihood, drop constants, recognize squared loss.)
- Show that ridge equals MAP under Gaussian prior. (Write log-posterior, recognize penalty with .)
- What's the relationship between cross-entropy and MLE? (CE = negative log-likelihood; minimizing CE = MLE.)
- Bayesian smoothing for Bernoulli — derive Laplace's rule of succession. (Beta(1,1) prior + observed data → posterior mean .)
- What's a conjugate prior and why is it useful? (Same-family posterior; closed-form updates; pseudo-count intuition.)
- What are the asymptotic properties of MLE? (Consistent, asymp. normal, efficient, invariant.)
- MAP vs Bayesian inference? (MAP gives a point estimate (mode); Bayesian gives full posterior + uncertainty; cost increases.)
10. Drill plan
- Derive MLE for: Bernoulli, Gaussian, Poisson, multinomial, exponential. 5 minutes each.
- Derive MAP for: linear regression with Gaussian prior (ridge), with Laplace prior (lasso).
- Derive Beta-Bernoulli posterior. Recite mean and mode.
- Recognize: OLS = MLE Gaussian, ridge = MAP Gaussian, lasso = MAP Laplace, cross-entropy = MLE general.
- Practice writing log-likelihoods cleanly without dropping constants until the end.
11. Further reading
- Murphy, Machine Learning: A Probabilistic Perspective, ch. 4–5.
- Bishop, Pattern Recognition and Machine Learning, ch. 1, 2.
- Wasserman, All of Statistics, ch. 9, 11.
- MacKay, Information Theory, Inference, and Learning Algorithms — beautiful Bayesian framing.