MLE and MAP Estimation — Deep Dive

Frontier-lab interview prep. Pair with INTERVIEW_GRILL.md.

This topic underpins almost everything in classical and modern ML. Cross-entropy, ridge/lasso, Bayesian deep learning, RLHF reward modeling — all are MLE/MAP under specific likelihoods and priors. Senior interviews probe whether you can derive these cleanly, not just recognize them.


1. The likelihood function

Given iid data and a parametric family :

The likelihood treats as the variable and the data as fixed — opposite of how is usually written.

Why log? Sums are easier than products, numerically stable (no underflow), and convex programming on is often tractable.

MLE: .


2. Worked MLE derivations

Bernoulli

, .

where . Setting :

Pure intuition: MLE for Bernoulli is the empirical frequency.

Gaussian (mean and variance unknown)

: .

: .

The variance MLE has a , not . It's biased (too small) — Bessel's correction unbiases it.

Multinomial

is a one-hot category among classes. Parameters , .

where . With Lagrangian for the simplex constraint:

Empirical frequency of each category.

Poisson

. .

.

Linear regression

, . Likelihood is Gaussian:

Maximizing = minimizing squared error = OLS:

Key insight: OLS is MLE under Gaussian noise. The choice of squared loss isn't arbitrary — it's the negative log-likelihood of a Gaussian.

Logistic regression

, .

This is the negative cross-entropy loss (up to sign). MLE = minimize cross-entropy. No closed form; use iteratively reweighted least squares (IRLS) or gradient descent.


3. Asymptotic theory of MLE

Under regularity conditions (smooth likelihood, identifiable, true in interior of parameter space):

Consistency: .

Asymptotic normality:

where is the Fisher information per observation. (Defining it on the joint log-likelihood would scale with and contradict the formula above; per-observation is correct.)

Asymptotic efficiency: variance achieves the Cramér-Rao bound.

Invariance: if estimates , then estimates . So the MLE of standard deviation is the square root of MLE of variance.

These properties make MLE the default estimator in classical ML — but only asymptotically. Finite-sample MLE can be biased, can overfit, and can be unbounded.


4. Bayesian setup and MAP

Bayes' theorem applied to a parameter:

The MAP estimate maximizes the posterior:

Equivalent to MLE plus a regularizer that comes from the log-prior. Several important consequences:

Gaussian prior → Ridge

. Then .

For linear regression with Gaussian likelihood:

Multiply through by : ridge regression with .

Laplace prior → Lasso

. Log-prior is penalty. Lasso = MAP under Laplace prior.

Beta prior + Bernoulli → Smoothed estimate

Prior . Posterior .

MAP: mode of Beta = (when ).

Posterior mean: — gives a smoothed estimate. With (uniform prior), posterior mean is — Laplace smoothing.

This is exactly what NLP people call add-one smoothing.


5. Conjugate priors — the catalog

LikelihoodConjugate priorPosterior
BernoulliBeta()Beta()
MultinomialDirichlet()Dirichlet(), where are per-category counts
PoissonGamma()Gamma()
Gaussian (mean, variance known)GaussianGaussian
Gaussian (variance, mean known)Inverse-GammaInverse-Gamma
Gaussian (both unknown)Normal-Inverse-Gamma (or Normal-Inverse-Wishart)Same family
ExponentialGammaGamma

Conjugate priors give closed-form posteriors. They also yield clean intuition: hyperparameters of the prior look like pseudo-counts — the prior acts as if you'd seen some imaginary data before.


6. MLE vs MAP vs Bayesian — the spectrum

MethodOutputCaptures uncertainty?Computational costWhen
MLEPoint estimateNoCheap (optimization)Lots of data, no strong prior
MAPPoint estimateNo (just the mode)Cheap (optimization with regularizer)Want regularization with Bayesian interpretation
BayesianFull posteriorYesExpensive (MCMC/VI)Need uncertainty, decision-theoretic problems

Modern deep learning is almost entirely MLE (cross-entropy, MSE) plus MAP (weight decay, dropout). True Bayesian deep learning is a research area (Bayesian NNs, dropout-as-Bayes-approx, deep ensembles for posterior approximation).


7. Why MLE = minimum cross-entropy = minimum forward KL

For data drawn from true distribution :

The right-hand expression is the cross-entropy of relative to . Equivalent:

The term doesn't depend on , so MLE = minimize forward KL from to model. This is the "mode-covering" KL — penalizes putting low probability on regions where is high. (Reverse KL would be mode-seeking; that's what variational inference uses.)


8. Common interview gotchas

QuestionCommon wrong answerRight answer
Is MLE always unbiased?YesNo — Gaussian variance MLE is biased; many MLEs are biased in finite samples
What's the relationship between MAP and regularization?They're differentMAP = MLE + log-prior; weight decay = Gaussian prior; lasso = Laplace prior
What does cross-entropy minimize?Cross-entropyForward KL (with constant offset )
MLE objective for OLS?Minimize squared lossMLE under Gaussian noise → minimizing squared loss
Why log-likelihood instead of likelihood?Same thingNumerics + sums vs products + matches concavity for many models
Why is MLE for variance biased?It isn'tPlug-in is closer to data than , so is too small
MAP = mean of posterior?YesNo, it's the mode. Posterior mean is a different point estimator

9. Eight most-asked interview questions

  1. Derive MLE for a Gaussian (both parameters). (Set partials to zero; recognize that variance MLE is biased.)
  2. Show that OLS equals MLE under Gaussian noise. (Write log-likelihood, drop constants, recognize squared loss.)
  3. Show that ridge equals MAP under Gaussian prior. (Write log-posterior, recognize penalty with .)
  4. What's the relationship between cross-entropy and MLE? (CE = negative log-likelihood; minimizing CE = MLE.)
  5. Bayesian smoothing for Bernoulli — derive Laplace's rule of succession. (Beta(1,1) prior + observed data → posterior mean .)
  6. What's a conjugate prior and why is it useful? (Same-family posterior; closed-form updates; pseudo-count intuition.)
  7. What are the asymptotic properties of MLE? (Consistent, asymp. normal, efficient, invariant.)
  8. MAP vs Bayesian inference? (MAP gives a point estimate (mode); Bayesian gives full posterior + uncertainty; cost increases.)

10. Drill plan

  • Derive MLE for: Bernoulli, Gaussian, Poisson, multinomial, exponential. 5 minutes each.
  • Derive MAP for: linear regression with Gaussian prior (ridge), with Laplace prior (lasso).
  • Derive Beta-Bernoulli posterior. Recite mean and mode.
  • Recognize: OLS = MLE Gaussian, ridge = MAP Gaussian, lasso = MAP Laplace, cross-entropy = MLE general.
  • Practice writing log-likelihoods cleanly without dropping constants until the end.

11. Further reading

  • Murphy, Machine Learning: A Probabilistic Perspective, ch. 4–5.
  • Bishop, Pattern Recognition and Machine Learning, ch. 1, 2.
  • Wasserman, All of Statistics, ch. 9, 11.
  • MacKay, Information Theory, Inference, and Learning Algorithms — beautiful Bayesian framing.