MLE and MAP Estimation — Interview Grill

45 questions on MLE, MAP, conjugate priors, and the connections to standard ML losses. Drill until you can answer 30+ cold.

A. Likelihood basics

1. Define likelihood and log-likelihood. $L (θ) = \prod_{i} p (x_{i} ∣ θ)$ , $ℓ (θ) = \sum_{i} lo g p (x_{i} ∣ θ)$ . Treats $θ$ as the variable, data as fixed.

2. Why log? Sums beat products numerically (no underflow). Concavity often preserved. Calculus easier.

3. MLE definition? $\hat{θ}_{MLE} = ar g max_{θ} ℓ (θ)$ .

4. Why is MLE the default in ML? Asymptotically consistent + efficient. Reduces to standard losses (cross-entropy, MSE) under standard distributions. Simple to derive and optimize.

B. Standard MLE derivations

5. Derive MLE for Bernoulli. $ℓ = s lo g θ + (n - s) lo g (1 - θ)$ . Set derivative to zero: $\hat{θ} = s / n = \overset{x}{ˉ}$ .

6. Derive MLE for Gaussian (mean only, variance known). $\overset{μ}{^} = \overset{x}{ˉ}$ .

7. MLE for Gaussian variance? $\overset{σ}{^}^{2} = \frac{1}{n} \sum (x_{i} - \overset{x}{ˉ})^{2}$ . Biased — divisor should be $n - 1$ for unbiased.

8. Why is MLE for variance biased? $\overset{x}{ˉ}$ is closer to the sample than the true $μ$ . $\sum (x - \overset{x}{ˉ})^{2} < \sum (x - μ)^{2}$ on average.

9. MLE for Poisson rate? $\hat{λ} = \overset{x}{ˉ}$ .

10. MLE for multinomial? $\hat{θ}_{k} = n_{k} / n$ — empirical class frequency.

11. MLE for linear regression — what loss does it correspond to? Squared error. $ar g max ℓ$ under Gaussian noise = $ar g min \sum (y - w^{⊤} x)^{2}$ = OLS.

12. MLE for logistic regression — what loss? Cross-entropy / log loss. $\sum [y lo g σ (w^{⊤} x) + (1 - y) lo g (1 - σ (w^{⊤} x))]$ . No closed form.

13. Why does logistic regression have no closed-form MLE? The score equation is non-linear in $w$ (sigmoid). Need iterative solver: IRLS, gradient descent, Newton-Raphson.

C. Asymptotic theory

14. Asymptotic distribution of MLE? $n (\hat{θ} - θ_{0}) \to N (0, I (θ_{0})^{- 1})$ where $I$ is Fisher information.

15. What's Fisher information? $I (θ) = - E [\partial^{2} ℓ / \partial θ^{2}]$ . Curvature of expected log-likelihood; measures how sharply peaked it is around true value.

16. Why is MLE asymptotically efficient? Variance achieves Cramér-Rao lower bound: $1/ I (θ)$ . No unbiased estimator can do better asymptotically.

17. Invariance of MLE — what does it mean? $g (θ) = g (\hat{θ})$ . So MLE of standard deviation = $\overset{σ}{^}_{MLE}^{2}$ .

18. When does asymptotic theory fail? Boundary parameters (e.g., $θ = 0$ when domain is $[0, \infty)$ ), non-identifiable models, infinite Fisher information, non-iid data.

D. MAP

19. Define MAP. $\hat{θ}_{MAP} = ar g max_{θ} p (θ ∣ x) = ar g max_{θ} [lo g p (x ∣ θ) + lo g p (θ)]$ .

20. MAP vs MLE — key relationship? MAP = MLE + log-prior penalty.

21. MAP equals MLE when? Uniform (improper) prior — log-prior is constant, has no effect.

22. MAP vs posterior mean — same? No. MAP is the mode; posterior mean is the expectation. Different unless posterior is symmetric.

E. Priors as regularizers

23. Gaussian prior on weights → what regularizer? $ℓ_{2}$ . $lo g N (0, τ^{2} I) \propto - ∥ w ∥^{2} / (2 τ^{2})$ .

24. Show ridge regression = MAP under Gaussian prior. Likelihood Gaussian, prior Gaussian. $lo g p (w ∣ x, y) = - \frac{1}{2 σ ^{2}} ∥ y - Xw ∥^{2} - \frac{1}{2 τ ^{2}} ∥ w ∥^{2}$ . Maximizing → ridge with $λ = σ^{2} / τ^{2}$ .

25. Laplace prior → what regularizer? $ℓ_{1}$ . $lo g Laplace (0, b) \propto - ∣ w ∣/ b$ .

26. Why does $ℓ_{1}$ produce sparsity? $ℓ_{1}$ ball has corners at axes; optimum is often at a corner → some weights exactly zero. Geometrically, lasso intersects the constraint set at a corner.

27. Why does $ℓ_{2}$ not produce sparsity? $ℓ_{2}$ ball is round → optimum is generically in the interior of an axis hyperplane → all weights non-zero.

28. What does early stopping correspond to? Approximately MAP with a Gaussian prior — the early stop limits how far weights move from the (zero) initialization. Connection is exact for linear models (Friedman, Hastie & Tibshirani).

F. Conjugate priors

29. What's a conjugate prior? Prior whose posterior stays in the same family. Enables closed-form Bayesian updates.

30. Conjugate of Bernoulli/Binomial? Beta.

31. Conjugate of multinomial/categorical? Dirichlet.

32. Conjugate of Poisson? Gamma.

33. Conjugate of Gaussian (mean, variance known)? Gaussian.

34. Beta-Bernoulli: prior + 5 successes / 3 failures from Beta(2, 2). What's the posterior? Beta(2 + 5, 2 + 3) = Beta(7, 5).

35. Beta-Bernoulli posterior mean? $(α + s) / (α + β + n)$ .

36. With $α = β = 1$ , what does the posterior mean become? $(s + 1) / (n + 2)$ — Laplace's rule of succession / add-one smoothing.

37. What's the "pseudo-count" interpretation? Beta( $α, β$ ) = $α$ pseudo-successes, $β$ pseudo-failures. The prior acts like imaginary data.

38. Dirichlet prior as smoothing — why does NLP use add- $α$ smoothing? $N$ -gram counts $n_{w}$ with Dirichlet( $α$ ) prior. Posterior probability for word $w$ : $(n_{w} + α) / (\sum_{v} n_{v} + V α)$ . Prevents zero probabilities for unseen tokens.

G. Connections to standard ML

39. Cross-entropy minimization equals what? MLE in general (negative log-likelihood). Specifically, minimizing CE = minimizing forward KL from data to model (up to data-entropy constant).

40. Forward KL vs reverse KL — which does MLE minimize? Forward: $KL (p^{*} ∥ p_{θ})$ . Mode-covering. (VI minimizes reverse KL.)

41. Why is squared loss the right loss for regression? Under Gaussian noise assumption, MLE = squared loss. Other noise models give other losses (Huber for heavy-tailed, MAE for Laplace noise).

42. RLHF reward model — what's the MLE? Bradley-Terry: $p (y_{w} ≻ y_{l} ∣ x) = σ (r (x, y_{w}) - r (x, y_{l}))$ . MLE is logistic regression on (preferred, rejected) pairs.

43. SFT loss = MLE of what? Conditional language model: $p (y ∣ x; θ)$ . Minimize $- \sum_{(x, y)} lo g p_{θ} (y ∣ x)$ = MLE.

44. DPO loss derivation starting point? Substitute the optimal RLHF policy ( $π^{*} (y ∣ x) \propto π_{ref} (y ∣ x) exp (r / β)$ ) into the Bradley-Terry MLE, eliminating the reward — yields a closed-form classification objective on preferences.

H. Subtleties

45. Is MLE always unbiased? No. MLE for Gaussian variance is biased; many other MLEs are biased in finite samples.

46. Is MAP always unbiased? Almost never. MAP introduces deliberate bias to reduce variance.

47. Why might you prefer MAP over MLE? Small data + strong prior → MAP regularizes against overfitting. Equivalent to standard regularization.

48. Why might you prefer Bayesian inference over MAP? Need uncertainty estimates, want credible intervals, decision-theoretic problems with non-symmetric loss. MAP throws away the posterior shape.

49. When does MAP become a poor summary of the posterior? Multimodal posterior, highly skewed posterior, transformation-dependent (MAP is not invariant under reparameterization, but MLE is — MAP point shifts under variable change).

50. Why is MAP not invariant under reparameterization? Under a transformation $θ \to ϕ = g (θ)$ , the prior density transforms by a Jacobian. The mode of $p (ϕ ∣ x)$ is generally not $g (\hat{θ}_{MAP})$ .

Quick fire

51. MLE Bernoulli? Sample mean. 52. MLE Gaussian variance divisor? $n$ (biased). 53. Unbiased Gaussian variance divisor? $n - 1$ . 54. OLS = MLE under what? Gaussian noise. 55. Ridge = MAP under what? Gaussian prior. 56. Lasso = MAP under what? Laplace prior. 57. Conjugate of Bernoulli? Beta. 58. Beta( $α, β$ ) mean? $α / (α + β)$ . 59. Beta(1,1) is? Uniform on $[0, 1]$ . 60. MLE achieves what bound? Cramér-Rao.

Self-grading

If you can't answer 1-15, you don't know MLE. If you can't answer 16-35, you'll struggle on every Bayesian/regularization question. If you can't answer 36-50, frontier-lab questions on RLHF/DPO/loss design will go past you.

Aim for 40+/60 cold.

ML & LLM Interview Prep — Deep Dives