Probability for ML — Deep Dive

Frontier-lab interview prep. Pair with INTERVIEW_GRILL.md.

Probability is the substrate of ML. Senior interviews use probability to test whether you understand uncertainty, can do clean derivations, and can apply Bayesian reasoning under pressure. This deep dive nails the foundations.


1. Probability axioms and basic identities

A probability is a function on a sample space satisfying:

  • , .
  • for any event .
  • Countable additivity: for disjoint .

Identities to know cold

  • Complement: .
  • Inclusion-exclusion: .
  • Union bound: .
  • Conditional: .
  • Multiplication: .
  • Independence: iff independent.
  • Law of total probability: for partition .
  • Bayes' theorem: .

2. Random variables, expectations, variance

A random variable is a function with an induced distribution.

PMF/PDF: for discrete; for continuous.

CDF: .

Expectation

Linearity (always — even for dependent RVs):

Law of the unconscious statistician: .

Variance and covariance

Variance of a sum:

For independent : → variance adds. (Note: does NOT imply independence in general, only for jointly Gaussian.)

Conditional expectation and variance

Tower (law of total expectation):

Law of total variance:

These are constantly useful in ML problems involving hierarchical or latent models (e.g., bias-variance decomposition arguments).


3. Common distributions — what to know

DistributionPMF/PDFMeanVarianceWhen
Bernoulli()Binary outcome
Binomial()Sum of Bernoullis
Geometric()Trials until first success
Poisson()Rare events, count data
Uniform()No info, bounded
Normal()CLT, continuous
Exponential()Time to event, memoryless
Gamma()Sum of exponentials
Beta()variesProbability of probability

Key relationships

  • Sum of iid Bernoulli() → Binomial().
  • Limit of Binomial() with fixed → Poisson().
  • Sum of independent Poissons → Poisson with summed rate.
  • Sum of iid Exponential() → Gamma().
  • = sum of squared standard normals.
  • t-distribution: ratio of standard normal to .

4. The Gaussian — workhorse distribution

PDF: .

Multivariate Gaussian

Properties to memorize

  • Affine transformations: if , then .
  • Marginals are Gaussian: any marginal of a multivariate Gaussian is Gaussian.
  • Conditionals are Gaussian: where is jointly Gaussian is Gaussian, with mean and variance computable in closed form.
  • Sum of independent Gaussians is Gaussian.
  • Uncorrelated jointly Gaussian = independent. (Special property — does NOT hold in general.)

Conditioning formula

For :

This is the foundation of Gaussian processes, Kalman filters, Bayesian linear regression, and many other methods.


5. Convergence and limit theorems

Law of large numbers (LLN)

For iid with finite mean :

  • Weak LLN: .
  • Strong LLN: .

The empirical mean converges to the true mean. This is why Monte Carlo estimation works.

Central limit theorem (CLT)

For iid with mean and finite variance :

The sample mean is approximately Gaussian for large , regardless of underlying distribution. This is why so many statistical tests assume normality of the sample mean.

Caveats:

  • Need finite variance — fails for heavy-tailed distributions like Cauchy.
  • Convergence rate depends on third moment; very skewed distributions need larger .
  • For finite samples, use -distribution instead of normal for inference.

6. Bayes' theorem — key applications

Naive Bayes classifier

The "naive" assumption: features independent given class. Surprisingly competitive baseline.

Medical testing (the canonical interview question)

Disease prevalence . Test sensitivity , specificity .

Even with a 95% accurate test, only 16% of positives have the disease. This is the base rate fallacy — and the reason rare-event detection is hard in ML.

Bayesian update

Prior + likelihood → posterior . Sequential data: posterior becomes prior for next observation.


7. Joint, marginal, conditional

For two RVs with joint distribution:

  • Joint PMF/PDF: .
  • Marginal: or .
  • Conditional: .

Independence: .

Conditional independence: iff . Different from unconditional independence.


8. Common interview gotchas

QuestionCommon wrong answerRight answer
Cov = 0 implies independence?YesOnly for jointly Gaussian; in general no
CLT requires iid?Yes (strict)Identical distribution can be relaxed (Lindeberg conditions); finite variance critical
implies independence?YesOnly that they're uncorrelated
Sum of independent variances?Yes (always)Yes if variance exists; for dependent must include
Memoryless property?Geometric and exponentialYes — specifically, the only memoryless distributions
Bayes' theorem requires prior to be informative?YesNo — uniform prior is fine; Bayes is about belief update
If are jointly Gaussian, is Gaussian?MaybeYes — affine transformations of Gaussians are Gaussian

9. Eight most-asked interview questions

  1. Walk through Bayes' theorem with the medical-testing example. (Lock down the base-rate-fallacy intuition.)
  2. Derive CLT informally. (Sum of zero-mean RVs scaled by converges to Gaussian; characteristic function argument.)
  3. State the law of total expectation and law of total variance. (Tower property; bias-variance decomposition uses this.)
  4. What's the difference between uncorrelated and independent? (Cov = 0 vs joint = product of marginals; Gaussian is the special case where they coincide.)
  5. How do you sample from a non-uniform distribution? (Inverse CDF, rejection sampling, MCMC; understand each.)
  6. Compute the marginal of a 2D Gaussian. (Marginalize one variable; result is Gaussian with the corresponding marginal mean and variance.)
  7. Explain conditional independence. (Different from independence; canonical example: causes of a common effect.)
  8. What does Poisson approximate? (Binomial with large, small, fixed; rare events.)

10. Drill plan

  • For each common distribution: PMF/PDF, mean, variance, generating story. 1 minute each.
  • Bayes' problem: medical test → recompute given different prevalence/sensitivity. Until automatic.
  • Derive Gaussian conditional formula from the joint density manipulation.
  • Practice 5 problems where you apply the law of total variance.
  • Compute Var(sum) and Var(mean) for iid and non-iid cases.

11. Further reading

  • Casella & Berger, Statistical Inference, ch. 1–4.
  • Wasserman, All of Statistics, ch. 1–4.
  • Pitman, Probability — friendly introduction.
  • 3blue1brown, "But what is the Central Limit Theorem?" — beautiful visual intuition.