Probability for ML — Deep Dive
Frontier-lab interview prep. Pair with
INTERVIEW_GRILL.md.
Probability is the substrate of ML. Senior interviews use probability to test whether you understand uncertainty, can do clean derivations, and can apply Bayesian reasoning under pressure. This deep dive nails the foundations.
1. Probability axioms and basic identities
A probability is a function on a sample space satisfying:
- , .
- for any event .
- Countable additivity: for disjoint .
Identities to know cold
- Complement: .
- Inclusion-exclusion: .
- Union bound: .
- Conditional: .
- Multiplication: .
- Independence: iff independent.
- Law of total probability: for partition .
- Bayes' theorem: .
2. Random variables, expectations, variance
A random variable is a function with an induced distribution.
PMF/PDF: for discrete; for continuous.
CDF: .
Expectation
Linearity (always — even for dependent RVs):
Law of the unconscious statistician: .
Variance and covariance
Variance of a sum:
For independent : → variance adds. (Note: does NOT imply independence in general, only for jointly Gaussian.)
Conditional expectation and variance
Tower (law of total expectation):
Law of total variance:
These are constantly useful in ML problems involving hierarchical or latent models (e.g., bias-variance decomposition arguments).
3. Common distributions — what to know
| Distribution | PMF/PDF | Mean | Variance | When |
|---|---|---|---|---|
| Bernoulli() | Binary outcome | |||
| Binomial() | Sum of Bernoullis | |||
| Geometric() | Trials until first success | |||
| Poisson() | Rare events, count data | |||
| Uniform() | No info, bounded | |||
| Normal() | CLT, continuous | |||
| Exponential() | Time to event, memoryless | |||
| Gamma() | Sum of exponentials | |||
| Beta() | varies | Probability of probability |
Key relationships
- Sum of iid Bernoulli() → Binomial().
- Limit of Binomial() with fixed → Poisson().
- Sum of independent Poissons → Poisson with summed rate.
- Sum of iid Exponential() → Gamma().
- = sum of squared standard normals.
- t-distribution: ratio of standard normal to .
4. The Gaussian — workhorse distribution
PDF: .
Multivariate Gaussian
Properties to memorize
- Affine transformations: if , then .
- Marginals are Gaussian: any marginal of a multivariate Gaussian is Gaussian.
- Conditionals are Gaussian: where is jointly Gaussian is Gaussian, with mean and variance computable in closed form.
- Sum of independent Gaussians is Gaussian.
- Uncorrelated jointly Gaussian = independent. (Special property — does NOT hold in general.)
Conditioning formula
For :
This is the foundation of Gaussian processes, Kalman filters, Bayesian linear regression, and many other methods.
5. Convergence and limit theorems
Law of large numbers (LLN)
For iid with finite mean :
- Weak LLN: .
- Strong LLN: .
The empirical mean converges to the true mean. This is why Monte Carlo estimation works.
Central limit theorem (CLT)
For iid with mean and finite variance :
The sample mean is approximately Gaussian for large , regardless of underlying distribution. This is why so many statistical tests assume normality of the sample mean.
Caveats:
- Need finite variance — fails for heavy-tailed distributions like Cauchy.
- Convergence rate depends on third moment; very skewed distributions need larger .
- For finite samples, use -distribution instead of normal for inference.
6. Bayes' theorem — key applications
Naive Bayes classifier
The "naive" assumption: features independent given class. Surprisingly competitive baseline.
Medical testing (the canonical interview question)
Disease prevalence . Test sensitivity , specificity .
Even with a 95% accurate test, only 16% of positives have the disease. This is the base rate fallacy — and the reason rare-event detection is hard in ML.
Bayesian update
Prior + likelihood → posterior . Sequential data: posterior becomes prior for next observation.
7. Joint, marginal, conditional
For two RVs with joint distribution:
- Joint PMF/PDF: .
- Marginal: or .
- Conditional: .
Independence: .
Conditional independence: iff . Different from unconditional independence.
8. Common interview gotchas
| Question | Common wrong answer | Right answer |
|---|---|---|
| Cov = 0 implies independence? | Yes | Only for jointly Gaussian; in general no |
| CLT requires iid? | Yes (strict) | Identical distribution can be relaxed (Lindeberg conditions); finite variance critical |
| implies independence? | Yes | Only that they're uncorrelated |
| Sum of independent variances? | Yes (always) | Yes if variance exists; for dependent must include |
| Memoryless property? | Geometric and exponential | Yes — specifically, the only memoryless distributions |
| Bayes' theorem requires prior to be informative? | Yes | No — uniform prior is fine; Bayes is about belief update |
| If are jointly Gaussian, is Gaussian? | Maybe | Yes — affine transformations of Gaussians are Gaussian |
9. Eight most-asked interview questions
- Walk through Bayes' theorem with the medical-testing example. (Lock down the base-rate-fallacy intuition.)
- Derive CLT informally. (Sum of zero-mean RVs scaled by converges to Gaussian; characteristic function argument.)
- State the law of total expectation and law of total variance. (Tower property; bias-variance decomposition uses this.)
- What's the difference between uncorrelated and independent? (Cov = 0 vs joint = product of marginals; Gaussian is the special case where they coincide.)
- How do you sample from a non-uniform distribution? (Inverse CDF, rejection sampling, MCMC; understand each.)
- Compute the marginal of a 2D Gaussian. (Marginalize one variable; result is Gaussian with the corresponding marginal mean and variance.)
- Explain conditional independence. (Different from independence; canonical example: causes of a common effect.)
- What does Poisson approximate? (Binomial with large, small, fixed; rare events.)
10. Drill plan
- For each common distribution: PMF/PDF, mean, variance, generating story. 1 minute each.
- Bayes' problem: medical test → recompute given different prevalence/sensitivity. Until automatic.
- Derive Gaussian conditional formula from the joint density manipulation.
- Practice 5 problems where you apply the law of total variance.
- Compute Var(sum) and Var(mean) for iid and non-iid cases.
11. Further reading
- Casella & Berger, Statistical Inference, ch. 1–4.
- Wasserman, All of Statistics, ch. 1–4.
- Pitman, Probability — friendly introduction.
- 3blue1brown, "But what is the Central Limit Theorem?" — beautiful visual intuition.