Probability for ML — Deep Dive

Frontier-lab interview prep. Pair with INTERVIEW_GRILL.md.

Probability is the substrate of ML. Senior interviews use probability to test whether you understand uncertainty, can do clean derivations, and can apply Bayesian reasoning under pressure. This deep dive nails the foundations.

1. Probability axioms and basic identities

A probability is a function $P$ on a sample space $Ω$ satisfying:

$P (Ω) = 1$ , $P (\emptyset) = 0$ .
$P (A) \in [0, 1]$ for any event $A$ .
Countable additivity: $P (⋃ A_{i}) = \sum P (A_{i})$ for disjoint $A_{i}$ .

Identities to know cold

Complement: $P (A^{c}) = 1 - P (A)$ .
Inclusion-exclusion: $P (A \cup B) = P (A) + P (B) - P (A \cap B)$ .
Union bound: $P (⋃ A_{i}) \leq \sum P (A_{i})$ .
Conditional: $P (A ∣ B) = P (A \cap B) / P (B)$ .
Multiplication: $P (A \cap B) = P (A ∣ B) P (B)$ .
Independence: $P (A \cap B) = P (A) P (B)$ iff $A, B$ independent.
Law of total probability: $P (A) = \sum_{i} P (A ∣ B_{i}) P (B_{i})$ for partition ${B_{i}}$ .
Bayes' theorem: $P (A ∣ B) = P (B ∣ A) P (A) / P (B)$ .

2. Random variables, expectations, variance

A random variable $X$ is a function $Ω \to R$ with an induced distribution.

PMF/PDF: $p_{X} (x)$ for discrete; $f_{X} (x)$ for continuous.

CDF: $F_{X} (x) = P (X \leq x)$ .

Expectation

$E [X] = x \sum x p (x) or \int x f (x) d x$

Linearity (always — even for dependent RVs):

$E [a X + bY] = a E [X] + b E [Y]$

Law of the unconscious statistician: $E [g (X)] = \sum g (x) p (x)$ .

Variance and covariance

$Var (X) = E [(X - μ)^{2}] = E [X^{2}] - E [X]^{2}$

$Cov (X, Y) = E [(X - μ_{X}) (Y - μ_{Y})] = E [X Y] - E [X] E [Y]$

Variance of a sum:

$Var (a X + bY) = a^{2} Var (X) + b^{2} Var (Y) + 2 ab Cov (X, Y)$

For independent $X, Y$ : $Cov = 0$ → variance adds. (Note: $Cov = 0$ does NOT imply independence in general, only for jointly Gaussian.)

Conditional expectation and variance

Tower (law of total expectation):

$E [X] = E [E [X ∣ Y]]$

Law of total variance:

$Var (X) = E [Var (X ∣ Y)] + Var (E [X ∣ Y])$

These are constantly useful in ML problems involving hierarchical or latent models (e.g., bias-variance decomposition arguments).

3. Common distributions — what to know

Distribution	PMF/PDF	Mean	Variance	When
Bernoulli( $p$ )	$p^{x} (1 - p)^{1 - x}$	$p$	$p (1 - p)$	Binary outcome
Binomial( $n, p$ )	$(x n) p^{x} (1 - p)^{n - x}$	$n p$	$n p (1 - p)$	Sum of Bernoullis
Geometric( $p$ )	$(1 - p)^{x - 1} p$	$1/ p$	$(1 - p) / p^{2}$	Trials until first success
Poisson( $λ$ )	$λ^{x} e^{- λ} / x!$	$λ$	$λ$	Rare events, count data
Uniform( $a, b$ )	$1/ (b - a)$	$(a + b) /2$	$(b - a)^{2} /12$	No info, bounded
Normal( $μ, σ^{2}$ )	$\frac{1}{σ 2 π} e^{- (x - μ)^{2} / (2 σ^{2})}$	$μ$	$σ^{2}$	CLT, continuous
Exponential( $λ$ )	$λ e^{- λ x}$	$1/ λ$	$1/ λ^{2}$	Time to event, memoryless
Gamma( $k, θ$ )	$\propto x^{k - 1} e^{- x / θ}$	$k θ$	$k θ^{2}$	Sum of exponentials
Beta( $α, β$ )	$\propto x^{α - 1} (1 - x)^{β - 1}$	$α / (α + β)$	varies	Probability of probability

Key relationships

Sum of $n$ iid Bernoulli( $p$ ) → Binomial( $n, p$ ).
Limit of Binomial( $n, p$ ) with $n p = λ$ fixed → Poisson( $λ$ ).
Sum of independent Poissons → Poisson with summed rate.
Sum of $k$ iid Exponential( $λ$ ) → Gamma( $k, 1/ λ$ ).
$χ_{k}^{2}$ = sum of $k$ squared standard normals.
t-distribution: ratio of standard normal to $χ_{k}^{2} / k$ .

4. The Gaussian — workhorse distribution

PDF: $f (x) = \frac{1}{σ 2 π} e^{- (x - μ)^{2} / (2 σ^{2})}$ .

Multivariate Gaussian

$f (x) = \frac{1}{( 2 π ) ^{d /2} ∣Σ ∣ ^{1/2}} exp (- \frac{1}{2} (x - μ)^{⊤} Σ^{- 1} (x - μ))$

Properties to memorize

Affine transformations: if $X \sim N (μ, Σ)$ , then $A X + b \sim N (A μ + b, A Σ A^{⊤})$ .
Marginals are Gaussian: any marginal of a multivariate Gaussian is Gaussian.
Conditionals are Gaussian: $X ∣ Y$ where $(X, Y)$ is jointly Gaussian is Gaussian, with mean and variance computable in closed form.
Sum of independent Gaussians is Gaussian.
Uncorrelated jointly Gaussian = independent. (Special property — does NOT hold in general.)

Conditioning formula

For $(X_{1} X_{2}) \sim N ((μ_{1} μ_{2}), (Σ_{11} Σ_{21} Σ_{12} Σ_{22}))$ :

$X_{1} ∣ X_{2} = x_{2} \sim N (μ_{1} + Σ_{12} Σ_{22}^{- 1} (x_{2} - μ_{2}), Σ_{11} - Σ_{12} Σ_{22}^{- 1} Σ_{21})$

This is the foundation of Gaussian processes, Kalman filters, Bayesian linear regression, and many other methods.

5. Convergence and limit theorems

Law of large numbers (LLN)

For iid $X_{i}$ with finite mean $μ$ :

Weak LLN: $\overset{ˉ}{X}_{n} \to_{p} μ$ .
Strong LLN: $\overset{ˉ}{X}_{n} \to_{a . s .} μ$ .

The empirical mean converges to the true mean. This is why Monte Carlo estimation works.

Central limit theorem (CLT)

For iid $X_{i}$ with mean $μ$ and finite variance $σ^{2}$ :

$n (\overset{ˉ}{X}_{n} - μ) \to N (0, σ^{2})$

The sample mean is approximately Gaussian for large $n$ , regardless of underlying distribution. This is why so many statistical tests assume normality of the sample mean.

Caveats:

Need finite variance — fails for heavy-tailed distributions like Cauchy.
Convergence rate depends on third moment; very skewed distributions need larger $n$ .
For finite samples, use $t$ -distribution instead of normal for inference.

6. Bayes' theorem — key applications

Naive Bayes classifier

$P (C ∣ x) \propto P (x ∣ C) P (C) = j \prod P (x_{j} ∣ C) P (C)$

The "naive" assumption: features independent given class. Surprisingly competitive baseline.

Medical testing (the canonical interview question)

Disease prevalence $P (D) = 0.01$ . Test sensitivity $P (+ ∣ D) = 0.95$ , specificity $P (- ∣ D^{c}) = 0.95$ .

$P (D ∣ +) = \frac{P ( + ∣ D ) P ( D )}{P ( + )} = \frac{0.95 \cdot 0.01}{0.95 \cdot 0.01 + 0.05 \cdot 0.99} \approx 0.16$

Even with a 95% accurate test, only 16% of positives have the disease. This is the base rate fallacy — and the reason rare-event detection is hard in ML.

Bayesian update

Prior $p (θ)$ + likelihood $p (x ∣ θ)$ → posterior $p (θ ∣ x) \propto p (x ∣ θ) p (θ)$ . Sequential data: posterior becomes prior for next observation.

7. Joint, marginal, conditional

For two RVs $X, Y$ with joint distribution:

Joint PMF/PDF: $p_{X, Y} (x, y)$ .
Marginal: $p_{X} (x) = \sum_{y} p_{X, Y} (x, y)$ or $\int p_{X, Y} (x, y) d y$ .
Conditional: $p_{X ∣ Y} (x ∣ y) = p_{X, Y} (x, y) / p_{Y} (y)$ .

Independence: $p_{X, Y} (x, y) = p_{X} (x) p_{Y} (y)$ .

Conditional independence: $X ⊥ Y ∣ Z$ iff $p (x, y ∣ z) = p (x ∣ z) p (y ∣ z)$ . Different from unconditional independence.

8. Common interview gotchas

Question	Common wrong answer	Right answer
Cov = 0 implies independence?	Yes	Only for jointly Gaussian; in general no
CLT requires iid?	Yes (strict)	Identical distribution can be relaxed (Lindeberg conditions); finite variance critical
$E [X Y] = E [X] E [Y]$ implies independence?	Yes	Only that they're uncorrelated
Sum of independent variances?	Yes (always)	Yes if variance exists; for dependent must include $2 Cov$
Memoryless property?	Geometric and exponential	Yes — specifically, the only memoryless distributions
Bayes' theorem requires prior to be informative?	Yes	No — uniform prior is fine; Bayes is about belief update
If $X, Y$ are jointly Gaussian, $X - Y$ is Gaussian?	Maybe	Yes — affine transformations of Gaussians are Gaussian

9. Eight most-asked interview questions

Walk through Bayes' theorem with the medical-testing example. (Lock down the base-rate-fallacy intuition.)
Derive CLT informally. (Sum of zero-mean RVs scaled by $1/ n$ converges to Gaussian; characteristic function argument.)
State the law of total expectation and law of total variance. (Tower property; bias-variance decomposition uses this.)
What's the difference between uncorrelated and independent? (Cov = 0 vs joint = product of marginals; Gaussian is the special case where they coincide.)
How do you sample from a non-uniform distribution? (Inverse CDF, rejection sampling, MCMC; understand each.)
Compute the marginal of a 2D Gaussian. (Marginalize one variable; result is Gaussian with the corresponding marginal mean and variance.)
Explain conditional independence. (Different from independence; canonical example: causes of a common effect.)
What does Poisson approximate? (Binomial with $n$ large, $p$ small, $n p = λ$ fixed; rare events.)

10. Drill plan

For each common distribution: PMF/PDF, mean, variance, generating story. 1 minute each.
Bayes' problem: medical test → recompute given different prevalence/sensitivity. Until automatic.
Derive Gaussian conditional formula from the joint density manipulation.
Practice 5 problems where you apply the law of total variance.
Compute Var(sum) and Var(mean) for iid and non-iid cases.

11. Further reading

Casella & Berger, Statistical Inference, ch. 1–4.
Wasserman, All of Statistics, ch. 1–4.
Pitman, Probability — friendly introduction.
3blue1brown, "But what is the Central Limit Theorem?" — beautiful visual intuition.

ML & LLM Interview Prep — Deep Dives