Topic 18: Distribution Classification
🔥 For interviews, read these first:
DISTRIBUTIONS_DEEP_DIVE.md— frontier-lab deep dive: which distribution to use for which data type, exponential family unification, GLMs and canonical links, heavy-tailed distributions, common pitfalls.INTERVIEW_GRILL.md— 60 active-recall questions.
What You'll Learn
This topic teaches you:
- Given a number, determine which distribution it came from
- Likelihood ratio test
- Maximum likelihood estimation
- Simple classification methods
Why We Need This
Interview Importance
- Common question: "Given data, which distribution?"
- Statistical reasoning: Shows statistical knowledge
- Problem-solving: Practical ML skill
Real-World Application
- Data analysis: Understand data distribution
- Model selection: Choose right distribution
- Anomaly detection: Identify outliers
Core Intuition
This is a probabilistic classification problem.
You have:
- two candidate data-generating sources
- one new observation
You want to know which source makes that observation more plausible.
That is why likelihood or posterior comparison is the right mental model, not just mean comparison.
Why Means Are Not Enough
Two distributions can have:
- the same mean
- different variances
- different tail behavior
A point can be closer to one mean but still more likely under another density.
Problem Statement
Question: You have 2 lists from 2 distributions. You get a new number. How do you determine which distribution it belongs to?
Short Answer: Use a generative classifier.
That means:
- estimate each distribution from its samples
- compute how likely the new value is under each distribution
- include class priors if one distribution is more common
- choose the distribution with the larger posterior score
In symbols:
choose dist1 if p(x | dist1) * P(dist1) > p(x | dist2) * P(dist2)
This is the cleanest interview answer because it shows:
- probabilistic reasoning
- awareness of assumptions
- understanding of priors
Technical Details Interviewers Often Want
Likelihood vs Posterior
If priors are equal, likelihood comparison is enough.
If priors differ, posterior comparison is more correct.
Parametric vs Nonparametric
Parametric:
- assume a family like Gaussian
- estimate a few parameters
Nonparametric:
- estimate the density more flexibly with KDE or related methods
How to Answer This in an Interview
Say it in this order:
Case 1: If I Am Comfortable Assuming a Distribution Family
If I assume both arrays come from Gaussians:
- fit mean and variance for each array
- compute Gaussian density of the new value under each fitted distribution
- if priors are equal, choose the larger likelihood
- if priors differ, choose the larger posterior score
Case 2: If I Do Not Want a Parametric Assumption
If I do not want to assume Gaussian:
- estimate density nonparametrically with KDE
- or use a simple nearest-neighbor density intuition in 1D
- then compare the estimated densities
Case 3: If the Distributions Overlap Heavily
Then classification may be ambiguous.
In that case, report:
- predicted class
- posterior/confidence
- the fact that the point lies in an overlapping region
This is a stronger answer than pretending every point can be classified confidently.
Industry-Standard Boilerplate Code
Likelihood Ratio Test
"""
Likelihood Ratio Test: Compare likelihoods
"""
import numpy as np
from scipy import stats
def classify_by_likelihood(new_value: float,
dist1_samples: np.ndarray,
dist2_samples: np.ndarray) -> int:
"""
Classify new value using likelihood ratio
Returns: 1 if from dist1, 2 if from dist2
"""
# Estimate parameters from samples
mu1, sigma1 = np.mean(dist1_samples), np.std(dist1_samples)
mu2, sigma2 = np.mean(dist2_samples), np.std(dist2_samples)
# Compute likelihoods
likelihood1 = stats.norm.pdf(new_value, mu1, sigma1)
likelihood2 = stats.norm.pdf(new_value, mu2, sigma2)
# Return distribution with higher likelihood
return 1 if likelihood1 > likelihood2 else 2
Maximum Likelihood Classification
"""
Maximum Likelihood: Choose distribution with max likelihood
"""
def classify_by_mle(new_value: float,
dist1_samples: np.ndarray,
dist2_samples: np.ndarray) -> int:
"""
Classify using maximum likelihood estimation
"""
# Fit distributions to samples
params1 = stats.norm.fit(dist1_samples)
params2 = stats.norm.fit(dist2_samples)
# Compute log-likelihoods
log_likelihood1 = stats.norm.logpdf(new_value, *params1)
log_likelihood2 = stats.norm.logpdf(new_value, *params2)
# Return distribution with higher log-likelihood
return 1 if log_likelihood1 > log_likelihood2 else 2
Bayesian Classification
"""
Bayesian: Use prior probabilities
"""
def classify_bayesian(new_value: float,
dist1_samples: np.ndarray,
dist2_samples: np.ndarray,
prior1: float = 0.5,
prior2: float = 0.5) -> int:
"""
Classify using Bayesian approach
P(dist|value) ∝ P(value|dist) × P(dist)
"""
# Estimate parameters
mu1, sigma1 = np.mean(dist1_samples), np.std(dist1_samples)
mu2, sigma2 = np.mean(dist2_samples), np.std(dist2_samples)
# Compute posterior probabilities
likelihood1 = stats.norm.pdf(new_value, mu1, sigma1) * prior1
likelihood2 = stats.norm.pdf(new_value, mu2, sigma2) * prior2
# Normalize
total = likelihood1 + likelihood2
posterior1 = likelihood1 / total
posterior2 = likelihood2 / total
return 1 if posterior1 > posterior2 else 2
Theory
Likelihood Ratio
- Compare P(x|dist1) vs P(x|dist2)
- Choose distribution with higher likelihood
- Simple and effective
Maximum Likelihood
- Fit distributions to samples
- Compute log-likelihood for new value
- Choose distribution with higher log-likelihood
Bayesian Approach
- Use prior probabilities
- Compute posterior: P(dist|x) ∝ P(x|dist) × P(dist)
- More sophisticated
Important Interview Follow-Ups
What assumptions are hidden here?
You are assuming:
- the training arrays are representative samples
- the fitted family is reasonable if you use a parametric model
- the new value is generated from one of those two candidate distributions
What if both distributions have the same mean?
Then variance still matters.
Example:
- one distribution may be narrow and concentrated
- the other may be wide and diffuse
A point near the shared mean may be more likely under the narrow one. A point far from the mean may be more likely under the wide one.
What if you only have a few samples?
Then parameter estimates are noisy.
Good answer:
- say confidence should be lower
- consider Bayesian priors or bootstrap uncertainty
- avoid overconfident claims
What if the distributions are unknown?
Then use:
- KDE
- histogram density estimate
- nearest-neighbor density idea
The principle is still the same: compare estimated densities.
Common Failure Modes
- comparing only means instead of densities
- ignoring class priors
- being overconfident when distributions overlap heavily
- assuming Gaussian without stating the assumption
- forgetting how noisy the answer can be with few samples
What to Practice Saying Out Loud
- Why this is really a Bayes classification problem
- Why density comparison is stronger than mean comparison
- Why overlap and priors both matter
Exercises
- Implement likelihood ratio test
- Compare different methods
- Test on various distributions
- Handle edge cases
- Practice the spoken answer for the exact interview question above
Next Steps
- Topic 19: Advanced clustering
- Topic 20: Multi-turn conversations