Topic 18: Distribution Classification

🔥 For interviews, read these first:

DISTRIBUTIONS_DEEP_DIVE.md — frontier-lab deep dive: which distribution to use for which data type, exponential family unification, GLMs and canonical links, heavy-tailed distributions, common pitfalls.

INTERVIEW_GRILL.md — 60 active-recall questions.

What You'll Learn

This topic teaches you:

Given a number, determine which distribution it came from
Likelihood ratio test
Maximum likelihood estimation
Simple classification methods

Why We Need This

Interview Importance

Common question: "Given data, which distribution?"
Statistical reasoning: Shows statistical knowledge
Problem-solving: Practical ML skill

Real-World Application

Data analysis: Understand data distribution
Model selection: Choose right distribution
Anomaly detection: Identify outliers

Core Intuition

This is a probabilistic classification problem.

You have:

two candidate data-generating sources
one new observation

You want to know which source makes that observation more plausible.

That is why likelihood or posterior comparison is the right mental model, not just mean comparison.

Why Means Are Not Enough

Two distributions can have:

the same mean
different variances
different tail behavior

A point can be closer to one mean but still more likely under another density.

Problem Statement

Question: You have 2 lists from 2 distributions. You get a new number. How do you determine which distribution it belongs to?

Short Answer: Use a generative classifier.

That means:

estimate each distribution from its samples
compute how likely the new value is under each distribution
include class priors if one distribution is more common
choose the distribution with the larger posterior score

In symbols:

choose dist1 if p(x | dist1) * P(dist1) > p(x | dist2) * P(dist2)

This is the cleanest interview answer because it shows:

probabilistic reasoning
awareness of assumptions
understanding of priors

Technical Details Interviewers Often Want

Likelihood vs Posterior

If priors are equal, likelihood comparison is enough.

If priors differ, posterior comparison is more correct.

Parametric vs Nonparametric

Parametric:

assume a family like Gaussian
estimate a few parameters

Nonparametric:

estimate the density more flexibly with KDE or related methods

How to Answer This in an Interview

Say it in this order:

Case 1: If I Am Comfortable Assuming a Distribution Family

If I assume both arrays come from Gaussians:

fit mean and variance for each array
compute Gaussian density of the new value under each fitted distribution
if priors are equal, choose the larger likelihood
if priors differ, choose the larger posterior score

Case 2: If I Do Not Want a Parametric Assumption

If I do not want to assume Gaussian:

estimate density nonparametrically with KDE
or use a simple nearest-neighbor density intuition in 1D
then compare the estimated densities

Case 3: If the Distributions Overlap Heavily

Then classification may be ambiguous.

In that case, report:

predicted class
posterior/confidence
the fact that the point lies in an overlapping region

This is a stronger answer than pretending every point can be classified confidently.

Industry-Standard Boilerplate Code

Likelihood Ratio Test

"""
Likelihood Ratio Test: Compare likelihoods
"""
import numpy as np
from scipy import stats

def classify_by_likelihood(new_value: float, 
                           dist1_samples: np.ndarray,
                           dist2_samples: np.ndarray) -> int:
    """
    Classify new value using likelihood ratio
    
    Returns: 1 if from dist1, 2 if from dist2
    """
    # Estimate parameters from samples
    mu1, sigma1 = np.mean(dist1_samples), np.std(dist1_samples)
    mu2, sigma2 = np.mean(dist2_samples), np.std(dist2_samples)
    
    # Compute likelihoods
    likelihood1 = stats.norm.pdf(new_value, mu1, sigma1)
    likelihood2 = stats.norm.pdf(new_value, mu2, sigma2)
    
    # Return distribution with higher likelihood
    return 1 if likelihood1 > likelihood2 else 2

Maximum Likelihood Classification

"""
Maximum Likelihood: Choose distribution with max likelihood
"""
def classify_by_mle(new_value: float,
                    dist1_samples: np.ndarray,
                    dist2_samples: np.ndarray) -> int:
    """
    Classify using maximum likelihood estimation
    """
    # Fit distributions to samples
    params1 = stats.norm.fit(dist1_samples)
    params2 = stats.norm.fit(dist2_samples)
    
    # Compute log-likelihoods
    log_likelihood1 = stats.norm.logpdf(new_value, *params1)
    log_likelihood2 = stats.norm.logpdf(new_value, *params2)
    
    # Return distribution with higher log-likelihood
    return 1 if log_likelihood1 > log_likelihood2 else 2

Bayesian Classification

"""
Bayesian: Use prior probabilities
"""
def classify_bayesian(new_value: float,
                     dist1_samples: np.ndarray,
                     dist2_samples: np.ndarray,
                     prior1: float = 0.5,
                     prior2: float = 0.5) -> int:
    """
    Classify using Bayesian approach
    
    P(dist|value) ∝ P(value|dist) × P(dist)
    """
    # Estimate parameters
    mu1, sigma1 = np.mean(dist1_samples), np.std(dist1_samples)
    mu2, sigma2 = np.mean(dist2_samples), np.std(dist2_samples)
    
    # Compute posterior probabilities
    likelihood1 = stats.norm.pdf(new_value, mu1, sigma1) * prior1
    likelihood2 = stats.norm.pdf(new_value, mu2, sigma2) * prior2
    
    # Normalize
    total = likelihood1 + likelihood2
    posterior1 = likelihood1 / total
    posterior2 = likelihood2 / total
    
    return 1 if posterior1 > posterior2 else 2