Topic 33: Information Theory & Probability Metrics

🔥 For interviews, read these first:

INFORMATION_THEORY_DEEP_DIVE.md — frontier-lab interview deep dive: entropy/cross-entropy/KL/MI, forward vs reverse KL (mean-seeking vs mode-seeking), why MLE = forward KL, MI for InfoNCE/CLIP, KL in VAE/RLHF/distillation, f-divergences, Wasserstein, source coding theorem.

INTERVIEW_GRILL.md — 45 active-recall questions.

What You'll Learn

This topic covers essential information theory and probability concepts:

Entropy
Cross-Entropy
KL Divergence
Mutual Information
Gini Impurity
Jensen-Shannon Divergence
Detailed mathematical explanations
Simple implementations
When to use each

Why We Need This

Interview Importance

Common questions: "Explain KL divergence", "What is entropy?"
Mathematical foundation: Shows deep understanding
Practical knowledge: Used in many ML algorithms

Real-World Application

Loss functions: Cross-entropy for classification
Regularization: KL divergence in RLHF, VAEs
Feature selection: Mutual information
Decision trees: Gini impurity, entropy

Core Intuition

Information theory gives a precise language for talking about:

uncertainty
surprise
compression
distribution mismatch

That is why it appears all over ML.

Entropy

Entropy measures uncertainty in a distribution.

High entropy means:

the outcome is hard to predict

Low entropy means:

the distribution is concentrated
the outcome is more predictable

Cross-Entropy

Cross-entropy tells you how costly it is to encode data from a true distribution using a model distribution.

In ML language:

it is a natural loss for probabilistic prediction

KL Divergence

KL divergence measures how one distribution differs from another.

It is especially useful when you want to penalize mismatch from a reference distribution.

Mutual Information

Mutual information measures how much knowing one variable reduces uncertainty about another.

That is why it is useful for:

feature selection
dependence analysis
representation learning

Technical Details Interviewers Often Want

Why Cross-Entropy Appears Everywhere

Cross-entropy is not an arbitrary loss choice.

It arises naturally from maximum likelihood when predicting a probability distribution.

Why KL Is Not a True Distance Metric

This is a classic follow-up.

KL divergence is:

non-negative
zero when distributions match

But it is not symmetric, so it is not a true metric.

Gini vs Entropy

Both measure impurity.

The important interview answer is:

they behave similarly in many tree settings
Gini is often a bit cheaper to compute
entropy has a more explicit information-theoretic interpretation

Common Failure Modes

treating KL divergence like a symmetric distance
memorizing formulas without knowing what uncertainty or divergence mean
not being able to connect cross-entropy to classification loss
using mutual information language without explaining dependence reduction

Edge Cases and Follow-Up Questions

Why is entropy highest for a uniform distribution?
Why is cross-entropy always at least as large as entropy of the true distribution?
Why is KL divergence asymmetric?
Why are Gini and entropy both useful impurity measures?
Why does mutual information become zero under independence?

What to Practice Saying Out Loud

Why entropy is really about uncertainty
Why cross-entropy is a natural loss for probabilistic models
Why KL divergence is useful even though it is not a true metric

Detailed Theory

1. Entropy

What is Entropy? Entropy measures the uncertainty or randomness in a probability distribution. Higher entropy = more uncertainty.

Mathematical Formulation:

H(X) = -Σ p(x) * log₂(p(x))

Where:
- X: Random variable
- p(x): Probability of outcome x
- Sum over all possible outcomes

Interpretation:

High entropy: Uniform distribution (maximum uncertainty)
Low entropy: Concentrated distribution (low uncertainty)
Zero entropy: Deterministic (one outcome has probability 1)

Example:

Fair coin: H = -0.5log₂(0.5) - 0.5log₂(0.5) = 1 bit (maximum uncertainty)
Biased coin (90% heads): H = -0.9log₂(0.9) - 0.1log₂(0.1) ≈ 0.47 bits (less uncertainty)

Properties:

H(X) ≥ 0 (always non-negative)
Maximum when uniform distribution
Minimum (0) when deterministic

Use Cases:

Decision trees: Choose splits that maximize information gain (reduce entropy)
Compression: Entropy is lower bound on average code length
Feature selection: Features with high entropy are more informative

2. Cross-Entropy

What is Cross-Entropy? Cross-entropy measures the average number of bits needed to encode events from distribution P using a code optimized for distribution Q.

Mathematical Formulation:

H(P, Q) = -Σ p(x) * log(q(x))

Where:
- P: True distribution
- Q: Predicted/approximated distribution
- p(x): True probability
- q(x): Predicted probability

Interpretation:

Measures how different Q is from P
Always ≥ H(P) (entropy of true distribution)
Equal to H(P) when Q = P
Used as loss function in classification

Why it's a good loss function:

Penalizes confident wrong predictions heavily
Encourages calibrated probabilities
Mathematically well-founded

Use Cases:

Classification: Cross-entropy loss (most common)
Language modeling: Next token prediction
Any probabilistic prediction: When you have true vs predicted distributions

3. KL Divergence (Kullback-Leibler Divergence)

What is KL Divergence? KL divergence measures how different two probability distributions are. It's the "distance" between distributions (though not a true metric).

Mathematical Formulation:

KL(P || Q) = Σ p(x) * log(p(x) / q(x))
           = Σ p(x) * log(p(x)) - Σ p(x) * log(q(x))
           = H(P, Q) - H(P)

Where:
- P: True/reference distribution
- Q: Approximated distribution

Interpretation:

KL(P || Q) = 0: Distributions are identical
KL(P || Q) > 0: Distributions are different
Not symmetric: KL(P || Q) ≠ KL(Q || P) in general
Not a metric: Doesn't satisfy triangle inequality

Properties:

KL(P || Q) ≥ 0 (always non-negative)
KL(P || Q) = 0 if and only if P = Q
Asymmetric: KL(P || Q) ≠ KL(Q || P)

Use Cases:

RLHF: KL penalty to keep policy close to reference
VAEs: KL divergence between posterior and prior
Model comparison: Compare different models
Regularization: Prevent overfitting

Why asymmetric?

KL(P || Q): "How surprised are we when we expect Q but get P?"
KL(Q || P): "How surprised are we when we expect P but get Q?"
Different interpretations, different values

4. Mutual Information

What is Mutual Information? Mutual information measures how much information one random variable gives about another. It's the reduction in uncertainty about Y when we know X.

Mathematical Formulation:

I(X; Y) = H(X) - H(X | Y)
        = H(Y) - H(Y | X)
        = H(X) + H(Y) - H(X, Y)

Where:
- H(X): Entropy of X
- H(X | Y): Conditional entropy of X given Y
- H(X, Y): Joint entropy

Interpretation:

I(X; Y) = 0: X and Y are independent (no information shared)
I(X; Y) > 0: X and Y are dependent (share information)
I(X; Y) = H(X): X completely determines Y
Symmetric: I(X; Y) = I(Y; X)

Use Cases:

Feature selection: Select features with high mutual information with target
Information bottleneck: Compress while preserving information
Clustering: Measure cluster quality
Dimensionality reduction: Preserve mutual information

Example:

If X and Y are independent: I(X; Y) = 0
If Y = X: I(X; Y) = H(X) (maximum)
If Y = f(X) (deterministic function): I(X; Y) = H(Y)

5. Gini Impurity

What is Gini Impurity? Gini impurity measures the probability of misclassifying a randomly chosen element if it were labeled according to the distribution of classes.

Mathematical Formulation:

Gini = 1 - Σ p_i²

Where:
- p_i: Proportion of class i
- Sum over all classes

Interpretation:

Gini = 0: Pure node (all same class)
Gini = 1 - 1/k: Maximum impurity for k classes (uniform distribution)
Range: [0, 1-1/k] for k classes
For binary: Range [0, 0.5]

Properties:

Minimum (0) when pure (one class)
Maximum when uniform distribution
Similar to entropy but different formula

Gini vs Entropy:

Gini: Gini = 1 - Σ p_i² (faster to compute)
Entropy: H = -Σ p_i * log(p_i) (more information-theoretic)
In practice: Both work similarly for decision trees
Gini: Slightly faster, more sensitive to class probability changes
Entropy: More theoretically grounded

Use Cases:

Decision trees: CART algorithm uses Gini
Classification: Measure node impurity
Feature selection: Choose splits that minimize Gini

6. Jensen-Shannon Divergence

What is JS Divergence? JS divergence is a symmetric version of KL divergence. It's a true metric (satisfies triangle inequality).

Mathematical Formulation:

JS(P || Q) = 0.5 * KL(P || M) + 0.5 * KL(Q || M)

Where:
- M = 0.5 * (P + Q) (average distribution)

Properties:

Symmetric: JS(P || Q) = JS(Q || P)
Bounded: JS(P || Q) ∈ [0, 1] (when using log base 2)
Metric: Satisfies triangle inequality
Smooth: More stable than KL divergence

Use Cases:

GANs: Measure distance between real and generated distributions
Model comparison: When you need symmetric distance
Clustering: Measure cluster separation

Industry-Standard Boilerplate Code

See information_theory.py for complete implementations.

Exercises

Implement all metrics from scratch
Compare Gini vs Entropy
Compute mutual information for feature selection
Use KL divergence for regularization

Perplexity and Entropy Connection

Perplexity is closely related to entropy:

Perplexity = 2^H (where H is cross-entropy in bits)
Perplexity = exp(H) (where H is cross-entropy in nats)
Both measure uncertainty, but in different units
Entropy: bits (information content)
Perplexity: "effective vocabulary size" (more intuitive)

See 03_evaluation_metrics/perplexity_detailed.md for complete explanation!

Next Steps

Use these in decision trees, neural networks, RLHF
Reference when implementing algorithms

ML & LLM Interview Prep — Deep Dives