Information Theory: A Frontier-Lab Interview Deep Dive

Why this exists. Information theory is the language ML uses to talk about loss functions, compression, generalization, and divergences. Strong candidates can move fluidly between cross-entropy as a loss, KL as a divergence, mutual information as a model objective, and the connections among them. This document is the bridge.


1. Entropy: the central quantity

For a discrete distribution over outcomes:

Intuition: the average number of bits (or nats) needed to encode an outcome drawn from . Equivalently, the average "surprise" of an outcome.

Properties

Max entropy = uniform. , with equality iff is uniform.

Min entropy = deterministic. , with equality iff is a point mass.

Concave in . If you average two distributions, you get higher entropy than the average of their entropies.

Additivity for independent variables. if .

What "entropy" means in different contexts

  • Statistics. Spread of a distribution.
  • Coding. Lower bound on average code length (Shannon's source coding theorem).
  • Physics. Disorder; thermodynamic entropy.
  • ML. How "uncertain" a model is.

2. Cross-entropy

For two distributions (true) and (model):

Average code length when encoding samples from using a code optimal for . Bounded below by (you can't do better than the entropy of the true distribution).

The cross-entropy = entropy + KL identity

is fixed (it's a property of the data). Minimizing over is equivalent to minimizing . This is why cross-entropy is the standard ML loss — it's KL up to a constant.

Cross-entropy in deep learning

For one-hot labels ( is a delta on the true class):

This is exactly the negative log-likelihood. So "cross-entropy loss" = "NLL" = "MLE" — three names for the same loss in the discrete-label case. Different generative assumptions give different losses (Gaussian → MSE), but for classification, cross-entropy is mandated by maximum likelihood under the categorical distribution.


3. KL divergence

Measures how differs from "from 's perspective."

Properties

Non-negative. , with equality iff . Direct consequence of Jensen's inequality applied to .

Asymmetric. in general. Not a distance.

Not a metric. Triangle inequality fails. Don't think of KL as a distance; it's a divergence.

Coordinate-invariant. Reparameterize for invertible ; KL is unchanged. Important for deriving properties of distributions.

Forward vs reverse KL: why direction matters

Forward KL ("mean-seeking"). Penalizes heavily where has mass and doesn't. Encourages to cover all modes of . If is restricted to a simpler family (e.g. unimodal Gaussian fitting a multimodal ), forward KL spreads to cover everything — high entropy, mean-seeking.

Reverse KL ("mode-seeking"). Penalizes where has mass but doesn't. Encourages to fit one mode of well, ignoring others. Low entropy, mode-seeking.

Why this matters for ML:

  • MLE / cross-entropy training is forward KL: . Makes the model cover the data distribution. Models trained this way often produce "average-looking" outputs.
  • Variational inference / RL with KL regularization is sometimes reverse KL: . Makes the model concentrate on a mode.
  • GANs approximately minimize Jensen-Shannon (a symmetric average of forward and reverse KL).

Frontier-lab interview gotcha: "Why does an MLE-trained model tend to produce average outputs?" Forward-KL is mean-seeking.


4. Mutual information

How much knowing reduces uncertainty about (and vice versa).

Properties

  • .
  • iff .
  • .
  • Symmetric: .

Why it matters in ML

  • Information bottleneck: train representations that maximize (predictive of label) while minimizing (compressing input). A theoretical framework for understanding "good" representations.
  • Self-supervised learning. Many SSL objectives (InfoNCE, contrastive losses) are lower bounds on mutual information.
  • Disentanglement. Maximizing between latent dimensions and meaningful factors.

InfoNCE (van den Oord et al. 2018)

The standard contrastive loss:

where is the positive (correct) pair and are negatives. This is a lower bound on . Used in CLIP, MoCo, SimCLR, and modern embedding models.


5. Conditional and joint entropy

Chain rule: .

Conditional entropy is the average uncertainty about given that is known. Always between 0 and .

These are useful for decomposing information flow in models. E.g., is the irreducible noise any model must contend with — a lower bound on cross-entropy loss.


6. KL in machine learning

KL appears in many places.

Maximum likelihood = forward KL minimization

Already covered. .

Variational inference / VAE

The Evidence Lower Bound (ELBO):

The first term is the reconstruction; the second is a KL penalty against the prior. This is why VAEs have a "KL term."

RLHF / PPO regularization

The RLHF objective:

The KL anchor prevents the policy from drifting too far from the reference. Same idea in TRPO, PPO with KL formulation.

DPO derivation

The closed-form solution to the KL-regularized RL objective, which becomes the basis for DPO. See 08_training_techniques/ALIGNMENT_DEEP_DIVE.md.

Knowledge distillation

Train a student model to match a teacher's distribution by minimizing . The student inherits the teacher's confidence pattern, not just hard predictions.


7. Other divergences

KL is one of many.

Jensen-Shannon (JS) divergence

Symmetric. Bounded . Square root of JS is a metric.

f-divergences

General family . KL: . JS, , total variation are all f-divergences with different .

Wasserstein distance

A different family entirely (optimal transport). Considers the geometry of the underlying space (not just distribution mass). Used in WGAN, optimal transport, distribution matching.

Total variation

The maximum probability of distinguishing and by any test. Pinsker's inequality: — bounding TV by KL.


8. Cross-entropy in detail

For a softmax classifier with logits :

The is the log-partition function (also called log-sum-exp). Numerically computed via:

Gradient w.r.t. logits

This is the famous "logits minus targets" gradient. It's the canonical-link gradient for the categorical distribution in GLM theory. Same form as logistic regression's extended to classes.


9. Perplexity

Geometric inverse of average per-token probability. Lower perplexity = better model.

Bounds

  • Lower bound: (true entropy of the data). A perfect LM would have .
  • Upper bound: (vocabulary size, if the model is uniform random).

Tokenizer dependence

Perplexity depends on tokenization. Same text, different tokenizer, different PPL. Cannot directly compare across tokenizers — see 03_evaluation_metrics/EVALUATION_METRICS_DEEP_DIVE.md.


10. Information bottleneck

A theoretical framework (Tishby et al. 2000) proposing that good representations of input for predicting label :

  • Maximize (predictive of label).
  • Minimize (compress input — "throw away irrelevant information").

Empirically, deep networks trained with cross-entropy seem to (approximately) follow this trajectory: early layers compress the input; later layers preserve task-relevant information. Whether IB is the right explanation for deep learning's success is debated.


11. Source coding theorem (Shannon)

The minimum average bits per symbol needed to losslessly encode samples from is . You cannot compress below entropy.

Practical relevance for ML:

  • Cross-entropy is the average code length if you use a code optimal for to encode samples from . Always .
  • Minimizing cross-entropy = building a near-optimal compressor for the data.
  • LLMs are essentially lossy compressors of their training data. Better LM → better compression.

A very recent line of research (Deletang et al., "Language Modeling is Compression") makes this explicit: SOTA LLMs can compress text better than gzip.


12. Common interview gotchas

GotchaStrong answer
"Is KL a distance?"No. Asymmetric, doesn't satisfy triangle inequality. It's a divergence.
"Why minimize cross-entropy?"It's MLE under categorical. Equivalently, it's up to the data entropy constant.
"Forward vs reverse KL?"Forward (): mean-seeking; covers . Reverse (): mode-seeking; fits one mode. MLE = forward.
"What's the KL between identical distributions?"0. .
"Can KL be infinite?"Yes. if there's a region where but . (You're "infinitely surprised" by a sample assigned probability 0.)
"What's mutual information?"KL between joint and product of marginals. Measures statistical dependence.
"When are KL and cross-entropy the same?"When is fixed (i.e., during training, where the data distribution doesn't change), minimizing cross-entropy = minimizing KL.
"What's perplexity?". Inverse geometric average per-token probability. Tokenizer-dependent.

13. The 10 most-asked information theory interview questions

  1. Define entropy. . Average surprise / coding length.
  2. Define cross-entropy. . Coding length using -optimal code on samples from .
  3. Cross-entropy = entropy + KL. . Why minimizing cross-entropy = minimizing KL.
  4. Define KL divergence. . Non-negative, asymmetric, not a metric.
  5. Forward vs reverse KL. Forward: mean-seeking. Reverse: mode-seeking.
  6. Mutual information. . Statistical dependence.
  7. Why is MLE = cross-entropy? Cross-entropy is the negative log-likelihood under categorical; MLE is .
  8. Perplexity? . Tokenizer-dependent.
  9. KL in RLHF? Penalty prevents policy from drifting from reference.
  10. What's the source coding theorem? Average code length entropy. Cross-entropy is the loss because it's compressibility under the model.

14. Drill plan

  1. Whiteboard derivation.
  2. Walk through forward vs reverse KL with multimodal-vs-unimodal example.
  3. Show MI = .
  4. Connect cross-entropy to MLE under categorical.
  5. Drill INTERVIEW_GRILL.md.

15. Further reading

  • Cover & Thomas, Elements of Information Theory (the textbook).
  • Shannon, "A Mathematical Theory of Communication" (1948) — the founding paper.
  • Tishby et al., "The Information Bottleneck Method" (2000).
  • van den Oord et al., "Representation Learning with Contrastive Predictive Coding" (InfoNCE, 2018).
  • Deletang et al., "Language Modeling is Compression" (2023).