Information Theory Metrics: Use Cases
When to Use Each Metric
Entropy
Use When:
- Building decision trees (ID3, C4.5)
- Measuring uncertainty in distributions
- Information gain calculations
- Feature selection (high entropy = more informative)
Example:
# Decision tree split selection
parent_entropy = entropy(parent_labels)
left_entropy = entropy(left_labels)
right_entropy = entropy(right_labels)
# Information gain
info_gain = parent_entropy - (|left|/|parent| * left_entropy +
|right|/|parent| * right_entropy)
# Choose split with maximum information gain
Cross-Entropy
Use When:
- Classification loss function (most common)
- Language modeling (next token prediction)
- Any probabilistic prediction task
- Comparing true vs predicted distributions
Example:
# Classification
true_labels = [0, 1, 2] # Class indices
pred_probs = model(input) # [batch, n_classes] probabilities
loss = cross_entropy_loss(pred_probs, true_labels)
KL Divergence
Use When:
- RLHF (KL penalty to keep policy close to reference)
- VAEs (KL between posterior and prior)
- Model comparison
- Regularization (prevent overfitting)
- Distribution matching
Example:
# RLHF KL penalty
policy_logprobs = policy_model(input)
reference_logprobs = reference_model(input)
kl_penalty = beta * kl_divergence(policy_logprobs, reference_logprobs)
loss = policy_loss + kl_penalty
Mutual Information
Use When:
- Feature selection (select informative features)
- Information bottleneck (compress while preserving info)
- Clustering evaluation
- Dimensionality reduction
- Understanding feature relationships
Example:
# Feature selection
for feature in features:
mi = mutual_information(feature, target)
if mi > threshold:
selected_features.append(feature)
Gini Impurity
Use When:
- Decision trees (CART algorithm)
- Classification impurity measure
- When you need faster computation than entropy
- Binary classification
Example:
# Decision tree split
gini_parent = gini_impurity(parent_labels)
gini_left = gini_impurity(left_labels)
gini_right = gini_impurity(right_labels)
# Gini gain
gini_gain = gini_parent - (|left|/|parent| * gini_left +
|right|/|parent| * gini_right)
Jensen-Shannon Divergence
Use When:
- GANs (measure distance between real and generated)
- Model comparison (when you need symmetric distance)
- Clustering (when KL is unstable)
- When distributions might not overlap (KL can be infinite)
Example:
# GAN training
real_dist = real_data_distribution
generated_dist = generator_output_distribution
distance = jensen_shannon_divergence(real_dist, generated_dist)
# Minimize distance to make generated data similar to real
Quick Reference Table
| Metric | Primary Use | Key Property |
|---|---|---|
| Entropy | Decision trees, uncertainty | H(X) ≥ 0, max when uniform |
| Cross-Entropy | Classification loss | H(P,Q) ≥ H(P), = H(P) when Q=P |
| KL Divergence | RLHF, VAEs, regularization | Asymmetric, KL(P|Q) ≥ 0 |
| Mutual Information | Feature selection | I(X;Y) = 0 if independent |
| Gini | Decision trees (CART) | Faster than entropy |
| JS Divergence | GANs, symmetric distance | Symmetric, bounded [0,1] |
Common Patterns
Pattern 1: Decision Tree Split Selection
# Use entropy or Gini
def choose_best_split(X, y):
best_gain = -float('inf')
for feature, threshold in candidate_splits:
left, right = split(X, y, feature, threshold)
gain = information_gain(y, left, right) # Uses entropy
# or
gain = gini_gain(y, left, right) # Uses Gini
if gain > best_gain:
best_gain = gain
best_split = (feature, threshold)
Pattern 2: Classification Loss
# Always use cross-entropy
loss = nn.CrossEntropyLoss()(predictions, true_labels)
Pattern 3: RLHF Regularization
# Use KL divergence
kl_penalty = beta * kl_divergence(policy_dist, reference_dist)
total_loss = policy_loss + kl_penalty
Pattern 4: Feature Selection
# Use mutual information
for feature in features:
mi = mutual_information(feature, target)
if mi > threshold:
selected.append(feature)