Information Theory Metrics: Use Cases

When to Use Each Metric

Entropy

Use When:

  • Building decision trees (ID3, C4.5)
  • Measuring uncertainty in distributions
  • Information gain calculations
  • Feature selection (high entropy = more informative)

Example:

# Decision tree split selection
parent_entropy = entropy(parent_labels)
left_entropy = entropy(left_labels)
right_entropy = entropy(right_labels)

# Information gain
info_gain = parent_entropy - (|left|/|parent| * left_entropy + 
                              |right|/|parent| * right_entropy)
# Choose split with maximum information gain

Cross-Entropy

Use When:

  • Classification loss function (most common)
  • Language modeling (next token prediction)
  • Any probabilistic prediction task
  • Comparing true vs predicted distributions

Example:

# Classification
true_labels = [0, 1, 2]  # Class indices
pred_probs = model(input)  # [batch, n_classes] probabilities
loss = cross_entropy_loss(pred_probs, true_labels)

KL Divergence

Use When:

  • RLHF (KL penalty to keep policy close to reference)
  • VAEs (KL between posterior and prior)
  • Model comparison
  • Regularization (prevent overfitting)
  • Distribution matching

Example:

# RLHF KL penalty
policy_logprobs = policy_model(input)
reference_logprobs = reference_model(input)
kl_penalty = beta * kl_divergence(policy_logprobs, reference_logprobs)
loss = policy_loss + kl_penalty

Mutual Information

Use When:

  • Feature selection (select informative features)
  • Information bottleneck (compress while preserving info)
  • Clustering evaluation
  • Dimensionality reduction
  • Understanding feature relationships

Example:

# Feature selection
for feature in features:
    mi = mutual_information(feature, target)
    if mi > threshold:
        selected_features.append(feature)

Gini Impurity

Use When:

  • Decision trees (CART algorithm)
  • Classification impurity measure
  • When you need faster computation than entropy
  • Binary classification

Example:

# Decision tree split
gini_parent = gini_impurity(parent_labels)
gini_left = gini_impurity(left_labels)
gini_right = gini_impurity(right_labels)

# Gini gain
gini_gain = gini_parent - (|left|/|parent| * gini_left + 
                          |right|/|parent| * gini_right)

Jensen-Shannon Divergence

Use When:

  • GANs (measure distance between real and generated)
  • Model comparison (when you need symmetric distance)
  • Clustering (when KL is unstable)
  • When distributions might not overlap (KL can be infinite)

Example:

# GAN training
real_dist = real_data_distribution
generated_dist = generator_output_distribution
distance = jensen_shannon_divergence(real_dist, generated_dist)
# Minimize distance to make generated data similar to real

Quick Reference Table

MetricPrimary UseKey Property
EntropyDecision trees, uncertaintyH(X) ≥ 0, max when uniform
Cross-EntropyClassification lossH(P,Q) ≥ H(P), = H(P) when Q=P
KL DivergenceRLHF, VAEs, regularizationAsymmetric, KL(P|Q) ≥ 0
Mutual InformationFeature selectionI(X;Y) = 0 if independent
GiniDecision trees (CART)Faster than entropy
JS DivergenceGANs, symmetric distanceSymmetric, bounded [0,1]

Common Patterns

Pattern 1: Decision Tree Split Selection

# Use entropy or Gini
def choose_best_split(X, y):
    best_gain = -float('inf')
    for feature, threshold in candidate_splits:
        left, right = split(X, y, feature, threshold)
        gain = information_gain(y, left, right)  # Uses entropy
        # or
        gain = gini_gain(y, left, right)  # Uses Gini
        if gain > best_gain:
            best_gain = gain
            best_split = (feature, threshold)

Pattern 2: Classification Loss

# Always use cross-entropy
loss = nn.CrossEntropyLoss()(predictions, true_labels)

Pattern 3: RLHF Regularization

# Use KL divergence
kl_penalty = beta * kl_divergence(policy_dist, reference_dist)
total_loss = policy_loss + kl_penalty

Pattern 4: Feature Selection

# Use mutual information
for feature in features:
    mi = mutual_information(feature, target)
    if mi > threshold:
        selected.append(feature)