Topic 3: Evaluation Metrics

🔥 For interviews, read these first:

EVALUATION_METRICS_DEEP_DIVE.md — frontier-lab interview deep dive: classification (precision/recall/F1/AUROC/PR-AUC), regression (MSE/MAE/R²/quantile loss), ranking (MAP/NDCG/MRR), LLM-specific (PPL, pass@k, BLEU, LLM-as-judge biases), calibration (Brier/ECE/temperature scaling), Goodhart's Law and methodology pitfalls.

INTERVIEW_GRILL.md — 50 active-recall questions with strong answers.

What You'll Learn

This topic teaches you to implement all common evaluation metrics from scratch:

Classification metrics (Accuracy, Precision, Recall, F1)
Regression metrics (MSE, MAE, R²)
Ranking metrics (NDCG, MAP)
Theory and when to use each

Why We Need This

Interview Importance

Common question: "Implement precision/recall from scratch"
Understanding: Know what metrics mean
Application: Choose right metric for problem

Real-World Application

Model evaluation: Measure model performance
Problem-specific: Different problems need different metrics
Debugging: Understand model weaknesses

Industry Use Cases

1. Classification Metrics

Use Case: Binary/multi-class classification

Spam detection (Precision important)
Medical diagnosis (Recall important)
Balanced problems (F1 score)

2. Regression Metrics

Use Case: Continuous value prediction

House prices (MSE, MAE)
Model comparison (R²)

3. Ranking Metrics

Use Case: Recommendation systems

Search engines (NDCG)
Recommendations (MAP)

Core Intuition

Metrics are not just for reporting a number after training.

They define what "good" means for the problem.

That is why interviewers care so much about them: if you choose the wrong metric, you can optimize the wrong behavior.

Classification

For classification, different metrics care about different kinds of mistakes.

Accuracy treats all mistakes equally
Precision asks: when I predict positive, how often am I right?
Recall asks: among true positives, how many did I recover?
F1 balances precision and recall

Regression

For regression, the main question is how errors are penalized.

MSE punishes large errors more strongly
MAE treats errors linearly
R2 measures variance explained relative to predicting the mean

Ranking

Ranking metrics care about order, not just set membership.

That is why search and recommendation systems need metrics like NDCG or MAP rather than plain classification accuracy.

Technical Details That Commonly Get Missed

Accuracy Can Be Misleading

If positives are rare, accuracy can look great even for a useless model.

Example:

99% negative data
always predict negative
99% accuracy
terrible recall for the positive class

Precision vs Recall Trade-Off

You often improve one at the expense of the other by changing the threshold.

That means the metric is not just about the model. It is also about:

threshold choice
business cost
tolerance for false positives vs false negatives

R2 Edge Case

R2 = 1 - SS_res / SS_tot

Important edge case:

if SS_tot = 0, the target has no variance
then R2 is not informative in the normal way

Ranking Metrics Need Position Sensitivity

NDCG is useful because relevant items near the top matter more than relevant items buried lower in the list.

That is usually what you want in retrieval and recommendation.

Common Failure Modes

using accuracy for heavy class imbalance
reporting F1 without saying threshold
comparing regression metrics across differently scaled targets without context
using perplexity or loss as if it directly captured downstream usefulness
forgetting confidence intervals for small evaluation sets

Edge Cases and Follow-Up Questions

What if the positive class is only 0.1%?
What if false negatives are much more costly than false positives?
Why can precision rise when recall falls?
Why might two models have similar accuracy but very different usefulness?
Why is NDCG better than accuracy for search ranking?

What to Practice Saying Out Loud

Why metric choice is really objective choice
Why threshold matters for classification metrics
Why MSE and MAE can disagree about which model is better
Why ranking metrics need position sensitivity

Industry-Standard Boilerplate Code

Classification Metrics (Pure Python)

"""
Classification Metrics from Scratch
Interview question: "Implement precision, recall, F1"
"""
import numpy as np
from typing import Tuple

def accuracy(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """Accuracy: (TP + TN) / (TP + TN + FP + FN)"""
    return np.mean(y_true == y_pred)

def precision(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """Precision: TP / (TP + FP)"""
    tp = np.sum((y_true == 1) & (y_pred == 1))
    fp = np.sum((y_true == 0) & (y_pred == 1))
    return tp / (tp + fp) if (tp + fp) > 0 else 0.0

def recall(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """Recall: TP / (TP + FN)"""
    tp = np.sum((y_true == 1) & (y_pred == 1))
    fn = np.sum((y_true == 1) & (y_pred == 0))
    return tp / (tp + fn) if (tp + fn) > 0 else 0.0

def f1_score(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """F1 Score: 2 * (Precision * Recall) / (Precision + Recall)"""
    prec = precision(y_true, y_pred)
    rec = recall(y_true, y_pred)
    return 2 * (prec * rec) / (prec + rec) if (prec + rec) > 0 else 0.0

def confusion_matrix(y_true: np.ndarray, y_pred: np.ndarray) -> np.ndarray:
    """Confusion Matrix"""
    classes = np.unique(np.concatenate([y_true, y_pred]))
    n_classes = len(classes)
    cm = np.zeros((n_classes, n_classes), dtype=int)
    
    for i, true_class in enumerate(classes):
        for j, pred_class in enumerate(classes):
            cm[i, j] = np.sum((y_true == true_class) & (y_pred == pred_class))
    
    return cm

Regression Metrics (Pure Python)

"""
Regression Metrics from Scratch
"""
import numpy as np

def mse(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """Mean Squared Error"""
    return np.mean((y_true - y_pred)**2)

def mae(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """Mean Absolute Error"""
    return np.mean(np.abs(y_true - y_pred))

def rmse(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """Root Mean Squared Error"""
    return np.sqrt(mse(y_true, y_pred))

def r2_score(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """R² Score: 1 - (SS_res / SS_tot)"""
    ss_res = np.sum((y_true - y_pred)**2)
    ss_tot = np.sum((y_true - np.mean(y_true))**2)
    return 1 - (ss_res / ss_tot) if ss_tot > 0 else 0.0

Theory

When to Use Which Metric

Classification:

Accuracy: Balanced classes
Precision: When false positives are costly
Recall: When false negatives are costly
F1: Balance between precision and recall

Regression:

MSE: Penalizes large errors more
MAE: Equal weight to all errors
R²: Proportion of variance explained

Exercises

Implement multi-class metrics
Implement weighted metrics
Calculate metrics from confusion matrix
Compare different metrics

Perplexity: Detailed Guide

New Comprehensive Content:

perplexity_detailed.md: Complete theoretical guide
- What is perplexity and intuitive understanding
- Mathematical formulations
- Connection to entropy and cross-entropy
- Interpretation and typical values
- Computing perplexity step-by-step
- Perplexity variants (word, character, byte-level)
- Perplexity in practice (training, evaluation, comparison)
- Limitations and best practices
- Related concepts (entropy, KL divergence, bits per token)
- Applications
perplexity_code.py: Complete implementations
- Basic perplexity computation
- Perplexity from logits
- Language model perplexity
- Per-token perplexity
- Character-level perplexity
- Bits per token
- Normalized perplexity
- Model comparison utilities

Key Concepts:

Perplexity = exp(average negative log-likelihood)
Lower perplexity = better model
Typical values: 10-50 for good language models
Connection: PP = 2^H (perplexity = 2^entropy)
BPT = log₂(PP) (bits per token)

Next Steps

Topic 4: Transformers
Topic 5: Attention mechanisms

ML & LLM Interview Prep — Deep Dives