Topic 23: Clustering Evaluation

🔥 For interviews, read these first:

CLUSTERING_EVALUATION_DEEP_DIVE.md — frontier-lab deep dive: internal metrics (silhouette, Davies-Bouldin, Calinski-Harabasz, Dunn), external metrics (ARI, NMI, V-measure, purity, pairwise F), choosing $K$ (elbow, silhouette, gap statistic, stability), bootstrap stability validation, common pitfalls.

INTERVIEW_GRILL.md — 45 active-recall questions.

What You'll Learn

This topic teaches you how to evaluate clustering:

Silhouette Score
Inertia (Within-cluster sum of squares)
Adjusted Rand Index (ARI)
Normalized Mutual Information (NMI)
Simple implementations

Why We Need This

Interview Importance

Common question: "How do you evaluate clustering?"
Practical knowledge: Need to measure quality
Problem-solving: Choose right metric

Real-World Application

Model selection: Choose best clustering
Parameter tuning: Find optimal k
Quality assessment: Measure cluster quality

Industry Use Cases

1. Silhouette Score

Use Case: General clustering evaluation

Works without ground truth
Measures cohesion and separation
Range: -1 to 1 (higher is better)

2. ARI (Adjusted Rand Index)

Use Case: When ground truth available

Compares to true labels
Adjusted for chance
Range: -1 to 1 (higher is better)

3. NMI (Normalized Mutual Information)

Use Case: Information-theoretic measure

Measures shared information
Normalized to [0, 1]
Higher is better

Core Intuition

Clustering is unsupervised, which makes evaluation harder than standard classification.

The main question is:

what does "good clustering" even mean?

Different metrics answer different versions of that question.

Silhouette Score

Silhouette measures two things at once:

how close a point is to its own cluster
how far it is from the nearest competing cluster

That makes it useful when you do not have labels.

Inertia

Inertia measures compactness inside clusters.

It is useful, but it has an important limitation:

inertia almost always decreases as you increase k

So it cannot be interpreted alone without comparing model complexity.

ARI and NMI

When ground-truth labels exist, ARI and NMI compare the clustering to a reference labeling.

That makes them external evaluation metrics rather than purely intrinsic clustering scores.

Technical Details Interviewers Often Want

Why Silhouette Is Not Always Enough

A clustering can have reasonable silhouette and still be wrong for the business or scientific task.

Why?

clusters may be geometrically neat but semantically useless
some data structures are not well captured by distance-based cohesion/separation

Why Inertia Needs Context

Since inertia usually decreases with more clusters, a lower inertia does not automatically mean a better clustering.

This is why the elbow method is heuristic rather than a theorem.

External vs Internal Metrics

This is a useful interview distinction:

internal metrics use only the clustering and geometry
external metrics compare against ground-truth labels

Common Failure Modes

treating inertia as a standalone model-selection criterion
interpreting a high clustering metric as proof of business usefulness
comparing clusterings without checking whether the metric matches the use case
forgetting that some metrics need ground truth and some do not

Edge Cases and Follow-Up Questions

Why does inertia almost always improve when k increases?
Why can silhouette be misleading for irregular cluster shapes?
When should you use ARI or NMI instead of silhouette?
Why is cluster evaluation more ambiguous than classification evaluation?
Why is the elbow method only a heuristic?

What to Practice Saying Out Loud

The difference between intrinsic and extrinsic clustering metrics
Why lower inertia is not automatically better
Why clustering quality depends on the task definition

Industry-Standard Boilerplate Code

Silhouette Score

"""
Silhouette Score: Measure clustering quality
"""
import numpy as np

def silhouette_score(X: np.ndarray, labels: np.ndarray) -> float:
    """
    Silhouette Score
    
    For each point:
    - a(i) = average distance to points in same cluster
    - b(i) = average distance to points in nearest other cluster
    - s(i) = (b(i) - a(i)) / max(a(i), b(i))
    
    Returns: Average silhouette score
    """
    n_samples = X.shape[0]
    unique_labels = np.unique(labels)
    n_clusters = len(unique_labels)
    
    if n_clusters == 1:
        return 0.0
    
    silhouette_scores = []
    
    for i in range(n_samples):
        # Distance to points in same cluster
        same_cluster = X[labels == labels[i]]
        if len(same_cluster) > 1:
            a_i = np.mean([np.linalg.norm(X[i] - x) for x in same_cluster if not np.array_equal(X[i], x)])
        else:
            a_i = 0.0
        
        # Distance to nearest other cluster
        min_b_i = float('inf')
        for label in unique_labels:
            if label != labels[i]:
                other_cluster = X[labels == label]
                b_i = np.mean([np.linalg.norm(X[i] - x) for x in other_cluster])
                min_b_i = min(min_b_i, b_i)
        
        b_i = min_b_i
        
        # Silhouette score for this point
        if max(a_i, b_i) > 0:
            s_i = (b_i - a_i) / max(a_i, b_i)
        else:
            s_i = 0.0
        
        silhouette_scores.append(s_i)
    
    return np.mean(silhouette_scores)

Inertia (WCSS)

"""
Inertia: Within-cluster sum of squares
Lower is better
"""
def inertia(X: np.ndarray, labels: np.ndarray) -> float:
    """
    Inertia = Sum of squared distances to centroids
    """
    unique_labels = np.unique(labels)
    total_inertia = 0.0
    
    for label in unique_labels:
        cluster_points = X[labels == label]
        centroid = np.mean(cluster_points, axis=0)
        
        for point in cluster_points:
            total_inertia += np.linalg.norm(point - centroid)**2
    
    return total_inertia

Adjusted Rand Index

"""
ARI: Adjusted Rand Index
Compares clustering to ground truth
"""
def adjusted_rand_index(labels_true: np.ndarray, 
                       labels_pred: np.ndarray) -> float:
    """
    ARI: Measures agreement between two clusterings
    
    Adjusted for chance (random clustering gets ~0)
    Range: -1 to 1 (1 = perfect match)
    """
    # Contingency table
    n_samples = len(labels_true)
    unique_true = np.unique(labels_true)
    unique_pred = np.unique(labels_pred)
    
    contingency = np.zeros((len(unique_true), len(unique_pred)))
    for i, true_label in enumerate(unique_true):
        for j, pred_label in enumerate(unique_pred):
            contingency[i, j] = np.sum((labels_true == true_label) & 
                                     (labels_pred == pred_label))
    
    # Sum over contingency table
    sum_combinations = np.sum([np.sum(contingency[i]) * (np.sum(contingency[i]) - 1) / 2
                              for i in range(len(unique_true))])
    
    sum_pred = np.sum([np.sum(contingency[:, j]) * (np.sum(contingency[:, j]) - 1) / 2
                      for j in range(len(unique_pred))])
    
    sum_true = np.sum([contingency[i, j] * (contingency[i, j] - 1) / 2
                      for i in range(len(unique_true))
                      for j in range(len(unique_pred))])
    
    # ARI formula
    n_choose_2 = n_samples * (n_samples - 1) / 2
    expected_index = sum_combinations * sum_pred / n_choose_2
    max_index = (sum_combinations + sum_pred) / 2
    
    if max_index - expected_index == 0:
        return 0.0
    
    ari = (sum_true - expected_index) / (max_index - expected_index)
    return ari

Theory

When to Use Which Metric

Metric	Ground Truth Needed	Use Case
Silhouette	No	General evaluation
Inertia	No	Compare different k
ARI	Yes	Compare to true labels
NMI	Yes	Information-theoretic

Choosing k (Number of Clusters)

Elbow Method: Plot inertia vs k, find elbow
Silhouette: Choose k with highest silhouette
Domain knowledge: Use business requirements

Exercises

Implement all metrics
Evaluate different clusterings
Find optimal k
Compare metrics

Next Steps

Topic 24: Linear algebra Q&A
Topic 25: Final review

ML & LLM Interview Prep — Deep Dives