Clustering Evaluation — Deep Dive

Frontier-lab interview prep. Pair with INTERVIEW_GRILL.md.

Clustering evaluation is uniquely hard because there's often no ground truth. This deep dive covers internal metrics (don't need labels), external metrics (require labels), choosing , stability analysis, and the practical principle that downstream task performance trumps any clustering metric.


1. Internal evaluation metrics

Use only the data and the cluster assignments — no labels needed.

Silhouette score

For each point :

  • = mean intra-cluster distance.
  • = mean distance to nearest other cluster.
  • .

Range . Mean over all points = silhouette score.

Interpretation:

  • : well-separated clusters.
  • : ambiguous, points on cluster boundaries.
  • : misclassified (closer to other cluster than own).

When: convex, well-separated clusters; meaningful Euclidean distances.

Davies-Bouldin index

where = average distance of cluster- points to centroid .

Lower = better. Penalizes clusters that are spread out and close to others.

Calinski-Harabasz (variance ratio)

where = between-cluster scatter matrix, = within-cluster scatter. Higher = better.

Like F-statistic for clustering. Strong when clusters are roughly equal-sized and globular.

Dunn index

Ratio of minimum inter-cluster distance to maximum intra-cluster diameter. Higher = better. Sensitive to outliers.

Limitations of internal metrics

  • Reward compactness + separation, but those don't always match what the task needs.
  • Globular bias: K-means + silhouette favor spherical clusters even when data has elongated shape.
  • Can't tell you the "right" number of clusters in any absolute sense.

2. External evaluation metrics (with ground truth labels)

When you have ground-truth class labels, you can compare cluster assignments to them.

Adjusted Rand Index (ARI)

Rand Index: fraction of pairs (i, j) that are clustered consistently (both same cluster + same class, or both different):

ARI corrects for chance agreement:

Range . 1 = perfect; 0 = chance; negative = worse than chance.

Normalized Mutual Information (NMI)

Mutual information between clustering and labels :

Normalized:

Range . Symmetric. Doesn't penalize having more or fewer clusters than classes.

V-measure

Harmonic mean of homogeneity and completeness:

  • Homogeneity: each cluster contains only one class. .
  • Completeness: each class is contained in one cluster. .
  • V-measure: .

Purity

For each cluster, count majority label. Sum / . Simple but biased toward many small clusters.

Pairwise F-measure

Compute precision/recall over pairs (do they belong to the same cluster, same class).


3. Choosing the number of clusters

A famously underdetermined problem.

Elbow method

Plot WCSS (within-cluster sum of squares) vs . Look for "elbow" where adding clusters stops helping much.

Issues: subjective; often no clear elbow; for very different cluster sizes can mislead.

Silhouette method

Compute silhouette for various . Pick with maximum.

More principled than elbow. Doesn't always give a clear answer.

Gap statistic (Tibshirani et al. 2001)

Compare WCSS to WCSS expected under uniform reference distribution. Pick where gap is largest.

Statistically grounded. Computationally expensive (need many reference samplings).

Stability-based

Run clustering on bootstrap subsamples; compare assignments. Stable → consistent clusters across samples.

Information criteria

For mixture models (GMM): BIC, AIC.

Practical answer

  • Start with domain knowledge if available.
  • Try multiple ; visualize.
  • Validate downstream task — clustering isn't an end, it's a means.

4. Stability analysis

Beyond just picking : are clusters meaningful or just an artifact of the algorithm + initialization?

Bootstrap stability

  • Resample data; rerun clustering.
  • Measure consistency: ARI between bootstrap clustering and original.
  • Stable clusters: high ARI across resamples.

Initialization stability

  • Re-run K-means with different seeds.
  • Variance in resulting clusters → solution depends on init.

Visualization

  • Project to 2D (PCA, t-SNE, UMAP).
  • Visually inspect cluster structure.

If clusters wildly different across runs / bootstraps, your "clusters" may be noise.


5. Common pitfalls

Comparing across algorithms with different

Internal metrics depend on . Different algorithms returning different can't be fairly compared.

Using internal metric to choose for the wrong algorithm

Silhouette favors compact, well-separated clusters. K-means produces those by construction. Selecting via silhouette + K-means is partly tautological.

Ignoring outliers

Some clustering methods (DBSCAN) explicitly mark outliers; others (K-means) absorb them. Affects all metrics.

Forgetting evaluation has hyperparameters

"Cluster quality" metric can favor specific cluster shapes. Match metric to expected cluster geometry.

Not validating downstream

If clustering is for a downstream task (segmentation, anomaly detection, recommendation), evaluate via that task's success metric — not clustering metrics.


6. Common interview gotchas

QuestionCommon wrong answerRight answer
Silhouette range? — negative when cluster assignment is wrong
ARI vs NMI?SameARI: pair-based, corrects for chance. NMI: information-theoretic, doesn't penalize K mismatch
How to choose ?ElbowMultiple methods; ultimately downstream-task validation
Internal metric guarantees clustering quality?YesNo — favors specific geometries; matches algorithm bias
Purity seems high — done?YesTrivially high with many small clusters; check completeness too
ARI = 0.5 — good?YesDepends; far from random (0) but far from perfect (1); context matters
Run K-means once and trust?SureInit-sensitive; use k-means++ + multiple runs + best-of

7. Eight most-asked interview questions

  1. What metrics evaluate clustering without labels? (Silhouette, Davies-Bouldin, Calinski-Harabasz, Dunn.)
  2. What metrics need labels? (ARI, NMI, V-measure, purity, pairwise F.)
  3. Why is choosing hard? (Underdetermined; no objective best; methods give different answers.)
  4. Walk me through silhouette. ( vs ; range ; meaningful for compact globular clusters.)
  5. ARI vs NMI? (ARI pair-based with chance correction; NMI info-theoretic; different intuitions.)
  6. Why does internal metric favor K-means style clusters? (Both reward compactness + separation; tautological.)
  7. How would you sanity-check a clustering result? (Visualization, bootstrap stability, downstream task validation.)
  8. When clustering doesn't match labels, what's wrong? (Could be: labels noisy, clustering uses different similarity, labels don't reflect natural clusters.)

8. Drill plan

  • For each internal metric, recite formula + when it works.
  • For each external metric, recite formula + interpretation.
  • Recite 3 methods to choose + their failure modes.
  • Sketch how bootstrap stability validates a clustering.
  • Practice 2 cases: customer segmentation, image clustering — describe full evaluation strategy.

9. Further reading

  • Halkidi, Batistakis, Vazirgiannis (2001), On Clustering Validation Techniques.
  • Vinh, Epps, Bailey (2010), Information Theoretic Measures for Clusterings Comparison. — NMI variants.
  • Tibshirani, Walther, Hastie (2001), Estimating the number of clusters in a data set via the gap statistic.
  • Hubert & Arabie (1985), Comparing partitions — original ARI paper.