Clustering: A Frontier-Lab Interview Deep Dive
Why this exists. Clustering is the canonical unsupervised learning task. Interviewers probe: K-means assumptions and failure modes, why DBSCAN handles non-convex shapes, how GMM relates to K-means, the EM algorithm, evaluation challenges. Strong candidates can derive K-means as coordinate descent on a quadratic objective and explain when each method fits.
1. The clustering taxonomy
| Method | Approach | Strength | Weakness |
|---|---|---|---|
| K-means | Partition into spherical clusters minimizing intra-cluster variance | Fast, scalable, simple | Requires known; assumes spherical clusters |
| Gaussian Mixture (GMM) | Soft K-means with covariance per cluster | Captures elliptical clusters; soft assignments | Requires known; sensitive to init |
| Hierarchical | Agglomerative tree of merges | No need to specify a priori; produces dendrogram | – memory/compute |
| DBSCAN | Density-based: clusters = dense regions | Handles arbitrary shapes; finds outliers | Sensitive to ; struggles with varying density |
| Spectral | Cluster via eigendecomposition of similarity graph | Handles non-convex shapes; theoretically principled | eigendecomposition |
| HDBSCAN | Hierarchical density-based | DBSCAN without tuning; varying density | Complex implementation |
There's no universally best clustering method. The right choice depends on cluster shape, density, scale, and whether is known.
2. K-means
The workhorse. Most-asked clustering algorithm in interviews.
The algorithm
Given clusters and data :
- Initialize centroids (e.g., k-means++).
- Assignment step: each point joins the nearest centroid: .
- Update step: each centroid moves to the mean of its assigned points: .
- Repeat 2–3 until assignments don't change.
The objective
K-means minimizes within-cluster sum of squares (WCSS):
Why it converges
Both steps decrease the objective:
- Assignment step: reassigning to nearest centroid can only decrease per-point distances (or keep equal).
- Update step: setting centroid to the mean is the closed-form optimum given the assignments.
Since the objective is bounded below and decreases monotonically, K-means converges (to a local minimum, not necessarily global).
K-means as coordinate descent
K-means is coordinate descent on the WCSS objective: alternately optimize over (assignments) holding fixed, and over holding fixed. Both subproblems have closed-form solutions. This is why it converges and why it's so fast.
Initialization: k-means++
Random initialization can produce bad local minima. k-means++ (Arthur & Vassilvitskii 2007) initializes centroids spread out:
- Pick first centroid uniformly at random.
- For each subsequent centroid, pick a point with probability proportional to — far from existing centroids.
Provides -approximation guarantees and dramatically improves convergence empirically. Default in sklearn.
Choosing K
- Elbow method: plot WCSS vs ; find the "elbow" where additional clusters give diminishing returns. Ad-hoc.
- Silhouette score: average of where = mean distance to own cluster, = mean distance to nearest other cluster. Range . Higher = better clustering.
- Gap statistic: compare WCSS to expected WCSS under a reference null distribution. More principled.
- Domain knowledge: often the best answer.
K-means failure modes
- Non-spherical clusters: K-means uses Euclidean distance, prefers spherical clusters. Fails on elongated, curved, or nested clusters.
- Different cluster sizes: K-means tends to balance cluster sizes (centroid is "pulled" toward more data).
- Different cluster densities: high-density clusters dominate; low-density ones may be split.
- Outliers: pull centroids toward them.
- Local minima: bad initialization → wrong clustering. Mitigate with k-means++ and multiple restarts.
Mini-batch K-means
For large data: sample mini-batches; update centroids incrementally. Trades some quality for scalability. Used for clustering millions of samples.
3. Gaussian Mixture Models (GMM)
K-means with covariance. Each cluster is a Gaussian; data is a weighted mixture.
The model
Parameters: mixture weights , means , covariances . Soft assignments: each point belongs partially to each cluster.
EM algorithm for GMM
E-step: compute posterior responsibilities (soft assignments):
M-step: update parameters using the responsibilities as weights:
Iterate until convergence. EM monotonically increases the log-likelihood.
Why EM? Why not just MLE?
The MLE for a mixture has no closed form (the log-sum is intractable). EM is a tractable alternative that monotonically increases a lower bound on the log-likelihood (the ELBO).
K-means as a degenerate GMM
If for all , mixing weights are equal, and , GMM's soft assignments become hard (the closest cluster gets ), and EM reduces to K-means. So K-means is GMM with shared spherical covariance, equal mixing weights, and hard assignments.
Covariance choices
- Spherical: — like K-means with per-cluster scale.
- Diagonal: — axis-aligned ellipses.
- Full: arbitrary — full ellipsoidal clusters. Most expressive; needs most data per cluster to estimate reliably.
When GMM beats K-means
- Elliptical (non-spherical) clusters.
- Soft assignments are useful (uncertainty quantification).
- Probabilistic interpretation needed.
Failure modes
- Singular covariances: a cluster with very few points can shrink to near-zero, blowing up likelihood. Fix: regularization, minimum eigenvalue constraints.
- Local minima: like K-means, EM converges to a local optimum.
4. DBSCAN
Density-Based Spatial Clustering of Applications with Noise. Ester et al. 1996.
The idea
Clusters are dense regions of points; sparse regions are noise. Two parameters:
- : radius for neighborhood.
min_samples: minimum points in -neighborhood for a "core" point.
Definitions
- Core point: has
min_samplesneighbors within . - Border point: not a core point, but in the -neighborhood of one.
- Noise: neither core nor border.
- Density-connected: chain of core points within of each other.
A cluster = maximal set of density-connected points.
The algorithm
For each unvisited point:
- If core: start a new cluster; add all density-connected points (BFS/DFS).
- If border: assign to a neighboring cluster (or noise if no core neighbors).
- If noise: leave unassigned.
Strengths
- Arbitrary shapes: can find non-convex clusters (concentric circles, S-curves, etc.).
- Noise detection: outliers explicitly identified.
- No need to specify : discovers number of clusters from data.
Weaknesses
- Sensitive to : too small → many noise points; too large → clusters merge.
- Varying density: a single doesn't fit clusters with different densities.
- High dimensions: distances become uniform; becomes meaningless. Curse of dimensionality.
Choosing
K-distance plot: for each point, compute distance to its -th nearest neighbor; sort; plot. The "knee" is a good . With min_samples = k.
HDBSCAN
Hierarchical DBSCAN. Removes the parameter by computing cluster stability across all density levels. Better for varying-density data. Slower but more robust.
5. Hierarchical clustering
Build a tree of clusters by merging or splitting.
Agglomerative (bottom-up)
- Start with each point as its own cluster.
- Merge the two closest clusters.
- Repeat until one cluster remains.
Result: a dendrogram. Cut at any height to get a clustering with that many clusters.
Linkage criteria
How to measure distance between clusters:
- Single linkage: min distance between any pair. Produces "chaining" — long, thin clusters.
- Complete linkage: max distance between any pair. Produces compact, spherical clusters.
- Average linkage: mean distance between pairs. Compromise.
- Ward's linkage: minimize within-cluster variance increase. Most common; produces well-separated clusters.
Pros
- No need to specify in advance — examine the dendrogram.
- Hierarchy is interpretable.
- Deterministic (given linkage and distance).
Cons
- memory (distance matrix), naive algorithm. Limits to .
- Greedy: early bad merges propagate.
- Sensitive to noise.
Divisive (top-down)
Less common. Start with one cluster; recursively split.
6. Spectral clustering
Cluster using the eigenstructure of a similarity graph.
The recipe
- Build similarity graph (e.g., Gaussian kernel of distances).
- Compute graph Laplacian (or normalized).
- Eigendecompose ; take bottom eigenvectors.
- Cluster the eigenvectors (typically with K-means).
Why it works
The eigenvectors of correspond to "smooth" functions on the graph. The first eigenvectors approximately indicate cluster membership. Especially good for non-convex shapes.
Pros
- Handles non-convex clusters (where K-means fails).
- Theoretically grounded (graph Laplacian theory).
Cons
- eigendecomposition. Hard at scale.
- Choice of similarity function and number of nearest neighbors matters.
7. Evaluation of clustering
Hard, because there's no ground truth.
Internal metrics
Use only the data and the clustering, no labels.
Silhouette coefficient: . Range . Higher = better separation.
Davies-Bouldin index: average of cluster-pair similarities. Lower = better.
Calinski-Harabasz index: ratio of between-cluster to within-cluster variance. Higher = better.
External metrics
Require ground-truth labels (when available).
Adjusted Rand Index (ARI): counts pairs that are in the same/different clusters in both predictions and labels, adjusted for chance. Range .
Normalized Mutual Information (NMI): . Information-theoretic; .
V-measure: harmonic mean of homogeneity (each cluster contains samples of one class) and completeness (each class is in one cluster).
Why this is hard
Clustering is task-dependent: the "right" clustering depends on what you'll do with it. Internal metrics measure compactness/separation but may not align with downstream utility. Best practice: evaluate on a downstream task, not just clustering metrics.
8. The curse of dimensionality
In high-dimensional spaces, all pairwise distances become similar. Clustering relies on distance-based grouping, so high-dim data is hard.
Symptoms
- All clusters look "equally far" from any query.
- Density (DBSCAN) becomes meaningless.
- K-means converges to weird, near-uniform partitions.
Mitigations
- Dimensionality reduction first: PCA, UMAP, or autoencoder embeddings. Then cluster in the reduced space.
- Domain-specific kernels: cosine for text, perceptual distances for images.
- Use the right method: GMM on PCA-reduced embeddings is a strong default.
9. Common interview gotchas
| Gotcha | Strong answer |
|---|---|
| "Why does K-means converge?" | Both steps decrease the WCSS; objective is bounded below; coordinate descent on a quadratic. Converges to a local min. |
| "K-means vs GMM?" | K-means: hard assignments, spherical clusters. GMM: soft assignments via EM, elliptical clusters. K-means is a degenerate GMM (shared spherical , ). |
| "Why use k-means++?" | Spread initial centroids; -approximation; avoids bad local minima from random init. |
| "How do you choose ?" | Elbow on WCSS, silhouette, gap statistic, or domain knowledge. There's no universal answer. |
| "When does DBSCAN beat K-means?" | Non-convex/arbitrary-shape clusters, when noise/outliers must be detected, when is unknown. |
| "DBSCAN's main weakness?" | Sensitive to . Varying-density clusters need different — single value can't fit both. HDBSCAN fixes this. |
| "Why does spectral clustering work on non-convex shapes?" | Operates on graph eigenstructure, not Euclidean distance. Captures connectivity rather than centroid distance. |
| "How do you evaluate clustering without labels?" | Silhouette, Davies-Bouldin, Calinski-Harabasz. None is a perfect substitute for downstream task evaluation. |
| "Curse of dimensionality?" | High-dim distances are uniform; clustering breaks down. Reduce dimensionality first or use domain-aware similarity. |
10. The 8 most-asked clustering interview questions
- Walk me through K-means. Initialize → assign points to nearest centroid → update centroid to cluster mean → repeat. Coordinate descent on WCSS.
- Why does K-means converge? Both steps decrease objective; bounded below; coordinate descent.
- K-means vs GMM? Hard vs soft assignments; spherical vs elliptical clusters; K-means is a degenerate GMM.
- Walk through EM for GMM. E-step: posterior responsibilities. M-step: weighted MLE updates of . Iterate.
- DBSCAN — how does it work? Core points (dense neighborhoods) + density-connected expansion. Discovers clusters and noise.
- Spectral clustering? Eigendecompose graph Laplacian; bottom eigenvectors are cluster indicators; cluster the eigenvectors.
- How do you choose ? Elbow method, silhouette, gap statistic, or domain knowledge.
- How do you evaluate clustering? Internal (silhouette, DB, CH) without labels; external (ARI, NMI) with ground truth; downstream task is best.
11. Drill plan
- Master K-means as coordinate descent on WCSS.
- Walk through GMM EM end-to-end.
- Know DBSCAN's core/border/noise classification.
- Know spectral clustering's graph Laplacian basis.
- Drill
INTERVIEW_GRILL.md.
12. Further reading
- Lloyd, "Least Squares Quantization in PCM" (K-means, 1957/1982).
- Arthur & Vassilvitskii, "k-means++: The Advantages of Careful Seeding" (2007).
- Dempster, Laird, Rubin, "Maximum Likelihood from Incomplete Data via the EM Algorithm" (1977).
- Ester et al., "DBSCAN" (1996).
- von Luxburg, "A Tutorial on Spectral Clustering" (2007).
- McInnes & Healy, "HDBSCAN" (2017).