Kernel Methods: A Frontier-Lab Interview Deep Dive
Why this exists. Kernels are still asked in classical ML interviews, especially for SVM-related questions. The kernel trick is one of the most beautiful results in classical ML and a common interview probe ("why does the dual formulation enable kernels?"). Understanding kernels also clarifies what attention is doing — the attention is essentially a learned kernel.
1. The kernel trick in one sentence
A kernel function computes an inner product in a (possibly infinite-dimensional) feature space without ever computing the features explicitly.
That sentence is the entire content of kernel methods. Once you have it, everything else follows.
2. Why kernels exist
The motivation
Linear methods (linear regression, logistic regression, linear SVM) can only learn linear decision boundaries. Real data is often non-linear. Two options:
- Hand-engineer features. Add , etc. Then apply a linear method.
- Use kernels. Implicitly map in some high-dimensional space, then apply the linear method there.
Kernels become useful when the feature mapping is high-dimensional or infinite-dimensional — explicit computation of is intractable, but might be cheap.
The trick
Many algorithms can be written so they only ever access the data through inner products . Replace with and you have a non-linear version of the algorithm operating in -space — without ever computing .
This is the kernel trick.
3. Common kernels
Linear kernel
Just the dot product. No mapping. Equivalent to no kernel.
Polynomial kernel
Implicitly maps to a feature space containing all monomials up to degree . For : includes products like .
RBF (Gaussian) kernel
The most popular kernel. Implicitly maps to an infinite-dimensional feature space. controls bandwidth: small → smooth, large → wiggly.
Sigmoid kernel
Inspired by neural networks. Not always positive semi-definite (Mercer's condition might fail) but used in practice.
Cosine kernel
Normalized inner product. Used heavily in NLP/IR.
4. Mercer's theorem
For a function to be a valid kernel (i.e., correspond to some inner product in some feature space):
- Symmetric: .
- Positive semi-definite: for any finite set , the kernel matrix has all eigenvalues .
Equivalently: there exists a feature map such that .
This is Mercer's theorem. It's why we can use kernels without ever computing — the PSD condition guarantees an implicit feature space exists.
5. Kernel SVM: the canonical application
Linear SVM optimizes:
The dual formulation (using Lagrange multipliers ):
subject to , . Decision function:
The dual only accesses data through inner products. Replace with :
Boom — non-linear SVM in the kernel feature space, computed without ever materializing .
Support vectors
Most end up at 0 in the dual. The non-zero correspond to support vectors — points on or inside the margin. Decision function only depends on support vectors:
Cost: per prediction.
6. RKHS (Reproducing Kernel Hilbert Space)
The advanced framing.
For any valid kernel, there exists a Hilbert space of functions where:
- Each kernel evaluation is itself a function in .
- Inner product in is given by the kernel: — the reproducing property.
Functions learned by kernel methods (SVM, kernel ridge, GP) live in — they're linear combinations of kernel evaluations.
Why it matters: RKHS provides a unified theoretical framework. The "regularization with " view explains why SVM with RBF kernel doesn't overfit despite operating in infinite dimensions.
7. Other kernel methods
Beyond SVM, the kernel trick applies to many algorithms:
Kernel ridge regression
Closed-form. Just like linear ridge but in kernel space.
Gaussian processes
Bayesian extension. . Posterior given data is also Gaussian, with mean and variance . Provides uncertainty estimates.
Kernel PCA
PCA in feature space via the kernel trick. Useful for non-linear dimensionality reduction.
Kernel k-means
K-means in feature space. Equivalent to spectral clustering for some kernels.
8. Why kernels lost to deep learning
Kernel methods dominated 1995–2010. Then deep learning won. Why?
Computational scaling
Kernel methods are typically memory (full kernel matrix) and to training. For , infeasible. Deep learning scales linearly in data size.
Representation learning
Kernels are fixed: you choose RBF, polynomial, etc. Deep learning learns the representation end-to-end. For images, text, audio, the right "feature space" is task-specific, and deep learning discovers it.
Hyperparameter tuning
Kernel choice + bandwidth + regularization is hard to tune. Deep learning hyperparameters are easier to navigate at scale.
Where kernels still win
- Small data ().
- Tabular data where SVM with RBF beats logistic regression but you don't have the data for deep learning.
- Bayesian uncertainty (Gaussian processes).
- Theoretical analysis (NTK).
9. The Neural Tangent Kernel (NTK) — connection to deep learning
Jacot et al. 2018. In the infinite-width limit of a deep neural network with appropriate scaling, the network's training behavior is exactly described by a kernel — the NTK.
i.e., inner product of the gradients of the network's output with respect to its parameters, in the limit of infinite width.
Why it matters
- Provides a theoretical framework for understanding what NNs are doing.
- The NTK is fixed at initialization — doesn't change during training (in the infinite-width limit). Training is equivalent to kernel ridge regression with the NTK.
- Explains why over-parameterized NNs generalize: under NTK dynamics, gradient descent finds the minimum-norm solution in .
Limitations
The NTK theory describes wide networks at initialization. Real NNs at modest width or after substantial training don't behave purely as NTK — feature learning happens. So NTK is a useful theoretical lens but doesn't fully explain deep learning's success.
10. Connection to attention
The attention mechanism is essentially a learned kernel.
Attention computes:
Compare to kernel ridge regression's prediction:
Attention is similar — query attends to keys via a kernel-like similarity (dot product), then weighted-averages values.
The key difference: attention's "kernel" is learned via . Classical kernels are fixed. This is why attention is so powerful — it learns the right similarity per task.
This connection is increasingly invoked in research (e.g., Tsai et al., "Transformer Dissection: An Unified Understanding for Transformer's Attention via the Lens of Kernel"). Frontier-lab interview-relevant.
11. Common interview gotchas
| Gotcha | Strong answer |
|---|---|
| "What's the kernel trick?" | Replace with in any algorithm that only accesses data via inner products. Operates in implicit high-dim feature space without computing it. |
| "What kernels are valid?" | Mercer's theorem: symmetric and positive semi-definite. PSD ⟹ implicit feature space exists. |
| "Why is RBF infinite-dimensional?" | Taylor-expand ; the polynomial expansion has infinitely many terms, each corresponding to a feature dimension. |
| "Why does SVM work with the kernel trick?" | The dual formulation only uses inner products. Replace with kernel; non-linear SVM. |
| "What are support vectors?" | Training points with in the dual — points on or inside the margin. Decision function depends only on them. |
| "Why did kernels lose to deep learning?" | scaling; fixed kernels can't learn task-specific representations; deep learning learns features. |
| "Is attention a kernel?" | Yes, conceptually. Attention is a learned kernel via . The connection unifies classical kernels and modern transformers. |
| "When are kernels still useful?" | Small data, Bayesian uncertainty (GPs), tabular tasks below NN scale, theoretical analysis (NTK). |
12. The 8 most-asked kernel interview questions
- What's the kernel trick? Replace inner products with kernel evaluations to operate in implicit feature space.
- Mercer's theorem? Symmetric + PSD ⟹ valid kernel.
- RBF kernel? . Infinite-dimensional implicit feature space. Most popular.
- Why does kernel SVM use the dual? Dual formulation accesses data only via inner products → kernel trick applies.
- Support vectors? Non-zero in the dual; decision function depends only on these.
- Why kernels lost to deep learning? scaling, fixed kernels, no representation learning.
- What's the NTK? Infinite-width NN behaves as kernel ridge regression with the NTK. Theoretical bridge.
- Connection to attention? Attention is a learned kernel — query-key dot product is a similarity function the model learns.
13. Drill plan
- State the kernel trick precisely.
- Memorize RBF kernel + Mercer's conditions.
- Walk through SVM dual derivation.
- Cite the NTK connection at sketchy level.
- Drill
INTERVIEW_GRILL.md.
14. Further reading
- Schölkopf & Smola, Learning with Kernels (2002) — the textbook.
- Hastie, Tibshirani, Friedman, Elements of Statistical Learning, Chapter 12.
- Rasmussen & Williams, Gaussian Processes for Machine Learning (2006).
- Jacot et al., "Neural Tangent Kernel" (2018).
- Tsai et al., "Transformer Dissection: Attention as a Kernel" (2019).