Kernel Functions: Detailed Examples
Example 1: Linearly Separable Data
Problem: Two classes that can be separated by a straight line.
Data:
- Class 0: Points with x₁ + x₂ < 0
- Class 1: Points with x₁ + x₂ > 0
Which kernel?
- Linear kernel: Perfect! Data is linearly separable
- RBF kernel: Works but unnecessary (overkill)
- Polynomial kernel: Works but unnecessary
Why linear works: The decision boundary is a line: x₁ + x₂ = 0. Linear kernel can find this line directly.
Code:
# Linear kernel is perfect
svm = SVC(kernel='linear')
svm.fit(X, y) # Will find the line x₁ + x₂ = 0
Example 2: Circular Boundary (Concentric Circles)
Problem: Two classes arranged in concentric circles:
- Class 0: Inner circle (radius < 2)
- Class 1: Outer circle (radius > 2)
Data:
- Points where x₁² + x₂² < 4 → Class 0
- Points where x₁² + x₂² > 4 → Class 1
Which kernel?
- Linear kernel: Fails! Can't separate circles with a line
- RBF kernel: Perfect! Can create circular boundaries
- Polynomial kernel (degree=2): Works! Can handle quadratic boundaries
Why RBF works: RBF kernel creates local neighborhoods. Points near the origin (inner circle) are similar to each other, points far from origin (outer circle) are similar to each other, but inner and outer are different.
Visual:
Class 1 (outer)
• • • • • • • • •
• •
• • Class 0 • •
• • (inner) • •
• •
• • • • • • • • •
Class 1 (outer)
Code:
# RBF kernel works
svm = SVC(kernel='rbf', gamma=1.0)
svm.fit(X, y) # Creates circular decision boundary
# Polynomial kernel (degree=2) also works
svm = SVC(kernel='poly', degree=2)
svm.fit(X, y) # Can handle quadratic boundaries
Example 3: XOR Problem
Problem: Classic XOR problem:
- (0, 0) → Class 0
- (0, 1) → Class 1
- (1, 0) → Class 1
- (1, 1) → Class 0
Which kernel?
- Linear kernel: Fails! XOR is not linearly separable
- RBF kernel: Works! Can create complex boundaries
- Polynomial kernel (degree=2): Works! Can handle XOR
Why linear fails: No single line can separate the classes. You need a non-linear boundary.
Why RBF/Polynomial work: They can create non-linear boundaries that separate the classes.
Code:
# Linear fails
svm_linear = SVC(kernel='linear')
# Won't work well
# RBF works
svm_rbf = SVC(kernel='rbf', gamma=1.0)
svm_rbf.fit(X, y) # Works!
# Polynomial works
svm_poly = SVC(kernel='poly', degree=2)
svm_poly.fit(X, y) # Works!
Example 4: Text Classification
Problem: Classify documents (spam/not spam) using TF-IDF features.
Data:
- High-dimensional (thousands of features)
- Sparse (most features are 0)
- Often linearly separable in high dimensions
Which kernel?
- Linear kernel: Usually best! High-dimensional data is often linearly separable
- RBF kernel: Can work but often overfits
- Polynomial kernel: Rarely needed
Why linear works: In high dimensions, data is often linearly separable (curse of dimensionality works in our favor here). Linear kernel is fast and works well.
Code:
# Linear kernel is best for text
svm = SVC(kernel='linear', C=1.0)
svm.fit(tfidf_features, labels)
Example 5: Image Classification (Small Images)
Problem: Classify small images (e.g., 32x32 pixels = 1024 features).
Data:
- High-dimensional but structured
- Non-linear patterns (edges, textures)
Which kernel?
- Linear kernel: Might work if features are good
- RBF kernel: Usually better (captures non-linear patterns)
- Polynomial kernel: Can work but RBF usually better
Why RBF often better: Images have complex non-linear patterns. RBF kernel can capture these better than linear.
Code:
# RBF kernel for images
svm = SVC(kernel='rbf', gamma=0.001, C=10.0)
svm.fit(image_features, labels)
# Tune gamma: too high = overfitting, too low = underfitting
Example 6: High-Dimensional Sparse Data
Problem: Data with many features (e.g., 10,000 features) but most are 0 (sparse).
Which kernel?
- Linear kernel: Best choice! Sparse high-dimensional data is often linearly separable
- RBF kernel: Can be slow and overfit
- Polynomial kernel: Usually not needed
Why linear:
- Fast computation (sparse dot products are efficient)
- High-dimensional spaces often allow linear separation
- Less prone to overfitting
Code:
# Linear kernel for sparse high-dimensional data
svm = SVC(kernel='linear', C=1.0)
svm.fit(sparse_features, labels)
Parameter Tuning Examples
RBF Kernel: Gamma Selection
Low Gamma (γ = 0.001):
- Wide kernel (large influence radius)
- Simpler boundaries
- Risk: Underfitting
- Use when: Data has smooth, simple patterns
Medium Gamma (γ = 0.1):
- Moderate kernel width
- Balanced complexity
- Good starting point
- Use when: Not sure, start here
High Gamma (γ = 10.0):
- Narrow kernel (small influence radius)
- Complex boundaries
- Risk: Overfitting
- Use when: Data has very complex patterns
How to choose:
- Start with γ = 1 / (n_features * variance)
- Try grid search: [0.001, 0.01, 0.1, 1.0, 10.0]
- Use cross-validation to select best
Polynomial Kernel: Degree Selection
Degree = 1:
- Same as linear kernel
- Use when: Data is linear
Degree = 2:
- Quadratic features
- Most common
- Use when: Moderate non-linearity
Degree = 3:
- Cubic features
- More complex
- Risk: Overfitting
- Use when: Strong non-linearity
Degree > 3:
- Rarely used
- High overfitting risk
- Avoid unless necessary
Decision Tree for Kernel Selection
Start
↓
Is data linearly separable?
├─ Yes → Use Linear Kernel
│
└─ No → Try RBF Kernel
↓
Does it overfit?
├─ No → Use RBF Kernel
│
└─ Yes → Try Polynomial Kernel (degree=2)
↓
Does it work?
├─ Yes → Use Polynomial Kernel
│
└─ No → Tune RBF parameters or use different model
Common Mistakes
Mistake 1: Always using RBF
- Linear kernel is often better for high-dimensional data
- Always try linear first
Mistake 2: Not scaling features
- SVM is sensitive to feature scales
- Always scale before using kernels (especially RBF)
Mistake 3: Gamma too high
- Causes overfitting
- Start with lower gamma, increase if needed
Mistake 4: Ignoring linear kernel
- Linear kernel is fast and interpretable
- Don't skip it!
Summary Table
| Kernel | Use When | Parameters | Pros | Cons |
|---|---|---|---|---|
| Linear | Linearly separable, high-dim | C only | Fast, interpretable | Can't handle non-linear |
| Polynomial | Polynomial relationships | degree, gamma, coef0 | Captures polynomials | Can overfit, need to tune degree |
| RBF | Non-linear (default) | gamma, C | Very flexible, one param | Can overfit, less interpretable |
| Sigmoid | Rarely | gamma, coef0 | Neural-like | Unstable, rarely best |
Key Takeaways
- Start with linear: It's fast and often works
- Use RBF for non-linear: Most common choice
- Scale features: Critical for SVM
- Tune parameters: Gamma and C matter a lot
- Avoid sigmoid: RBF is almost always better