Topic 11: Regularization

🔥 For interviews, read these first:

REGULARIZATION_DEEP_DIVE.md — frontier-lab interview deep dive: bias-variance trade-off, L1/L2 geometry and Bayesian priors, dropout (3 stories), early stopping ≈ L2, MixUp/CutMix, label smoothing, SAM, implicit regularization of SGD, why modern LLMs use no dropout.

INTERVIEW_GRILL.md — 50 active-recall questions.

What You'll Learn

This topic teaches you regularization techniques:

L1 regularization (Lasso)
L2 regularization (Ridge)
Dropout
Early stopping
Theory and implementations

Why We Need This

Interview Importance

Common question: "Explain L1 vs L2 regularization"
Understanding: Prevent overfitting
Trade-offs: Bias-variance tradeoff

Real-World Application

Overfitting: Models overfit without regularization
Generalization: Regularization improves generalization
Feature selection: L1 can select features

Industry Use Cases

1. L2 Regularization

Use Case: Most common

Prevents large weights
Improves generalization
Default in many frameworks

2. L1 Regularization

Use Case: Feature selection

Sparse models
Feature selection
Interpretability

3. Dropout

Use Case: Neural networks

Prevents co-adaptation
Improves generalization
Standard in deep learning

Core Intuition

Regularization is about controlling how the model fits the data, not just adding a penalty term mechanically.

The deeper idea is:

models can fit patterns that are real
models can also fit noise, shortcuts, or accidental correlations

Regularization pushes learning toward more stable solutions.

L2 Regularization

L2 discourages very large weights.

Intuition:

if a solution needs extreme parameter values to fit the data, it may be too brittle
smaller weights usually correspond to smoother functions

L1 Regularization

L1 encourages sparsity.

That makes it useful when:

many features may be irrelevant
interpretability matters
you want the model to rely on a smaller subset of features

Dropout

Dropout randomly removes activations during training.

The core intuition is:

the network should not rely too heavily on any one hidden pathway
multiple redundant, more robust pathways are encouraged

Early Stopping

Early stopping is also regularization.

It works because:

later optimization steps may fit noise more aggressively
stopping at the right point can reduce overfitting

Technical Details Interviewers Often Want

L1 vs L2 Difference

This is a very common question.

L1 can drive weights exactly to zero
L2 usually shrinks weights continuously but not exactly to zero

That is why L1 is associated with feature selection.

Why Dropout Uses Scaling

During training, units are dropped randomly.

To keep expected activation magnitude consistent, dropout implementations usually scale activations appropriately. Otherwise train-time and test-time behavior would not match.

Regularization Is an Inductive Bias

This is a stronger interview answer than "it prevents overfitting."

Regularization says:

prefer simpler or more stable explanations
prefer smaller weights
prefer less co-adaptation
prefer solutions that transfer better

Common Failure Modes

too much regularization causing underfitting
using dropout mechanically where it does not help much
confusing L2 regularization with all forms of weight decay in adaptive optimizers
claiming regularization always improves test performance
forgetting that data augmentation is also a form of regularization

Edge Cases and Follow-Up Questions

Why can L1 produce sparse solutions?
Why is L2 often the default regularizer?
Why can too much regularization hurt performance?
Why is early stopping considered regularization?
Why may dropout help in some networks more than others?

What to Practice Saying Out Loud

Why regularization is really about inductive bias
The conceptual difference between L1, L2, dropout, and early stopping
Why preventing overfitting is not the same as blindly increasing regularization

Industry-Standard Boilerplate Code

L1 Regularization (Lasso)

"""
L1 Regularization (Lasso)
Adds |weights| to loss
Promotes sparsity (many weights become 0)
"""
import numpy as np

def l1_regularization_loss(weights: np.ndarray, lambda_reg: float) -> float:
    """
    L1 Regularization: lambda * sum(|w|)
    
    Effect: Many weights become exactly 0 (sparsity)
    Use: Feature selection, interpretability
    """
    return lambda_reg * np.sum(np.abs(weights))

def l1_gradient(weights: np.ndarray, lambda_reg: float) -> np.ndarray:
    """Gradient of L1 regularization"""
    return lambda_reg * np.sign(weights)

L2 Regularization (Ridge)

"""
L2 Regularization (Ridge)
Adds weights^2 to loss
Prevents large weights
"""
import numpy as np

def l2_regularization_loss(weights: np.ndarray, lambda_reg: float) -> float:
    """
    L2 Regularization: lambda * sum(w^2)
    
    Effect: Shrinks weights toward 0
    Use: Most common, improves generalization
    """
    return lambda_reg * np.sum(weights**2)

def l2_gradient(weights: np.ndarray, lambda_reg: float) -> np.ndarray:
    """Gradient of L2 regularization"""
    return 2 * lambda_reg * weights

Dropout

"""
Dropout from Scratch
Randomly set some activations to 0 during training
"""
import numpy as np

def dropout(x: np.ndarray, dropout_rate: float, training: bool = True) -> np.ndarray:
    """
    Dropout: Randomly zero out activations
    
    Args:
        x: Input activations
        dropout_rate: Probability of dropping (0.0 to 1.0)
        training: If False, no dropout (scale by 1-dropout_rate)
    """
    if not training:
        return x * (1 - dropout_rate)
    
    # Create mask
    mask = np.random.binomial(1, 1 - dropout_rate, x.shape)
    
    # Apply mask and scale
    return x * mask / (1 - dropout_rate)

Theory

L1 vs L2

Aspect	L1 (Lasso)	L2 (Ridge)
Penalty		w
Effect	Sparsity (weights → 0)	Shrinking (weights → small)
Use	Feature selection	Generalization
Gradient	Constant	Linear

Bias-Variance Tradeoff

No regularization: Low bias, high variance (overfitting)
Too much regularization: High bias, low variance (underfitting)
Right amount: Balance

Exercises

Implement L1/L2 in linear regression
Compare with/without regularization
Implement dropout
Tune regularization strength

Next Steps

Topic 12: Comprehensive theory
Topic 13: Interview Q&A

ML & LLM Interview Prep — Deep Dives