Topic 2: Gradient Descent & Learning Rate

Files in this folder

File	Purpose
`README.md`	Conceptual overview (this file) — read this first.
`LEARNING_RATE_DEEP_DIVE.md`	The core interview deep-dive on learning rate: when it works, when it fails, schedules, scaling rules, edge of stability, AdamW. Most important file in this folder.
`INTERVIEW_GRILL.md`	60 active-recall interview questions with strong answers. Drill before interviews.

What you'll learn

The mathematics that decides whether gradient descent converges, oscillates, or diverges.
Why mini-batch dominates batch and stochastic GD in practice — and how to defend that answer rigorously.
How learning rate, batch size, gradient noise, and generalization are linked through a single quantity ( $η / B$ ).
The standard schedules (warmup, cosine, linear) and why each phase exists.
How to read training curves and gradient norms to debug an unstable run.
The frontier-lab vocabulary: edge of stability, critical batch size, gradient noise scale, muP.

If you can answer the 60 grill questions in INTERVIEW_GRILL.md cleanly, you are above the bar for an applied scientist screen on this topic.

Why this topic matters in interviews

Almost every modern training recipe — for vision, NLP, RL, diffusion, and LLMs — is some variant of mini-batch gradient descent with an adaptive optimizer and a learning-rate schedule. Interviewers use questions in this area to probe:

Do you understand optimization or recite slogans? "Adam works better" is a slogan. Knowing that Adam approximates the diagonal of the Hessian via the second moment of the gradient and that it can over-rescale dimensions whose $\overset{v}{^}$ is dominated by noise — that's understanding.
Can you debug? Given a loss curve and a gradient norm, can you diagnose whether the LR is too high, too low, or whether the issue lives elsewhere?
Do you know how it scales? From a 1B-parameter run to a 70B-parameter run, what changes? If you don't know what muP is, you'll struggle.
Do you know modern subtleties? Edge of stability, AdamW vs. Adam+L2, critical batch size, linear scaling rule and its limits — these are the topics that separate top candidates.

The deep-dive file goes section by section through these topics with the right level of math and the right honesty about what's settled and what isn't.

Core intuition: gradient descent in one paragraph

You have a loss $L (θ)$ and you want to find $θ$ that makes it small. The gradient $\nabla L (θ)$ points in the direction of steepest increase; subtract a small multiple of it from $θ$ and you decrease the loss. Repeat. The "small multiple" is the learning rate $η$ . The reason this isn't trivial: the loss surface in real deep learning is non-convex, ill-conditioned (curvature varies massively across directions), and stochastic (we use mini-batch estimates of $\nabla L$ ). Every interesting question in this folder follows from one of these three properties.

The three regimes

Batch gradient descent

Computes $\nabla L$ over the entire dataset before each update. Stable, expensive, rarely used at scale. Only an option for small datasets or when exact gradients are essential (rare).

Stochastic gradient descent (SGD, single sample)

Computes $\nabla L$ from one sample per step. Cheap per step, very noisy, fast to start learning. The noise has an underappreciated benefit — it's a form of implicit regularization that biases SGD toward flat minima. But variance is too high for most practical use.

Mini-batch gradient descent

The default. Batch size $B$ (typically 32–8192) trades stability for speed. Variance of the gradient estimate is $σ^{2} / B$ . The right $B$ depends on hardware (memory, parallelism) and on the gradient noise scale (after which doubling $B$ stops paying off). See LEARNING_RATE_DEEP_DIVE.md §6 for critical batch size.

Why the learning rate is the master hyperparameter

For a quadratic loss with Hessian $H$ , GD converges only if $0 < η < 2/ λ_{m a x} (H)$ . Above that, you diverge in the sharpest direction. Below $1/ λ_{m a x} (H)$ , you converge but waste steps in flatter directions. The optimal rate is $2/ (λ_{m a x} + λ_{m i n})$ , and convergence speed depends on the condition number $κ = λ_{m a x} / λ_{m i n}$ .

In real deep networks:

$λ_{m a x} (H)$ varies by orders of magnitude across layers.
$H$ itself changes during training.
We don't compute $H$ ; we approximate.

This is why a single global $η$ is fundamentally wrong, and why every modern optimizer is some attempt to recover per-direction step sizes. Adam approximates per-parameter step sizes via $1/ \overset{v}{^}_{t}$ . AdamW separates weight decay from preconditioning. LARS/LAMB and muP scale per layer. See LEARNING_RATE_DEEP_DIVE.md §1, §10, §14 for the full story.

Common failure modes (with diagnostic signatures)

What you see	Likely cause	First thing to try
NaN at step 1–5	LR way too high, or fp16 overflow	Lower $η$ 10x; check forward-pass magnitudes
NaN at step 100–500	Warmup too short / peak LR too high	Extend warmup; lower peak $η$
Loss flat, gradients healthy	LR too low	LR finder; raise $η$
Loss flat, gradients vanishing	Stuck at saddle/critical point	Warm restart, perturbation
Oscillation with growing amplitude	Past stability boundary	Lower $η$ ; clip gradients
Occasional spike, recovery	Edge of stability — often fine	Add gradient clipping at norm 1.0
Fine-tuning destroys pretrained capability	LR too high for transfer	Reduce 10–100x

The single most useful debugging quantity is the per-layer update-to-weight ratio $∥ η \cdot update ∥/∥ θ ∥$ . Healthy training has this around $1 0^{- 3}$ per layer. See LEARNING_RATE_DEEP_DIVE.md §3.

Reference implementations (from scratch)

The implementations below are minimal but correct. Use them as the code you would whiteboard in an interview when asked "implement SGD" or "implement Adam." For real training you'd use torch.optim.SGD or torch.optim.AdamW.

Mini-batch SGD with momentum

import numpy as np

class SGDMomentum:
    """
    Mini-batch SGD with classical momentum (Polyak).
    Update:
        v_{t+1} = beta * v_t + g_t
        theta_{t+1} = theta_t - eta * v_{t+1}
    Notes:
      - beta=0.9 is standard; higher beta = more inertia.
      - Nesterov variant uses gradient at theta - eta*beta*v_t (lookahead); often slightly better.
    """
    def __init__(self, params_shape, lr=0.01, momentum=0.9):
        self.lr = lr
        self.momentum = momentum
        self.v = np.zeros(params_shape)

    def step(self, params, grad):
        self.v = self.momentum * self.v + grad
        return params - self.lr * self.v

In math form:

$v_{t + 1} = β v_{t} + g_{t}, θ_{t + 1} = θ_{t} - η v_{t + 1}$

Adam (correct, with bias correction)

import numpy as np

class Adam:
    """
    Adam optimizer (Kingma & Ba, 2014).
    See math below.
    Defaults: beta1=0.9, beta2=0.999, eps=1e-8.
    """
    def __init__(self, params_shape, lr=1e-3, beta1=0.9, beta2=0.999, eps=1e-8):
        self.lr, self.b1, self.b2, self.eps = lr, beta1, beta2, eps
        self.m = np.zeros(params_shape)
        self.v = np.zeros(params_shape)
        self.t = 0

    def step(self, params, grad):
        self.t += 1
        self.m = self.b1 * self.m + (1 - self.b1) * grad
        self.v = self.b2 * self.v + (1 - self.b2) * (grad ** 2)
        m_hat = self.m / (1 - self.b1 ** self.t)
        v_hat = self.v / (1 - self.b2 ** self.t)
        return params - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)

In math form:

$m_{t} v_{t} \overset{m}{^}_{t} \overset{v}{^}_{t} θ_{t + 1} = β_{1} m_{t - 1} + (1 - β_{1}) g_{t} = β_{2} v_{t - 1} + (1 - β_{2}) g_{t}^{2} = \frac{m _{t}}{1 - β _{1}^{t}} (bias correction) = \frac{v _{t}}{1 - β _{2}^{t}} (bias correction) = θ_{t} - η \frac{m ^ _{t}}{v ^ _{t} + ε}$

AdamW (decoupled weight decay)

class AdamW(Adam):
    """
    AdamW (Loshchilov & Hutter, 2019).
    Identical to Adam, plus a decoupled weight decay term added directly to theta.
    Why decoupled: in plain Adam, L2 regularization (lambda * theta added to gradient)
    is divided by sqrt(v_hat), weakening regularization where gradient variance is high.
    Decoupled decay applies a uniform fractional shrinkage, recovering the
    intended regularization behavior across all parameters.
    """
    def __init__(self, params_shape, lr=1e-3, beta1=0.9, beta2=0.999,
                 eps=1e-8, weight_decay=0.01):
        super().__init__(params_shape, lr, beta1, beta2, eps)
        self.wd = weight_decay

    def step(self, params, grad):
        params = super().step(params, grad)
        return params - self.lr * self.wd * params

In math form:

$θ_{t + 1} = θ_{t} - η \frac{m ^ _{t}}{v ^ _{t} + ε} - η λ θ_{t}$

Linear warmup + cosine decay schedule

import math

def warmup_cosine_lr(step, warmup_steps, total_steps, peak_lr, min_lr_frac=0.1):
    """
    Linear warmup over `warmup_steps`, then cosine decay to `min_lr_frac * peak_lr`.
    Returns the LR at the given step.
    """
    if step < warmup_steps:
        return peak_lr * step / max(1, warmup_steps)
    progress = (step - warmup_steps) / max(1, total_steps - warmup_steps)
    cosine = 0.5 * (1 + math.cos(math.pi * progress))
    min_lr = peak_lr * min_lr_frac
    return min_lr + (peak_lr - min_lr) * cosine

In math form:

$η (t) = ⎩ ⎨ ⎧ η_{m a x} \cdot \frac{t}{W} η_{m i n} + \frac{1}{2} (η_{m a x} - η_{m i n}) (1 + cos (π \cdot \frac{t - W}{T - W})) t \leq W W < t \leq T$

What to practice saying out loud

Before any interview involving training:

"Mini-batch GD is the practical default because batch size $B$ controls a tradeoff between gradient variance ( $σ^{2} / B$ ) and per-step cost; below the gradient noise scale, larger batches help, above it they don't."
"The learning rate must satisfy $η < 2/ λ_{m a x} (H)$ for convergence on a quadratic. For deep networks, $λ_{m a x}$ varies across layers and during training, which is why we need adaptive optimizers, schedules, and warmup."
"AdamW differs from Adam with L2 because Adam's preconditioning weakens L2 wherever $\overset{v}{^}$ is large; AdamW decouples the decay so it's uniform across parameters."
"We use linear warmup because Adam's variance estimates are noisy and residual streams uncalibrated near initialization; we use cosine decay because it dominates step decay empirically and avoids sudden shocks to the optimizer."
"The implicit regularization scale is $η / B$ ; that's why scaling batch size requires scaling LR, and why very large batches lose the generalization benefit of SGD noise."

These five sentences, said cleanly, get you 70% of the way through any LR-related interview.

What the interviewer may ask next

(Each is fully answered in INTERVIEW_GRILL.md.)

Walk me through Adam with bias correction.
Why does AdamW exist?
What's the linear scaling rule and when does it break?
What's edge of stability?
How would you transfer LR from a small to a large model? (muP)
Loss spikes occasionally — what do you do?
Why is fine-tuning LR much smaller than pretraining LR?
What's the gradient noise scale?

If any of these aren't crisp for you, that's the next thing to drill.

Cross-references

10_optimizers/ — focused tour of optimizer algorithms (deeper SGD/Momentum/Adam/AdamW/Lion comparisons).
11_regularization/ — weight decay vs. L2, dropout, label smoothing.
48_optimization_and_matrix_calculus/ — gradients, Hessians, conditioning.
62_frontier_training_playbook/ — production training recipes.

Next steps

Read LEARNING_RATE_DEEP_DIVE.md from start to finish.
Drill INTERVIEW_GRILL.md until you can answer 40+ of 60 cold.
Move on to 10_optimizers/ for the per-optimizer comparisons.

ML & LLM Interview Prep — Deep Dives