Scaling Laws — Deep Dive

Distilled from Stanford CS336 (Tatsu Hashimoto's Basic Scaling Laws lecture, 2025) + cross-referenced against the canonical literature (Hestness 2017, Kaplan 2020, Hoffmann 2022 Chinchilla, Rosenfeld, Pearson & Song, Yair-resolution-paper, Epoch-AI Chinchilla-method-3 reanalysis).

The point: scaling laws turn frontier model engineering from "spend $10M and pray" into "fit a curve at small scale and extrapolate." They're not a free pass — they're an engineered measurement instrument that requires careful execution. This chapter walks the math, the historical lineage, the modern practice, and the canonical Kaplan-vs-Chinchilla cautionary tale.

Pair with 04_transformers/MODERN_LLM_ARCHITECTURE_CHOICES.md (the architecture choices scaling laws are used to justify), 52_statistical_learning_theory/ (the generalization-bound lineage), 62_frontier_training_playbook/ (production-scale recipes).

The mental model — why scaling laws exist
Historical lineage — scaling laws are 30+ years old
The math — why power laws are natural
Data scaling laws (the cleanest case)
Data mixture scaling
Data repetition scaling (the 4-epoch rule)
Scale-dependent phenomena (data filtering)
Architecture scaling (LSTM vs Transformer, etc.)
Optimizer scaling (SGD vs Adam)
Hyperparameter scaling (aspect ratio, layers, head dim)
The parameter-counting footgun (Kaplan's exclusion)
MoE scaling
Critical batch size
Learning rate scaling and μP
Upstream vs downstream transfer
Joint scaling laws (Kaplan & Rosenfeld functional forms)
The Kaplan-vs-Chinchilla saga
Why they disagreed (Yair, Pearson-Song)
The Chinchilla method-3 mystery (Epoch AI resolution)
Isoflops — the workhorse research protocol
The "overtraining for serving" reality
Pitfalls and senior signals
Interview grill — 70 questions
References

1. The Mental Model

The motivating scenario

Your wealthy friend hands you 10,000 B200s for a month. Build a great open-source LLM. You have an infra team. You have pretraining data. Now you have to choose: architecture, optimizer, batch size, learning rate, depth, width, vocab, data mix.

Naïve approach: do multiple full training runs and tune. Wasteful and infeasible — each run costs millions.

Scaling-law approach: do all your optimization at small scale, fit predictive curves, extrapolate to the big run. This works only if small-scale → large-scale is connected by simple regularities. The amazing empirical finding of the last decade is it is, often with stunning precision.

Scaling laws are simultaneously:

A paradigm. "We believe in the scaling laws" — at frontier labs, almost a creed.
An engineered measurement instrument. Not magic; requires careful execution. Tatsu's mantra: predictability across scales is engineered, not automatic.

Three functional-form patterns to recognize

Quantity scaled	Y-axis	Form on log-log	Interpretation
Training data D	log test loss	linear (slope ≈ −0.05 to −0.1)	power-law decay: `loss ≈ const · D^(-α)`
Model size N	log test loss	linear	same: `loss ≈ const · N^(-α)`
Compute C	log test loss	linear	compute-optimal frontier
Data + Model jointly	log loss surface	bilinear	joint scaling law
Downstream task accuracy	accuracy vs log compute	sigmoid	emergence-flavored
Capability vs date	task vs date	linear (upper envelope)	forecasting trends

Hook

Scaling laws are simple predictive rules — usually power laws — that let you extrapolate small-scale behavior to large-scale behavior. They are how modern LLM engineering is done. They require careful setup; sloppy execution gives misleading conclusions.

2. Historical Lineage

Scaling laws are not new. The neural language modeling era didn't invent them.

Cortes & Vapnik et al. (1993, Bell Labs). Asked: "Training classifiers on huge datasets is expensive. Can we fit on subsets, fit a curve, extrapolate?" → literally a data scaling law in 1993.

Banko & Brill (NLP, 2001). "Scaling to very very large corpora for natural language disambiguation." Showed that for many NLP tasks, more data beats algorithm choice — log-linear improvement in performance.

Collobert et al. (2012). Machine translation BLEU vs data size. Got the same power-3 and power-4 exponents we still use.

Hestness et al. (Baidu, 2017). Deep Learning Scaling is Predictable, Empirically. Studied data scaling for speech recognition, machine translation, character-level LM, image classification — all showed power-law data scaling. Talked about emergence (because accuracy is discontinuous), compute scaling, systems-as-accuracy. Most things we discuss today were known in 2017 if you'd been paying attention.

Kaplan et al. (OpenAI, 2020). Scaling Laws for Neural Language Models. The canonical modern reference. Power-law scaling for compute, data, parameters; joint scaling-law functional form.

Hoffmann et al. (DeepMind, 2022) Chinchilla. Training Compute-Optimal Large Language Models. Showed Kaplan was wrong by a factor of ~3-4; established the 20:1 token:parameter ratio.

Resolution papers (2023–2024). Yair et al. Resolving Discrepancies in Compute Optimal Scaling. Pearson & Song. Epoch AI's Chinchilla method-3 reanalysis. We'll walk through these in §17–§19.

The lesson: scaling laws as an empirical paradigm are 30+ years old. The neural-LLM-era contribution was scaling them across many orders of magnitude and using them to make multi-million-dollar engineering decisions.

3. The Math — Why Power Laws Are Natural

Mean estimation (the simplest scaling law)

You have n Gaussian samples; estimate the mean μ̂. Error:

$ $E [(\overset{μ}{^} - μ)^{2}] = \frac{σ ^{2}}{n} .$ $

Take logs: log(error) = log(σ²) − log(n). Linear on a log-log plot, slope −1. This is a scaling law.

In general, anything of the form error = C · n^(−α) + ε_∞ plotted on log-log gives a line with slope −α (after subtracting the asymptote ε_∞).

For classical parametric estimation (mean, regression), α = 1. Slope minus one. This is the classical statistics rate.

Non-parametric estimation (more flexible models)

Estimate an arbitrary smooth D-dimensional function. Cut the input space into boxes of side n^(−1/D); each box gets ~n / n_boxes samples; the per-box error is ~ 1/√(samples_per_box). The total error rate is

$ $error \sim n^{- 1/ D} .$ $

Slope on log-log plot is −1/D. Non-parametric rate is much slower than parametric 1/n.

Where neural language models sit

Empirical neural scaling-law exponents are typically −0.05 to −0.1 — way slower than −1, more like −1/D with D ≈ 10-20. This suggests:

Neural networks behave more like non-parametric regressors.
The "intrinsic dimension" of the learning problem is on the order of 10s.

Some theorists (Bahri et al.) argue this is literal: scaling-law exponents directly read off intrinsic dimension. The evidence is debatable, but the framing is useful.

Hook

Power-law scaling is the natural form for empirical risk decay. Parametric problems give slope −1; non-parametric gives slope −1/D. Neural LM scaling exponents (−0.05 to −0.1) are non-parametric-like, suggesting an effective intrinsic dimension of ~10–20.

4. Data Scaling Laws

The setup

Fix model architecture (much larger than data). Fix optimizer / schedule. Vary D (data size). Plot log test loss vs log D. Get a line.

The empirical fact (Kaplan, 2020 et al.)

log(loss) = log(C) − α · log(D)

with α ≈ 0.05–0.1 for language models. Slope is shallow.

Implication

You need to multiply data by 10-100× to halve loss in many regimes. This is why the modern push for trillions of tokens.

The "model bigger than data" caveat

Data scaling laws assume you're in the power-law regime — model is big enough that you haven't hit the irreducible loss floor. Rule of thumb: model should be ~10× bigger than would fit the data, OR you must explicitly fit and subtract the asymptote.

If you're in the asymptote regime (data ≫ model capacity), more data doesn't help — you've saturated the model class.

5. Data Mixture Scaling

The question

You have multiple data sources (e.g., news + Wikipedia). What mix maximizes performance?

The classical insight

For data scaling laws, slopes are usually determined by the model class, not the distribution. The intercept changes with mix; the slope mostly doesn't.

→ The best mix at small scale is also the best mix at large scale (if slopes don't change).

The practical recipe

Data Mixing Laws (paper). Train small models on small data with various mixes. Fit a function of (mix → loss). Extrapolate to predict optimal mix at production compute.

The empirical reality (DataDecide and others). Just train a bunch of small models, pick the best mix, scale up. Often no scaling law needed — best small mix = best large mix because slopes are similar.

Hook

"Slopes don't change with the mix; only intercepts do. So the best small-scale mix is the best large-scale mix. You can fit a scaling law or just sweep at small scale — both work."

6. Data Repetition Scaling (the 4-epoch rule)

The question

If compute is growing faster than data, how many times can you repeat data before it stops helping?

The empirical finding (Muennighoff et al. 2023, "Scaling Data-Constrained Language Models")

Up to ~4 epochs, repeating data is essentially free — you get the same scaling law as fresh data. Past 4 epochs, the realized scaling law diverges below the projected one.

There's a modified functional form that quantifies the degradation. Repetition has diminishing returns; the marginal value of an extra epoch shrinks.

The "infinite compute" extreme

Recent work (Liu, Hashimoto et al.) asks: with infinite compute, what's the best you can do with a fixed dataset?

Can't just repeat indefinitely (diminishing returns).
Can't grow model arbitrarily on fixed data (saturates).
Reach for ensembles, regularization, etc.
The slopes of the scaling laws barely change under these interventions; only the intercepts.

→ General lesson: "interventions change the intercept; the slope is determined by the data + model class."

7. Scale-Dependent Phenomena (Data Filtering)

The dynamic nature of "data quality"

Data filtering decisions are not static. They depend on your compute budget.

Low compute: filter aggressively, keep only the highest-quality stuff. You can't afford to train on noise.
High compute: loosen filters, accept lower quality. You'd rather train on more diverse low-quality data than repeat high-quality data N times.

Implication

Concepts that feel static — "data quality," "the right filter" — are actually dynamic across scale. Optimal filters are not fixed; they shift with scale. Engineering at scale requires re-tuning these decisions, not copying them from smaller runs.

8. Architecture Scaling

The brute-force question

Are transformers really better than LSTMs? Brute-force answer: train both at GPT-3 scale and compare. Multi-million-dollar question.

The scaling-law answer

Train both architectures at small scales across a compute range. Plot loss vs compute on log-log axes. Compare slopes and intercepts.

If LSTM has worse intercept AND/OR worse slope → don't pick LSTM.

If LSTM has same slope but worse intercept → it's a fixed gap; LSTM is dominated for this objective.

If LSTM has better slope (rare) → at sufficiently large compute, LSTM will eventually win — interesting.

Why every architecture paper has this plot

Mamba paper. Gated DeltaNet paper. Every architecture-improvement paper since 2020. The plot:

X-axis: log compute or log params.
Y-axis: log validation loss.
Lines: vanilla transformer baseline + the proposed architecture.

If the proposed architecture's line is below the baseline's at all compute levels in the studied range, the case is made. If the slope is worse, the case is broken — even if the intercept is better, scaling will eventually overturn the result.

The Narang et al. 2020 study (T5 architectures)

A scaling study across many T5 architecture variants:

GLU vs non-GLU: GLU consistently better across scales. (Validates §3 of Modern LLM Architecture Choices.)
Performer (efficient attention): worse scaling — don't use.
Switch Transformer (MoE): good scaling.
Mixture of softmax: good scaling (though dropped from frontier for other reasons).

These small-scale comparisons captured the architecture decisions we ship in production today.

Hook

"Architecture papers prove themselves with scaling-law plots. Better intercept + same-or-better slope = adopt. Worse slope = discard. Frontier architecture decisions are made on small-compute scaling studies, not full runs."

9. Optimizer Scaling

Same procedure: SGD vs Adam — train across a compute range, plot scaling laws.

Empirical finding (Hestness et al., others): Adam has a better intercept than SGD. Same slope. Adam wins at all compute levels.

This recurs across many architecture / optimizer comparisons: slopes are stubbornly similar; intercepts are what move. Even huge interventions (SGD → Adam) usually leave the slope alone.

This is one of the deeper mysteries of empirical neural scaling.

10. Hyperparameter Scaling — Scale-Invariant Quantities

Number of layers

Tiny number of layers (1–2) → terrible scaling. Past that, more layers → smaller intercept (better) at every compute level.

But: number of layers is NOT scale-invariant. Bigger models want more layers in absolute terms.

Aspect ratio (`d_model / n_layers`) — the scale-invariant cousin

Plot terminal loss vs aspect ratio at multiple model sizes. The optimum is roughly the same — around d_model / n_layers ≈ 100.

This is what you actually want from a hyperparameter for scaling: the optimal value doesn't shift much with scale, so you can fit at small scale and reuse at large.

Head dimension

Similar story: roughly invariant across scale.

The general principle

When designing your scaling strategy:

Identify scale-invariant quantities (aspect ratio, head dim ratios, learning rate ratios).
Tune these at small scale and freeze them.
Scale up only the absolute sizes (parameters, data, compute).

11. The Parameter-Counting Footgun (Kaplan's exclusion)

This is the headline cautionary tale. Scaling laws are sensitive to what you put on the x-axis.

What Kaplan did

When plotting depth-related scaling laws, the curves with embedding parameters included looked "funky." Kaplan excluded:

Token embeddings (vocab × d_model).
Final softmax projection (d_model × vocab).

Justification: "These don't do computation."

What this broke

Excluding embeddings systematically shifts the parameter axis. At small model sizes, embeddings are a huge fraction of total parameters. Excluding them makes small models look "smaller" than they really are.

This shifts the scaling law and changes the predicted compute-optimal model size by a factor of 3-4× — which is exactly the Kaplan-vs-Chinchilla gap.

Why this matters

Scaling laws aren't magic. Predictability across scales is engineered. You must pick the right x-axis, set hyperparameters correctly across scales, and avoid systematic biases like Kaplan's exclusion. Otherwise you get a scaling law that misleads.

12. MoE Scaling

The new variable

In dense models, "parameters" = "active parameters." In MoE, total params and active params are decoupled. What's the right x-axis?

The Apple/MIT analysis

For a fixed compute budget, you can ask: how should I trade total params (sparsity) vs active params?

Fix active params. Add more empty (inactive) total params (more experts). Loss decreases. "Inactive" parameters still help.
Fix total params. Increase sparsity (fewer active per token). Compute drops; quality drops.
Sweet spot: as compute grows, models want higher sparsity (more total, less active).

The functional form fits a clean joint scaling law over (active params, total params, compute).

Implication

Modern frontier MoE training (DeepSeek-V3 with 671B total / 37B active, Mixtral, etc.) lives in this 3D scaling regime. Designing the MoE config = picking a point on this surface.

13. Critical Batch Size

The motivating question

Big batch = good (more parallelism, data-parallel-friendly). But how big is too big?

The two regimes

Noise-limited (small batches). Each extra example reduces gradient variance proportionally. Doubling the batch ≈ doubling the effective gradient quality. Perfect scaling.
Bias-limited (large batches). You've reduced gradient noise below the bias floor (the gap between local descent direction and global minimum direction). Adding more examples doesn't help. Diminishing returns.

The critical batch size (CBS)

The crossover point. Defined operationally as the batch size where the marginal value of an additional example equals the marginal cost (in compute).

How to estimate it (OpenAI's procedure)

Pick a target loss L*.
For each batch size B, train and record (steps_to_target, examples_to_target). They satisfy examples = steps × B.
Fit the relationship:

$ $\frac{1}{S / S _{m i n}} + \frac{1}{E / E _{m i n}} = 1.$ $4. C r i t i c a l ba t c h s i ze :$ $B_{crit} = E_{m i n} / S_{m i n} .$ $

Balances steps and examples — slightly over both minima but not wasteful in either.

Why this is in the scaling lecture

The critical batch size scales with target loss / compute. As your run gets bigger (lower target loss), CBS grows as a power law:

$ $B_{crit} \propto loss^{- β} .$ $

→ Big training runs can use huge batch sizes. That's good, because data parallelism wants big batches.

The intuition: closer to the minimum, gradient variance matters more relative to bias, so variance reduction (= bigger batch) is more valuable.

Hook

"Critical batch size = the largest batch where you're still getting near-perfect parallelization gains. It grows as loss decreases (per a power law), so big training runs can use enormous batches — convenient for data parallelism."

14. Learning Rate Scaling and μP

The empirical fact

As models get wider, optimal learning rate shrinks — roughly as 1/width for standard parameterizations.

Two strategies

Strategy 1: Fit a learning-rate scaling law.

Sweep LR at multiple model sizes.
Find optimal LR at each size.
Fit optimal_LR = const · width^(−γ).
Extrapolate to your big run.

Used by many production runs.

Strategy 2: Reparameterize so optimal LR is invariant. (μP — "Maximal Update Parameterization", Yang et al.)

Rescale initialization sizes and per-parameter learning rates so that the optimal LR is the same across all model sizes.
Then sweep LR at small scale, find optimum, use that LR at large scale.

μP advantages:

Eliminates the LR-scaling-law fit.
Theoretically motivated.

μP disadvantages:

Touches initialization and optimizer scaling for every parameter group — annoying to implement correctly.
Reports of mixed success (some labs report great results; others struggle).

Both strategies have shipped frontier models. Strategy 1 is more common; μP is gaining ground.

(Detailed coverage in 02_gradient_descent/LEARNING_RATE_DEEP_DIVE.md and the advanced scaling lecture's μP section.)

15. Upstream vs Downstream Transfer

The seductive story

"My pretraining loss looks great → my model will be great."

The reality

Upstream perplexity vs downstream task accuracy is far less correlated than you'd think. From the Narang et al. T5 study: their best perplexity model (NL-12) was not the best downstream model (NL-32 XL was, despite worse perplexity).

Why scaling laws live on the upstream side

Perplexity is clean, regular, predictable, low-variance. Singletons (no replicates) are usually fine because the second-decimal-place noise is tiny.
Downstream metrics are jagged, noisy, sometimes discontinuous. Sigmoidal emergence patterns. Hard to fit clean scaling laws.

The senior engineering practice

Establish scaling regularity on the upstream metric (perplexity, log loss, BPB).
Establish a (less strict) belief about transfer to downstream — usually monotone, often non-trivial.
Validate transfer separately with downstream eval, ideally on a few model sizes.

Don't conflate "I have a beautiful upstream scaling law" with "downstream will follow."

The post-training people's complaint

Tatsu's anecdote: "Pretraining people hand you a model and say 'perplexity is good, your problem now.' But the problem started in pretraining."

→ The senior takeaway: don't just optimize perplexity. Validate downstream.

16. Joint Scaling Laws (Kaplan & Rosenfeld functional forms)

The compute-allocation question

Given a fixed compute budget, do you spend it on:

More data (bigger D)?
A bigger model (bigger N)?

Need a joint scaling law loss(N, D).

Two competing functional forms

Rosenfeld (simple):

$ $L (N, D) = \frac{a}{N ^{α}} + \frac{b}{D ^{β}} + L_{\infty} .$ $

Sum of two inverse power-law terms plus an irreducible-loss asymptote.

Kaplan (similar idea, slightly more elaborate).

Why the limits make sense

D → ∞: data term vanishes; you become model-size-bound. L → a/N^α + L_∞ — pure model scaling.
N → ∞: model term vanishes; you become data-bound. L → b/D^β + L_∞ — pure data scaling.

Always sanity-check a joint scaling law by taking limits.

Empirical reality

Rosenfeld and others showed that fitting on a small (N, D) corner extrapolates accurately to much higher N and D. This is the entire premise of compute-optimal scaling.

Compute-optimal trade-off

Given compute = constant · N · D (linear in product, roughly), minimize loss subject to compute constraint. Standard non-linear optimization → recipe for (N*, D*) as a function of compute.

17. The Kaplan-vs-Chinchilla Saga

Kaplan (2020) prescription

Solving the joint scaling-law optimization, Kaplan got:

$ $N^{*} \propto C^{0.73}, D^{*} \propto C^{0.27} .$ $

→ As compute grows, train much bigger models, with relatively little extra data. Tokens-per-parameter shrinks.

The GPT-3 era

This Kaplan recipe drove the era of giant dense models: 175B GPT-3, MT-NLG 530B, hundreds of billions to trillions of dense parameters. Token-per-parameter ratios as low as ~3.

Chinchilla (DeepMind, Hoffmann et al., 2022)

Did their own joint scaling-law fit using three different methods. Got:

$ $N^{*} \propto C^{0.5}, D^{*} \propto C^{0.5} .$ $

→ Train models smaller than people thought; train them on more data. Tokens-per-parameter constant, around 20.

The famous "Chinchilla 20:1 ratio."

For a fixed compute budget, the Chinchilla recipe says: train a 67B-ish model, not a 280B model. The 67B model trained Chinchilla-optimal will outperform the 280B trained Kaplan-optimal at the same compute.

Empirically: Chinchilla was right.

Why the disagreement matters

These were two reasonable papers, by reasonable researchers, both fitting joint scaling laws. They disagreed by a factor of 3-4× on optimal model size. What happened?

18. Why Kaplan and Chinchilla Disagreed

The Yair et al. resolution paper

Resolving Discrepancies in Compute Optimal Scaling of Language Models. They walk through the gap step by step:

Replicate Kaplan settings exactly → get Kaplan's prediction.
Change parameter counting (include all parameters including embeddings and final softmax). Curve shifts.
Fix learning-rate warmup for small models (Kaplan's small models weren't fully converged because warmup was too long relative to total training).
Tune optimizer per model size (Kaplan held one batch size fixed; suboptimal for small models).

Cumulative effect of these "minor" decisions: exactly Chinchilla's prediction. The gap is a sequence of small calibration errors compounding.

The lesson

Tatsu's framing: scaling laws are lower bounds. They tell you: "if I scale up this recipe, the result will be at least this good." If your recipe is misspecified at small scale (bad warmup, bad batch size, wrong parameter counting), the scaling law you fit will mislead.

→ Get the small-scale recipe right first. Match learning-rate warmup, batch size, parameter counting — everything that scales — to what you'd actually do at large scale. Otherwise your scaling law is fitting an artifact.

The Pearson & Song complementary analysis

Showed (without training new models) that Kaplan's lower compute scale + the non-linearity from non-embedding-parameter-counting is sufficient to produce the Kaplan-vs-Chinchilla gap. They simulated Kaplan-style training curves from the Chinchilla functional form and reproduced the disagreement.

→ Two complementary explanations: (1) Yair's "small calibration errors compound," (2) Pearson-Song's "low-compute regime + parameter-counting nonlinearity."

Both probably true.

19. The Chinchilla Method-3 Mystery (Epoch AI Resolution)

The three Chinchilla methods

The Chinchilla paper fit scaling laws three ways:

Method 1: Lower-envelope. Take the bottom of training curves (lowest loss at each compute level). Fit a line. → 67B optimal model.

Method 2: Isoflops. Pick fixed compute budgets. Sweep N/D trade-off at each. Find the minimum. Fit the minima. → 63B optimal.

Method 3: Joint functional-form fit. Fit Rosenfeld-style L(N, D). Solve. → 0.46/0.54 split (different from methods 1+2's 0.5/0.5).

Methods 1 and 2 agree → 20:1 ratio. Method 3 disagrees — implies tokens-per-parameter grows with compute, not constant.

The Epoch AI reanalysis

Couldn't get raw data or code. Extracted data points from plot images in the paper. Refit Method 3.

Discovery: the Chinchilla paper's Method-3 fit was suboptimal — didn't actually minimize the fitting loss. With proper curve-fitting (better optimization, possibly different priors), Method 3 produces almost exactly the same 0.5/0.5 / 20:1 prediction as Methods 1 and 2.

The Chinchilla authors were more right than they realized. All three methods agree once you do Method 3 correctly.

The deeper lesson

Even canonical, peer-reviewed, well-cited scaling-law papers can have fitting bugs that change conclusions materially. Be skeptical of curve fits, especially in 3D. Replicate when you can.

20. Isoflops — The Workhorse Research Protocol

Why isoflops won

Of the three Chinchilla methods, isoflops is the most robust and easiest to execute in practice:

Pick a flop budget C_0. Or several: C_1, C_2, C_3, ... in a geometric ladder.
For each C_i, sweep (N, D) pairs that all satisfy N · D ≈ C_i.
Train each pair. Record final loss.
Plot loss vs N (with D implicitly varying). Get a U-shape per C_i.
Fit a quadratic. Take the minimum. That's (N*_i, D*_i, L*_i) for compute C_i.
Plot (C_i, N*_i) and fit a power law. Same for (C_i, D*_i).

Done. Robust, parsimonious, doesn't require fitting a 3D surface.

Where isoflops shows up

Chinchilla method 2.
The MoE scaling study (active vs total params at fixed compute).
Diffusion model scaling studies.
Architecture-vs-architecture comparisons.

If you're stuck on a scaling-law decision, default to isoflops. It's the hammer that fits most nails.

21. The "Overtraining for Serving" Reality

Why Chinchilla 20:1 is not what production wants

Chinchilla 20:1 is training-compute-optimal. But in production, the cost split is roughly:

~20% on training.
~80% on R&D and serving.

Inference cost dominates over the model's lifetime. The relevant optimization is performance per parameter (small models that are good).

Overtraining

Train a small-er model than Chinchilla recommends, but on way more tokens. You sacrifice a tiny bit of training-compute efficiency in exchange for a much smaller inference-cost model.

Modern recipes:

Llama 2 7B: 286:1 tokens/param (vs Chinchilla 20:1).
Llama 3 8B: 1,875:1.
Qwen 2.5 7B: comparable or higher.

→ Modern frontier models are massively overtrained vs Chinchilla. Not because Chinchilla is wrong — because the optimization target shifted from "training-compute-optimal" to "serving-cost-optimal."

The lesson

Chinchilla's 20:1 is a research number — the point at which you minimize the FLOPs needed to reach a given loss. It is not what you want if you'll serve the model at scale. Pick a smaller model than 20:1 suggests; train it on more tokens; pay slightly more in training to save much more in serving.

Why Chinchilla is still important

Even though we don't follow the 20:1 ratio, the Chinchilla saga teaches:

How to fit joint scaling laws.
How small calibration errors compound.
The isoflops protocol.
Why upstream-vs-downstream matters.
The methodology, not the recipe.

22. Pitfalls and Senior Signals

Pitfalls

Compute scale too small. Hard to distinguish polynomial from exponential scaling — Taylor approximations look linear at any zoom level. Fit on at least 3–4 orders of magnitude in compute if you can.
Bad parameter counting (Kaplan footgun). Include all parameters consistently. Embeddings, final softmax, biases (if present), LayerNorm scales — all of it.
Hyperparameters not properly scaled across runs. If LR warmup, batch size, or μP-adjustment isn't right at small scale, your scaling law fits an artifact.
Method-fitting bugs. Even well-cited papers (Chinchilla method 3) have them. Replicate or sanity-check.
Conflating upstream and downstream. Scaling laws live on perplexity. Validate downstream separately.
Ignoring variance. Most scaling-law plots use singletons. Usually fine for perplexity (very low variance) but not for LR / batch-size / hyperparameter scaling laws (which can have huge variance). Replicate when stakes matter.
Slope vs intercept confusion. Most interventions change the intercept, not the slope. Don't claim a "huge improvement" if all you've done is shift the intercept down — that's a constant-factor speedup, not a fundamental scaling change.

Senior signals

You think in slope-vs-intercept terms. Most interventions move the intercept. Slope changes are rare and important.
You separate "what to scale" from "what to keep constant." Aspect ratio constant, parameters scale. LR-schedule shape constant, peak LR scales.
You name isoflops as the default protocol. And you can execute it on a whiteboard.
You know the Kaplan-vs-Chinchilla saga. Including the resolution.
You know about overtraining. And why production deviates from Chinchilla 20:1.
You distinguish upstream from downstream. Establish regularity on perplexity; validate transfer separately.
You can derive the simple-mean scaling law (σ²/n) and explain why neural exponents are smaller (non-parametric-like).
You don't oversell your scaling law. Acknowledge what it doesn't tell you (downstream, emergence, OOD generalization).

23. Interview Grill — 70 questions

Foundations (Q1–10)

Why do scaling laws exist? Why not just train at large scale and tune?
What's a power-law scaling law on a log-log plot?
State the simplest scaling law (mean estimation). What's the slope?
Why do parametric estimators give slope −1?
Why do non-parametric estimators give slope −1/D?
Typical neural-LM scaling exponent? What does it suggest about effective dimension?
Trace the historical lineage from Cortes 1993 → Banko-Brill → Hestness 2017 → Kaplan 2020.
What's the Hestness 2017 contribution that's underappreciated?
When does a power-law approximation break (asymptote)?
Why is "predictability across scales engineered, not automatic"?

Data scaling (Q11–18)

Sketch the data-scaling-law plot.
Why must the model be larger than the data for a clean data scaling law?
What's the modern slope for language modeling on data?
How would you fit a data-mixture scaling law?
Why does "best small-scale mix = best large-scale mix" often hold?
State the 4-epoch repetition rule.
Why does optimal data filtering depend on compute?
Why does the slope rarely change with intervention?

Architecture / optimizer / hyperparameter scaling (Q19–28)

How would you compare LSTM vs Transformer using scaling laws?
What does a worse slope mean for an alternative architecture?
Cite an architecture intervention that consistently wins on scaling-law plots (per Narang 2020).
Adam vs SGD scaling laws — same slope or different?
What's a "scale-invariant quantity"? Why does it matter for scaling strategy?
What's the canonical aspect ratio (d_model/n_layers)?
Why is number of layers not scale-invariant?
Why might a head dim be scale-invariant?
What's the Kaplan parameter-counting footgun?
How do non-embedding parameter exclusions distort scaling laws?

MoE scaling (Q29–32)

What's new about MoE scaling vs dense scaling?
Trade-off: total params vs active params at fixed compute?
As compute grows, do MoE models want more or less sparsity?
Why do "inactive" parameters still help reduce loss?

Critical batch size (Q33–40)

Define noise-limited and bias-limited regimes.
What's the critical batch size?
State the OpenAI estimation procedure.
What's the formula B_crit = E_min / S_min saying?
How does CBS scale with target loss?
Why does CBS grow as compute grows?
Why is CBS in the "scaling laws" lecture?
How is CBS related to data parallelism?

Learning rate scaling (Q41–46)

How does optimal LR scale with width (default rule of thumb)?
What's μP?
Compare "fit an LR scaling law" vs "use μP."
Why is μP harder to implement?
Which strategy do production runs use?
How does LR interact with batch size?

Upstream vs downstream (Q47–50)

Why are scaling laws cleaner on perplexity than on accuracy?
State the Narang 2020 observation about NL-12 vs NL-32 XL.
How do you validate transfer from upstream to downstream?
Why do post-training engineers complain about pretraining people?

Joint scaling and Chinchilla (Q51–62)

Sketch Rosenfeld's joint scaling law form.
State Kaplan's compute-optimal allocation.
State Chinchilla's compute-optimal allocation.
What's the famous Chinchilla 20:1 ratio?
Walk through Chinchilla method 1 (lower envelope).
Walk through Chinchilla method 2 (isoflops).
Walk through Chinchilla method 3 (joint fit).
Why did Kaplan and Chinchilla disagree (Yair's three reasons)?
What's Pearson & Song's complementary explanation?
What was the Chinchilla method-3 mystery?
How did Epoch AI resolve it?
State the deeper lesson from the saga.

Modern practice (Q63–70)

What's "overtraining" and why do production models do it?
Token-per-parameter ratio for Llama 2 7B vs Chinchilla 20?
Why isn't Chinchilla 20:1 what serving-cost-minimizing labs want?
Why is isoflops the default protocol?
Walk through an isoflops sweep end-to-end.
Compute scale too small — what goes wrong?
Why are most scaling-law data points singletons (no replicates)?
State the "scaling laws are lower bounds" framing in one sentence.

24. References

Cortes, Vapnik et al., 1993. Learning curves: Asymptotic values and rate of convergence.
Banko, Brill 2001. Scaling to very very large corpora for natural language disambiguation.
Collobert et al. 2012. Natural language processing (almost) from scratch.
Hestness et al. (Baidu) 2017. Deep learning scaling is predictable, empirically. arXiv:1712.00409.
Kaplan et al. (OpenAI) 2020. Scaling Laws for Neural Language Models. arXiv:2001.08361.
Hoffmann et al. (DeepMind) 2022. Training Compute-Optimal Large Language Models (Chinchilla). arXiv:2203.15556.
Rosenfeld et al. 2019. A Constructive Prediction of the Generalization Error Across Scales. arXiv:1909.12673.
Muennighoff et al. 2023. Scaling Data-Constrained Language Models. arXiv:2305.16264.
Narang et al. 2020. Do Transformer Modifications Transfer Across Implementations and Applications? arXiv:2102.11972.
Yair et al. 2024. Resolving Discrepancies in Compute-Optimal Scaling of Language Models. arXiv:2406.19146.
Pearson, Song 2024. Reconciling Kaplan and Chinchilla Scaling Laws.
Epoch AI. Chinchilla's wild implications / method-3 reanalysis.
Yang et al. 2022. Tensor Programs V (μP).
McCandlish et al. (OpenAI) 2018. An Empirical Model of Large-Batch Training (critical batch size). arXiv:1812.06162.
Bahri et al. Explaining Neural Scaling Laws.
Liu, Hashimoto et al. Pretraining Under Infinite Compute.

Cross-references in this repo

04_transformers/MODERN_LLM_ARCHITECTURE_CHOICES.md — what architecture choices scaling laws are used to justify.
02_gradient_descent/LEARNING_RATE_DEEP_DIVE.md — LR-scaling math.
52_statistical_learning_theory/ — generalization bounds.
62_frontier_training_playbook/ — production-scale recipes.
66_frontier_alignment_rl/REASONING_MODELS_DEEP_DIVE.md — test-time-compute as a third scaling axis.

How to use this chapter

Read straight through once — the historical → math → practical arc lands best in order.
Memorize §10 (scale-invariant quantities), §11 (Kaplan footgun), §17 (Chinchilla saga), §20 (isoflops), §21 (overtraining).
Be able to derive the simple-mean scaling law on a whiteboard in 60 seconds.
Be able to execute an isoflops protocol end-to-end on a whiteboard.
Be able to explain Kaplan-vs-Chinchilla in 90 seconds.
Drill the §23 grill until 60+/70 cold.

Single sentence to remember

Scaling laws are power-law-shaped predictive rules for how loss decays with data / model / compute; you fit them on a small-scale corner and extrapolate; the slope is determined by the model class and rarely moves; intercepts move with most interventions; the canonical lesson — Kaplan-vs-Chinchilla — is that small calibration errors (parameter counting, LR warmup, batch size) at small scale compound into large prediction gaps at large scale, so the recipe you fit must match the recipe you'll deploy.

ML & LLM Interview Prep — Deep Dives