Scaling Laws — Deep Dive
Distilled from Stanford CS336 (Tatsu Hashimoto's Basic Scaling Laws lecture, 2025) + cross-referenced against the canonical literature (Hestness 2017, Kaplan 2020, Hoffmann 2022 Chinchilla, Rosenfeld, Pearson & Song, Yair-resolution-paper, Epoch-AI Chinchilla-method-3 reanalysis).
The point: scaling laws turn frontier model engineering from "spend $10M and pray" into "fit a curve at small scale and extrapolate." They're not a free pass — they're an engineered measurement instrument that requires careful execution. This chapter walks the math, the historical lineage, the modern practice, and the canonical Kaplan-vs-Chinchilla cautionary tale.
Pair with
04_transformers/MODERN_LLM_ARCHITECTURE_CHOICES.md(the architecture choices scaling laws are used to justify),52_statistical_learning_theory/(the generalization-bound lineage),62_frontier_training_playbook/(production-scale recipes).
Table of contents
- The mental model — why scaling laws exist
- Historical lineage — scaling laws are 30+ years old
- The math — why power laws are natural
- Data scaling laws (the cleanest case)
- Data mixture scaling
- Data repetition scaling (the 4-epoch rule)
- Scale-dependent phenomena (data filtering)
- Architecture scaling (LSTM vs Transformer, etc.)
- Optimizer scaling (SGD vs Adam)
- Hyperparameter scaling (aspect ratio, layers, head dim)
- The parameter-counting footgun (Kaplan's exclusion)
- MoE scaling
- Critical batch size
- Learning rate scaling and μP
- Upstream vs downstream transfer
- Joint scaling laws (Kaplan & Rosenfeld functional forms)
- The Kaplan-vs-Chinchilla saga
- Why they disagreed (Yair, Pearson-Song)
- The Chinchilla method-3 mystery (Epoch AI resolution)
- Isoflops — the workhorse research protocol
- The "overtraining for serving" reality
- Pitfalls and senior signals
- Interview grill — 70 questions
- References
1. The Mental Model
The motivating scenario
Your wealthy friend hands you 10,000 B200s for a month. Build a great open-source LLM. You have an infra team. You have pretraining data. Now you have to choose: architecture, optimizer, batch size, learning rate, depth, width, vocab, data mix.
Naïve approach: do multiple full training runs and tune. Wasteful and infeasible — each run costs millions.
Scaling-law approach: do all your optimization at small scale, fit predictive curves, extrapolate to the big run. This works only if small-scale → large-scale is connected by simple regularities. The amazing empirical finding of the last decade is it is, often with stunning precision.
Scaling laws are simultaneously:
- A paradigm. "We believe in the scaling laws" — at frontier labs, almost a creed.
- An engineered measurement instrument. Not magic; requires careful execution. Tatsu's mantra: predictability across scales is engineered, not automatic.
Three functional-form patterns to recognize
| Quantity scaled | Y-axis | Form on log-log | Interpretation |
|---|---|---|---|
| Training data D | log test loss | linear (slope ≈ −0.05 to −0.1) | power-law decay: loss ≈ const · D^(-α) |
| Model size N | log test loss | linear | same: loss ≈ const · N^(-α) |
| Compute C | log test loss | linear | compute-optimal frontier |
| Data + Model jointly | log loss surface | bilinear | joint scaling law |
| Downstream task accuracy | accuracy vs log compute | sigmoid | emergence-flavored |
| Capability vs date | task vs date | linear (upper envelope) | forecasting trends |
Hook
Scaling laws are simple predictive rules — usually power laws — that let you extrapolate small-scale behavior to large-scale behavior. They are how modern LLM engineering is done. They require careful setup; sloppy execution gives misleading conclusions.
2. Historical Lineage
Scaling laws are not new. The neural language modeling era didn't invent them.
Cortes & Vapnik et al. (1993, Bell Labs). Asked: "Training classifiers on huge datasets is expensive. Can we fit on subsets, fit a curve, extrapolate?" → literally a data scaling law in 1993.
Banko & Brill (NLP, 2001). "Scaling to very very large corpora for natural language disambiguation." Showed that for many NLP tasks, more data beats algorithm choice — log-linear improvement in performance.
Collobert et al. (2012). Machine translation BLEU vs data size. Got the same power-3 and power-4 exponents we still use.
Hestness et al. (Baidu, 2017). Deep Learning Scaling is Predictable, Empirically. Studied data scaling for speech recognition, machine translation, character-level LM, image classification — all showed power-law data scaling. Talked about emergence (because accuracy is discontinuous), compute scaling, systems-as-accuracy. Most things we discuss today were known in 2017 if you'd been paying attention.
Kaplan et al. (OpenAI, 2020). Scaling Laws for Neural Language Models. The canonical modern reference. Power-law scaling for compute, data, parameters; joint scaling-law functional form.
Hoffmann et al. (DeepMind, 2022) Chinchilla. Training Compute-Optimal Large Language Models. Showed Kaplan was wrong by a factor of ~3-4; established the 20:1 token:parameter ratio.
Resolution papers (2023–2024). Yair et al. Resolving Discrepancies in Compute Optimal Scaling. Pearson & Song. Epoch AI's Chinchilla method-3 reanalysis. We'll walk through these in §17–§19.
The lesson: scaling laws as an empirical paradigm are 30+ years old. The neural-LLM-era contribution was scaling them across many orders of magnitude and using them to make multi-million-dollar engineering decisions.
3. The Math — Why Power Laws Are Natural
Mean estimation (the simplest scaling law)
You have n Gaussian samples; estimate the mean μ̂. Error:
$$
Take logs: log(error) = log(σ²) − log(n). Linear on a log-log plot, slope −1. This is a scaling law.
In general, anything of the form error = C · n^(−α) + ε_∞ plotted on log-log gives a line with slope −α (after subtracting the asymptote ε_∞).
For classical parametric estimation (mean, regression), α = 1. Slope minus one. This is the classical statistics rate.
Non-parametric estimation (more flexible models)
Estimate an arbitrary smooth D-dimensional function. Cut the input space into boxes of side n^(−1/D); each box gets ~n / n_boxes samples; the per-box error is ~ 1/√(samples_per_box). The total error rate is
$$
Slope on log-log plot is −1/D. Non-parametric rate is much slower than parametric 1/n.
Where neural language models sit
Empirical neural scaling-law exponents are typically −0.05 to −0.1 — way slower than −1, more like −1/D with D ≈ 10-20. This suggests:
- Neural networks behave more like non-parametric regressors.
- The "intrinsic dimension" of the learning problem is on the order of 10s.
Some theorists (Bahri et al.) argue this is literal: scaling-law exponents directly read off intrinsic dimension. The evidence is debatable, but the framing is useful.
Hook
Power-law scaling is the natural form for empirical risk decay. Parametric problems give slope −1; non-parametric gives slope −1/D. Neural LM scaling exponents (−0.05 to −0.1) are non-parametric-like, suggesting an effective intrinsic dimension of ~10–20.
4. Data Scaling Laws
The setup
Fix model architecture (much larger than data). Fix optimizer / schedule. Vary D (data size). Plot log test loss vs log D. Get a line.
The empirical fact (Kaplan, 2020 et al.)
log(loss) = log(C) − α · log(D)
with α ≈ 0.05–0.1 for language models. Slope is shallow.
Implication
You need to multiply data by 10-100× to halve loss in many regimes. This is why the modern push for trillions of tokens.
The "model bigger than data" caveat
Data scaling laws assume you're in the power-law regime — model is big enough that you haven't hit the irreducible loss floor. Rule of thumb: model should be ~10× bigger than would fit the data, OR you must explicitly fit and subtract the asymptote.
If you're in the asymptote regime (data ≫ model capacity), more data doesn't help — you've saturated the model class.
5. Data Mixture Scaling
The question
You have multiple data sources (e.g., news + Wikipedia). What mix maximizes performance?
The classical insight
For data scaling laws, slopes are usually determined by the model class, not the distribution. The intercept changes with mix; the slope mostly doesn't.
→ The best mix at small scale is also the best mix at large scale (if slopes don't change).
The practical recipe
Data Mixing Laws (paper). Train small models on small data with various mixes. Fit a function of (mix → loss). Extrapolate to predict optimal mix at production compute.
The empirical reality (DataDecide and others). Just train a bunch of small models, pick the best mix, scale up. Often no scaling law needed — best small mix = best large mix because slopes are similar.
Hook
"Slopes don't change with the mix; only intercepts do. So the best small-scale mix is the best large-scale mix. You can fit a scaling law or just sweep at small scale — both work."
6. Data Repetition Scaling (the 4-epoch rule)
The question
If compute is growing faster than data, how many times can you repeat data before it stops helping?
The empirical finding (Muennighoff et al. 2023, "Scaling Data-Constrained Language Models")
Up to ~4 epochs, repeating data is essentially free — you get the same scaling law as fresh data. Past 4 epochs, the realized scaling law diverges below the projected one.
There's a modified functional form that quantifies the degradation. Repetition has diminishing returns; the marginal value of an extra epoch shrinks.
The "infinite compute" extreme
Recent work (Liu, Hashimoto et al.) asks: with infinite compute, what's the best you can do with a fixed dataset?
- Can't just repeat indefinitely (diminishing returns).
- Can't grow model arbitrarily on fixed data (saturates).
- Reach for ensembles, regularization, etc.
- The slopes of the scaling laws barely change under these interventions; only the intercepts.
→ General lesson: "interventions change the intercept; the slope is determined by the data + model class."
7. Scale-Dependent Phenomena (Data Filtering)
The dynamic nature of "data quality"
Data filtering decisions are not static. They depend on your compute budget.
- Low compute: filter aggressively, keep only the highest-quality stuff. You can't afford to train on noise.
- High compute: loosen filters, accept lower quality. You'd rather train on more diverse low-quality data than repeat high-quality data N times.
Implication
Concepts that feel static — "data quality," "the right filter" — are actually dynamic across scale. Optimal filters are not fixed; they shift with scale. Engineering at scale requires re-tuning these decisions, not copying them from smaller runs.
8. Architecture Scaling
The brute-force question
Are transformers really better than LSTMs? Brute-force answer: train both at GPT-3 scale and compare. Multi-million-dollar question.
The scaling-law answer
Train both architectures at small scales across a compute range. Plot loss vs compute on log-log axes. Compare slopes and intercepts.
If LSTM has worse intercept AND/OR worse slope → don't pick LSTM.
If LSTM has same slope but worse intercept → it's a fixed gap; LSTM is dominated for this objective.
If LSTM has better slope (rare) → at sufficiently large compute, LSTM will eventually win — interesting.
Why every architecture paper has this plot
Mamba paper. Gated DeltaNet paper. Every architecture-improvement paper since 2020. The plot:
- X-axis: log compute or log params.
- Y-axis: log validation loss.
- Lines: vanilla transformer baseline + the proposed architecture.
If the proposed architecture's line is below the baseline's at all compute levels in the studied range, the case is made. If the slope is worse, the case is broken — even if the intercept is better, scaling will eventually overturn the result.
The Narang et al. 2020 study (T5 architectures)
A scaling study across many T5 architecture variants:
- GLU vs non-GLU: GLU consistently better across scales. (Validates §3 of Modern LLM Architecture Choices.)
- Performer (efficient attention): worse scaling — don't use.
- Switch Transformer (MoE): good scaling.
- Mixture of softmax: good scaling (though dropped from frontier for other reasons).
These small-scale comparisons captured the architecture decisions we ship in production today.
Hook
"Architecture papers prove themselves with scaling-law plots. Better intercept + same-or-better slope = adopt. Worse slope = discard. Frontier architecture decisions are made on small-compute scaling studies, not full runs."
9. Optimizer Scaling
Same procedure: SGD vs Adam — train across a compute range, plot scaling laws.
Empirical finding (Hestness et al., others): Adam has a better intercept than SGD. Same slope. Adam wins at all compute levels.
This recurs across many architecture / optimizer comparisons: slopes are stubbornly similar; intercepts are what move. Even huge interventions (SGD → Adam) usually leave the slope alone.
This is one of the deeper mysteries of empirical neural scaling.
10. Hyperparameter Scaling — Scale-Invariant Quantities
Number of layers
Tiny number of layers (1–2) → terrible scaling. Past that, more layers → smaller intercept (better) at every compute level.
But: number of layers is NOT scale-invariant. Bigger models want more layers in absolute terms.
Aspect ratio (d_model / n_layers) — the scale-invariant cousin
Plot terminal loss vs aspect ratio at multiple model sizes. The optimum is roughly the same — around d_model / n_layers ≈ 100.
This is what you actually want from a hyperparameter for scaling: the optimal value doesn't shift much with scale, so you can fit at small scale and reuse at large.
Head dimension
Similar story: roughly invariant across scale.
The general principle
When designing your scaling strategy:
- Identify scale-invariant quantities (aspect ratio, head dim ratios, learning rate ratios).
- Tune these at small scale and freeze them.
- Scale up only the absolute sizes (parameters, data, compute).
11. The Parameter-Counting Footgun (Kaplan's exclusion)
This is the headline cautionary tale. Scaling laws are sensitive to what you put on the x-axis.
What Kaplan did
When plotting depth-related scaling laws, the curves with embedding parameters included looked "funky." Kaplan excluded:
- Token embeddings (vocab × d_model).
- Final softmax projection (d_model × vocab).
Justification: "These don't do computation."
What this broke
Excluding embeddings systematically shifts the parameter axis. At small model sizes, embeddings are a huge fraction of total parameters. Excluding them makes small models look "smaller" than they really are.
This shifts the scaling law and changes the predicted compute-optimal model size by a factor of 3-4× — which is exactly the Kaplan-vs-Chinchilla gap.
Why this matters
Scaling laws aren't magic. Predictability across scales is engineered. You must pick the right x-axis, set hyperparameters correctly across scales, and avoid systematic biases like Kaplan's exclusion. Otherwise you get a scaling law that misleads.
12. MoE Scaling
The new variable
In dense models, "parameters" = "active parameters." In MoE, total params and active params are decoupled. What's the right x-axis?
The Apple/MIT analysis
For a fixed compute budget, you can ask: how should I trade total params (sparsity) vs active params?
- Fix active params. Add more empty (inactive) total params (more experts). Loss decreases. "Inactive" parameters still help.
- Fix total params. Increase sparsity (fewer active per token). Compute drops; quality drops.
- Sweet spot: as compute grows, models want higher sparsity (more total, less active).
The functional form fits a clean joint scaling law over (active params, total params, compute).
Implication
Modern frontier MoE training (DeepSeek-V3 with 671B total / 37B active, Mixtral, etc.) lives in this 3D scaling regime. Designing the MoE config = picking a point on this surface.
13. Critical Batch Size
The motivating question
Big batch = good (more parallelism, data-parallel-friendly). But how big is too big?
The two regimes
- Noise-limited (small batches). Each extra example reduces gradient variance proportionally. Doubling the batch ≈ doubling the effective gradient quality. Perfect scaling.
- Bias-limited (large batches). You've reduced gradient noise below the bias floor (the gap between local descent direction and global minimum direction). Adding more examples doesn't help. Diminishing returns.
The critical batch size (CBS)
The crossover point. Defined operationally as the batch size where the marginal value of an additional example equals the marginal cost (in compute).
How to estimate it (OpenAI's procedure)
- Pick a target loss
L*. - For each batch size
B, train and record(steps_to_target, examples_to_target). They satisfyexamples = steps × B. - Fit the relationship:
$$
Balances steps and examples — slightly over both minima but not wasteful in either.
Why this is in the scaling lecture
The critical batch size scales with target loss / compute. As your run gets bigger (lower target loss), CBS grows as a power law:
$$
→ Big training runs can use huge batch sizes. That's good, because data parallelism wants big batches.
The intuition: closer to the minimum, gradient variance matters more relative to bias, so variance reduction (= bigger batch) is more valuable.
Hook
"Critical batch size = the largest batch where you're still getting near-perfect parallelization gains. It grows as loss decreases (per a power law), so big training runs can use enormous batches — convenient for data parallelism."
14. Learning Rate Scaling and μP
The empirical fact
As models get wider, optimal learning rate shrinks — roughly as 1/width for standard parameterizations.
Two strategies
Strategy 1: Fit a learning-rate scaling law.
- Sweep LR at multiple model sizes.
- Find optimal LR at each size.
- Fit
optimal_LR = const · width^(−γ). - Extrapolate to your big run.
Used by many production runs.
Strategy 2: Reparameterize so optimal LR is invariant. (μP — "Maximal Update Parameterization", Yang et al.)
- Rescale initialization sizes and per-parameter learning rates so that the optimal LR is the same across all model sizes.
- Then sweep LR at small scale, find optimum, use that LR at large scale.
μP advantages:
- Eliminates the LR-scaling-law fit.
- Theoretically motivated.
μP disadvantages:
- Touches initialization and optimizer scaling for every parameter group — annoying to implement correctly.
- Reports of mixed success (some labs report great results; others struggle).
Both strategies have shipped frontier models. Strategy 1 is more common; μP is gaining ground.
(Detailed coverage in 02_gradient_descent/LEARNING_RATE_DEEP_DIVE.md and the advanced scaling lecture's μP section.)
15. Upstream vs Downstream Transfer
The seductive story
"My pretraining loss looks great → my model will be great."
The reality
Upstream perplexity vs downstream task accuracy is far less correlated than you'd think. From the Narang et al. T5 study: their best perplexity model (NL-12) was not the best downstream model (NL-32 XL was, despite worse perplexity).
Why scaling laws live on the upstream side
- Perplexity is clean, regular, predictable, low-variance. Singletons (no replicates) are usually fine because the second-decimal-place noise is tiny.
- Downstream metrics are jagged, noisy, sometimes discontinuous. Sigmoidal emergence patterns. Hard to fit clean scaling laws.
The senior engineering practice
- Establish scaling regularity on the upstream metric (perplexity, log loss, BPB).
- Establish a (less strict) belief about transfer to downstream — usually monotone, often non-trivial.
- Validate transfer separately with downstream eval, ideally on a few model sizes.
Don't conflate "I have a beautiful upstream scaling law" with "downstream will follow."
The post-training people's complaint
Tatsu's anecdote: "Pretraining people hand you a model and say 'perplexity is good, your problem now.' But the problem started in pretraining."
→ The senior takeaway: don't just optimize perplexity. Validate downstream.
16. Joint Scaling Laws (Kaplan & Rosenfeld functional forms)
The compute-allocation question
Given a fixed compute budget, do you spend it on:
- More data (bigger D)?
- A bigger model (bigger N)?
Need a joint scaling law loss(N, D).
Two competing functional forms
Rosenfeld (simple):
$$
Sum of two inverse power-law terms plus an irreducible-loss asymptote.
Kaplan (similar idea, slightly more elaborate).
Why the limits make sense
D → ∞: data term vanishes; you become model-size-bound.L → a/N^α + L_∞— pure model scaling.N → ∞: model term vanishes; you become data-bound.L → b/D^β + L_∞— pure data scaling.
Always sanity-check a joint scaling law by taking limits.
Empirical reality
Rosenfeld and others showed that fitting on a small (N, D) corner extrapolates accurately to much higher N and D. This is the entire premise of compute-optimal scaling.
Compute-optimal trade-off
Given compute = constant · N · D (linear in product, roughly), minimize loss subject to compute constraint. Standard non-linear optimization → recipe for (N*, D*) as a function of compute.
17. The Kaplan-vs-Chinchilla Saga
Kaplan (2020) prescription
Solving the joint scaling-law optimization, Kaplan got:
$$
→ As compute grows, train much bigger models, with relatively little extra data. Tokens-per-parameter shrinks.
The GPT-3 era
This Kaplan recipe drove the era of giant dense models: 175B GPT-3, MT-NLG 530B, hundreds of billions to trillions of dense parameters. Token-per-parameter ratios as low as ~3.
Chinchilla (DeepMind, Hoffmann et al., 2022)
Did their own joint scaling-law fit using three different methods. Got:
$$
→ Train models smaller than people thought; train them on more data. Tokens-per-parameter constant, around 20.
The famous "Chinchilla 20:1 ratio."
For a fixed compute budget, the Chinchilla recipe says: train a 67B-ish model, not a 280B model. The 67B model trained Chinchilla-optimal will outperform the 280B trained Kaplan-optimal at the same compute.
Empirically: Chinchilla was right.
Why the disagreement matters
These were two reasonable papers, by reasonable researchers, both fitting joint scaling laws. They disagreed by a factor of 3-4× on optimal model size. What happened?
18. Why Kaplan and Chinchilla Disagreed
The Yair et al. resolution paper
Resolving Discrepancies in Compute Optimal Scaling of Language Models. They walk through the gap step by step:
- Replicate Kaplan settings exactly → get Kaplan's prediction.
- Change parameter counting (include all parameters including embeddings and final softmax). Curve shifts.
- Fix learning-rate warmup for small models (Kaplan's small models weren't fully converged because warmup was too long relative to total training).
- Tune optimizer per model size (Kaplan held one batch size fixed; suboptimal for small models).
Cumulative effect of these "minor" decisions: exactly Chinchilla's prediction. The gap is a sequence of small calibration errors compounding.
The lesson
Tatsu's framing: scaling laws are lower bounds. They tell you: "if I scale up this recipe, the result will be at least this good." If your recipe is misspecified at small scale (bad warmup, bad batch size, wrong parameter counting), the scaling law you fit will mislead.
→ Get the small-scale recipe right first. Match learning-rate warmup, batch size, parameter counting — everything that scales — to what you'd actually do at large scale. Otherwise your scaling law is fitting an artifact.
The Pearson & Song complementary analysis
Showed (without training new models) that Kaplan's lower compute scale + the non-linearity from non-embedding-parameter-counting is sufficient to produce the Kaplan-vs-Chinchilla gap. They simulated Kaplan-style training curves from the Chinchilla functional form and reproduced the disagreement.
→ Two complementary explanations: (1) Yair's "small calibration errors compound," (2) Pearson-Song's "low-compute regime + parameter-counting nonlinearity."
Both probably true.
19. The Chinchilla Method-3 Mystery (Epoch AI Resolution)
The three Chinchilla methods
The Chinchilla paper fit scaling laws three ways:
Method 1: Lower-envelope. Take the bottom of training curves (lowest loss at each compute level). Fit a line. → 67B optimal model.
Method 2: Isoflops. Pick fixed compute budgets. Sweep N/D trade-off at each. Find the minimum. Fit the minima. → 63B optimal.
Method 3: Joint functional-form fit. Fit Rosenfeld-style L(N, D). Solve. → 0.46/0.54 split (different from methods 1+2's 0.5/0.5).
Methods 1 and 2 agree → 20:1 ratio. Method 3 disagrees — implies tokens-per-parameter grows with compute, not constant.
The Epoch AI reanalysis
Couldn't get raw data or code. Extracted data points from plot images in the paper. Refit Method 3.
Discovery: the Chinchilla paper's Method-3 fit was suboptimal — didn't actually minimize the fitting loss. With proper curve-fitting (better optimization, possibly different priors), Method 3 produces almost exactly the same 0.5/0.5 / 20:1 prediction as Methods 1 and 2.
The Chinchilla authors were more right than they realized. All three methods agree once you do Method 3 correctly.
The deeper lesson
Even canonical, peer-reviewed, well-cited scaling-law papers can have fitting bugs that change conclusions materially. Be skeptical of curve fits, especially in 3D. Replicate when you can.
20. Isoflops — The Workhorse Research Protocol
Why isoflops won
Of the three Chinchilla methods, isoflops is the most robust and easiest to execute in practice:
- Pick a flop budget
C_0. Or several:C_1, C_2, C_3, ...in a geometric ladder. - For each
C_i, sweep(N, D)pairs that all satisfyN · D ≈ C_i. - Train each pair. Record final loss.
- Plot loss vs
N(withDimplicitly varying). Get a U-shape perC_i. - Fit a quadratic. Take the minimum. That's
(N*_i, D*_i, L*_i)for computeC_i. - Plot
(C_i, N*_i)and fit a power law. Same for(C_i, D*_i).
Done. Robust, parsimonious, doesn't require fitting a 3D surface.
Where isoflops shows up
- Chinchilla method 2.
- The MoE scaling study (active vs total params at fixed compute).
- Diffusion model scaling studies.
- Architecture-vs-architecture comparisons.
If you're stuck on a scaling-law decision, default to isoflops. It's the hammer that fits most nails.
21. The "Overtraining for Serving" Reality
Why Chinchilla 20:1 is not what production wants
Chinchilla 20:1 is training-compute-optimal. But in production, the cost split is roughly:
- ~20% on training.
- ~80% on R&D and serving.
Inference cost dominates over the model's lifetime. The relevant optimization is performance per parameter (small models that are good).
Overtraining
Train a small-er model than Chinchilla recommends, but on way more tokens. You sacrifice a tiny bit of training-compute efficiency in exchange for a much smaller inference-cost model.
Modern recipes:
- Llama 2 7B: 286:1 tokens/param (vs Chinchilla 20:1).
- Llama 3 8B: 1,875:1.
- Qwen 2.5 7B: comparable or higher.
→ Modern frontier models are massively overtrained vs Chinchilla. Not because Chinchilla is wrong — because the optimization target shifted from "training-compute-optimal" to "serving-cost-optimal."
The lesson
Chinchilla's 20:1 is a research number — the point at which you minimize the FLOPs needed to reach a given loss. It is not what you want if you'll serve the model at scale. Pick a smaller model than 20:1 suggests; train it on more tokens; pay slightly more in training to save much more in serving.
Why Chinchilla is still important
Even though we don't follow the 20:1 ratio, the Chinchilla saga teaches:
- How to fit joint scaling laws.
- How small calibration errors compound.
- The isoflops protocol.
- Why upstream-vs-downstream matters.
- The methodology, not the recipe.
22. Pitfalls and Senior Signals
Pitfalls
- Compute scale too small. Hard to distinguish polynomial from exponential scaling — Taylor approximations look linear at any zoom level. Fit on at least 3–4 orders of magnitude in compute if you can.
- Bad parameter counting (Kaplan footgun). Include all parameters consistently. Embeddings, final softmax, biases (if present), LayerNorm scales — all of it.
- Hyperparameters not properly scaled across runs. If LR warmup, batch size, or μP-adjustment isn't right at small scale, your scaling law fits an artifact.
- Method-fitting bugs. Even well-cited papers (Chinchilla method 3) have them. Replicate or sanity-check.
- Conflating upstream and downstream. Scaling laws live on perplexity. Validate downstream separately.
- Ignoring variance. Most scaling-law plots use singletons. Usually fine for perplexity (very low variance) but not for LR / batch-size / hyperparameter scaling laws (which can have huge variance). Replicate when stakes matter.
- Slope vs intercept confusion. Most interventions change the intercept, not the slope. Don't claim a "huge improvement" if all you've done is shift the intercept down — that's a constant-factor speedup, not a fundamental scaling change.
Senior signals
- You think in slope-vs-intercept terms. Most interventions move the intercept. Slope changes are rare and important.
- You separate "what to scale" from "what to keep constant." Aspect ratio constant, parameters scale. LR-schedule shape constant, peak LR scales.
- You name isoflops as the default protocol. And you can execute it on a whiteboard.
- You know the Kaplan-vs-Chinchilla saga. Including the resolution.
- You know about overtraining. And why production deviates from Chinchilla 20:1.
- You distinguish upstream from downstream. Establish regularity on perplexity; validate transfer separately.
- You can derive the simple-mean scaling law (
σ²/n) and explain why neural exponents are smaller (non-parametric-like). - You don't oversell your scaling law. Acknowledge what it doesn't tell you (downstream, emergence, OOD generalization).
23. Interview Grill — 70 questions
Foundations (Q1–10)
- Why do scaling laws exist? Why not just train at large scale and tune?
- What's a power-law scaling law on a log-log plot?
- State the simplest scaling law (mean estimation). What's the slope?
- Why do parametric estimators give slope −1?
- Why do non-parametric estimators give slope −1/D?
- Typical neural-LM scaling exponent? What does it suggest about effective dimension?
- Trace the historical lineage from Cortes 1993 → Banko-Brill → Hestness 2017 → Kaplan 2020.
- What's the Hestness 2017 contribution that's underappreciated?
- When does a power-law approximation break (asymptote)?
- Why is "predictability across scales engineered, not automatic"?
Data scaling (Q11–18)
- Sketch the data-scaling-law plot.
- Why must the model be larger than the data for a clean data scaling law?
- What's the modern slope for language modeling on data?
- How would you fit a data-mixture scaling law?
- Why does "best small-scale mix = best large-scale mix" often hold?
- State the 4-epoch repetition rule.
- Why does optimal data filtering depend on compute?
- Why does the slope rarely change with intervention?
Architecture / optimizer / hyperparameter scaling (Q19–28)
- How would you compare LSTM vs Transformer using scaling laws?
- What does a worse slope mean for an alternative architecture?
- Cite an architecture intervention that consistently wins on scaling-law plots (per Narang 2020).
- Adam vs SGD scaling laws — same slope or different?
- What's a "scale-invariant quantity"? Why does it matter for scaling strategy?
- What's the canonical aspect ratio (
d_model/n_layers)? - Why is number of layers not scale-invariant?
- Why might a head dim be scale-invariant?
- What's the Kaplan parameter-counting footgun?
- How do non-embedding parameter exclusions distort scaling laws?
MoE scaling (Q29–32)
- What's new about MoE scaling vs dense scaling?
- Trade-off: total params vs active params at fixed compute?
- As compute grows, do MoE models want more or less sparsity?
- Why do "inactive" parameters still help reduce loss?
Critical batch size (Q33–40)
- Define noise-limited and bias-limited regimes.
- What's the critical batch size?
- State the OpenAI estimation procedure.
- What's the formula
B_crit = E_min / S_minsaying? - How does CBS scale with target loss?
- Why does CBS grow as compute grows?
- Why is CBS in the "scaling laws" lecture?
- How is CBS related to data parallelism?
Learning rate scaling (Q41–46)
- How does optimal LR scale with width (default rule of thumb)?
- What's μP?
- Compare "fit an LR scaling law" vs "use μP."
- Why is μP harder to implement?
- Which strategy do production runs use?
- How does LR interact with batch size?
Upstream vs downstream (Q47–50)
- Why are scaling laws cleaner on perplexity than on accuracy?
- State the Narang 2020 observation about NL-12 vs NL-32 XL.
- How do you validate transfer from upstream to downstream?
- Why do post-training engineers complain about pretraining people?
Joint scaling and Chinchilla (Q51–62)
- Sketch Rosenfeld's joint scaling law form.
- State Kaplan's compute-optimal allocation.
- State Chinchilla's compute-optimal allocation.
- What's the famous Chinchilla 20:1 ratio?
- Walk through Chinchilla method 1 (lower envelope).
- Walk through Chinchilla method 2 (isoflops).
- Walk through Chinchilla method 3 (joint fit).
- Why did Kaplan and Chinchilla disagree (Yair's three reasons)?
- What's Pearson & Song's complementary explanation?
- What was the Chinchilla method-3 mystery?
- How did Epoch AI resolve it?
- State the deeper lesson from the saga.
Modern practice (Q63–70)
- What's "overtraining" and why do production models do it?
- Token-per-parameter ratio for Llama 2 7B vs Chinchilla 20?
- Why isn't Chinchilla 20:1 what serving-cost-minimizing labs want?
- Why is isoflops the default protocol?
- Walk through an isoflops sweep end-to-end.
- Compute scale too small — what goes wrong?
- Why are most scaling-law data points singletons (no replicates)?
- State the "scaling laws are lower bounds" framing in one sentence.
24. References
- Cortes, Vapnik et al., 1993. Learning curves: Asymptotic values and rate of convergence.
- Banko, Brill 2001. Scaling to very very large corpora for natural language disambiguation.
- Collobert et al. 2012. Natural language processing (almost) from scratch.
- Hestness et al. (Baidu) 2017. Deep learning scaling is predictable, empirically. arXiv:1712.00409.
- Kaplan et al. (OpenAI) 2020. Scaling Laws for Neural Language Models. arXiv:2001.08361.
- Hoffmann et al. (DeepMind) 2022. Training Compute-Optimal Large Language Models (Chinchilla). arXiv:2203.15556.
- Rosenfeld et al. 2019. A Constructive Prediction of the Generalization Error Across Scales. arXiv:1909.12673.
- Muennighoff et al. 2023. Scaling Data-Constrained Language Models. arXiv:2305.16264.
- Narang et al. 2020. Do Transformer Modifications Transfer Across Implementations and Applications? arXiv:2102.11972.
- Yair et al. 2024. Resolving Discrepancies in Compute-Optimal Scaling of Language Models. arXiv:2406.19146.
- Pearson, Song 2024. Reconciling Kaplan and Chinchilla Scaling Laws.
- Epoch AI. Chinchilla's wild implications / method-3 reanalysis.
- Yang et al. 2022. Tensor Programs V (μP).
- McCandlish et al. (OpenAI) 2018. An Empirical Model of Large-Batch Training (critical batch size). arXiv:1812.06162.
- Bahri et al. Explaining Neural Scaling Laws.
- Liu, Hashimoto et al. Pretraining Under Infinite Compute.
Cross-references in this repo
04_transformers/MODERN_LLM_ARCHITECTURE_CHOICES.md— what architecture choices scaling laws are used to justify.02_gradient_descent/LEARNING_RATE_DEEP_DIVE.md— LR-scaling math.52_statistical_learning_theory/— generalization bounds.62_frontier_training_playbook/— production-scale recipes.66_frontier_alignment_rl/REASONING_MODELS_DEEP_DIVE.md— test-time-compute as a third scaling axis.
How to use this chapter
- Read straight through once — the historical → math → practical arc lands best in order.
- Memorize §10 (scale-invariant quantities), §11 (Kaplan footgun), §17 (Chinchilla saga), §20 (isoflops), §21 (overtraining).
- Be able to derive the simple-mean scaling law on a whiteboard in 60 seconds.
- Be able to execute an isoflops protocol end-to-end on a whiteboard.
- Be able to explain Kaplan-vs-Chinchilla in 90 seconds.
- Drill the §23 grill until 60+/70 cold.
Single sentence to remember
Scaling laws are power-law-shaped predictive rules for how loss decays with data / model / compute; you fit them on a small-scale corner and extrapolate; the slope is determined by the model class and rarely moves; intercepts move with most interventions; the canonical lesson — Kaplan-vs-Chinchilla — is that small calibration errors (parameter counting, LR warmup, batch size) at small scale compound into large prediction gaps at large scale, so the recipe you fit must match the recipe you'll deploy.