Frontier Training Deep Dive

This file is intentionally more descriptive than the chapter README.

The README is there to orient you. This note is here to help you think like someone planning, running, and debugging a serious large-model training program.

1. Frontier Training Is Mostly a Methodology Problem

Many interview answers about frontier models fail because they center on one architectural novelty.

That is rarely how strong labs think.

In practice, model quality emerges from a stack:

architecture
tokenizer
optimizer and schedule
data mixture
stability tricks
context-length curriculum
mid-training
post-training
evaluation discipline
infrastructure reliability

If one of those is weak, it can dominate the final result.

That is why a good research answer often sounds less glamorous than people expect. It focuses on:

fair comparisons
strong baselines
low-noise ablations
early eval discipline
stability and throughput

This is also why so many "method X beat method Y" claims fall apart under scrutiny. The variable that got the credit is often not the only variable that changed.

2. Start With the Product Goal, Then Work Backward

Strong labs do not start with:

"What architecture do I want to try?"

They start with:

"What capability mix do I need, and what constraints matter most?"

Examples:

If the goal is a low-latency assistant, serving cost matters more than a tiny benchmark gain.
If the goal is strong code and math, the data mixture and post-training stack must reflect that.
If long context is a headline feature, context scaling must be built into both training and evaluation instead of treated as a late patch.

This is why good interview answers begin with the objective.

You are showing that architecture is downstream of product and evaluation goals.

3. Why Baselines Matter So Much

In large-model training, a weak baseline is expensive in two ways.

First, it wastes compute directly.

Second, it destroys your ability to interpret experiments.

If your baseline is unstable or poorly tuned, then an improvement from a new method may mean:

the new method is genuinely better
the new method is simply more forgiving
the baseline was accidentally handicapped

That is why experienced teams often default to something boring but robust:

dense model
standard optimizer
known scheduler
proven attention implementation
already-debugged data pipeline

This is not lack of ambition.

It is a way of buying interpretability.

4. Architecture Decisions Are Really Cost-Shape Decisions

Architecture questions are often asked as if they were purely modeling questions.

They are not.

Each architecture decision changes a different cost surface.

Dense vs MoE

Dense models are simpler to reason about:

all parameters participate
no routing pathologies
no expert imbalance
easier distributed implementation

MoE changes the equation:

much larger total parameter count becomes possible
active compute per token can stay lower than total capacity suggests
but routing quality and systems behavior become central

The trap is to say:

"MoE is better because it gives more parameters for the same compute."

That misses the hard part.

MoE only works well when:

routing is stable
experts are utilized sensibly
load balancing is maintained
communication overhead is acceptable

So the honest answer is:

"MoE can give better capacity-efficiency trade-offs, but it introduces new optimization and systems failure modes that dense models avoid."

MHA vs MQA vs GQA

This is one of the most interview-relevant design trade-offs because it connects architecture to serving.

Full multi-head attention keeps separate key and value heads for each query head. That is expressive but expensive at inference because KV cache scales with the number of KV heads.

MQA goes to the other extreme: all query heads share one key and one value head. That is cheap, but often loses quality because all heads have to read from the same compressed memory view.

GQA is the compromise:

fewer KV heads than MHA
more head specialization than MQA
materially smaller KV cache than MHA

Why do interviewers like this question?

Because a good answer should connect:

model quality
KV-cache size
inference bandwidth
serving cost

Positional Encoding and Long Context

A weak answer says:

"Use RoPE for long context."

A stronger answer says:

"Long context requires a coherent story about positional encoding, training distribution, and scaling schedule. RoPE is common, but context extension also depends on how you scale frequencies, what context lengths the model sees during training, and whether the eval suite actually requires long-context use."

That is the difference between naming a tool and describing a recipe.

5. Why Document Masking Matters More at Larger Scale

Packed training sequences are efficient because they reduce padding waste.

But if multiple documents are packed into one sequence and you use plain causal masking, later documents can attend to earlier unrelated documents.

That can blur boundaries and create a training objective that does not match the intended data structure.

At small scale, some teams may tolerate this.

At larger scale and especially for long context, the cost becomes more visible because:

the model has more capacity to exploit accidental cross-document patterns
longer contexts magnify the consequences of bad masking

So document masking is not just a cleanliness preference.

It is a way of making the attention pattern more faithful to the data-generating structure.

6. Why Stability Tricks Need Mechanistic Explanations

Interviewers do not just want the name of a stabilization trick.

They want to know whether you understand what failure mode it targets.

z-loss

z-loss penalizes excessively large logit scale through the log partition term.

Mechanically, it discourages logits from drifting to very large magnitudes.

The important insight is that this is a loss-level intervention: it changes the objective by adding a regularization term.

Logit Softcapping

Softcapping instead acts in the forward pass by smoothly compressing large logits with a bounded transformation like soft_cap * tanh(logits / soft_cap).

Why is this attractive?

Because it bounds activation magnitude without the hard non-differentiability of clipping.

Why is it not a free win?

Because changing logits this way can interact with kernel assumptions and may affect gradient behavior near unstable regions.

These methods try to prevent attention scores from becoming too extreme before softmax.

The interviewer-level insight is:

"Attention instability can arise before the final softmax. So some methods act earlier in the pipeline than output-level logit stabilization."

That sentence already sounds much stronger than a list of trick names.

7. Optimizer Choice Is About Dynamics, Not Brand Names

AdamW remains the default in many serious training setups because it is predictable and operationally well understood.

When teams explore alternatives, the right question is not:

"Is optimizer B more advanced than AdamW?"

It is:

"What optimization geometry does it exploit, when does that help, and what infrastructure cost does it impose?"

For matrix-aware or second-order-inspired optimizers, the theoretical appeal can be real. But deployment at scale may require:

all-to-all communication
tensor packing and padding tricks
more careful gradient handling

So a strong answer is careful:

"A more sophisticated optimizer may improve sample efficiency, but if it complicates distributed execution or creates new scaling pathologies, the total program may still lose."

8. Data Mixture Often Dominates Architecture Tweaks

This is one of the most important interview truths.

At fixed compute, changing the data mixture can reshape behavior more dramatically than a modest architecture tweak.

Why:

data decides what behaviors are seen
data quality affects gradient usefulness
data schedule shapes late-training priorities

This means a believable training story should include:

deduplication
contamination checks
domain mixture choices
late-stage injection of high-quality data
reasoning or code emphasis when relevant

If your answer on frontier training barely mentions data, it is incomplete.

9. Multi-Stage Training Is a Behavior-Shaping Tool

A multi-stage schedule is not just a convenience.

It is a way of telling the model what should matter most near the end of optimization.

Late-stage high-quality or domain-specific data matters because the end of training often has outsized influence on the final behavior.

That is why many recipes include:

broad pretraining first
later high-quality STEM or code injection
context-length extension stages
mid-training if initial SFT reveals domain gaps

This is also why "just train longer on the same mixture" is often not the best answer.

The order of data can matter, not just the total count.

10. Mid-Training Is Not Always the Right Move

Mid-training is attractive when SFT reveals that the base model lacks a core capability such as:

coding fluency
math or reasoning priors
domain-specific terminology

But if the goal is shallow surface behavior, style, or dialogue tone, compute may be better spent in post-training.

This is a good example of research taste:

you do not automatically escalate to a more expensive intervention if a cheaper stage can solve the problem.

11. Post-Training Is Where Many Capabilities Become Visible

A common mistake is to evaluate a final model as if all gains came from pretraining.

In reality, post-training often determines:

instruction following
tool use
refusal style
reasoning format
preference behavior

So a serious answer about a model’s capabilities should ask:

what was the SFT data?
what preference optimization or RL stage followed?
what reward or critique signal was used?
how was output length controlled?

This is especially important for reasoning models because reward hacking and excessive output length can masquerade as progress.

12. Why Reward Hacking and Length Hacking Keep Appearing

When you optimize a reward, the model will search for the easiest way to increase that reward.

If longer outputs correlate with higher reward, the model may simply learn to talk more.

If a judge reward is easy to flatter, the model may learn persuasive but low-quality behavior.

That is why strong post-training answers mention:

verifiable rewards when possible
length penalties or constraints
calibration between reward and real task success
alternative methods like online DPO or distillation when RL is too unstable

This is one of the clearest places where research maturity shows up.

13. Many Training Failures Are Operational, Not Theoretical

A romanticized view of frontier training imagines the hard part is always the math.

Often it is not.

Common failures include:

dataloader pathologies
storage stalls
throughput collapse over long runs
checkpoint issues
seed inconsistency across tensor-parallel setups
silently bad data shards

This matters for interviews because a senior answer should not pretend the only failures are conceptual.

You should sound like someone who knows that large training runs break in mundane ways.

14. How to Sound Strong in a Training-Methodology Interview

A good answer usually follows this pattern:

State the goal and constraint.
Pick a conservative baseline.
Explain the biggest trade-offs first.
Describe what you would ablate and in what order.
Mention the main stability and infrastructure risks.
End with the strongest conclusion you would be willing to claim.

That structure works because it combines:

theory
practical engineering
scientific discipline

15. Questions You Should Be Able to Answer in Full Sentences

Try answering these without bullets:

Why might a dense model still be the right choice even if MoE looks more efficient on paper?
Why does GQA help not only training feasibility but also serving cost?
Why can long context require a training curriculum instead of a one-shot context jump?
Why is late-stage high-quality data often so important?
What makes an architecture ablation believable rather than confounded?
Why can a training failure be caused by the dataloader even when the loss curve is the visible symptom?

ML & LLM Interview Prep — Deep Dives