Integrated Interview Synthesis — Deep Dive

Frontier-lab interview prep. Pair with INTERVIEW_GRILL.md.

This is a meta-document. The repo's 50+ deep dives cover individual topics; this one is about the cross-topic synthesis that separates senior candidates from junior ones. Frontier interviews increasingly ask questions that span multiple areas — "design an LLM system that handles X" requires synthesizing inference, alignment, evaluation, A/B testing, system design, and prompting fluency in one answer.

1. The five archetype questions

Most cross-topic interview questions reduce to one of five archetypes. Knowing which you're being asked sets the answer's structure.

A. "Design X" (system design)

"Design Spotify recommendations." "Build a fraud detector for credit cards." "Design a chatbot."

Use the 6-step framework from 29_system_design_for_ml/: clarify → frame → data → features+model → serving → monitoring.

B. "Train X" (training methodology)

"Train a 70B model." "Improve our model's math abilities." "Build a code assistant."

Use frontier training playbook: scaling laws → architecture → data mixture → training stages → evaluation. See 62_frontier_training_playbook/.

C. "Why does X work?" (research / theory)

"Why does scale help?" "Why does Adam beat SGD here?" "Why is RLHF important?"

Use mechanistic answer: reduce to first principles. Often involves bias-variance, optimization geometry, capacity arguments, scaling laws.

D. "Debug X" (debugging methodology)

"Loss is spiking — what's wrong?" "Model regressed in production — investigate." "Eval looks fine but users complain — what's the issue?"

Use debugging tree: data → model → evaluation → distribution shift → cost asymmetry. Common failures by frequency, not exotic-ness.

E. "Trade off X vs Y" (judgment)

"Bigger model vs more data." "Online vs offline training." "Latency vs accuracy."

Use trade-off framework: list axes; identify which the business cares about; explain the curve.

2. Cross-topic synthesis questions

Questions that show up at frontier labs:

"Build an LLM-powered customer support agent"

Frame: agent (LLM + tools); chat interface; multi-turn.
Components: system prompt, RAG over docs, tool use (account lookup, ticket creation), safety filtering.
Evaluation: automated (faithfulness, refusal correctness) + human review.
Deployment: latency budget; failover; monitoring.
Cross-topic: prompting (07), RAG (39), inference (06), agents, A/B testing (30), evaluation (49).

"Improve our ranker by 1% NDCG"

Frame: ranking optimization; specific metric.
Levers: features (richer signals), model (DLRM upgrade, transformer ranker), training (more data, hard negatives), loss (listwise vs pairwise).
Validation: A/B test for online lift; offline NDCG ablation.
Cross-topic: recommendations (22), evaluation (49), A/B testing (30), ranking (in case studies 28).

"Reduce hallucination in our Q&A system"

Frame: factual accuracy + faithfulness.
Levers: stronger RAG, calibration, refusal training, tool use, self-consistency.
Evaluation: faithfulness vs source, factual accuracy on held-out, refusal rate.
Cross-topic: LLM problems (07), alignment (08), RAG (39), evaluation (49).

"Train a model to do X better"

Frame: capability gap; targeted improvement.
Levers: SFT data, RLHF reward, mid-training, prompting.
Validation: capability-specific eval; broader regression check.
Cross-topic: training playbook (62), alignment (08), data curation, evaluation (49).

"Why is our offline metric not matching online?"

Frame: distribution shift; counterfactual; selection bias.
Causes: position bias (search/rec), counterfactual issue (offline data from old policy), long-term effects, novelty effects.
Cross-topic: A/B testing (30), evaluation (49), recommendations (22).

3. The "first principles" answer pattern

Strong synthesis answers follow a pattern:

State the goal clearly: what are we optimizing? What's the user-facing outcome?
Identify the constraint(s): latency, cost, data, scale.
Apply the dominant principle: scaling laws, bias-variance, cost asymmetry — whatever's most relevant.
Recommend a baseline: what's the simplest thing that could work?
Iterate up: how would you improve from there, in priority order?
Discuss what could fail: 2-3 failure modes; mitigation for each.
State the strongest conclusion you'd defend: not "all of the above" but "I'd start here, because..."

This earns more points than a comprehensive list. Interviewers value judgment over coverage.

4. Topics that bridge multiple areas

Some topics show up everywhere. Owning these unlocks cross-topic answers:

Cross-entropy / KL divergence

Pre-training loss (LM losses, 43).
RLHF objective (alignment, 08).
Knowledge distillation (LoRA, 25).
Variational inference (information theory, 33).
Clustering (Bregman divergences, 19).

Bias-variance trade-off

Classical ML (advanced theory, 27).
Deep learning generalization (SLT, 52).
Estimator design (statistical inference, 47).
Regularization choice (regularization, 11).

Embeddings

Retrieval (RAG, 39).
Recommendations (two-tower, 22).
Multimodal (CLIP, 38).
Tokenization (15).
Search ranking (BM25 + dense, 36).

Attention

Transformer architecture (04, 05).
Long context (LLM problems, 07).
KV cache + serving (paged attention, 63).
Position encoding (14).

Data curation

Pre-training (frontier training, 62).
SFT / RLHF (08).
RAG corpora (39).
Anomaly detection (32).

When you can connect these threads in your answer, you sound like someone who has worked across the stack.

Discriminative model.
MLE = cross-entropy.
Linear decision boundary.
For tabular: GBDT often beats DL.
Calibration matters for cost-weighted decisions.

Optimization

SGD with momentum is robust default.
Adam handles bad conditioning approximately.
LR is the most important hyperparameter.
Warmup + cosine decay is standard.

Generalization

Classical: bias-variance.
Modern: double descent.
Implicit regularization of SGD finds flat minima.
Real-world generalization needs distribution-shift defense.

Inference

KV cache critical.
Quantization for memory.
Speculative decoding for throughput.
PagedAttention + continuous batching for serving.

Alignment

SFT for format.
RLHF for capability + preference.
DPO simpler alternative.
Reward hacking is the eternal threat.

Evaluation

Online ground truth via A/B test.
Offline can mislead (position bias, counterfactuals).
Calibration matters separately from accuracy.
Multiple metrics + uncertainty.

Systems

6-step design framework.
Two-stage retrieval + ranking.
Latency budget = where time actually goes.
Always have a fallback.

7. Common interview gotchas

Question	Common wrong answer	Right answer
What's the right answer?	Listing options	Pick one, justify, mention alternative for context
How would you improve X?	"Try harder model"	Identify bottleneck first; then targeted improvement
What if X fails?	"It won't"	Always have failure mode + mitigation
Tradeoff between X and Y?	Pick one	Identify what business prioritizes; explain the curve
Why is this hard?	"It's complex"	Specific reason: bias-variance, cost asymmetry, distribution shift
What's the metric?	"Accuracy"	Business metric → ML proxy → offline → calibration
Latency / cost / accuracy — pick?	Pick one	Two of three usually; depends on use case

8. Eight cross-topic questions you should be ready for

Design an LLM-powered Q&A system. (RAG + prompting + agent + safety + evaluation + serving.)
Why does scale work? (Scaling laws + bias-variance + over-parameterization + implicit reg.)
What's the most important thing in training large models? (Methodology + data > architecture; ablation rigor.)
Reduce model latency to half. (Quantization + smaller model + caching + batching + speculative decoding.)
Improve a metric without retraining. (Inference tricks: prompting, sampling, post-processing, calibration, retrieval.)
You see a regression in production. Walk me through your investigation. (Data + model + infra + drift + evaluation; rollback + diagnose.)
Online and offline metrics disagree. What do you check? (Position bias, counterfactual, novelty, distribution shift, label time leakage.)
What's the next frontier in LLMs? (Reasoning, agents, multimodal, efficiency, alignment robustness — name 2-3 with substance.)

9. Drill plan

Practice 5-minute answers for the eight cross-topic questions above.
For each, identify which 4-5 deep dives in the repo are relevant.
Pick one deep dive per day; spend 15 min on the grill questions; identify cross-references to other topics.
Time yourself: aim for 30-45 min total prep per major mock interview.

10. Further reading

This deep dive is a meta-document. The "further reading" is the rest of the repo:

For first-principles: SLT (52), info theory (33), optimization (48).
For systems: ML system design (29), large-scale LLM (61), paged attention (63).
For methodology: frontier training (62), generalization (49), A/B testing (30).
For practice: mock interviews (57), blind drills (59), case studies (28).

If you can't connect across these in an interview, drill the bridges (cross-entropy, embeddings, attention, bias-variance, data curation) until they're second nature.

ML & LLM Interview Prep — Deep Dives