Descriptive Interview Narratives

This file is for a specific skill:

turning knowledge into spoken, descriptive, interviewer-friendly answers.

Many candidates know the facts but still sound weak because their answers feel like disconnected notes. These examples are written in the style you should aim to speak.

1. "Why Does GQA Help?"

Weak answer:

"GQA shares keys and values across groups, so it is more efficient."

Stronger answer:

"Grouped-query attention is mainly an efficiency-quality compromise. Full multi-head attention keeps separate key and value heads for every query head, which makes the KV cache large and expensive to move during serving. Multi-query attention goes to the other extreme and shares one KV set across all heads, which is cheap but can lose quality because every head reads from the same compressed memory. GQA sits in the middle: it reduces KV-cache size materially while preserving more specialization than MQA. So the reason it helps is not just architectural elegance. It changes serving cost and memory bandwidth in a favorable way while usually keeping much of the quality of full MHA."

Why this answer is stronger:

it explains the trade-off
it connects architecture to serving
it shows why the middle-ground design exists

2. "How Would You Train a Frontier-Style Model?"

Weak answer:

"I would use a transformer with RoPE, GQA, good data, and then do SFT and RL."

Stronger answer:

"I would start by locking the product goal and evaluation protocol early, because otherwise architecture decisions become ungrounded. Then I would choose a conservative baseline recipe with known failure modes, usually something dense with GQA and a standard optimizer unless there is a strong reason to absorb the extra complexity of MoE or a more exotic optimizer. From there, I would de-risk the program by running small reliable ablations on only a few variables at a time: attention setup, positional encoding strategy, optimizer choice, and the learning-rate schedule. In parallel, I would make the data pipeline a first-class part of the project, with deduplication, contamination checks, and a planned multi-stage mixture rather than one static soup of data. Finally, I would treat post-training as part of the training story instead of an afterthought, because instruction following, reasoning style, and tool use are often determined there."

Why this answer is stronger:

it sounds like a plan, not a list
it treats evals and data as central
it shows methodology and scientific discipline

3. "Why Can a Better Metric Still Be Misleading?"

Weak answer:

"Maybe the benchmark is bad."

Stronger answer:

"A metric can improve for the wrong reason. For example, a model can get better perplexity because it predicts local token patterns more accurately, but that does not guarantee better factuality, instruction following, or user preference. Likewise, a retrieval system can improve recall@k while the final answer quality stays flat if the generator still ignores the evidence. So when I see a gain, I first ask what behavior the metric really tracks, then what behaviors it ignores, and then whether the reported gain is robust across slices and seeds. The issue is usually not that the metric is useless. It is that the metric is narrower than the claim being made."

Why this answer is stronger:

it distinguishes metric from claim
it uses concrete examples
it sounds like research judgment

4. "How Would You Debug a Model That Is Not Learning?"

Weak answer:

"I would lower the learning rate and inspect gradients."

Stronger answer:

"I would debug it in layers instead of guessing. First I would verify the data and labels, because if the target or split logic is wrong, optimizer tuning is irrelevant. Then I would check tensor shapes and the loss definition, especially whether the model is outputting logits or probabilities and whether the loss expects one or the other. After that I would inspect gradient flow: are the gradients zero, NaN, or just never applied because the parameters are frozen or missing from the optimizer? Only after those checks would I start tuning learning rate or clipping, because hyperparameters are often blamed for problems that are actually caused by semantics or numerics."

Why this answer is stronger:

it is procedural
it narrows the search space
it shows debugging maturity

5. "Why Is PagedAttention Useful?"

Weak answer:

"It reduces KV-cache memory."

Stronger answer:

"Plain KV caching avoids recomputing past keys and values, but it also creates a large growing memory object for every active request. If a serving engine allocates that memory naively as large contiguous buffers, you get waste and fragmentation, especially when requests have different lengths and some terminate early. Paged attention fixes the memory-management side of the problem by storing KV cache in fixed-size blocks and using a block table to map logical sequence positions to physical memory blocks. The actual attention semantics stay the same, but memory becomes much easier to reuse and schedule. That is why it improves serving efficiency: not because the model changed, but because the allocator and scheduler can use GPU memory much more effectively."

Why this answer is stronger:

it explains the before and after
it shows the problem plain KV cache leaves unsolved
it distinguishes semantics from memory layout

6. "Why Might MoE Fail Even If It Looks Better on Paper?"

Weak answer:

"Because routing is hard."

Stronger answer:

"Mixture-of-experts can look very attractive because total parameter count becomes much larger without activating all parameters on every token. But that headline hides several failure modes. The router has to send tokens in a useful way, experts need to stay sufficiently balanced so some do not collapse while others overload, and the distributed systems stack has to tolerate the communication and dispatch pattern. So MoE can absolutely be the right answer, but it is not a free parameter-efficiency upgrade. It is a different optimization and systems problem. If the infra is immature or the timeline is tight, a dense model may still be the better decision even if MoE looks superior in theory."

Why this answer is stronger:

it translates theory into operations
it shows why dense baselines still matter
it avoids hype language

7. "What Makes an Ablation Believable?"

Weak answer:

"Change one thing at a time."

Stronger answer:

"Changing one thing at a time is the starting principle, but a believable ablation also requires that the surrounding setup stays comparable in the ways that matter. If one model gets a different optimizer, longer training, better data curation, and a different decoding setup, then isolating the architecture claim becomes almost impossible. I also want low-noise evaluation, stable rankings over time, and metrics that match the intended use case. So the real goal of an ablation is not to produce another table. It is to support a causal statement about what actually drove the improvement."

Why this answer is stronger:

it turns a slogan into a standard
it includes evaluation noise and comparability
it sounds like someone who has seen confounded experiments

8. How to Use These Narratives

Do not memorize these word for word.

Use them to learn the shape of a strong answer:

start with the real problem
explain the mechanism
connect to trade-offs
mention the important limitation

If you can do that consistently, your answers will sound much more complete even when the interviewer keeps changing topic.

ML & LLM Interview Prep — Deep Dives