Topic 51: LLM Research Interview Prep

"The objective determines what conditional distribution the model is learning. In a causal LM, we train the model to predict the next token given all previous tokens."

You should also connect objective to loss:

next-token prediction usually uses cross-entropy
average negative log-likelihood is the key quantity
perplexity is exp(average_nll)

2. Tokenization Trade-Offs

You should be able to explain:

why tokenization exists
why rare words get split
trade-offs between vocabulary size and sequence length

Useful interview answer:

"A larger vocabulary can shorten sequences but increases embedding and softmax cost. A smaller vocabulary improves compositional coverage but makes sequences longer."

3. Scaling and Optimization

You should know the major trade-offs:

more parameters
more data
more compute
longer context
better optimizer and schedule

Useful follow-up reasoning:

larger context improves some tasks but increases memory and latency
optimization can become unstable due to precision, normalization, masking, or bad learning-rate schedules

4. Evaluation

For LLMs, evaluation is always trickier than one headline number.

You should be able to separate:

intrinsic metrics: perplexity, next-token loss
task metrics: exact match, F1, pass@k, recall@k
human or preference metrics: win rate, pairwise preference

Good research answer:

"I would pair a training-objective metric like perplexity with task-level metrics that reflect the actual use case, then slice results by prompt type and failure mode."

5. Ablations

If a result improves, you should ask:

was compute constant?
was data constant?
was decoding constant?
is the gain concentrated in one slice?
does the gain survive multiple seeds?

This is how you sound like a researcher instead of just a benchmark collector.

6. Hallucination and Grounding

You should be able to discuss:

hallucination vs uncertainty
retrieval failures
stale knowledge
citation faithfulness
instruction-following failures

A strong answer often sounds like:

"I would first classify the failure: retrieval miss, context selection issue, model ignoring context, or generation unsupported by evidence. The fix depends on which stage failed."

7. Alignment and Preference Optimization

Know the high-level roles of:

supervised fine-tuning (SFT)
reward models
PPO
DPO

Important interview point:

Better preference optimization does not automatically mean better factuality or robustness. Alignment changes behavior, but evaluation must still check truthfulness, calibration, safety, and task success.

8. Paper Discussion

In paper discussion rounds, you should be able to answer:

What problem is the paper solving?
What is the key idea?
What assumptions are hidden?
What evidence is missing?
What ablation would you add?
What would break in production?

If you can do that consistently, you will sound much stronger than someone who only summarizes the abstract.

Technical Details Interviewers Often Want

Perplexity Is Not End-Task Quality

Perplexity is an intrinsic language-modeling metric. It tells you how much probability mass the model assigns to the observed tokens.

That makes it useful for optimization monitoring and controlled model comparison.

But it does not automatically tell you:

whether answers are factual
whether instructions are followed well
whether reasoning is robust
whether retrieval is used correctly

A good answer is:

"Perplexity is useful for next-token prediction quality, but it is not a complete proxy for downstream task success or human preference."

Ablation Logic

In research interviews, "run an ablation" is too vague.

You should be able to say what the ablation is isolating.

Good ablation questions include:

does the gain come from the architecture or just more compute?
does the gain persist if we hold data fixed?
does the gain survive the same decoding setup?
is one component necessary or only helpful in combination?

The point of an ablation is to test causal attribution, not to generate extra tables.

Retrieval Failure Decomposition

For RAG-style questions, separate the pipeline into stages:

query formation
retrieval recall
reranking or context selection
grounding during generation
citation faithfulness

Different failures imply different fixes.

If retrieval recall is poor, prompt engineering the generator will not solve the core issue.

If retrieval is good but the model ignores context, then the bottleneck is downstream.

Alignment Objectives and Trade-Offs

You should be able to articulate that SFT, PPO, and DPO optimize different signals.

For example:

SFT imitates demonstrations
PPO optimizes a learned reward through policy updates
DPO converts preference comparisons into a direct optimization objective

The follow-up point that often gets missed is that better preference optimization can still degrade calibration, truthfulness, or robustness if the reward signal is narrow or biased.

Scaling Discussion

When discussing scaling, be precise about the axis:

model size
data size
compute budget
context length
inference-time budget

The strongest answers explicitly say which resource is the bottleneck and which trade-off is being made.

Common Failure Modes

1. Overclaiming from a Single Metric

This is extremely common in LLM discussions.

A model can improve on perplexity or win rate while getting worse on factuality, latency, calibration, or safety.

2. Confounded Ablations

An experiment changes multiple things at once:

model size
data mix
training duration
decoding settings

Then the gain is attributed to one idea without enough evidence.

3. Retrieval Evaluation Mismatch

People often report retrieval metrics like recall@k, but the user problem depends on answer quality, context use, and citation faithfulness.

High recall@k does not guarantee grounded final answers.

4. Benchmark Contamination

If evaluation data leaks into training, the apparent gain can be misleading.

A research-scientist answer should always leave room for contamination, weak splitting, or template overlap.

5. Reward Hacking

If a reward model or preference signal is narrow, the optimized model may learn to exploit the reward rather than improve the real task.

That is one of the central reasons alignment metrics need complementary evaluation.

Edge Cases and Follow-Up Questions

What if perplexity improves but users prefer the old model?

Then the optimization objective and user utility are misaligned.

Check instruction following, verbosity, calibration, refusal behavior, and output style instead of assuming the lower-perplexity model is better.

What if retrieval recall is high but answers are still hallucinated?

Then the system may be failing at:

selecting the right retrieved chunk
grounding generation on the evidence
resolving contradictions across documents

This is exactly why retrieval metrics alone are not enough.

What if win rate improves but factuality drops?

That may mean the model became more fluent or more persuasive without becoming more truthful.

What if a new architecture wins only on one prompt slice?

Then the strongest valid conclusion is narrow.

Do not generalize to "better overall" without broader slices and variance estimates.

What if a paper shows strong gains but uses a much larger inference budget?

Then part of the gain may come from test-time compute rather than better underlying model quality.

In interviews, this is a strong place to ask whether comparisons were compute-matched.

Boilerplate Code

See llm_eval_and_ablation.py for easy interview-style implementations of:

Negative log-likelihood
Perplexity
Exact match
Token-level F1
pass@k
Retrieval recall@k
Mean reciprocal rank
Pairwise win rate
Simple ablation delta tables

These are small enough to implement in an interview and useful enough to support experiment reasoning.

The goal of these helpers is not to replace a full evaluation pipeline. The goal is to make sure you can define metrics clearly, compute them correctly on small examples, and explain what each metric does and does not capture.

Paper Discussion Template

When asked to discuss a paper, use this structure:

Problem
Main idea
Why it might work
Main assumptions
Missing ablations
Failure modes
What experiment you would run next

That structure is often more valuable than a long summary.

What to Practice Saying Out Loud

What does perplexity measure, and what does it not measure?
Why can better preference win rate fail to improve factual accuracy?
How would you diagnose whether RAG failure is retrieval-side or generation-side?
What ablations would you require before believing a new architecture claim?
If a paper shows average gain, what slices would you ask for next?
What kinds of leakage can make an LLM benchmark result look better than it is?
If a model is better on average but much worse on rare but critical prompts, how would you report that?
What conclusion is justified by the current evidence, and what conclusion is still too strong?

Suggested Use

Use this chapter after the earlier theory and coding topics.

The intended order is:

Topic 47: inference
Topic 48: optimization
Topic 49: evaluation and generalization
Topic 50: coding patterns
Topic 51: LLM-specific research reasoning

ML & LLM Interview Prep — Deep Dives