Mock Interview Loops

Each loop is designed to feel like a real technical interview segment.

Use a timer. Do not look things up while answering.

Loop 1: Theory + Follow-Ups

Prompt

Explain why logistic regression uses cross-entropy instead of MSE.

Expected strong answer

You should connect:

Bernoulli likelihood
sigmoid output as probability
MLE leading to cross-entropy
better gradient behavior than MSE for classification

Follow-ups

Derive the gradient with respect to the logits.
Why does the gradient simplify to p - y?
When might MSE still appear in classification work?

Loop 2: Probability / Statistics

Prompt

You have two arrays from two distributions and a new scalar value. How do you decide which source it most likely came from?

Expected strong answer

You should cover:

likelihood comparison
priors
Gaussian plug-in classification if assumptions are acceptable
KDE fallback if distribution family is unknown
confidence / ambiguity if overlap is high

Follow-ups

What if both distributions have the same mean?
What if one class is much more common?
What if you only have a few samples?

Loop 3: Coding

Prompt

Implement masked softmax for attention.

Expectations

You should:

clarify mask convention
write a stable softmax
use the correct axis
mention complexity

Follow-ups

How would you make it causal?
What bug would produce NaNs here?
What shape errors are common?

Loop 4: Debugging

Prompt

A training loop suddenly starts returning NaN losses after a few hundred steps. Walk through your debugging plan.

Expected strong answer

You should cover:

inspect data and labels
check learning rate and schedule
inspect activation/gradient ranges
check log, exp, division, normalization
clip gradients if needed
isolate the exact step where instability begins

Follow-ups

What if the issue only appears in mixed precision?
What if train is fine but validation is NaN?
What if this only happens on one GPU rank?

Loop 5: Research Judgment

Prompt

A new method improves perplexity but hurts exact match on downstream QA. How do you reason about that?

Expected strong answer

You should discuss:

training objective vs downstream metric mismatch
calibration and decoding effects
domain mismatch
answer-format sensitivity
slice analysis and error analysis

Follow-ups

What ablations would you run next?
What if the gain only appears on one seed?
What if retrieval quality improved at the same time?

Loop 6: Large-Scale Systems

Prompt

How would you fit a larger LLM training run when you are running out of memory?

Expected strong answer

You should discuss:

lower batch size + gradient accumulation
mixed precision
activation checkpointing
optimizer state sharding
FSDP / ZeRO intuition
sequence length trade-offs

Follow-ups

What do you lose with checkpointing?
Why does Adam consume so much memory?
How does longer context affect memory?

Loop 7: Paper Critique

Prompt

A paper claims a strong improvement on one benchmark. What do you need to see before you believe it?

Expected strong answer

You should ask for:

strong baseline
same data and compute controls
multiple seeds
ablations
slice metrics
failure cases

Follow-ups

What if the benchmark is saturated?
What if the paper uses a proprietary internal dataset?
What if the improvement is only 0.2 points?

Loop 8: End-to-End Mixed Loop

Prompt

Design and defend a small RAG experiment for factual QA.

Expected strong answer

You should cover:

baseline retriever/generator
chunking choice
retrieval metrics and answer metrics
ablations
failure taxonomy
confidence and evaluation slices

Follow-ups

How do you know whether failure is retrieval-side or generation-side?
Why might better Recall@10 not improve final answers?
What would you optimize first under latency constraints?

ML & LLM Interview Prep — Deep Dives