Topic 53: ML Debugging and Mock Coding
🔥 For interviews, read these first:
ML_DEBUGGING_DEEP_DIVE.md— frontier-lab deep dive: 8-layer debugging tree, loss-curve interpretation, sanity checks (overfit one batch, tiny dataset), NaN debugging (FP16/log-of-zero/anomaly detection), leakage detection, gradient checking, distribution-shift investigation.INTERVIEW_GRILL.md— 50 active-recall questions.
What You'll Learn
This topic is for the part of interviews where something is broken and you need to reason quickly.
You will learn:
- how to debug training loops
- how to check shapes and masks
- how to catch unstable softmax or log operations
- how to reason about exploding or vanishing gradients
- how to debug evaluation bugs and leakage
- how to structure timed coding answers
Why This Matters
Interview coding is rarely just:
"Write clean code from scratch."
Very often it is:
- "This model is not learning. What would you check?"
- "This attention code gives NaNs. Why?"
- "This metric looks too good. What is suspicious?"
That is debugging, not just implementation.
Core Intuition
Debugging is mostly the process of shrinking the space of possible mistakes.
Weak answers jump randomly between hypotheses.
Strong answers eliminate categories of failure in a disciplined order.
Most ML bugs fall into a few buckets:
- the data is wrong
- the target is wrong
- the objective is wrong
- the shapes are wrong
- the numerics are unstable
- the optimization step is not doing what you think
That means a good debugging interview answer is not a long list of guesses.
It is a short ordered procedure that rules out the highest-probability failures first.
Debugging Mindset
Use this order:
- Check the data.
- Check the target.
- Check tensor shapes.
- Check loss definition.
- Check scale and numerical stability.
- Check whether parameters are actually updating.
This sequence avoids random guessing.
Technical Details Interviewers Often Want
Data and Label Checks
Before touching the model, verify that the data pipeline is sane:
- input values are in the expected range
- labels correspond to the right examples
- train and evaluation splits are truly separate
- preprocessing at evaluation time matches training-time preprocessing
An interviewer often wants to see that you do not blame the optimizer before checking whether the target itself is corrupted.
Loss and Activation Compatibility
This is one of the most common silent bugs.
Examples:
- applying softmax before a loss that already expects logits
- using mean-squared error for classification without good reason
- mismatching binary labels with multiclass output format
The key point is that the code can run without crashing and still learn the wrong thing.
Gradient Flow
When a model is not learning, a strong answer is to inspect:
- whether gradients are zero
- whether gradients are
NaN - whether parameters change after
optimizer.step() - whether the intended parameters are actually included in the optimizer
This is more informative than saying "maybe the learning rate is wrong" and stopping there.
Attention Debugging
For attention, four checks solve many issues:
Q,K,Vshapes- whether the score matrix has the expected shape
- whether the mask broadcasts to the score shape
- whether softmax is applied over the key dimension
If any of those is wrong, the model may still run but produce meaningless attention patterns.
Numerical Stability
NaNs usually come from a small set of operations:
- exponentials on large values
- logarithms of zero
- divisions by tiny denominators
- half-precision overflow
- invalid normalization constants
A good answer names the operation class, not just the symptom.
Common Interview Bugs
1. Loss Does Not Decrease
Possible causes:
- learning rate too high or too low
- wrong labels
- no gradient flow
- optimizer not stepping
- output activation mismatched with loss
2. NaNs During Training
Common reasons:
log(0)exp(large_number)- division by zero
- exploding gradients
- invalid normalization
3. Accuracy Is Suspiciously High
Check:
- train/test leakage
- duplicates across splits
- label leakage in features
- preprocessing fit on all data
4. Attention Is Wrong
Check:
- mask orientation
- broadcasting shape
- scaling by
sqrt(d_k) - softmax axis
Common Failure Modes
1. The Model Appears to Train but the Metric Is Broken
Sometimes the loss goes down because the code optimizes something real, but the evaluation metric is computed incorrectly.
Examples:
- thresholding logits incorrectly
- averaging over padded tokens
- mixing micro and macro averaging unintentionally
2. Leakage Hidden in Preprocessing
A classic example is fitting normalization, vocabulary construction, PCA, or imputation on the full dataset before splitting.
This can make evaluation look unrealistically strong.
3. Training/Eval Mode Bugs
BatchNorm and dropout behave differently in training and evaluation.
If the mode is wrong, metrics can swing dramatically even though the model code itself looks unchanged.
4. Parameters Not Updating
Common reasons:
- frozen parameters
- missing parameter group in the optimizer
zero_gradorstepused incorrectly- gradient accumulation logic applied incorrectly
5. Shape Bugs That Broadcast Silently
This is particularly dangerous in NumPy and PyTorch because the code may run and produce outputs of the wrong meaning without raising an exception.
Edge Cases and Follow-Up Questions
What if the training loss decreases but validation quality is flat?
That suggests:
- overfitting
- train/eval distribution mismatch
- wrong validation metric
- label leakage in training but not validation
Do not assume optimization is the only issue.
What if NaNs happen only in mixed precision?
Then the likely problem is not the abstract model architecture.
It is probably one of:
- reduced numeric range
- unstable gradient scaling
- half-precision overflow in activations or logits
What if attention code runs but outputs are nonsense?
That often means a semantic shape bug:
- wrong transpose
- wrong softmax axis
- bad mask broadcast
- mixing batch and head dimensions
What if accuracy is high but examples look obviously wrong?
Then inspect the evaluation setup itself:
- label encoding
- thresholding
- class imbalance
- leakage
- duplicate examples
What if only one class is ever predicted?
Possible causes include:
- severe class imbalance
- threshold issue
- collapsed logits
- bad bias initialization
- loss-weighting problem
Timed Coding Strategy
When the problem starts, do this:
- State assumptions.
- Write a simple correct version.
- Mention edge cases.
- Improve stability or efficiency.
- Give runtime.
That pattern is reliable under pressure.
Files in This Topic
- debugging_patterns.py: small bug patterns and checks
- mock_questions.md: timed coding and debugging prompts
These files are intentionally small and repeatable. The point is to make your debugging procedure easy to recall in an interview, not to build a large debugging framework.
What to Practice Saying Out Loud
- If loss is flat, what are your first five checks?
- If accuracy is 99.9%, why might that be wrong?
- If attention output is nonsense, which tensor shapes would you inspect first?
- Why does clipping help with exploding gradients?
- Why can the wrong loss/activation pair silently break learning?
- Why can a model look stable in training but fail only at evaluation time?
- What is the difference between a numerical bug and a statistical bug in model performance?
- Which checks would you do before changing the model architecture?