Topic 54: Data Manipulation for ML

What You'll Learn

Some ML interviews still include practical data wrangling.

This topic focuses on the patterns you are most likely to need quickly:

filtering rows
handling missing values
groupby aggregations
joins and merges
feature normalization
one-hot encoding
train/test leakage concerns in preprocessing

Why This Matters

A lot of candidates prepare only model code and then get slowed down by:

basic pandas manipulation
feature engineering tables
aggregations over users, sessions, or time windows

For research and applied interviews, this can matter a lot because messy data is often where real work starts.

Core Intuition

1. Data Manipulation Is Part of Modeling

If your feature table is wrong, the model is wrong.

This means:

row-level correctness matters
time logic matters
grouping logic matters
leakage can happen before training even begins

2. Most Interview Tasks Are Simple but Easy to Mess Up

Typical prompts:

compute per-user statistics
fill missing values
merge labels with features
create normalized columns
aggregate by day or by category

The difficulty is usually not deep math. It is avoiding silent mistakes.

3. Leakage Risk

Very common issue:

compute global mean using all rows
then use it in the training pipeline

That is wrong if the test split influenced the transform.

Correct principle:

fit preprocessing on training data
apply the learned transform to validation/test

Common Failure Modes

1. Wrong Join Semantics

A merge can silently:

drop rows
duplicate rows
misalign labels and features

That can completely change the dataset before modeling starts.

2. Aggregating Over the Wrong Time Window

This is a common source of leakage.

If you compute a user feature using events that occur after the prediction timestamp, the model gets future information it should not have.

3. Filling Missing Values Without Thinking About Meaning

A missing value can mean:

zero
unknown
not applicable
sensor failure

Treating all missing values the same can erase useful signal or create misleading artifacts.

4. Fitting Preprocessing on the Full Dataset

Normalization, imputation, encoding, and PCA should usually be fit on training data only.

Otherwise evaluation is contaminated.

Edge Cases and Follow-Up Questions

What if a left join creates duplicate rows?

That usually means the join key is not unique on the right-hand side.

You should inspect key cardinality before trusting the result.

What if a category appears only at test time?

Then one-hot encoding needs a policy for unseen categories.

A good answer is to mention an "unknown" bucket or a learned fallback.

What if missingness itself is predictive?

Then the missing-value indicator can be useful as its own feature.

What if normalization is done per batch instead of from training statistics?

Then the feature meaning can drift across splits, and evaluation may become inconsistent.

Files in This Topic

data_manipulation.py: compact pandas and NumPy patterns

What to Practice Saying Out Loud

Why can preprocessing cause leakage?
How do you aggregate per entity without duplicating rows incorrectly?
When would you use a left join versus an inner join?
Why should normalization parameters come from training data only?
How do you handle missing values without hiding a signal?

ML & LLM Interview Prep — Deep Dives