Topic 30: A/B Testing & Experimentation

🔥 For interviews, read these first:

AB_TESTING_DEEP_DIVE.md — frontier-lab deep dive: hypothesis tests (z/t/Mann-Whitney/bootstrap), sample size formulas, CUPED, peeking and sequential testing, SUTVA / network effects, SRM check, novelty effects, multiple testing, Bayesian A/B, ML-specific (interleaving, holdback, off-policy / IPS).

INTERVIEW_GRILL.md — 55 active-recall questions.

What You'll Learn

This topic covers A/B testing for ML:

Statistical foundations
Hypothesis testing
Sample size calculation
Multiple testing correction
Interpreting results
Common pitfalls
Bayesian A/B testing

Why We Need This

Interview Importance

Common questions: "How do you A/B test a new model?"
Practical knowledge: Essential for production ML
Statistical rigor: Shows scientific approach

Real-World Application

Model deployment: Test new models before full rollout
Feature testing: Test new features
Business decisions: Data-driven decisions

Core Intuition

A/B testing is about causal evidence, not just comparing two averages.

The real question is:

did the intervention cause the observed difference?
is that difference large enough to matter?

Why Randomization Matters

Randomization helps make treatment and control comparable.

Without it, observed differences may come from:

selection bias
seasonality
user-segment imbalance

Statistical vs Practical Significance

This distinction matters a lot in interviews.

statistical significance asks whether the effect is unlikely under the null
practical significance asks whether the effect is worth acting on

Technical Details Interviewers Often Want

Power and Sample Size

An underpowered experiment can miss a real effect.

So "not significant" does not automatically mean "no effect."

Multiple Metrics

Testing many metrics increases false positives unless you:

predefine a primary metric
correct for multiple testing

Peeking

Repeatedly checking results and stopping early inflates false positive risk in fixed-horizon testing.

Common Failure Modes

declaring victory from a tiny but significant effect
running an underpowered experiment
testing many metrics without correction or prioritization
peeking early and treating the p-value as valid
confusing causal lift with observational correlation

Edge Cases and Follow-Up Questions

Why is randomization so important?
Why can a p-value below 0.05 still be uninteresting?
Why does peeking break naive fixed-horizon inference?
Why do multiple metrics increase false positives?
Why can an experiment fail to show significance even when a useful effect exists?

What to Practice Saying Out Loud

The difference between statistical and practical significance
Why sample size and power matter
Why A/B testing is really about causal inference in product decisions

Theory

Hypothesis Testing

Null Hypothesis (H₀):

No difference between A and B
Model A = Model B

Alternative Hypothesis (H₁):

There is a difference
Model A ≠ Model B (or A > B)

Significance Level (α):

Probability of rejecting H₀ when it's true (Type I error)
Typically α = 0.05 (5%)

P-value:

Probability of observing results as extreme if H₀ is true
If p < α, reject H₀

Sample Size Calculation

Formula:

n = 2 × (Z_α/2 + Z_β)² × σ² / (μ_A - μ_B)²

Where:
- Z_α/2: Z-score for significance level (1.96 for α=0.05)
- Z_β: Z-score for power (0.84 for 80% power)
- σ: Standard deviation
- μ_A - μ_B: Minimum detectable effect

Factors: