Topic 30: A/B Testing & Experimentation
🔥 For interviews, read these first:
AB_TESTING_DEEP_DIVE.md— frontier-lab deep dive: hypothesis tests (z/t/Mann-Whitney/bootstrap), sample size formulas, CUPED, peeking and sequential testing, SUTVA / network effects, SRM check, novelty effects, multiple testing, Bayesian A/B, ML-specific (interleaving, holdback, off-policy / IPS).INTERVIEW_GRILL.md— 55 active-recall questions.
What You'll Learn
This topic covers A/B testing for ML:
- Statistical foundations
- Hypothesis testing
- Sample size calculation
- Multiple testing correction
- Interpreting results
- Common pitfalls
- Bayesian A/B testing
Why We Need This
Interview Importance
- Common questions: "How do you A/B test a new model?"
- Practical knowledge: Essential for production ML
- Statistical rigor: Shows scientific approach
Real-World Application
- Model deployment: Test new models before full rollout
- Feature testing: Test new features
- Business decisions: Data-driven decisions
Core Intuition
A/B testing is about causal evidence, not just comparing two averages.
The real question is:
- did the intervention cause the observed difference?
- is that difference large enough to matter?
Why Randomization Matters
Randomization helps make treatment and control comparable.
Without it, observed differences may come from:
- selection bias
- seasonality
- user-segment imbalance
Statistical vs Practical Significance
This distinction matters a lot in interviews.
- statistical significance asks whether the effect is unlikely under the null
- practical significance asks whether the effect is worth acting on
Technical Details Interviewers Often Want
Power and Sample Size
An underpowered experiment can miss a real effect.
So "not significant" does not automatically mean "no effect."
Multiple Metrics
Testing many metrics increases false positives unless you:
- predefine a primary metric
- correct for multiple testing
Peeking
Repeatedly checking results and stopping early inflates false positive risk in fixed-horizon testing.
Common Failure Modes
- declaring victory from a tiny but significant effect
- running an underpowered experiment
- testing many metrics without correction or prioritization
- peeking early and treating the p-value as valid
- confusing causal lift with observational correlation
Edge Cases and Follow-Up Questions
- Why is randomization so important?
- Why can a p-value below 0.05 still be uninteresting?
- Why does peeking break naive fixed-horizon inference?
- Why do multiple metrics increase false positives?
- Why can an experiment fail to show significance even when a useful effect exists?
What to Practice Saying Out Loud
- The difference between statistical and practical significance
- Why sample size and power matter
- Why A/B testing is really about causal inference in product decisions
Theory
Hypothesis Testing
Null Hypothesis (H₀):
- No difference between A and B
- Model A = Model B
Alternative Hypothesis (H₁):
- There is a difference
- Model A ≠ Model B (or A > B)
Significance Level (α):
- Probability of rejecting H₀ when it's true (Type I error)
- Typically α = 0.05 (5%)
P-value:
- Probability of observing results as extreme if H₀ is true
- If p < α, reject H₀
Sample Size Calculation
Formula:
n = 2 × (Z_α/2 + Z_β)² × σ² / (μ_A - μ_B)²
Where:
- Z_α/2: Z-score for significance level (1.96 for α=0.05)
- Z_β: Z-score for power (0.84 for 80% power)
- σ: Standard deviation
- μ_A - μ_B: Minimum detectable effect
Factors:
- Effect size (how big difference you want to detect)
- Statistical power (1 - β, typically 80%)
- Significance level (α, typically 5%)
- Variance (more variance → larger sample needed)
Multiple Testing Correction
Problem:
- Testing multiple metrics increases false positive rate
- 20 tests at α=0.05 → ~64% chance of at least one false positive
Solutions:
- Bonferroni: Divide α by number of tests
- FDR (False Discovery Rate): Control expected proportion of false positives
Interpreting Results
Statistical Significance:
- p < 0.05: Statistically significant
- But: Statistical ≠ Practical significance
Effect Size:
- How big is the difference?
- 0.1% improvement might be significant but not meaningful
Confidence Intervals:
- Range of likely true effect
- If CI doesn't include 0, significant
Common Pitfalls
- Stopping early: Don't peek at results
- Multiple testing: Need correction
- Sample size: Too small → underpowered
- Selection bias: Non-random assignment
- Novelty effect: Temporary behavior changes
A/B Testing for ML Models
Process:
Step 1: Design Experiment
- Define metrics (primary and secondary)
- Set sample size
- Randomization strategy
- Duration
Step 2: Run Experiment
- Split traffic (50/50 or other)
- Collect data
- Don't peek!
Step 3: Analyze Results
- Statistical test (t-test, chi-square)
- Effect size
- Confidence intervals
- Check assumptions
Step 4: Decision
- If significant and positive: Rollout
- If not significant: Need more data or no effect
- If negative: Don't rollout
Example: Testing New Recommendation Model
Setup:
- Control: Current model (A)
- Treatment: New model (B)
- Metric: Click-through rate (CTR)
- Sample size: 10,000 users per group
- Duration: 2 weeks
Results:
- A: CTR = 2.5%
- B: CTR = 2.8%
- p-value = 0.02
- Effect size: 12% relative increase
Interpretation:
- Statistically significant (p < 0.05)
- Practically significant (12% increase)
- Decision: Rollout to 100%
Exercises
- Calculate sample size for experiment
- Analyze A/B test results
- Design experiment for new model
- Handle multiple metrics
Next Steps
- Review all topics
- Practice system design
- Prepare for interviews