Topic 29: System Design for ML

What You'll Learn

This topic covers system design for ML systems:

Scalable ML pipelines
Model serving architecture
Real-time vs batch inference
Feature stores
Model versioning
Monitoring and alerting
A/B testing infrastructure
Cost optimization

Why We Need This

Interview Importance

Common questions: "Design a system to serve 1M predictions/second"
System design: Critical for ML engineer roles
Production knowledge: Shows real-world experience

Real-World Application

Production systems: Need to scale
Reliability: Systems must be robust
Cost: Efficient systems save money

System Design Topics

1. Scalable ML Pipeline

Components:

Data ingestion
Feature engineering
Model training
Model serving
Monitoring

Architecture:

Data Sources → Feature Store → Training Pipeline → Model Registry
                                                      ↓
User Requests → Feature Store → Model Serving → Predictions

2. Model Serving

Options:

Real-time: REST API, gRPC
Batch: Scheduled jobs
Streaming: Kafka, Flink

Considerations:

Latency requirements
Throughput requirements
Cost constraints

3. Feature Stores

Purpose:

Centralized feature storage
Consistent features across training/serving
Feature versioning
Real-time feature computation

Benefits:

Prevent training-serving skew
Reuse features
Faster development

4. Model Versioning

Strategy:

Version models (v1, v2, ...)
Track metadata (metrics, data, hyperparameters)
Easy rollback

Tools:

MLflow
Weights & Biases
Custom solutions

5. Monitoring

What to monitor:

Prediction latency
Throughput
Error rates
Data drift
Model performance (A/B test)

Alerting:

Set thresholds
Alert on anomalies
Dashboard for visualization

Core Intuition

System design questions are not asking for the fanciest architecture.

They are usually asking whether you can reason about constraints.

For ML systems, the core constraints are:

latency
throughput
correctness
cost
reliability
offline/online consistency

The strongest answers start by naming which of those matter most for the problem.

Training vs Serving

One of the easiest mistakes is to mix training concerns with serving concerns.

Training systems optimize for:

throughput
reproducibility
experiment tracking

Serving systems optimize for:

latency
availability
safe rollouts
observability

Feature Stores Matter Because Consistency Matters

A feature store is not just a database of features.

Its real purpose is to reduce training-serving skew:

the feature definition should mean the same thing offline and online
the transformation path should be consistent
timestamps and freshness matter

Technical Details Interviewers Often Want

Real-Time Serving Trade-Offs

If you batch requests aggressively:

throughput usually improves
tail latency can worsen

If you cache aggressively:

latency and cost can improve
freshness and personalization can worsen

Monitoring Needs Multiple Layers

It is not enough to monitor infrastructure only.

A real ML system needs:

system metrics: latency, error rate, CPU/GPU, queue depth
data metrics: drift, null rates, feature freshness
model metrics: accuracy, calibration, business KPIs

Rollout Safety

A strong answer often mentions:

canary or shadow deployment
rollback plan
model versioning
experiment analysis before full rollout

Common Failure Modes

optimizing average latency while ignoring tail latency
forgetting training-serving skew
no rollback or versioning strategy
monitoring only infrastructure and not model quality
underestimating feature freshness issues in online systems

Edge Cases and Follow-Up Questions

What if latency and accuracy goals conflict?
What if online features arrive late or are missing?
Why can a model pass offline validation but fail in production?
What is the difference between canary, shadow, and A/B deployment?
Why is monitoring data quality as important as monitoring service uptime?

What to Practice Saying Out Loud

How you would structure a serving answer from requirements to architecture
How you would prevent training-serving skew
What you would monitor in the first week after deployment

Design Patterns

Pattern 1: Real-Time Serving

Requirements:

<100ms latency
10K requests/second
99.9% uptime

Architecture:

Load Balancer → API Gateway → Feature Service → Model Service → Cache
                                                      ↓
                                                 Database (for logging)

Components:

Load Balancer: Distribute traffic
API Gateway: Rate limiting, authentication
Feature Service: Get features (from store or compute)
Model Service: Run inference
Cache: Store predictions for common requests

Pattern 2: Batch Inference

Requirements:

Process millions of records
Run daily/weekly
Cost-efficient

Architecture:

Scheduled Job → Data Pipeline → Feature Engineering → Batch Inference → Results Storage

Tools:

Airflow (orchestration)
Spark (processing)
S3/GCS (storage)

Pattern 3: A/B Testing Infrastructure

Requirements:

Route traffic to different models
Track metrics per variant
Statistical significance testing

Architecture:

Request → Experiment Service → Model A (50%) / Model B (50%)
                                    ↓
                            Metrics Collection → Analysis

Cost Optimization

Strategies:

Caching: Cache predictions for common inputs
Batching: Process multiple requests together
Model optimization: Quantization, pruning
Right-sizing: Use appropriate instance types
Spot instances: For batch jobs

Exercises

Design system for 1M predictions/second
Design feature store
Design A/B testing infrastructure
Optimize costs

Next Steps

Topic 30: A/B testing and experimentation
Review all system design patterns

ML & LLM Interview Prep — Deep Dives