Topic 29: ML System Design

🔥 For interviews, read these first:

  • ML_SYSTEM_DESIGN_DEEP_DIVE.md — substantially redesigned (2025): 6-step framework + 7 fully-worked platform-scale system designs (YouTube recommender, Google search, ads ranking, Stripe-scale fraud, content moderation, LLM serving platform, semantic image search). Each covers clarification, frame, data, multi-stage architecture with diagrams, serving with latency budgets, monitoring across infra/model/business layers, failure modes, and iteration roadmap.
  • INTERVIEW_GRILL.md — 55 active-recall questions.

This document focuses on platform-scale system design ("design YouTube recommender"). For product/business case studies ("design churn prediction"), see 28_business_use_cases/.

What You'll Learn

This topic covers the open-ended "design an ML system" interview question:

  • A repeatable 6-step framework
  • ML problem framing (classification/regression/ranking/retrieval)
  • Data sources and label leakage
  • Two-stage retrieval pattern (everywhere)
  • Serving patterns (online/batch/streaming/async)
  • Latency budgets and where time goes
  • Monitoring (infra/model/business metrics)
  • Failure modes and fallback strategies

Why This Matters

Big-tech ML system design rounds are open-ended on purpose. Interviewers test whether you ask the right clarifying questions, recognize standard patterns, understand trade-offs, and design for production failure modes. The framework here makes that visible.

Next Steps

  • Topic 30: A/B testing — how you actually decide whether to ship the new system.
  • Topic 49: Generalization & evaluation — what metrics to monitor.
  • Topic 39: RAG — full system design example for retrieval-augmented LLMs.
  • Topic 63: Paged attention & LLM serving internals — deep dive on inference serving.