Topic 61: Large-Scale LLM Systems
🔥 For interviews, read these first:
LARGE_SCALE_LLM_DEEP_DIVE.md— frontier-lab deep dive: training memory math ( rule), activation checkpointing, BF16/FP8, ZeRO-1/2/3 / FSDP, Megatron tensor parallelism (column/row), pipeline parallelism + bubble formula, 3D parallelism layout, expert parallelism for MoE, sequence/context parallelism, MFU, training failure modes (loss spikes, NaNs, hangs, stragglers).INTERVIEW_GRILL.md— 60 active-recall questions.
What You'll Learn
This topic covers the systems fluency that often appears in research-scientist interviews when they shift from model ideas to scale constraints.
You will learn:
- memory breakdown in training
- why optimizer states are expensive
- gradient accumulation
- mixed precision
- activation checkpointing
- FSDP and ZeRO intuition
- data, tensor, and pipeline parallelism intuition
- throughput vs latency
- serving trade-offs
- failure modes at scale
Why This Matters
Even research interviews often push on scale:
- "Why did training OOM?"
- "How would you fit a longer context?"
- "Why is serving so expensive?"
- "What do you shard?"
The repo already covers some single-GPU and inference basics. This topic makes the multi-GPU and large-scale reasoning explicit.
Core Intuition
Large-scale LLM systems questions are usually bottleneck questions.
The interviewer is often asking some version of:
"What resource is running out first, and what lever would you pull?"
That resource might be:
- GPU memory
- training throughput
- inference latency
- communication bandwidth
- engineering reliability
The best answers break the system into components, identify the dominant bottleneck, and then explain the trade-off of each mitigation.
Files in This Topic
- large_scale_systems.md: detailed interview explanations
- systems_tradeoffs.py: small memory and throughput calculators
Technical Details Interviewers Often Want
Memory Breakdown
Training memory is not just parameters.
A useful decomposition is:
- model parameters
- gradients
- optimizer states
- activations
For Adam-style optimizers, optimizer states can be a major memory cost because each parameter may need multiple additional tensors.
This is why changing the optimizer or sharding optimizer states can have a large effect.
Why Long Context Is Expensive
Longer context affects both memory and compute.
In attention-based models, longer sequences increase the size of attention score matrices and the amount of activation storage.
That means context-length questions are rarely just "more tokens is better."
They are trade-off questions involving:
- memory
- latency
- throughput
- model quality
Mixed Precision and Checkpointing
Mixed precision reduces memory and can improve throughput, but it also increases sensitivity to numerical instability.
Activation checkpointing reduces activation memory by recomputing parts of the forward pass during backpropagation.
That means it trades memory for extra compute.
FSDP and ZeRO Intuition
You do not need every implementation detail in an interview, but you should know the high-level purpose:
- shard parameters, gradients, and or optimizer states across devices
- reduce per-device memory footprint
- accept communication overhead as a trade-off
Throughput vs Latency
This distinction is frequently misunderstood.
Throughput asks how much total work the system can do over time.
Latency asks how long one request takes.
Batching often helps throughput but can hurt latency.
That is a classic interview trade-off.
Common Failure Modes
1. Treating OOM as a Single Problem
Out-of-memory errors can come from different sources:
- activations too large
- optimizer states too large
- sequence length too large
- microbatch too large
The fix depends on which component dominates.
2. Naming Parallelism Without Explaining the Trade-Off
Saying "use FSDP" or "use tensor parallelism" is incomplete unless you also say what cost you are paying, usually communication or implementation complexity.
3. Ignoring Inference Constraints
A model can look good in training discussion and still be impractical at serving time because of KV-cache growth, latency, or hardware cost.
4. Confusing Throughput Improvement with Better User Experience
Higher throughput does not automatically mean lower latency for an individual request.
Edge Cases and Follow-Up Questions
What if training fits but inference is still too expensive?
Then the bottleneck has shifted.
You may need to discuss:
- KV-cache memory
- batching policy
- quantization
- speculative decoding
- serving architecture
What if gradient checkpointing slows training too much?
Then it may be the wrong lever if compute, not memory, is already the limiting resource.
What if communication dominates after sharding?
Then additional sharding may no longer help.
This is why scaling techniques must be evaluated in the context of interconnect speed and cluster topology.
What if the user asks for lower latency and higher throughput at the same time?
You should explain that those goals can conflict and that the right solution depends on workload shape and batching strategy.
Core Interview Pattern
When asked a scale question:
- State the bottleneck.
- Break memory or latency into components.
- Name the levers.
- Explain the trade-off of each lever.
- Pick the best first action for the stated constraint.
That structure makes your answer sound practical instead of vague.
What to Practice Saying Out Loud
- What component is dominating memory here: parameters, activations, gradients, or optimizer state?
- Why does longer context increase both quality potential and systems cost?
- What does checkpointing save, and what does it cost?
- Why can sharding solve memory while worsening communication overhead?
- What is the difference between improving throughput and improving latency?
Suggested Use
For deeper follow-up on frontier methodology and serving-engine internals, continue to: