Large-Scale LLM Systems
1. Training Memory Breakdown
A useful way to explain GPU memory is:
- model weights
- gradients
- optimizer states
- activations
- temporary buffers / fragmentation
For Adam-style training, optimizer states are expensive because you usually keep:
- parameters
- gradients
- first moment
- second moment
That is why memory can be much larger than just parameter count.
2. What to Do When You OOM
Common levers:
- reduce batch size
- use gradient accumulation
- use mixed precision
- use activation checkpointing
- shorten sequence length
- shard optimizer states and parameters
3. Gradient Accumulation
Useful explanation:
"Gradient accumulation lets me simulate a larger effective batch size without storing all microbatches in memory at once. I still pay more wall-clock time, but I reduce instantaneous memory pressure."
4. Activation Checkpointing
Useful explanation:
"Checkpointing saves memory by not storing every intermediate activation. Instead, some activations are recomputed during backward pass. So I trade extra compute for lower memory."
5. Mixed Precision
Useful explanation:
"Mixed precision reduces memory and often improves throughput, but it can introduce numerical instability if scaling and sensitive operations are not handled carefully."
6. FSDP / ZeRO Intuition
FSDP
Shard model parameters, gradients, and optimizer state across devices so no single GPU holds the full copy all the time.
ZeRO
Partition optimizer state, gradients, and sometimes parameters across ranks to reduce redundant memory replication.
Easy interview phrasing:
"The main idea is to avoid every GPU holding a full copy of everything."
7. Parallelism Types
Data Parallelism
Each GPU gets different data, same model.
Good when:
- model fits on each device
- you want higher throughput
Tensor Parallelism
Split tensors or layers across devices.
Good when:
- a single layer is too large for one device
Pipeline Parallelism
Split layers into stages across devices.
Good when:
- model depth is large
- you can tolerate pipeline scheduling complexity
8. Long Context Costs
Longer context usually increases:
- activation memory
- attention memory
- latency
Useful answer:
"If context length doubles, attention cost often grows more than linearly, and in vanilla full attention it grows quadratically with sequence length."
9. Serving Trade-Offs
At serving time, common trade-offs are:
- latency vs throughput
- batch size vs tail latency
- model size vs cost
- quantization vs accuracy
- cache size vs memory
10. Failure Modes at Scale
Things that often break:
- OOM from activations
- NCCL or communication bottlenecks
- rank desynchronization
- mixed precision instability
- checkpoint corruption
- data pipeline starvation
11. What Interviewers Often Want
Usually they do not need deep framework-specific commands.
They want to hear:
- that you know the bottleneck
- that you know the available levers
- that you understand the trade-off of each lever