Topic 63: Paged Attention and LLM Serving Internals
🔥 For interviews, read these first:
paged_attention_deep_dive.md— descriptive deep dive: serving vs training bottlenecks, KV-cache fragmentation, paging analogy, block tables, prefix sharing, continuous batching, prefill/decode disaggregation.INTERVIEW_GRILL.md— 60 active-recall questions covering KV-cache math (GQA/MQA/MLA savings), PagedAttention internals, continuous batching, prefix caching / RadixAttention, speculative decoding, INT8/INT4/FP8 quantization, vLLM/SGLang/TensorRT-LLM differentiation.
What You'll Learn
This topic goes deeper on the serving internals that matter in strong LLM systems interviews.
You will learn:
- why serving is often memory-bound rather than compute-bound
- how KV-cache growth drives inference cost
- why naive KV allocation causes waste and fragmentation
- how paged attention organizes KV memory into blocks
- how block tables, prefix sharing, and copy-on-write work
- why continuous batching changes serving behavior
- how speculative decoding and paged KV caching fit together
Why This Matters
Interviewers increasingly ask systems questions like:
- "Why is LLM serving expensive?"
- "What is PagedAttention actually fixing?"
- "Why is KV cache such a big bottleneck?"
- "How does vLLM get better throughput?"
Good answers require more than saying "use KV cache" or "batch requests."
They require you to explain the memory model.
Core Intuition
During autoregressive generation, the model keeps producing new keys and values for every token at every layer.
Those keys and values must stay available because future tokens attend to them.
So the serving problem becomes:
"How do we keep large growing per-request memory around without wasting too much space or stalling the GPU?"
Naive serving often reserves memory as if each request will use one large contiguous buffer.
That causes two problems:
- over-reservation: you hold space a request may never use
- fragmentation: free space exists, but not in the right contiguous shape
Paged attention solves this by making KV cache look more like virtual memory:
- break KV storage into fixed-size blocks
- let each request own a logical sequence of blocks
- translate logical positions to physical blocks with a block table
The attention math stays the same.
What changes is how memory is laid out and addressed.
Technical Details Interviewers Often Want
Why KV Cache Dominates Serving
At decode time, compute per new token may be moderate, but memory traffic is large because the model has to read all prior keys and values for the current sequence.
KV-cache cost grows with:
- number of layers
- number of KV heads
- head dimension
- sequence length
- batch size
That is why inference questions often become memory-bandwidth questions.
Fragmentation in Naive Allocation
Suppose each request gets a large contiguous cache reservation for its worst-case length.
Then:
- short requests waste space
- completed requests leave holes
- new requests may not fit even when total free memory looks large
This is classic fragmentation.
Paged KV caching reduces that waste because requests can grow block by block instead of reserving one giant contiguous region.
Block Tables
Each request keeps a mapping from logical token positions to physical KV blocks.
That means:
- the sequence is logically continuous
- the underlying memory can be physically non-contiguous
The key interview point is:
"PagedAttention changes the storage layout, not the semantics of attention."
Prefix Sharing and Copy-on-Write
If multiple requests share the same prefix, they can often share the same KV blocks for that prefix.
That is powerful for:
- branching generations
- beam search
- repeated system prompts
Copy-on-write means the shared blocks stay shared until one branch needs to modify or extend in a way that requires new ownership.
Continuous Batching
Traditional static batching waits for a batch to finish before admitting new work.
Continuous batching admits and retires requests dynamically.
That improves utilization because:
- finished sequences stop consuming decode slots
- shorter requests do not block the entire batch
- the server keeps the GPU busy with a changing set of active requests
The trade-off is that scheduling logic becomes more complex.
How This Relates to Speculative Decoding
Speculative decoding attacks latency by proposing multiple future tokens cheaply and verifying them.
Paged KV caching attacks memory waste and scheduling inefficiency.
They solve different bottlenecks and can complement each other.
Common Failure Modes
1. Explaining KV Cache but Not Memory Layout
That is enough for an entry-level answer, but stronger systems interviews usually want the fragmentation and allocator story too.
2. Treating Paging as a Change to the Attention Equation
Paged attention does not change the mathematical objective of attention.
It changes how KV memory is stored and gathered efficiently.
3. Ignoring Memory Bandwidth
Candidates sometimes talk only about FLOPs.
Serving often bottlenecks on moving KV data around, not just on raw arithmetic.
4. Claiming Throughput Gains Without Mentioning Scheduling
High throughput in a real server depends on:
- batching policy
- request length distribution
- memory allocator behavior
- prefix reuse
5. Forgetting GQA and MQA Effects
Serving cost is strongly affected by the number of KV heads.
That is why MQA and GQA matter not only for architecture, but also for serving efficiency.
Edge Cases and Follow-Up Questions
What if paged allocation reduces fragmentation but block lookup adds overhead?
Then the real question is whether the memory-efficiency gain outweighs the lookup and gather overhead on the target hardware and workload.
What if prefixes are rarely shared?
Then prefix sharing gives little benefit, but paged KV management can still help by reducing fragmentation and improving allocator flexibility.
What if request lengths are very uniform?
Then the scheduling and fragmentation benefit of paging may be smaller than in highly variable workloads.
What if latency matters more than throughput?
Then aggressive batching may not be the right answer, even if it improves device utilization.
What if GQA reduces KV memory but hurts quality?
Then you must discuss the trade-off honestly: better serving efficiency may come at some representational cost.
Boilerplate Code
See:
The Python file contains small interview-friendly helpers for:
- KV-cache memory estimation
- naive contiguous allocation waste estimates
- paged block usage
- a toy continuous-batching scheduler
The goal is not to reimplement vLLM.
The goal is to make the core ideas mechanically clear.
For a longer descriptive explanation of the serving logic, allocator trade-offs, and why paging helps, read paged_attention_deep_dive.md.
What to Practice Saying Out Loud
- Why is LLM serving often memory-bound even when training is compute-heavy?
- What problem does paged attention solve that plain KV caching does not?
- Why does fragmentation matter in a serving engine?
- How does continuous batching improve throughput?
- Why do GQA and MQA matter for KV-cache cost?
- What does prefix sharing buy you, and when might it not help much?
Suggested Use
Use this topic after: