Topic 63: Paged Attention and LLM Serving Internals

🔥 For interviews, read these first:

paged_attention_deep_dive.md — descriptive deep dive: serving vs training bottlenecks, KV-cache fragmentation, paging analogy, block tables, prefix sharing, continuous batching, prefill/decode disaggregation.

INTERVIEW_GRILL.md — 60 active-recall questions covering KV-cache math (GQA/MQA/MLA savings), PagedAttention internals, continuous batching, prefix caching / RadixAttention, speculative decoding, INT8/INT4/FP8 quantization, vLLM/SGLang/TensorRT-LLM differentiation.

What You'll Learn

This topic goes deeper on the serving internals that matter in strong LLM systems interviews.

You will learn:

why serving is often memory-bound rather than compute-bound
how KV-cache growth drives inference cost
why naive KV allocation causes waste and fragmentation
how paged attention organizes KV memory into blocks
how block tables, prefix sharing, and copy-on-write work
why continuous batching changes serving behavior
how speculative decoding and paged KV caching fit together

Why This Matters

Interviewers increasingly ask systems questions like:

"Why is LLM serving expensive?"
"What is PagedAttention actually fixing?"
"Why is KV cache such a big bottleneck?"
"How does vLLM get better throughput?"

Good answers require more than saying "use KV cache" or "batch requests."

They require you to explain the memory model.

Core Intuition

During autoregressive generation, the model keeps producing new keys and values for every token at every layer.

Those keys and values must stay available because future tokens attend to them.

So the serving problem becomes:

"How do we keep large growing per-request memory around without wasting too much space or stalling the GPU?"

Naive serving often reserves memory as if each request will use one large contiguous buffer.

That causes two problems:

over-reservation: you hold space a request may never use
fragmentation: free space exists, but not in the right contiguous shape

Paged attention solves this by making KV cache look more like virtual memory:

break KV storage into fixed-size blocks
let each request own a logical sequence of blocks
translate logical positions to physical blocks with a block table

The attention math stays the same.

What changes is how memory is laid out and addressed.

Technical Details Interviewers Often Want

Why KV Cache Dominates Serving

At decode time, compute per new token may be moderate, but memory traffic is large because the model has to read all prior keys and values for the current sequence.

KV-cache cost grows with:

number of layers
number of KV heads
head dimension
sequence length
batch size

That is why inference questions often become memory-bandwidth questions.

Fragmentation in Naive Allocation

Suppose each request gets a large contiguous cache reservation for its worst-case length.

Then:

short requests waste space
completed requests leave holes
new requests may not fit even when total free memory looks large

This is classic fragmentation.

Paged KV caching reduces that waste because requests can grow block by block instead of reserving one giant contiguous region.

Block Tables

Each request keeps a mapping from logical token positions to physical KV blocks.

That means:

the sequence is logically continuous
the underlying memory can be physically non-contiguous

The key interview point is:

"PagedAttention changes the storage layout, not the semantics of attention."

If multiple requests share the same prefix, they can often share the same KV blocks for that prefix.

That is powerful for:

branching generations
beam search
repeated system prompts

Copy-on-write means the shared blocks stay shared until one branch needs to modify or extend in a way that requires new ownership.

Continuous Batching

Traditional static batching waits for a batch to finish before admitting new work.

Continuous batching admits and retires requests dynamically.

That improves utilization because:

finished sequences stop consuming decode slots
shorter requests do not block the entire batch
the server keeps the GPU busy with a changing set of active requests

The trade-off is that scheduling logic becomes more complex.

How This Relates to Speculative Decoding

Speculative decoding attacks latency by proposing multiple future tokens cheaply and verifying them.

Paged KV caching attacks memory waste and scheduling inefficiency.

They solve different bottlenecks and can complement each other.

Common Failure Modes

1. Explaining KV Cache but Not Memory Layout

That is enough for an entry-level answer, but stronger systems interviews usually want the fragmentation and allocator story too.

2. Treating Paging as a Change to the Attention Equation

Paged attention does not change the mathematical objective of attention.

It changes how KV memory is stored and gathered efficiently.

3. Ignoring Memory Bandwidth

Candidates sometimes talk only about FLOPs.

Serving often bottlenecks on moving KV data around, not just on raw arithmetic.

4. Claiming Throughput Gains Without Mentioning Scheduling

High throughput in a real server depends on:

batching policy
request length distribution
memory allocator behavior
prefix reuse

KV-cache memory estimation
naive contiguous allocation waste estimates
paged block usage
a toy continuous-batching scheduler

The goal is not to reimplement vLLM.

The goal is to make the core ideas mechanically clear.

For a longer descriptive explanation of the serving logic, allocator trade-offs, and why paging helps, read paged_attention_deep_dive.md.

What to Practice Saying Out Loud

Why is LLM serving often memory-bound even when training is compute-heavy?
What problem does paged attention solve that plain KV caching does not?
Why does fragmentation matter in a serving engine?
How does continuous batching improve throughput?
Why do GQA and MQA matter for KV-cache cost?
What does prefix sharing buy you, and when might it not help much?

Suggested Use

Use this topic after:

ML & LLM Interview Prep — Deep Dives

Topic 63: Paged Attention and LLM Serving Internals

What You'll Learn

Why This Matters

Core Intuition

Technical Details Interviewers Often Want

Why KV Cache Dominates Serving

Fragmentation in Naive Allocation

Block Tables

Continuous Batching

How This Relates to Speculative Decoding

Common Failure Modes

1. Explaining KV Cache but Not Memory Layout

2. Treating Paging as a Change to the Attention Equation

3. Ignoring Memory Bandwidth

4. Claiming Throughput Gains Without Mentioning Scheduling

5. Forgetting GQA and MQA Effects

Edge Cases and Follow-Up Questions

What if paged allocation reduces fragmentation but block lookup adds overhead?

What if prefixes are rarely shared?

What if request lengths are very uniform?

What if latency matters more than throughput?

What if GQA reduces KV memory but hurts quality?

Boilerplate Code

What to Practice Saying Out Loud

Suggested Use