Paged Attention Deep Dive

This note is meant to explain paged attention and modern LLM serving in descriptive terms rather than only as a short interview summary.

1. Why LLM Serving Feels Different From Training

Training and serving are often discussed together because both use the same model weights.

But they are bottlenecked by different things.

During training:

you run forward and backward passes
you process many tokens in parallel
you pay heavy compute and activation-memory costs

During autoregressive serving:

generation is sequential
each new token depends on the previous context
you repeatedly read historical state
latency matters directly to the user

That changes the optimization problem.

Training often asks:

"How do I maximize throughput for huge batches while staying numerically stable?"

Serving often asks:

"How do I keep per-request latency acceptable while fitting many active requests into limited fast memory?"

That is why a technique that helps training may not help serving, and vice versa.

2. Why KV Cache Exists in the First Place

Without KV caching, every decode step would recompute the full prefix through the transformer stack.

If the prompt is 1000 tokens long and you want 100 output tokens, the model would keep reprocessing almost the same prefix over and over.

That is wasteful because the old keys and values for previous tokens do not change.

KV caching stores them so that future decode steps only need to compute:

the new query
the new key
the new value

and then attend over cached history.

This dramatically cuts repeated computation.

So KV cache is the first major serving optimization.

But it creates a new problem:

you now have a growing memory object attached to every active request.

3. Why KV Cache Becomes the Serving Bottleneck

The cache grows with:

sequence length
number of layers
number of KV heads
head dimension
batch size

So even when each decode step is only one token at a time, the model may need to read a large amount of historical KV state to produce that token.

That means serving can become limited by:

VRAM capacity
memory bandwidth
cache layout efficiency

This is why a model can be computationally feasible and still be expensive to serve.

4. Why Naive KV Allocation Wastes So Much Memory

Imagine a server that gives each request one large contiguous KV buffer sized for a pessimistic maximum length.

That sounds simple, but it wastes memory in several ways.

Over-Reservation

Many requests finish early or never use the full reserved space.

So the server holds memory for tokens that never arrive.

External Fragmentation

When requests finish, they leave holes in memory.

You may have enough total free memory, but not enough in a large contiguous region for the next request.

Variable-Length Workloads

Real serving traffic is messy:

some requests are very short
some continue for a long time
some share prefixes
some branch into multiple generations

A rigid contiguous allocator handles this badly.

This is why the memory problem is not just "KV cache is large."

It is also "KV cache is dynamic and irregular."

5. Why Paging Is the Right Analogy

Operating systems faced a similar problem long ago.

Programs wanted the illusion of large contiguous memory, but physical memory was limited and fragmented.

The solution was virtual memory:

split logical memory into fixed-size pages
map them onto physical page frames
keep a table that translates logical to physical locations

Paged KV caching borrows exactly this idea.

Instead of giving a request one giant contiguous KV region, the serving engine gives it:

a sequence of fixed-size KV blocks
a block table that maps logical order to physical blocks

This preserves the logical sequence while relaxing the physical layout.

That is the key idea.

6. What a Block Table Is Doing

A block table is conceptually simple.

For a given request, it answers:

"If I need the KV data for logical block 0, 1, 2, ... where do those blocks actually live in memory?"

This means the request can grow incrementally:

first one block
then another
then another

without requiring one giant contiguous region.

That is what makes memory reuse practical.

7. What Paging Fixes and What It Does Not

Paging mainly fixes:

allocator waste
external fragmentation
rigid growth behavior

It does not magically make KV cache small.

The underlying historical information is still there.

So a strong answer should say:

"Paged attention improves memory efficiency and serving utilization, but it does not eliminate the fundamental cost of carrying long context."

That distinction matters.

8. Internal Fragmentation Still Exists

Paging does not remove all waste.

If a request uses 3001 tokens and the block size is 128, the last block will not be full.

So there is still some unused capacity in the tail block.

But this waste is bounded.

Instead of wasting an arbitrarily large contiguous reserve, you waste at most roughly one block per request.

That is a much better trade.

This also creates the block-size trade-off:

larger blocks mean fewer lookups and simpler traversal
smaller blocks mean tighter packing and less tail waste

That is a real systems trade-off, not a theoretical curiosity.

9. Why Paged Attention Is Not a New Attention Formula

This is a very common misunderstanding.

Paged attention does not invent a new probabilistic attention rule.

The model is still attending over the same logical sequence.

What changes is how the kernel gathers the keys and values:

they may be physically scattered
the kernel uses the block table to traverse them in logical order

So the math remains equivalent to standard attention over the same history.

The implementation and memory layout are what change.

Many requests share the same prefix:

the same system prompt
the same conversation stem
the same beam-search history

If each request stored that prefix separately, memory use would explode.

Prefix sharing allows multiple requests to point to the same KV blocks for the common prefix.

That is why block metadata often includes reference counts or similar ownership tracking.

This is especially powerful in:

beam search
branching agent trajectories
repeated enterprise prompts

11. Why Copy-on-Write Matters

Shared blocks are safe only while requests are reading the same prefix.

The moment one request diverges and needs new continuation state, it must stop writing into the shared block.

So the engine allocates fresh blocks for the diverging continuation.

That is copy-on-write at block granularity.

The reason this matters in interviews is that it shows you understand how sharing and mutation coexist safely.

12. Why Continuous Batching Changes Throughput

Static batching is easy to reason about but wasteful for variable-length decoding.

If one request in a batch finishes early, static batching may leave device capacity underused until the whole batch retires.

Continuous batching fixes this by:

retiring completed requests immediately
admitting new requests when capacity opens
maintaining a rolling active set

That usually improves utilization and throughput.

But it also makes the scheduler more important.

This is why a good answer does not say only "continuous batching is better."

It says:

"Continuous batching improves utilization in heterogeneous workloads, but it introduces scheduling complexity and may not be optimal when ultra-low latency per request is the main objective."

13. Why GQA and MQA Matter So Much for Serving

The number of KV heads directly affects KV-cache size.

That means an architecture choice made during model design has a first-order impact on inference cost.

This is one of the best examples of a bridge between model architecture and systems behavior.

If you want to sound strong in interview loops, explicitly connect them:

"GQA helps because it reduces KV-cache footprint and memory bandwidth compared with MHA, while usually preserving more quality than full MQA."

That sentence touches model design, serving, and trade-off reasoning at once.

14. Why These Tricks Do Not Automatically Transfer to Other Workloads

Paged KV caching is powerful because LLM serving has three specific properties:

long-lived per-request memory
variable-length dynamic allocation
a memory-bound decode loop

If another workload is mostly compute-bound or has static tensor shapes, paging-like indirection may add overhead without enough payoff.

This is an important nuance because mature engineers know not to generalize every optimization beyond its natural workload.

15. Where Speculative Decoding Fits In

Speculative decoding solves a different problem.

It tries to reduce latency by letting a cheaper draft model propose multiple tokens and having a stronger model verify them.

Paged KV caching instead tries to improve memory efficiency and scheduling flexibility.

These methods are complementary, not substitutes.

So a strong answer can say:

"Speculative decoding attacks the number of expensive verification steps, while paged KV management attacks the memory and allocator bottlenecks of maintaining many active decode states."

16. How to Answer "Why Is vLLM Faster?"

A strong answer is not:

"Because it uses PagedAttention."

A stronger answer is:

"Because serving performance depends heavily on KV-cache efficiency and scheduling. vLLM improves both. It manages KV memory in blocks instead of large contiguous buffers, which reduces fragmentation and waste, and it pairs that with dynamic scheduling techniques like continuous batching. That lets the server fit more useful active work into the same GPU memory budget and maintain higher utilization."

That is the kind of answer that sounds complete.

17. Questions You Should Be Able to Answer Smoothly

Practice these in full descriptive sentences:

Why is serving often memory-bound while training is often compute-heavy?
Why does plain KV caching solve one problem but create another?
Why is fragmentation such a serious issue for serving engines?
Why does a block-table design help with variable-length requests?
Why does prefix sharing matter for real-world workloads?
Why can a method improve throughput without necessarily improving per-request latency?

ML & LLM Interview Prep — Deep Dives