Serving Notes

Fast Interview Story

If you need a short answer:

KV cache makes autoregressive decoding faster by reusing past keys and values.
But long-running requests make KV memory huge, and naive contiguous allocation wastes space.
Paged attention breaks KV cache into fixed-size blocks and maps logical positions to physical blocks.
That reduces fragmentation and enables efficient dynamic serving features like prefix sharing and continuous batching.

Serving LLMs is often memory-bound because every active request accumulates KV cache over time. The raw attention math has not changed, but the engine has to keep old keys and values resident and quickly accessible. If you reserve one large contiguous buffer per request, you waste space and create fragmentation. Paged attention fixes this by allocating fixed-size KV blocks and tracking them through a block table. That gives the server more flexibility, improves memory efficiency, and works especially well with dynamic workloads and prefix reuse.

ML & LLM Interview Prep — Deep Dives

Serving Notes

Fast Interview Story

One-Minute Version

Follow-Up Questions

Why is this different from training?

Why does GQA help serving?

Why does continuous batching matter?