Topic 69: AI Infrastructure Engineering — Production Playbook
Production-flavored counterpart to the research-scientist chapters. Built around the 8-area AI Infrastructure Engineer skill checklist:
- GPU / VRAM fundamentals, quantization & batching
- vLLM / TensorRT-LLM / inference optimization
- KV caching, speculative decoding & token throughput
- Distributed training basics (DDP / FSDP / DeepSpeed)
- Model serving & autoscaling
- Vector DB retrieval pipelines
- Prompt caching & cost optimization
- Observability for LLM apps
🔥 Read these first:
AI_INFRA_ENGINEER_PLAYBOOK.md— 21 sections covering the full production stack: GPU hardware mental model with frontier specs (A100/H100/H200/B200); quantization in production (FP8, INT8, INT4 with GPTQ/AWQ/SmoothQuant); batching strategies (static/dynamic/continuous/chunked-prefill/disaggregated); inference engines (vLLM/TensorRT-LLM/TGI/SGLang/LMDeploy with decision matrix); KV caching production (PagedAttention, prefix cache, eviction, KV quant math); speculative decoding production gotchas; SLO metrics (TTFT/TPOT/TPS/ITL); distributed training infra (DDP/FSDP/DeepSpeed/Megatron + slurm/k8s/Ray); serving platforms (Triton/KServe/BentoML/Ray Serve/Modal/Baseten) with autoscaling and multi-tenant LoRA; vector DB pipelines (Pinecone/Weaviate/Qdrant/Milvus/pgvector + HNSW/IVF-PQ + hybrid retrieval); prompt caching and cost optimization (10-point checklist + worked GPU-cost example); observability for LLM apps (LangSmith/Langfuse/Helicone/Arize + OpenTelemetry GenAI); capacity planning math; reliability patterns (blue-green, canary, shadow, multi-AZ); infra-layer security; full production architecture diagram. Plus 100 interview-grill questions across A–M and a 7-day drill plan.
Why this exists
The other chapters (06_llm_inference, 61_large_scale_llm_systems, 62_frontier_training_playbook, 63_paged_attention_and_llm_serving) cover the algorithms and internals. They lean research-scientist. This folder covers the production operations layer an AI Infrastructure Engineer is expected to know:
- Which inference engine and version (vLLM 0.6, TRT-LLM 0.13, etc.)
- Which serving platform (Triton, KServe, Ray Serve)
- Which observability stack (LangSmith, Langfuse, OpenTelemetry GenAI)
- How to compute GPU costs for a planned product
- How to design SLOs (TTFT p95 < 1s, TPOT p95 < 50ms)
- How to do canary / blue-green / shadow deploys
- How to manage multi-tenant LoRA, prompt caching, model routing for cost
- Which vector DB at which scale, with which index type
These are the topics that come up in AI Infrastructure Engineer interviews at OpenAI, Anthropic, Cohere, Together, Fireworks, Anyscale, Modal, Baseten, Replicate, etc.
Single sentence to remember
AI infra engineering = pick engine + quantization + batching for the SLO budget; KV math determines hardware; PagedAttention + continuous batching + prompt caching are the throughput trinity; multi-replica + canary + observability are the reliability trinity.
Cross-references
06_llm_inference/LLM_INFERENCE_DEEP_DIVE.md— algorithm internals.61_large_scale_llm_systems/EFFICIENT_TRAINING_INFERENCE_PLAYBOOK.md— research-flavored training/inference depth.63_paged_attention_and_llm_serving/— paged attention deep dive.41_mixture_of_experts/MOE_DEEP_DIVE.md— MoE serving considerations.39_rag_retrieval_augmented_generation/RAG_DEEP_DIVE.md— RAG algorithm side; this folder covers the infra side.65_llm_security/LLM_SECURITY_DEEP_DIVE.md— security at the model layer; this folder covers infra-layer security.
How to use
- Read
AI_INFRA_ENGINEER_PLAYBOOK.mdcover-to-cover once. - Memorize the GPU spec table (§2.3), inference engine decision matrix (§5.7), and the full stack diagram (§17).
- Be able to derive KV-cache size for any model in seconds.
- Be able to do the GPU cost calculation (§14.2) on a whiteboard for any DAU/QPS scenario.
- Drill the 100-question grill in §20.
- If you've shipped this stack before, lean on specifics in interviews ("at $LASTROLE we ran vLLM 0.6 with FP8 KV cache on H100 TP=2 with chunked prefill enabled, hit p95 TTFT of 800ms at 50 concurrent users"). That kind of concrete answer beats theory every time.