ML & LLM Learning: Coding Interview Preparation
A comprehensive repository for ML and LLM coding interview preparation with implementations, theory, and interview Q&A.
🔥 Frontier-Lab Interview Prep — Start Here
These deep-dive and grill files are the highest-yield content in the repo for applied scientist / ML engineer interviews at frontier labs and big tech. Each topic has a *_DEEP_DIVE.md (interview-grade theory) plus an INTERVIEW_GRILL.md (50–60 brutal questions with strong answers). Drill the grill files until you can answer 40+ cold.
| Topic | Why it matters | Files |
|---|---|---|
| Learning rate / Gradient descent | The single hyperparameter most likely to make training succeed or fail. Interviewers probe this to test if you actually understand optimization. | 02_gradient_descent/LEARNING_RATE_DEEP_DIVE.md · 02_gradient_descent/INTERVIEW_GRILL.md |
| Optimizers | SGD, Adam, AdamW, Lion, Sophia, Shampoo — what each fixes, when each wins. AdamW vs Adam+L2 is a high-frequency interview question. | 10_optimizers/README.md · 10_optimizers/INTERVIEW_GRILL.md |
| Logistic regression | Simplest model with the richest theoretical structure. Many senior offers turn on five hard logistic-regression questions. | 01_classical_ml/LOGISTIC_REGRESSION_DEEP_DIVE.md · 01_classical_ml/LOGISTIC_REGRESSION_INTERVIEW_GRILL.md |
| LLM inference | Prefill vs decode, KV cache memory math, PagedAttention, FlashAttention, speculative decoding (with rejection-sampling proof), quantization. Critical for serving / infra roles. | 06_llm_inference/LLM_INFERENCE_DEEP_DIVE.md · 06_llm_inference/INTERVIEW_GRILL.md |
| Post-training & alignment | RLHF math, full DPO derivation (whiteboard-ready), the alphabet soup (IPO/KTO/ORPO/SimPO/GRPO), Constitutional AI, reward hacking, KL blowup, alignment tax. | 08_training_techniques/ALIGNMENT_DEEP_DIVE.md · 08_training_techniques/INTERVIEW_GRILL.md |
| Transformers | Scaled-dot-product derivation, multi-head reasoning, FFN role, residual stream view, pre-LN vs post-LN, encoder/decoder/cross-attention, scaling laws. | 04_transformers/TRANSFORMERS_DEEP_DIVE.md · 04_transformers/INTERVIEW_GRILL.md |
| Attention mechanisms | MHA → MQA → GQA → MLA hierarchy, sliding window receptive-field math, sparse and linear attention, induction heads, attention sinks. | 05_attention_mechanisms/ATTENTION_DEEP_DIVE.md · 05_attention_mechanisms/INTERVIEW_GRILL.md |
| Normalization | BN/LN/RMSNorm/GroupNorm, why BN fails for transformers, pre-LN vs post-LN stability, the affine transform, the loss-landscape-smoothing argument. | 44_normalization/NORMALIZATION_DEEP_DIVE.md · 44_normalization/INTERVIEW_GRILL.md |
| Positional embeddings | Sinusoidal/learned/T5-bias/RoPE/ALiBi/NoPE, full RoPE derivation showing relative-position from rotated dot products, NTK scaling, YaRN. | 14_advanced_positional_embeddings/POSITIONAL_DEEP_DIVE.md · 14_advanced_positional_embeddings/INTERVIEW_GRILL.md |
| Tokenization | BPE/WordPiece/Unigram/SentencePiece, byte-level BPE, vocabulary trade-offs, arithmetic and multilingual quirks, glitch tokens, multimodal extensions. | 15_tokenization/TOKENIZATION_DEEP_DIVE.md · 15_tokenization/INTERVIEW_GRILL.md |
| Evaluation metrics | Classification (precision/recall/F1/AUROC/PR-AUC), regression (MSE/MAE/R²/quantile), ranking (MAP/NDCG/MRR), LLM-specific (PPL/pass@k/BLEU/LLM-as-judge), calibration, Goodhart's Law. | 03_evaluation_metrics/EVALUATION_METRICS_DEEP_DIVE.md · 03_evaluation_metrics/INTERVIEW_GRILL.md |
| Regularization | Bias-variance, L1/L2 geometry and Bayesian priors, dropout (3 stories), early stopping ≈ L2, MixUp/CutMix, label smoothing, SAM, implicit regularization of SGD. | 11_regularization/REGULARIZATION_DEEP_DIVE.md · 11_regularization/INTERVIEW_GRILL.md |
| Sampling techniques | Greedy/beam/temperature/top-k/top-p/min-p/typical/Mirostat/penalties, why beam search fails for LLMs, speculative decoding, best-of-N for test-time scaling. | 09_sampling_techniques/SAMPLING_DEEP_DIVE.md · 09_sampling_techniques/INTERVIEW_GRILL.md |
| Language modeling losses | CLM/MLM/Span-corruption/PrefixLM/MoD/ELECTRA, why CLM dominates, why NSP died, how ICL emerges from CLM, multi-token prediction, prompt masking for SFT. | 43_language_modeling_losses/LM_LOSSES_DEEP_DIVE.md · 43_language_modeling_losses/INTERVIEW_GRILL.md |
| Information theory | Entropy/cross-entropy/KL/MI, forward vs reverse KL, why MLE = forward KL, MI for contrastive (InfoNCE/CLIP), KL in VAE/RLHF/distillation, source coding theorem. | 33_information_theory/INFORMATION_THEORY_DEEP_DIVE.md · 33_information_theory/INTERVIEW_GRILL.md |
| RAG | Indexing/retrieve/rerank/generate pipeline, chunking strategies, BM25 vs dense vs hybrid, HNSW/IVF/PQ, embedding models, HyDE, lost-in-the-middle, RAGAS, Self-RAG/GraphRAG. | 39_rag_retrieval_augmented_generation/RAG_DEEP_DIVE.md · 39_rag_retrieval_augmented_generation/INTERVIEW_GRILL.md |
| Mixture of Experts | Top-k routing, load balancing loss derivation, capacity factor / token dropping, expert parallelism + all-to-all, Switch/Mixtral/DeepSeek-V3, auxiliary-loss-free balancing. | 41_mixture_of_experts/MOE_DEEP_DIVE.md · 41_mixture_of_experts/INTERVIEW_GRILL.md |
| State Space Models | Continuous SSM ODE, discretization, recurrent vs convolutional view, HiPPO, S4 (DPLR), Mamba (selectivity + parallel scan), hybrid models (Jamba). | 42_state_space_models/SSM_DEEP_DIVE.md · 42_state_space_models/INTERVIEW_GRILL.md |
| Diffusion models | Forward/reverse processes, why predict noise, score-matching connection, DDIM/DPM-Solver, classifier-free guidance, latent diffusion, DiT, flow matching. | 40_diffusion_models/DIFFUSION_DEEP_DIVE.md · 40_diffusion_models/INTERVIEW_GRILL.md |
| LoRA & PEFT | LoRA math (ΔW = B·A), intrinsic-dimension hypothesis, α/r scaling, QLoRA's NF4 + double quantization + paged optimizer, adapters/prefix/IA³/DoRA/GaLore, multi-LoRA serving. | 25_adapters_lora/LORA_DEEP_DIVE.md · 25_adapters_lora/INTERVIEW_GRILL.md |
| Tree-based methods | Gini/entropy splits, RF (bagging + feature subsampling), GBDT (functional gradient descent), XGBoost (second-order + regularization), LightGBM (histogram + leaf-wise), CatBoost (ordered boosting). | 26_tree_based_methods/TREES_DEEP_DIVE.md · 26_tree_based_methods/INTERVIEW_GRILL.md |
| Kernel functions | Kernel trick, Mercer's theorem, RBF/polynomial/string kernels, SVM dual, RKHS, kernel ridge, NTK, attention-as-kernel-smoothing. | 35_kernel_functions/KERNELS_DEEP_DIVE.md · 35_kernel_functions/INTERVIEW_GRILL.md |
| Clustering (advanced) | K-means as coordinate descent, GMM with full EM derivation, DBSCAN core/border/noise, hierarchical linkage, spectral clustering, evaluation metrics. | 19_advanced_clustering/CLUSTERING_DEEP_DIVE.md · 19_advanced_clustering/INTERVIEW_GRILL.md |
| Dimensionality reduction | PCA (variance-max derivation, SVD, Eckart-Young), kernel PCA, t-SNE (KL with Student-t), UMAP, autoencoders/VAE, ICA, NMF, method-selection guide. | 21_dimensionality_reduction/DIMENSIONALITY_REDUCTION_DEEP_DIVE.md · 21_dimensionality_reduction/INTERVIEW_GRILL.md |
| Neural networks fundamentals | MLP, universal approximation, activations (ReLU/GELU/SwiGLU), He/Xavier init derivations, full backprop, vanishing/exploding gradients, residual connections, modern training tricks. | 31_neural_networks/NEURAL_NETWORKS_DEEP_DIVE.md · 31_neural_networks/INTERVIEW_GRILL.md |
| Statistical inference | Estimators (unbiased/consistent/efficient + CRLB), MLE asymptotics, Wald/bootstrap/credible intervals, hypothesis testing, multiple testing (Bonferroni/BH), Bayesian updates with conjugate priors. | 47_statistical_inference/STATISTICAL_INFERENCE_DEEP_DIVE.md · 47_statistical_inference/INTERVIEW_GRILL.md |
| MLE & MAP | Full MLE derivations (Bernoulli/Gaussian/Poisson/multinomial/linreg/logreg), asymptotic theory, MAP-as-regularization (ridge from Gaussian prior, lasso from Laplace), conjugate priors, MLE = forward KL, RLHF/DPO connections. | 37_mle_map_estimation/MLE_MAP_DEEP_DIVE.md · 37_mle_map_estimation/INTERVIEW_GRILL.md |
| Linear algebra for ML | Rank, eigendecomposition (spectral theorem), SVD (Eckart-Young), positive (semi)definiteness, matrix calculus (OLS gradient + Hessian), conditioning, projections. | 24_linear_algebra_qa/LINEAR_ALGEBRA_DEEP_DIVE.md · 24_linear_algebra_qa/INTERVIEW_GRILL.md |
| Probability for ML | Axioms, Bayes' theorem (with base-rate fallacy), expectations and variance (linearity, total expectation, total variance), common distributions, multivariate Gaussian (marginals/conditionals), LLN/CLT. | 17_probability_math/PROBABILITY_DEEP_DIVE.md · 17_probability_math/INTERVIEW_GRILL.md |
| Picking distributions / GLMs | Which distribution for which data type, exponential family unification, GLMs and canonical links (linreg/logreg/Poisson), heavy-tailed distributions, common pitfalls. | 18_distribution_classification/DISTRIBUTIONS_DEEP_DIVE.md · 18_distribution_classification/INTERVIEW_GRILL.md |
| Generalization & evaluation | Data leakage (4 types), calibration (ECE, Platt/isotonic/temperature), distribution shift (covariate/label/concept), class imbalance, double descent, cross-validation done right, ablations, metric uncertainty. | 49_generalization_and_evaluation/GENERALIZATION_DEEP_DIVE.md · 49_generalization_and_evaluation/INTERVIEW_GRILL.md |
| A/B testing | Hypothesis tests, sample-size formulas, CUPED, peeking and sequential testing, SUTVA / network effects, SRM check, novelty effects, multiple testing, Bayesian A/B, ML-specific (interleaving, holdback, off-policy / IPS). | 30_ab_testing/AB_TESTING_DEEP_DIVE.md · 30_ab_testing/INTERVIEW_GRILL.md |
| Large-scale LLM systems | Training memory math ( rule), activation checkpointing, BF16/FP8, ZeRO-1/2/3 / FSDP, Megatron tensor parallelism, pipeline parallelism + bubble formula, 3D parallelism, expert parallelism for MoE, sequence/context parallelism, MFU, training failure modes. | 61_large_scale_llm_systems/LARGE_SCALE_LLM_DEEP_DIVE.md · 61_large_scale_llm_systems/INTERVIEW_GRILL.md |
| RL fundamentals | MDPs, Bellman equations, value/policy iteration, Q-learning vs SARSA (on vs off-policy), DQN tricks, policy gradient theorem with derivation, REINFORCE + baselines, actor-critic, TRPO/PPO with clipped surrogate, GAE, RLHF connection, GRPO. | 45_rl_fundamentals/RL_DEEP_DIVE.md · 45_rl_fundamentals/INTERVIEW_GRILL.md |
| ML system design | 6-step framework (clarify → frame → data → features+model → serving → monitoring), two-stage retrieval, cold start, cost asymmetry, drift detection, shadow/canary deployment, worked examples (recommender, fraud). | 29_system_design_for_ml/ML_SYSTEM_DESIGN_DEEP_DIVE.md · 29_system_design_for_ml/INTERVIEW_GRILL.md |
| Optimization (deeper) | Convex/strongly-convex/smooth definitions, GD convergence rates, Nesterov acceleration, Newton/BFGS/Gauss-Newton, SGD scaling, Lagrangian + KKT (with SVM dual), deep-learning loss landscape (saddles dominate, flat minima, edge of stability). | 48_optimization_and_matrix_calculus/OPTIMIZATION_DEEP_DIVE.md · 48_optimization_and_matrix_calculus/INTERVIEW_GRILL.md |
| Multimodal & embedding history | BoW/TF-IDF → Word2Vec/GloVe → BERT → Sentence-BERT → CLIP → multimodal LLMs (Flamingo, LLaVA), full CLIP loss derivation, InfoNCE as MI bound, SigLIP, vector search (HNSW/IVF-PQ), hybrid retrieval. | 38_multimodal_models_and_embedding_history/MULTIMODAL_EMBEDDING_DEEP_DIVE.md · 38_multimodal_models_and_embedding_history/INTERVIEW_GRILL.md |
| Statistical learning theory | ERM, PAC learning, VC dimension, Rademacher complexity, bias-variance, double descent, no-free-lunch theorem, regularization-as-inductive-bias, modern bounds (PAC-Bayes, stability, compression). | 52_statistical_learning_theory/STATISTICAL_LEARNING_THEORY_DEEP_DIVE.md · 52_statistical_learning_theory/INTERVIEW_GRILL.md |
| RNNs & LSTMs | Vanilla RNN forward/BPTT, vanishing/exploding gradients (Jacobian product analysis), LSTM gates and cell-state additive update, GRU, bidirectional, seq2seq + attention (Bahdanau/Luong), transformer transition, connection to modern SSMs. | 46_rnn_lstm/RNN_LSTM_DEEP_DIVE.md · 46_rnn_lstm/INTERVIEW_GRILL.md |
| Discriminative vs generative | vs , Naive Bayes, LDA/QDA decision boundaries, LDA = linear boundary same as logistic regression, Ng & Jordan sample-complexity result, HMM, modern generative models (VAE/GAN/diffusion/LLM). | 34_discriminative_vs_generative/DISCRIMINATIVE_VS_GENERATIVE_DEEP_DIVE.md · 34_discriminative_vs_generative/INTERVIEW_GRILL.md |
| Frontier training playbook | Methodology over architecture, scaling laws (Kaplan/Chinchilla), past-Chinchilla for inference cost, MoE/GQA/MLA trade-offs, data dedup + filtering, stability tricks (z-loss, softcapping, QK-norm), staged training, ablation methodology. | 62_frontier_training_playbook/frontier_training_deep_dive.md · 62_frontier_training_playbook/INTERVIEW_GRILL.md |
| Paged attention & LLM serving | KV-cache math (GQA/MQA/MLA savings), PagedAttention internals (block tables, paging analogy), continuous batching, prefix caching / RadixAttention, speculative decoding, INT8/INT4/FP8 quantization, vLLM/SGLang/TensorRT-LLM. | 63_paged_attention_and_llm_serving/paged_attention_deep_dive.md · 63_paged_attention_and_llm_serving/INTERVIEW_GRILL.md |
| Recommendation systems | Collaborative filtering, matrix factorization (BPR), two-tower retrieval (in-batch negatives, ANN serving), sequential models (GRU4Rec/SASRec/BERT4Rec), two-stage retrieval+ranking, GBDT/DeepFM/DLRM, NDCG/MAP/MRR, cold start, echo chamber + exploration. | 22_recommendation_systems/RECOMMENDATION_SYSTEMS_DEEP_DIVE.md · 22_recommendation_systems/INTERVIEW_GRILL.md |
| Anomaly detection | Statistical (z-score, Mahalanobis), density-based (KDE, LOF), Isolation Forest score derivation, One-Class SVM, autoencoder reconstruction, embedding-based AD, time-series anomalies (point/contextual/collective), AUPRC over AUC. | 32_anomaly_detection/ANOMALY_DETECTION_DEEP_DIVE.md · 32_anomaly_detection/INTERVIEW_GRILL.md |
| Business case studies | 9-step case-study framework + canonical templates (churn, fraud, recs, forecasting, pricing, lead scoring, content moderation, search) — end-to-end answers covering data, leakage, model, evaluation, deployment, iteration. | 28_business_use_cases/BUSINESS_CASE_STUDIES_DEEP_DIVE.md · 28_business_use_cases/INTERVIEW_GRILL.md |
| NLP basics | TF-IDF, n-gram language models, smoothing (Laplace, Good-Turing, Kneser-Ney with continuation count), perplexity, Zipf's law, Heaps' law, edit distance DP, BM25 with hyperparameter intuition. | 36_nlp_basics/NLP_BASICS_DEEP_DIVE.md · 36_nlp_basics/INTERVIEW_GRILL.md |
| Advanced ML theory | Bias-variance decomposition with proof, cross-validation theory (k-fold/stratified/group/time-series/nested with LOO closed form), learning curves, AIC vs BIC, ROC/PR curves with cost-aware operating points, F-beta scores. | 27_advanced_theory/ADVANCED_THEORY_DEEP_DIVE.md · 27_advanced_theory/INTERVIEW_GRILL.md |
| LLM problems & mitigations | Long-context (lost-in-the-middle), hallucination overview, prompting (CoT, self-consistency, ToT, ReAct), jailbreaks + defenses, indirect prompt injection, agent architectures and failure modes, multi-turn memory, latency/cost, evaluation. | 07_llm_problems/LLM_PROBLEMS_DEEP_DIVE.md · 07_llm_problems/INTERVIEW_GRILL.md |
| Hallucination detection (LLM) | Full taxonomy (factual / faithfulness / source / logical / self-contradictory; intrinsic vs extrinsic), causes (RLHF-honesty paradox, lost-in-the-middle, citation hallucination), detection methods across 3 families (reference-based: NLI/QA/citation/KG; reference-free: SelfCheckGPT, semantic entropy, CoVe; internal-states: truth probes, EigenScore, SAPLMA), RAG-specific (RAGAS, citation faithfulness, AIS), benchmarks (TruthfulQA, SimpleQA, HaluEval, FactScore, RAGTruth), production cascade design, 90 active-recall questions. | 07_llm_problems/HALLUCINATION_DETECTION_DEEP_DIVE.md · 07_llm_problems/HALLUCINATION_INTERVIEW_GRILL.md |
| LLM evaluation | Why LLM eval is hard, capability benchmarks (MMLU-Pro, GPQA, MATH/AIME, HumanEval+/SWE-Bench-Verified/LiveCodeBench, RULER long-context, MMMU, GAIA, TAU-bench), instruction following (IFEval, MT-Bench, AlpacaEval-2 length-controlled, Arena-Hard-Auto), LLM-as-judge methodology (5 biases + calibration), pairwise / ELO / Bradley-Terry / Chatbot Arena, factuality measurement (FactScore, SAFE, RAGAS, FACTS Grounding), contamination detection (Min-K%-prob, time-shifted benchmarks), robustness, statistical methodology (CIs, pass@k, multiple comparisons), harnesses (lm-eval-harness, HELM, OpenCompass, Inspect), online telemetry, A/B testing for LLM products, full product eval suite case study, 115 active-recall questions. | 07_llm_problems/LLM_EVALUATION_DEEP_DIVE.md · 07_llm_problems/LLM_EVALUATION_INTERVIEW_GRILL.md |
| Build an agent in 30 min | Codable-from-memory agent: 70-line working loop with tool calls + parser; production extensions (memory, parallel tools, planner, observability, streaming); 8 failure modes with mitigations; 5-min interview narrative. | 07_llm_problems/AGENT_IN_30_MIN.md |
| LLM / AI security | Threat model and attack surface; prompt injection (direct, indirect, multi-modal, the lethal trifecta); jailbreak taxonomy with named techniques (GCG, PAIR, AutoDAN, PAP, Crescendo, Skeleton Key, Many-Shot, Best-of-N); poisoning and backdoors (Sleeper Agents, BadLlama); training-data extraction (Carlini, ChatGPT divergence); membership inference, model extraction, embedding inversion (Vec2Text); agent and tool security (confused deputy, AgentDojo); plugin / MCP security; output-handling vulns (XSS / SSRF / RCE / SQLi); defenses across input / model / output / system / deployment (Constitutional Classifiers, Circuit Breakers, Llama Guard, SmoothLLM, Spotlighting, Dual-LLM); red-teaming and benchmarks (HarmBench, JailbreakBench, AgentDojo, StrongREJECT, WMDP, CyberSecEval); privacy and unlearning; frontier safety frameworks (RSP, Preparedness, FSF, AISI, METR); 5-tier production playbook; 9 failure-mode case studies; 135 active-recall questions. | 65_llm_security/LLM_SECURITY_DEEP_DIVE.md · 65_llm_security/INTERVIEW_GRILL.md |
| Frontier alignment + RL (reasoning models) | The most-asked-about frontier topic at OpenAI / DeepMind / Anthropic in 2025. Reasoning models deep dive: paradigm shift; test-time compute scaling (Snell et al.); PRMs vs ORMs (Lightman PRM800K, Math-Shepherd, OmegaPRM); search + RL (STaR, Quiet-STaR, V-STaR, ReST^EM, MCTS-based AlphaProof / AlphaGeometry); R1-Zero pure-RL with the "aha moment"; full R1 four-stage pipeline (cold-start SFT → reasoning-RL → rejection-sampling SFT → final RLHF); o1/o3 inferred details; deliberative alignment; reasoning distillation (R1-Distill); inference-time strategies; long-CoT failure modes. Dedicated RLVR chapter: full 2024-2025 algorithm zoo (PPO, GRPO, Dr.GRPO, DAPO with its 4 tricks, RLOO, REINFORCE++, VinePPO, PRIME, GSPO, Step-RL, Iterative DPO); verifier design across math/code/tool-use/formal-proofs/generative-judges; reward shaping; KL choices; curriculum; failure modes; open-source infra (TRL/veRL/Open-RLHF/Open-R1/vLLM); substantial low-resource multilingual reasoning section with 7 approach families and 6 concrete research project blueprints (BengaliMath-RL, Cross-Lingual Transfer Study, Code-Switched RLVR, Synthetic Multilingual Data, Multilingual PRMs, Tool-Augmented Multilingual). Frontier reward modeling: scalar vs generative vs verifiable; RLAIF and Constitutional AI; self-rewarding LMs; full reward-hacking taxonomy (length, sycophancy, format, refusal, verifier-hack, prompt-injection); reward overoptimization (Gao curve); RewardBench. Open-source playbooks: memorizable 60-90s answers for DeepSeek R1, Tülu 3, Llama 3, Qwen 2.5, Open-R1; synthesized 6-stage interview cookbook. 150 active-recall questions. | 66_frontier_alignment_rl/REASONING_MODELS_DEEP_DIVE.md · 66_frontier_alignment_rl/RLVR_DEEP_DIVE.md · 66_frontier_alignment_rl/FRONTIER_REWARD_MODELING.md · 66_frontier_alignment_rl/OPEN_SOURCE_POSTTRAIN_PLAYBOOKS.md · 66_frontier_alignment_rl/INTERVIEW_GRILL.md |
| Frontier intuitive probability questions | The open-ended Bayesian / probabilistic scenarios that DeepMind / OpenAI / Anthropic interviewers actually ask. 7 frameworks (Bayesian classification + LRT + Neyman-Pearson, MLE / MAP / method-of-moments, concentration inequalities Markov / Chebyshev / Hoeffding / Bernstein / Chernoff, KL divergence as test statistic + Bayes-error exponent, sequential decision / bandits, importance / rejection sampling, Stein's paradox + shrinkage). The canonical DeepMind "two-arrays-from-two-distributions, classify a new sample" question fully worked with 90-second oral answer template (parametric vs KDE vs discriminative, sample complexity , OOD handling). 25 additional frontier-lab worked examples (coin-flip detection, Monty Hall, German tank, AB test pitfalls, change-point detection, KL estimation, etc.). 125 grill questions. The framing checklist for any probabilistic open question. | 67_frontier_intuitive_questions/INTUITIVE_QUESTIONS_DEEP_DIVE.md · 67_frontier_intuitive_questions/INTERVIEW_GRILL.md |
| LeetCode / NeetCode 150 patterns | Pattern-recognition for coding rounds, organized around the 18 NeetCode 150 categories (Arrays & Hashing, Two Pointers, Sliding Window, Stack, Binary Search, Linked List, Trees, Tries, Heap, Backtracking, Graphs, Advanced Graphs incl. Dijkstra/Bellman-Ford/Floyd-Warshall/MST/Topo, 1D DP, 2D DP, Greedy, Intervals, Math/Geometry, Bit Manipulation). The 30-second triage (input shape × output shape × constraints × structural cues → pattern), per-pattern recognition signals and code templates with 5-10 representative problems each, problem-to-pattern transformation tricks ("sort first," "build a graph," "binary search on the answer," "treat 2D as 1D"), complexity reasoning cheatsheet (-to-algorithm mapping), common interview traps, the 5-step problem-solving protocol (Understand → Examples → Brute force → Optimize → Code), 9-week drilling plan, 165 pattern-recognition grill questions. | 68_leetcode_patterns/LEETCODE_PATTERNS_DEEP_DIVE.md · 68_leetcode_patterns/INTERVIEW_GRILL.md |
| AI Infrastructure Engineer playbook | Production-flavored counterpart for AI Infrastructure Engineer interviews. 8 areas: GPU/VRAM fundamentals (A100/H100/H200/B200 specs, NVLink, the prefill-vs-decode regimes); quantization in production (FP8, INT8, INT4 + SmoothQuant/AWQ/GPTQ); batching (static / dynamic / continuous / chunked-prefill / disaggregated); inference engines (vLLM, TensorRT-LLM, TGI, SGLang, LMDeploy with decision matrix); KV caching production (PagedAttention, prefix cache, eviction, KV-quant math, the 327 KB/token formula); speculative decoding production gotchas (Medusa/EAGLE); SLO metrics (TTFT/TPOT/TPS/ITL with target ranges); distributed training infrastructure (FSDP/DeepSpeed/Megatron + slurm/k8s/Ray); serving platforms (Triton/KServe/Ray Serve/BentoML/Modal/Baseten/Together/Fireworks) with autoscaling and S-LoRA multi-tenancy; vector DB retrieval pipelines (Pinecone/Weaviate/Qdrant/Milvus/pgvector + HNSW/IVF-PQ + hybrid retrieval + reranking); prompt caching and cost optimization (10-point cost-cut checklist + worked GPU-cost example for 100K DAU); observability (LangSmith/Langfuse/Helicone/Arize/OpenTelemetry-GenAI) with online + offline eval; capacity planning math; reliability (blue-green / canary / shadow / multi-AZ / circuit breakers); infra-layer security; full production architecture diagram; 100-question grill; 7-day drill plan. | 69_ai_infrastructure_engineering/AI_INFRA_ENGINEER_PLAYBOOK.md |
| Scaling laws | Distilled from CS336's Basic Scaling Laws lecture. Historical lineage (Cortes 1993 → Banko-Brill → Hestness 2017 → Kaplan 2020); math derivation (parametric n^-1, non-parametric n^(-1/D), neural exponents -0.05 to -0.1 ⇒ effective dimension ~10-20); data scaling laws + mixtures + repetition (4-epoch rule); scale-dependent data filtering; architecture scaling (LSTM vs Transformer methodology, Narang 2020); optimizer scaling (slopes don't change!); hyperparameter scaling (aspect ratio ~100 as scale-invariant); the Kaplan parameter-counting footgun (excluding embeddings); MoE scaling; critical batch size (noise-vs-bias regimes, OpenAI estimation procedure); learning rate scaling + μP; upstream vs downstream transfer; joint scaling laws (Rosenfeld + Kaplan functional forms); the full Kaplan-vs-Chinchilla saga with Yair's resolution + Pearson-Song's complementary analysis + Epoch AI's method-3 mystery resolution; the overtraining-for-serving reality (Llama 2/3 at 286:1 / 1875:1 token-per-param vs Chinchilla 20:1); isoflops as the workhorse protocol; pitfalls and senior signals; 70-question grill. | 70_scaling_laws/SCALING_LAWS_DEEP_DIVE.md |
| Clustering evaluation | Internal metrics (silhouette, Davies-Bouldin, Calinski-Harabasz, Dunn), external metrics (ARI, NMI, V-measure, purity), choosing (elbow / silhouette / gap statistic / stability), bootstrap stability validation, common pitfalls. | 23_clustering_evaluation/CLUSTERING_EVALUATION_DEEP_DIVE.md · 23_clustering_evaluation/INTERVIEW_GRILL.md |
| Cross-topic synthesis | Meta-document: 5 archetype questions (design/train/why-works/debug/tradeoff), bridge topics (cross-entropy, embeddings, attention, bias-variance, data curation), first-principles answer pattern, common mistakes, synthesis cheatsheet. | 64_integrated_ai_ml_interview_synthesis/INTERVIEW_SYNTHESIS_DEEP_DIVE.md · 64_integrated_ai_ml_interview_synthesis/INTERVIEW_GRILL.md |
| ML coding patterns | Stable softmax + log-sum-exp, scaled dot-product attention with masking and multi-head, top-k/top-p sampling, beam search with length normalization, K-means, padding/masking, vectorized cosine similarity, logistic regression, backprop from scratch. | 50_ml_coding_interview_patterns/CODING_PATTERNS_DEEP_DIVE.md · 50_ml_coding_interview_patterns/INTERVIEW_GRILL.md |
| ML debugging | 8-layer debugging tree, loss-curve interpretation (flat / explode / val gap / spike), sanity checks (overfit one batch, tiny dataset), NaN debugging (FP16/log-of-zero/anomaly detection), leakage detection, gradient checking, distribution-shift investigation. | 53_ml_debugging_and_mock_coding/ML_DEBUGGING_DEEP_DIVE.md · 53_ml_debugging_and_mock_coding/INTERVIEW_GRILL.md |
| Training behaviors | Healthy loss curves and pathologies, LR (warmup, decay, finder), batch size effects (linear scaling, critical batch, generalization gap), gradient norm tracking, mixed precision (FP16/BF16/FP8), loss spike recovery, catastrophic forgetting + replay/EWC. | 16_training_behaviors/TRAINING_BEHAVIORS_DEEP_DIVE.md · 16_training_behaviors/INTERVIEW_GRILL.md |
| Whiteboard derivations | Meta-collection of 13 must-master derivations: backprop, attention, OLS, logistic gradient, KL, EM, PCA via SVD, SVM dual, RoPE rotation, DPO, ELBO, bias-variance, info gain — each with step-by-step proof + cross-reference. | 58_whiteboard_derivations/WHITEBOARD_DERIVATIONS_DEEP_DIVE.md · 58_whiteboard_derivations/INTERVIEW_GRILL.md |
| Multi-turn conversation design | Memory strategies (sliding window / summarization / retrieval / hybrid), persona consistency + sycophancy, multi-turn evaluation (simulated users, trajectory metrics), state management, tool integration, prompt template formats, prompt caching, personalization. | 20_multi_turn_conversations/MULTI_TURN_DEEP_DIVE.md · 20_multi_turn_conversations/INTERVIEW_GRILL.md |
Recommended drill sequence: (1) Read each *_DEEP_DIVE.md start to finish. (2) Drill INTERVIEW_GRILL.md until 40+/60 cold. (3) Cycle back through the misses the next day. (4) Ask a friend to randomly pick 10 questions per topic and grill you out loud.
The remaining content of this repo (60+ topic folders) is supporting material. The five interview-grade pairs above are the highest-leverage files.
🎯 What You'll Learn
This repository covers everything you need for ML/LLM coding interviews:
- Classical ML Algorithms - Simple implementations (pure Python/NumPy + PyTorch)
- Evaluation Metrics - All common metrics with simple code
- Transformers & Attention - Core concepts with simple implementations
- LLM Inference Techniques - KV cache, quantization (simple code)
- Attention Mechanisms - Different types with clear code
- LLM Problem Solving - Long context, efficiency solutions
- Training Techniques - RLHF, DPO (simplified implementations)
- Sampling Techniques - Top-p, nucleus, temperature (pure Python)
- Optimizers - SGD, Adam, etc. (from scratch)
- Regularization - L1, L2, dropout (simple implementations)
- Theory & Interview Q&A - Comprehensive coverage
- Diffusion Models - Complete theory, training, evaluation, NLP applications
- Mixture of Experts (MoE) - Architecture, routing, load balancing, efficiency
- State Space Models (SSM) - Mamba, linear complexity, long sequence modeling
- Language Modeling Losses - MLM, CLM, NSP implementations and explanations
- Normalization Techniques - BatchNorm and LayerNorm with detailed theory and implementations
- Reinforcement Learning Fundamentals - MDP, Q-Learning, Multi-Armed Bandit, Monte Carlo in easy language
- RNN and LSTM - Simple, short, precise implementations from scratch
💡 Code Philosophy
All code is kept simple:
- Pure Python/NumPy versions - No heavy dependencies, easy to understand
- PyTorch versions - Simple PyTorch implementations for comparison
- From scratch - Understand how things work internally
- Interview-ready - Code you can write in interviews
📁 Repository Structure
ml_and_llm_learning/
├── 00_pytorch_fundamentals/ # PyTorch basics (START HERE if new to PyTorch)
├── 01_classical_ml/ # Linear/logistic regression, KNN, K-means
├── 02_gradient_descent/ # Different GD variants
├── 03_evaluation_metrics/ # All evaluation metrics
├── 04_transformers/ # Transformer architecture
├── 05_attention_mechanisms/ # Different attention types
├── 06_llm_inference/ # KV cache, optimization
├── 07_llm_problems/ # Long context, efficiency
├── 08_training_techniques/ # RLHF, DPO, PPO, GRPO
├── 09_sampling_techniques/ # Top-p, nucleus, temperature
├── 10_optimizers/ # Optimizer implementations
├── 11_regularization/ # Regularization techniques
├── 12_theory/ # Comprehensive theory
├── 13_interview_qa/ # Interview questions & answers
├── ...
├── 47_statistical_inference/ # Estimators, CIs, tests, Bayesian updates
├── 48_optimization_and_matrix_calculus/ # Gradients, Hessians, conditioning
├── 49_generalization_and_evaluation/ # Leakage, calibration, ablations
├── 50_ml_coding_interview_patterns/ # Pressure-friendly coding templates
├── 51_llm_research_interview_prep/ # LLM eval, ablations, research judgment
├── ...
├── 62_frontier_training_playbook/ # Architecture, stability, data, ablations
├── 63_paged_attention_and_llm_serving/ # KV cache, fragmentation, paging, batching
└── 64_integrated_ai_ml_interview_synthesis/ # Cross-topic interview answer patterns
🚀 Quick Start
1. Set Up Environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
2. Learn PyTorch Basics (If Needed)
# If you're new to PyTorch, start here
cd 00_pytorch_fundamentals
python pytorch_basics.py
3. Start Learning
# Classical ML
cd 01_classical_ml
python linear_regression.py
# Transformers
cd 04_transformers
python attention.py
# LLM Inference
cd 06_llm_inference
python kv_cache.py
📚 Learning Path
See LEARNING_PATH.md for the complete learning journey.
🎓 Prerequisites
- Python 3.9+
- Basic Python knowledge
- Understanding of linear algebra
- (Optional) PyTorch/TensorFlow experience
🔧 Technologies
- NumPy: Numerical computations (pure Python/NumPy implementations)
- PyTorch: Simple PyTorch versions (optional, for comparison)
- Matplotlib: Visualization (optional)
- Pure Python: Most code uses only NumPy (minimal dependencies)
📖 Topics Covered
- Classical ML - Linear/logistic regression, KNN, K-means
- Gradient Descent - Batch, SGD, Mini-batch, Adam
- Evaluation Metrics - Accuracy, precision, recall, F1, etc.
- Transformers - Architecture, attention, decoding
- Attention Mechanisms - Self-attention, cross-attention, etc.
- LLM Inference - KV cache, quantization, optimization
- LLM Problems - Long context, efficiency solutions
- Training Techniques - RLHF, DPO, PPO, GRPO
- Sampling Techniques - Top-p, nucleus, temperature
- Optimizers - SGD, Adam, AdamW, etc.
- Regularization - L1, L2, dropout, etc.
- Theory - Comprehensive ML/LLM theory
- Interview Q&A - 100+ interview questions
- Advanced Positional Embeddings - RoPE, ALiBi
- Tokenization - BPE, WordPiece, SentencePiece
- Training Behaviors - Single GPU, loss spikes
- Probability Math - Common probability Q&A
- Distribution Classification - Which distribution?
- Advanced Clustering - Hierarchical, DBSCAN, GMM
- Multi-Turn Conversations - Design & long context
- Dimensionality Reduction - PCA, theory & math
- Recommendation Systems - Matrix factorization, evaluation
- Clustering Evaluation - Silhouette, ARI, NMI
- Linear Algebra Q&A - Eigenvalues, SVD, rank
- Adapters & LoRA - Parameter-efficient fine-tuning
- Tree-Based Methods - Decision Tree, Random Forest, Gradient Boosting, XGBoost
- Advanced Theory - Bias-variance, cross-validation, learning curves
- Business Use Cases - Churn, recommendations, fraud, pricing (detailed solutions)
- System Design for ML - Scalable pipelines, serving, monitoring
- A/B Testing - Statistical testing, sample size, interpretation
- Neural Networks - Forward pass, backpropagation from scratch (detailed)
- Anomaly Detection - Isolation Forest (detailed explanation, when to use)
- Information Theory - Entropy, KL divergence, cross-entropy, mutual information, Gini
- Discriminative vs Generative - Model types, assumptions (Linear, Logistic, SVM), Bayes' theorem
- Kernel Functions - Linear, Polynomial, RBF, Sigmoid (detailed explanations, when to use)
- NLP Basics - TF-IDF, N-grams, Laplace smoothing, L1/L2 priors (detailed explanations)
- MLE and MAP Estimation - Maximum Likelihood, Maximum A Posteriori (detailed derivations)
- Multimodal Models & Embedding History - CLIP, embedding training, NLP evolution (TF-IDF → Word2Vec → GloVe → BERT)
- RAG (Retrieval-Augmented Generation) - Industry-standard architecture, challenges, solutions, evaluation (production-ready)
- Diffusion Models - Complete theory, training, evaluation, NLP applications
- Mixture of Experts (MoE) - Architecture, routing, load balancing, efficiency
- State Space Models (SSM) - Mamba, linear complexity, long sequence modeling
- Language Modeling Losses - MLM, CLM, NSP implementations and explanations
- Normalization Techniques - BatchNorm and LayerNorm with detailed theory and implementations
- Reinforcement Learning Fundamentals - MDP, Q-Learning, Multi-Armed Bandit, Monte Carlo in easy language
- RNN and LSTM - Simple, short, precise implementations from scratch
- Statistical Inference - Estimators, MLE, confidence intervals, hypothesis tests, bootstrap, Bayesian updates
- Optimization and Matrix Calculus - Gradients, Jacobians, Hessians, convexity, conditioning, optimizer intuition
- Generalization and Evaluation - Leakage, calibration, class imbalance, distribution shift, ablations, metric uncertainty
- ML Coding Interview Patterns - Stable softmax, masking, vectorization, top-k/top-p, padding, k-means update templates
- LLM Research Interview Prep - Perplexity, pass@k, retrieval metrics, ablation reasoning, paper discussion structure
- Statistical Learning Theory - Empirical vs population risk, capacity, generalization gap, regularization as inductive bias
- ML Debugging and Mock Coding - Timed coding prompts, NaN/debugging patterns, leakage checks, training failure diagnosis
- Data Manipulation for ML - Pandas feature-table work, joins, groupby, normalization, preprocessing without leakage
- Research Papers and Mock Interviews - Paper discussion prompts, research judgment, and probability questions like distribution membership
- Spoken Interview Question Bank - Live-interview style model answers for ML theory, coding, probability, and LLM research questions
- Meta-Style Mock Interviews - Full simulated technical loops, follow-ups, and scoring rubric
- Whiteboard Derivations - Stepwise must-master derivations and memory skeletons
- Blind Coding Drills - No-search, timed implementation drills from memory
- Research Judgment Rounds - Scenario-based evidence review, ablations, and claim evaluation
- Large-Scale LLM Systems - Training-memory, sharding, parallelism, serving, and scale trade-offs
- Frontier Training Playbook - Methodology for architecture choices, stability, data mixture, and believable ablations
- Paged Attention and LLM Serving Internals - KV-cache fragmentation, block tables, prefix sharing, continuous batching, and serving bottlenecks
- Integrated AI and ML Interview Synthesis - Bridges across theory, coding, systems, and research judgment with answer frameworks
🎯 Prerequisites
New to PyTorch? Start with 00_pytorch_fundamentals/ to learn all PyTorch concepts you'll need for this repository.
🎯 Learning Goals
By completing this repository, you'll be able to:
- ✅ Implement ML algorithms from scratch
- ✅ Understand transformer architecture deeply
- ✅ Optimize LLM inference
- ✅ Answer interview questions confidently
- ✅ Understand training techniques (RLHF, DPO, etc.)
- ✅ Implement evaluation metrics
- ✅ Understand theory behind ML/LLM
Ready to start? Open LEARNING_PATH.md and begin your journey! 🚀