LLM / AI Security — Interview Grill
100+ active-recall questions. Pair with
LLM_SECURITY_DEEP_DIVE.md. Answer each in <60 seconds out loud. Mark anything you can't answer cleanly and re-read the relevant section.
Section A — Foundations and threat model (Q1–10)
- Why is LLM security different from classical infosec and from classical alignment?
- Why does "instructions and data share a channel" matter?
- Define misuse, confidentiality, integrity, availability attacks against LLMs. Give one example of each.
- What's a confused deputy? Why are LLM agents prone to it?
- What does "the lethal trifecta" mean? Name the three legs.
- Black-box vs grey-box vs white-box LLM attacks — what changes for the attacker?
- Why are open-weights frontier models a security headache?
- Name three pretraining-time attack vectors.
- Name three inference-time attack vectors.
- Why does behavioural alignment evaluation alone not rule out misalignment? (Reference Sleeper Agents.)
Section B — Prompt injection (Q11–20)
- Define direct prompt injection.
- Define indirect prompt injection. Who coined it?
- Give three real channels through which indirect injection can land in context.
- What's multi-modal prompt injection? Give one image-based and one audio-based example.
- Why does "putting the rule in the system prompt" not defend against indirect injection?
- Why does pattern-matching for injection strings fail?
- Walk through how the lethal trifecta enables data exfiltration via an indirectly-injected agent.
- What's the "spotlighting" defense?
- What's the "dual-LLM / quoting" defense?
- Why is indirect injection considered the worst class of LLM attack right now?
Section C — Jailbreaks (Q21–32)
- Define a jailbreak. How is it different from injection?
- What's DAN / persona jailbreak?
- What's prefix injection?
- What's refusal suppression?
- Why do encoding tricks (base64, ROT13, ASCII art) sometimes succeed?
- Walk through Crescendo. Why does it exploit context coherence?
- Walk through Skeleton Key.
- Walk through Many-Shot Jailbreaking. Why does it scale with context length?
- Walk through Best-of-N. Why is it model-agnostic?
- Why do low-resource languages still produce jailbreak vectors?
- Why doesn't more RLHF "fix" jailbreaks once and for all?
- Why is fine-tuning even a small dataset (BadLlama / Qi et al.) a jailbreak?
Section D — Optimization-based adversarial attacks (Q33–40)
- Sketch GCG end-to-end.
- Why do GCG suffixes transfer across models?
- Walk through PAIR.
- What's AutoDAN?
- What's PAP and what's the high-level claim?
- What does "latent-space attack" mean?
- What is a Universal Adversarial Trigger? How does it differ from a per-prompt attack?
- Compare GCG (white-box gradient) vs PAIR (black-box LLM-vs-LLM).
Section E — Defenses against jailbreaks (Q41–50)
- Why is RLHF refusal training only a partial defense?
- What's adversarial training, and what are its limits?
- What are circuit breakers (Zou et al. 2024) and why are they more robust?
- What's latent adversarial training?
- What does Llama Guard do?
- What are Constitutional Classifiers?
- What's SmoothLLM, and what attack does it defeat?
- Output-side classifiers vs input-side classifiers — when do you use each?
- Why is "the system prompt is secret" a fragile defense?
- Defense in depth — what does it mean for an LLM product?
Section F — Data poisoning and backdoors (Q51–58)
- What is pretraining-data poisoning? How can an attacker inject content cheaply?
- What's a backdoor / trojan attack?
- What are sleeper agents? What was Anthropic's headline finding?
- Why does standard safety training fail on sleeper agents?
- Walk through the BadLlama-style fine-tuning attack.
- Why does this make fine-tuning APIs a security perimeter?
- What is RLHF-data poisoning? What's the defense?
- How does deduplication of training data interact with backdoor robustness?
Section G — Memorization, extraction, privacy (Q59–66)
- What is training-data extraction? Cite the canonical paper.
- Walk through the ChatGPT divergence attack (Nasr et al. 2023).
- Why does memorization scale with model size?
- What is membership inference? Two methods.
- What is Min-K%-prob? Why does it work?
- What is logit-extraction stealing (Carlini 2024)? What does it recover?
- What is embedding inversion (Vec2Text)? What's the privacy implication?
- Why are vector DB embeddings PII?
Section H — Agents and tools (Q67–78)
- What's the agent security threat model in one sentence?
- Indirect injection in tool output — give a concrete attack chain.
- What's a tool-arg injection attack?
- Markdown image-fetch exfiltration — how does it work and how do you prevent it?
- What's denial-of-wallet? How do you defend?
- What does AgentDojo measure?
- Why does an agent that browses the web AND reads private files AND can post webhooks have a critical risk?
- How do you architect a coding agent to avoid the lethal trifecta?
- What's the defense pattern for "send email" tools?
- What does human-in-the-loop add and why is it imperfect?
- What attacks does sandboxing protect against? What does it not protect against?
- Capability scoping per task — give an example.
Section I — Output handling and product vulns (Q79–86)
- How does markdown XSS work in chat UIs?
- Why is rendering raw HTML from an LLM dangerous?
- SQL injection via LLM-generated queries — how to prevent?
- SSRF via LLM-proposed URLs — how to prevent?
- Path traversal via LLM-proposed filenames — how to prevent?
- Why is OWASP Top 10 for LLM Applications worth memorizing?
- Why is logging an LLM product subtle from a privacy perspective?
- Code-execution agent — what's the minimum viable sandbox?
Section J — Red-teaming and evaluation (Q87–94)
- Manual vs automated red-teaming — when do you use each?
- What does HarmBench measure? What does JailbreakBench add?
- What's StrongREJECT and why is it harder to fool than a vanilla GPT-judge?
- What's WMDP measuring?
- What's CyberSecEval?
- What's Perez et al. 2022's contribution?
- What does an external pre-deployment AISI evaluation look like?
- Why do bug bounty programs exist for LLMs in 2024+?
Section K — Privacy and unlearning (Q95–100)
- What is differential privacy at training? Why is it impractical at frontier scale?
- What is machine unlearning? Name two methods (TOFU / NPO).
- What's the GDPR right-to-be-forgotten implication for LLMs?
- PII redaction at training-time vs inference-time — what's the difference?
- What's the EU AI Act's treatment of frontier "general purpose AI"?
- What does HIPAA require for an LLM-based medical app?
Section L — Frameworks and policy (Q101–105)
- What's Anthropic's RSP? What is ASL-3?
- What's OpenAI's Preparedness Framework?
- What's DeepMind's Frontier Safety Framework? What are CCLs?
- What's METR? Why does it matter?
- NIST AI RMF + AI 600-1 — what's it for?
Section M — Senior-level scenario questions (Q106–115)
- Scenario. You're shipping a customer-support agent that reads internal docs, searches the web, and can email customers. Walk me through the security architecture.
- Scenario. A pen-tester demonstrates GCG suffix jailbreak on your API. What's your incident response and what do you ship?
- Scenario. Researchers report indirect-injection in your RAG pipeline causing exfiltration via image-fetch. Walk me through root cause and the layered fix.
- Scenario. Your product offers a code-interpreter tool. Design the sandbox.
- Scenario. Your customer wants on-prem deployment with their fine-tunes. What policy controls do you require?
- Scenario. A user reports the model emitted what looks like another customer's PII. What's your investigation and remediation?
- Scenario. You're red-teaming a new release. What benchmarks do you run, and what gates do you put on shipping?
- Scenario. Design the eval suite and gating policy for an agent that controls a browser.
- Scenario. The model is suspected to have been pretrained on contaminated benchmarks. How do you confirm and what do you publish?
- Scenario. Your fine-tuning API is being abused to strip safety training. Design the abuse-detection pipeline.
Quick fire (Q116–135)
- One line: prompt injection.
- One line: indirect prompt injection.
- One line: lethal trifecta.
- One line: GCG.
- One line: PAIR.
- One line: Crescendo.
- One line: Many-Shot Jailbreaking.
- One line: Best-of-N Jailbreaking.
- One line: Skeleton Key.
- One line: Sleeper Agents.
- One line: BadLlama.
- One line: SmoothLLM.
- One line: Circuit Breakers.
- One line: Constitutional Classifiers.
- One line: AgentDojo.
- One line: HarmBench.
- One line: StrongREJECT.
- One line: Min-K%-prob.
- One line: Vec2Text.
- One line: RSP / Preparedness / FSF.
Self-grading
- 110+ correct: ready for frontier-lab security or AI-safety-engineering rounds.
- 80–109: re-read §3 (injection), §4 (jailbreaks), §9 (agents), §12 (defenses), §16 (production).
- 50–79: re-read full deep dive then redo.
- <50: take three days on the deep dive, drill §18 senior signals, then come back.
7-day drill plan
- Day 1: §1–2 (foundations, threat model). Drill A.
- Day 2: §3 (prompt injection) + §4 (jailbreak taxonomy). Drill B, C.
- Day 3: §5 (optimization attacks) + §12 (defenses). Drill D, E.
- Day 4: §6 (poisoning) + §7–8 (extraction/privacy). Drill F, G.
- Day 5: §9 (agents) + §10–11 (plugins, output). Drill H, I.
- Day 6: §13 (red-team/eval) + §14 (privacy) + §15 (frameworks). Drill J, K, L.
- Day 7: §16 (production) + §17 (case studies) + §18 (senior signals). Drill M (scenarios) + Quick fire. Whiteboard a security architecture for one product.