LLM / AI Security — Interview Grill

100+ active-recall questions. Pair with LLM_SECURITY_DEEP_DIVE.md. Answer each in <60 seconds out loud. Mark anything you can't answer cleanly and re-read the relevant section.

Section A — Foundations and threat model (Q1–10)

Why is LLM security different from classical infosec and from classical alignment?
Why does "instructions and data share a channel" matter?
Define misuse, confidentiality, integrity, availability attacks against LLMs. Give one example of each.
What's a confused deputy? Why are LLM agents prone to it?
What does "the lethal trifecta" mean? Name the three legs.
Black-box vs grey-box vs white-box LLM attacks — what changes for the attacker?
Why are open-weights frontier models a security headache?
Name three pretraining-time attack vectors.
Name three inference-time attack vectors.
Why does behavioural alignment evaluation alone not rule out misalignment? (Reference Sleeper Agents.)

Section B — Prompt injection (Q11–20)

Define direct prompt injection.
Define indirect prompt injection. Who coined it?
Give three real channels through which indirect injection can land in context.
What's multi-modal prompt injection? Give one image-based and one audio-based example.
Why does "putting the rule in the system prompt" not defend against indirect injection?
Why does pattern-matching for injection strings fail?
Walk through how the lethal trifecta enables data exfiltration via an indirectly-injected agent.
What's the "spotlighting" defense?
What's the "dual-LLM / quoting" defense?
Why is indirect injection considered the worst class of LLM attack right now?

Section C — Jailbreaks (Q21–32)

Define a jailbreak. How is it different from injection?
What's DAN / persona jailbreak?
What's prefix injection?
What's refusal suppression?
Why do encoding tricks (base64, ROT13, ASCII art) sometimes succeed?
Walk through Crescendo. Why does it exploit context coherence?
Walk through Skeleton Key.
Walk through Many-Shot Jailbreaking. Why does it scale with context length?
Walk through Best-of-N. Why is it model-agnostic?
Why do low-resource languages still produce jailbreak vectors?
Why doesn't more RLHF "fix" jailbreaks once and for all?
Why is fine-tuning even a small dataset (BadLlama / Qi et al.) a jailbreak?

Section D — Optimization-based adversarial attacks (Q33–40)

Sketch GCG end-to-end.
Why do GCG suffixes transfer across models?
Walk through PAIR.
What's AutoDAN?
What's PAP and what's the high-level claim?
What does "latent-space attack" mean?
What is a Universal Adversarial Trigger? How does it differ from a per-prompt attack?
Compare GCG (white-box gradient) vs PAIR (black-box LLM-vs-LLM).

Section E — Defenses against jailbreaks (Q41–50)

Why is RLHF refusal training only a partial defense?
What's adversarial training, and what are its limits?
What are circuit breakers (Zou et al. 2024) and why are they more robust?
What's latent adversarial training?
What does Llama Guard do?
What are Constitutional Classifiers?
What's SmoothLLM, and what attack does it defeat?
Output-side classifiers vs input-side classifiers — when do you use each?
Why is "the system prompt is secret" a fragile defense?
Defense in depth — what does it mean for an LLM product?

Section F — Data poisoning and backdoors (Q51–58)

What is pretraining-data poisoning? How can an attacker inject content cheaply?
What's a backdoor / trojan attack?
What are sleeper agents? What was Anthropic's headline finding?
Why does standard safety training fail on sleeper agents?
Walk through the BadLlama-style fine-tuning attack.
Why does this make fine-tuning APIs a security perimeter?
What is RLHF-data poisoning? What's the defense?
How does deduplication of training data interact with backdoor robustness?

Section G — Memorization, extraction, privacy (Q59–66)

What is training-data extraction? Cite the canonical paper.
Walk through the ChatGPT divergence attack (Nasr et al. 2023).
Why does memorization scale with model size?
What is membership inference? Two methods.
What is Min-K%-prob? Why does it work?
What is logit-extraction stealing (Carlini 2024)? What does it recover?
What is embedding inversion (Vec2Text)? What's the privacy implication?
Why are vector DB embeddings PII?

Section H — Agents and tools (Q67–78)

What's the agent security threat model in one sentence?
Indirect injection in tool output — give a concrete attack chain.
What's a tool-arg injection attack?
Markdown image-fetch exfiltration — how does it work and how do you prevent it?
What's denial-of-wallet? How do you defend?
What does AgentDojo measure?
Why does an agent that browses the web AND reads private files AND can post webhooks have a critical risk?
How do you architect a coding agent to avoid the lethal trifecta?
What's the defense pattern for "send email" tools?
What does human-in-the-loop add and why is it imperfect?
What attacks does sandboxing protect against? What does it not protect against?
Capability scoping per task — give an example.

Section I — Output handling and product vulns (Q79–86)

How does markdown XSS work in chat UIs?
Why is rendering raw HTML from an LLM dangerous?
SQL injection via LLM-generated queries — how to prevent?
SSRF via LLM-proposed URLs — how to prevent?
Path traversal via LLM-proposed filenames — how to prevent?
Why is OWASP Top 10 for LLM Applications worth memorizing?
Why is logging an LLM product subtle from a privacy perspective?
Code-execution agent — what's the minimum viable sandbox?

Section J — Red-teaming and evaluation (Q87–94)

Manual vs automated red-teaming — when do you use each?
What does HarmBench measure? What does JailbreakBench add?
What's StrongREJECT and why is it harder to fool than a vanilla GPT-judge?
What's WMDP measuring?
What's CyberSecEval?
What's Perez et al. 2022's contribution?
What does an external pre-deployment AISI evaluation look like?
Why do bug bounty programs exist for LLMs in 2024+?

Section K — Privacy and unlearning (Q95–100)

What is differential privacy at training? Why is it impractical at frontier scale?
What is machine unlearning? Name two methods (TOFU / NPO).
What's the GDPR right-to-be-forgotten implication for LLMs?
PII redaction at training-time vs inference-time — what's the difference?
What's the EU AI Act's treatment of frontier "general purpose AI"?
What does HIPAA require for an LLM-based medical app?

Section L — Frameworks and policy (Q101–105)

What's Anthropic's RSP? What is ASL-3?
What's OpenAI's Preparedness Framework?
What's DeepMind's Frontier Safety Framework? What are CCLs?
What's METR? Why does it matter?
NIST AI RMF + AI 600-1 — what's it for?

Section M — Senior-level scenario questions (Q106–115)

Scenario. You're shipping a customer-support agent that reads internal docs, searches the web, and can email customers. Walk me through the security architecture.
Scenario. A pen-tester demonstrates GCG suffix jailbreak on your API. What's your incident response and what do you ship?
Scenario. Researchers report indirect-injection in your RAG pipeline causing exfiltration via image-fetch. Walk me through root cause and the layered fix.
Scenario. Your product offers a code-interpreter tool. Design the sandbox.
Scenario. Your customer wants on-prem deployment with their fine-tunes. What policy controls do you require?
Scenario. A user reports the model emitted what looks like another customer's PII. What's your investigation and remediation?
Scenario. You're red-teaming a new release. What benchmarks do you run, and what gates do you put on shipping?
Scenario. Design the eval suite and gating policy for an agent that controls a browser.
Scenario. The model is suspected to have been pretrained on contaminated benchmarks. How do you confirm and what do you publish?
Scenario. Your fine-tuning API is being abused to strip safety training. Design the abuse-detection pipeline.

Quick fire (Q116–135)

One line: prompt injection.
One line: indirect prompt injection.
One line: lethal trifecta.
One line: GCG.
One line: PAIR.
One line: Crescendo.
One line: Many-Shot Jailbreaking.
One line: Best-of-N Jailbreaking.
One line: Skeleton Key.
One line: Sleeper Agents.
One line: BadLlama.
One line: SmoothLLM.
One line: Circuit Breakers.
One line: Constitutional Classifiers.
One line: AgentDojo.
One line: HarmBench.
One line: StrongREJECT.
One line: Min-K%-prob.
One line: Vec2Text.
One line: RSP / Preparedness / FSF.

Self-grading

110+ correct: ready for frontier-lab security or AI-safety-engineering rounds.
80–109: re-read §3 (injection), §4 (jailbreaks), §9 (agents), §12 (defenses), §16 (production).
50–79: re-read full deep dive then redo.
<50: take three days on the deep dive, drill §18 senior signals, then come back.

7-day drill plan

Day 1: §1–2 (foundations, threat model). Drill A.
Day 2: §3 (prompt injection) + §4 (jailbreak taxonomy). Drill B, C.
Day 3: §5 (optimization attacks) + §12 (defenses). Drill D, E.
Day 4: §6 (poisoning) + §7–8 (extraction/privacy). Drill F, G.
Day 5: §9 (agents) + §10–11 (plugins, output). Drill H, I.
Day 6: §13 (red-team/eval) + §14 (privacy) + §15 (frameworks). Drill J, K, L.
Day 7: §16 (production) + §17 (case studies) + §18 (senior signals). Drill M (scenarios) + Quick fire. Whiteboard a security architecture for one product.

ML & LLM Interview Prep — Deep Dives