Hallucination Detection in LLM-Generated Text — Deep Dive

Why this exists. Hallucination is the #1 reason LLMs fail in production, and "how would you detect / measure / mitigate hallucinations?" is one of the most common applied-scientist interview questions in 2025. This deep dive is a comprehensive, interview-grade reference: definitions, taxonomies, why models hallucinate, every detection method (reference-based, reference-free, internal-states-based), benchmarks, mitigation, and full system-design treatment.

1. What is a hallucination, precisely?

The word "hallucination" gets used loosely. Interview-grade definition:

A hallucination is content generated by an LLM that is unsupported by, or contradicted by, the relevant ground truth.

Two qualifiers matter:

"Relevant ground truth" = the source we're evaluating against. For RAG, it's the retrieved passages. For factual QA, it's world knowledge. For summarization, it's the source document.
"Unsupported" vs "contradicted" matter separately. An unsupported fact may be true; a contradicted one is definitely wrong. Most production detectors check "is this grounded in the source?" rather than "is this true in the world?".

This subtle distinction is a frontier-lab interview probe. Be ready for: "Is a true-but-unsupported claim a hallucination?" Strong answer: depends on the application — for RAG-grounded QA, yes (faithfulness criterion); for general QA, no (factuality criterion).

2. Taxonomy

Hallucinations are not one thing. Senior candidates know the categories.

2.1 By target

Type	Definition	Example
Factual	Wrong about world knowledge	"Einstein won the Nobel Prize for relativity" (it was photoelectric effect)
Faithfulness	Contradicts source documents (in RAG / summarization)	Source says "30 employees"; summary says "30,000 employees"
Logical / reasoning	Internally inconsistent	"X > Y, Y > Z, therefore Z > X"
Source / citation	Invents references, papers, URLs	Cites "Smith et al. 2018" — paper doesn't exist
Self-contradictory	Earlier statement contradicts later one in same response	"She was born in 1990. In 1985, she..."
Instruction misalignment	Doesn't follow the user's request	User asks for 3 bullet points; gets 7 paragraphs
Multilingual mistranslation	Wrong meaning in translation	Common in low-resource language pairs

2.2 Intrinsic vs extrinsic (Maynez et al. 2020)

For grounded generation (summarization, RAG):

Intrinsic hallucination: contradicts the source. Source: "Sales rose 10%." Summary: "Sales fell 10%."
Extrinsic hallucination: not contradicted by the source but not supported either. Source: "Sales rose 10%." Summary: "Sales rose 10% due to a new marketing campaign." (The "due to..." is unsupported.)

Extrinsic is more dangerous because it's harder to detect — the source doesn't contradict it, you have to verify against external knowledge.

2.3 By severity (production framing)

Critical: medical, legal, financial — could cause real harm.
Significant: factually wrong but bounded impact (e.g., wrong year for a historical event).
Cosmetic: stylistic or marginal (e.g., "this paper showed X" when paper actually showed X but with caveats).

Production systems weight detection by severity. A 0.5% medical hallucination rate is unacceptable; a 0.5% cosmetic-error rate may be fine.

2.4 Reasoning hallucinations (frontier topic)

A separate category that's increasingly important with reasoning models (o1, R1):

Step-level errors: a single reasoning step is wrong even if the final answer is right.
Final-answer errors: chain-of-thought looks plausible, final answer wrong.
Reasoning over hallucinated premises: the model invents a "fact" early and reasons consistently from it.

Detection differs: process-level (PRM) catches step errors; outcome-level catches only the final answer.

3. Why LLMs hallucinate

You'll be asked. Have a structured answer.

3.1 Statistical reasons

Next-token prediction objective doesn't penalize confidence about wrongness. During pretraining, the model learns to produce plausible continuations. "Plausible" ≠ "true." If a confident-sounding wrong continuation has higher probability than "I don't know," the model will produce it.

Coverage gaps in training data. The model has read "Einstein won the Nobel Prize" but the reason (photoelectric effect, not relativity) is mentioned much less. The model hallucinates the more salient association.

Long-tail facts are forgotten or misremembered. Pretraining has billions of tokens but a power-law distribution of facts. Rare entities and events are barely represented; the model fills in "best guess" patterns.

3.2 Architectural reasons

Tokenization quirks. Numbers, names, and code can be tokenized inconsistently. A specific phone number or DOI may be tokenized differently each time the model sees it; the model can't memorize it cleanly.

Position bias / lost in the middle. In long context, attention concentrates on early and recent tokens. Mid-context information gets used less reliably → model "fills in" rather than reading.

Greedy / low-temperature decoding doesn't help. The model commits to the highest-probability token at each step even when alternatives are nearly tied.

3.3 Training-objective reasons

RLHF can increase hallucinations (counterintuitive — interviewers love this). The reward model is trained on human preferences. Humans prefer confident-sounding, fluent, complete answers. So RLHF rewards the model for producing confident answers — whether or not they're correct. The model learns to never say "I don't know" because uncertainty is unrewarded.

This is a major reason post-RLHF models (GPT-4, Claude, etc.) are more confident-but-not-more-correct than their SFT predecessors. Calibration worsens with RLHF in many cases.

Mitigation: explicit "I don't know" reward signal; refusal training on hard questions; calibration after RLHF.

3.4 Sampling reasons

Temperature, top-p. Higher temperature / wider nucleus = more diversity but more hallucination risk. Lower = more conservative but more repetitive and may miss correct-but-low-probability tokens.

Stochastic generation gives different answers across runs. Used by detection methods (self-consistency).

3.5 Reasoning failures

Compounding errors in long chains. Chain-of-thought multiplies error rates: probability of correct full chain = product of correctness at each step. Long reasoning is fragile.

Reward hacking on verifiable rewards (frontier issue): models learn to game the verifier. E.g., math models that produce reasoning that looks correct but uses non-rigorous shortcuts.

3.6 The honest summary

LLMs hallucinate because:

The training objective rewards plausibility, not truth.
The world has long-tail facts the model doesn't fully memorize.
RLHF rewards confident outputs.
Sampling is stochastic.
Long chains compound errors.

Mitigation isn't "make the model not hallucinate" — it's "detect and correct when it does." That's why detection is the focus.

4. Detection methods — the full taxonomy

There are three families. A senior interview answer covers all three.

Family	Idea	Needs ground truth?	Cost
Reference-based	Compare to known truth	Yes	Cheap
Reference-free	Detect via LLM/sampling tricks	No	Medium-high
Internal-states-based	Use model's own activations / logits	No (but needs model access)	Cheap once trained

5. Reference-based detection (when ground truth exists)

The easiest case: you have a reference (gold answer, source document, knowledge base) to compare against.

5.1 String overlap metrics

BLEU, ROUGE, METEOR: n-gram overlap. Not designed for hallucination — high overlap doesn't guarantee correctness; low overlap doesn't guarantee error.
Exact match (EM) and F1 (token-level): used for QA where answers are short.

Why they're weak: They confuse paraphrasing with hallucination. A paraphrased correct answer scores low; a wrong answer that copies words from the question scores high.

Use as baselines, not as hallucination detectors.

5.2 NLI-based detection

Frame each generated sentence as a hypothesis; the source/reference is the premise. Use a Natural Language Inference model to check entailment:

For each sentence S in the generated output:
    For each premise P in the source:
        if NLI(P, S) == "entailment":
            S is supported.
            break
    else:
        S is unsupported (potential hallucination).

Models commonly used:

RoBERTa-large-MNLI (Williams et al.): off-the-shelf NLI model.
DeBERTa-v3 fine-tuned on MNLI/ANLI: stronger.
Specialized: SummaC (Laban et al. 2022), FactCC (Kryscinski et al. 2020) — trained specifically on summarization-faithfulness data.

Strengths: solid baseline, widely understood, no LLM-judge cost.

Weaknesses:

NLI models can be brittle on long premises.
Numeric reasoning poorly handled by NLI (" $30 mi ll i o n " v s "$ 30 billion" — sometimes scored as entailment).
Doesn't catch extrinsic hallucinations (statement is consistent with source but added information).

5.3 QA-based detection (FEQA, QAGS, QuestEval)

Generate questions from the candidate text, answer them using the source, and check if the answers match.

For each fact F in the candidate:
    Q = generate_question(F)
    A_candidate = extract_answer_from(F)
    A_source = qa_model(Q, source)
    if A_candidate != A_source:
        F is a potential hallucination

Strengths: catches subtle factual errors better than NLI; numeric and entity consistency tested directly.

Weaknesses: depends on QA model quality and question generation quality; multi-hop questions can fool it.

5.4 Citation verification

For RAG / agentic outputs that cite sources:

For each cited claim, retrieve the cited passage.
Use NLI / LLM-judge to verify the passage actually supports the claim.

Citation faithfulness is a key sub-metric. Modern systems (GPT-4o, Claude, Perplexity) cite sources, but ~30-40% of citations don't actually support the claim attached. Detection here is critical.

5.5 Knowledge graph triple matching

For factual claims about entities:

Extract (subject, relation, object) triples from the candidate.
Look up triples in a knowledge graph (Wikidata, internal KG).
Mismatch → hallucination.

Used in production for entity-rich domains (biomedical, legal).

5.6 Code execution

For code generation:

Run the generated code with test cases.
Failure to execute or wrong output → hallucination.
Static analysis: does the imported function actually exist? Correct signature?

This is the cleanest verification path — truly verifiable. The reason verifiable-reward RL works on code.

6. Reference-free detection (no ground truth)

The harder case: production deployments often don't have ground truth. Five major techniques.

6.1 Self-consistency (SelfCheckGPT — Manakul et al. 2023)

Idea: if the model is confident in a fact, it'll produce the same fact across multiple stochastic generations. Hallucinations vary because they're sampled from the model's "confused" probability distribution.

1. Generate the original response with temperature 0 (or low).
2. Generate K=5 additional responses with temperature ~1.0 (high diversity).
3. For each sentence/claim S in the original:
    Compute consistency score: how many of the K samples support S?
    Low consistency → likely hallucination.

Consistency scoring options:

NLI-based: NLI(sample_k, original_sentence) for each k; aggregate.
QA-based: ask same question to each sample; see if answers agree.
N-gram overlap: simple but noisy.
LLM-judge: ask another LLM to compare for consistency.

Strengths: no ground truth needed; works across domains; intuitive.

Weaknesses: multiplies inference cost by K; if the model is confidently wrong (memorized misinformation), all samples will agree — false negative.

Production note: SelfCheckGPT became the default reference-free baseline. Many production systems use a cheaper variant: K=3 with NLI scoring.

6.2 Token-level uncertainty signals

The model's own probabilities at generation time reveal uncertainty.

Mean token log-prob: average over generated tokens. Low → model was uncertain.
Min token log-prob: weakest token in the chain. A single very-low-prob token can flag a hallucinated entity.
Token entropy: full distribution entropy at each step.
Perplexity (geometric mean of token probs).

Strengths: free (you already have logits during generation); fast.

Weaknesses:

Calibration is unreliable post-RLHF (the model is more confident on hallucinated entities than on rare-but-true ones).
Some hallucinations are high-probability (model is confidently wrong).
Doesn't localize the hallucination cleanly.

In practice: use as a feature in a learned classifier, not as a standalone signal.

6.3 Semantic entropy (Farquhar et al. 2024 — Nature)

A major 2024 advance. The key insight: token-level uncertainty is misleading because different token sequences can mean the same thing. The model can be split between "Paris is the capital of France" and "The capital of France is Paris" — high token-level entropy but zero meaning entropy. Conversely, the model can be split between "Einstein" and "Newton" with low token-level entropy (the names are short) but high meaning entropy.

Algorithm:

1. Sample K=10 responses for the same prompt.
2. Cluster them by semantic equivalence using NLI (bidirectional entailment).
3. Compute entropy over clusters (not over tokens).
   semantic_entropy = -sum(p_cluster * log(p_cluster))
4. High semantic entropy → model is uncertain about meaning → likely hallucination.

Why it works: meaning-level entropy correlates with truth far better than token-level entropy. The Nature paper showed semantic entropy is the strongest reference-free hallucination predictor across many domains.

Cost: K samples + K-1 NLI calls for clustering. Comparable to SelfCheckGPT.

This is the must-know 2024 method. Mention it by name in interviews.

6.4 LLM-as-judge with chain-of-verification (CoVe — Dhuliawala et al. 2023)

Idea: the model itself can detect its own hallucinations if prompted correctly.

1. Generate a draft response.
2. Generate verification questions: "What facts in this response need checking?"
3. For each question, ask the model independently (without the draft as context).
4. Compare draft answers to fresh answers; flag inconsistencies.
5. Generate final response that incorporates corrections.

Strengths: requires only the model itself; can both detect and correct in one pass.

Weaknesses: ~5× the cost of single generation; depends on the model's self-judge ability (frontier models are okay at this, smaller models are unreliable).

Used in production for high-stakes responses where compute budget allows.

6.5 Verifier models

Train a separate classifier to predict "is this output a hallucination?" given the prompt + response.

Inputs: (prompt, response, optional retrieved context).
Outputs: binary (hallucination or not) or per-sentence scores.
Training data: human-labeled hallucination examples (HaluEval, FactScore).

Production examples:

Vectara HHEM (Hughes Hallucination Evaluation Model): widely used, public.
Honest LLM judge (Lin et al. 2024): smaller LLM trained specifically as hallucination detector.
NLI-based commercial offerings: Patronus AI, Galileo, etc.

Strengths: fast at inference (single classifier pass); can be domain-specialized.

Weaknesses: needs labeled training data; quality depends on annotation; out-of-distribution test prompts may fool the verifier.

6.6 Ensemble disagreement

Run multiple LLMs (or one LLM with different prompts/temperatures) on the same query; check agreement.

Strengths: simple; catches systematic biases of any single model.

Weaknesses: expensive; correlated errors (if all models share training data biases, they all hallucinate similarly).

7. Internal-states-based detection (frontier methods)

Use the LLM's own hidden states or attention patterns to predict hallucinations. Faster than reference-free methods at inference (single forward pass), and surprisingly effective.

7.1 Truth probes (Burns et al. 2022 / "Discovering Latent Knowledge")

Idea: LLMs internally "know" when they're uncertain — there's a direction in activation space that distinguishes truthful from untruthful claims.

Algorithm:

Collect a dataset of (statement, label∈{true, false}) pairs.
For each statement, run through the LLM and collect activations at a chosen layer.
Train a linear probe (logistic regression) to predict the truth label from activations.
At inference: extract activations from the model's response, apply the probe.

Findings: linear probes on middle layers (e.g., layer 16 of a 32-layer model) often achieve 80-90% accuracy on truth classification — the model internally represents truth even when it generates a falsehood.

Strengths: very cheap at inference (one extra dot product); no extra LLM calls.

Weaknesses: requires labeled training data; probe transfers imperfectly across domains; needs activation access (white-box).

7.2 INSIDE / activation-based hallucination scores

Several papers (INSIDE, EigenScore, SAPLMA — 2023-2024) use functions of internal activations:

EigenScore: spread of representations across multiple samples (sampled responses' activations).
SAPLMA: train a small MLP on activations to predict factuality.
INSIDE: focuses on covariance between hidden states and decoded tokens.

All exploit the observation that the model's internal "uncertainty" is a stronger signal than its output probability distribution (which RLHF corrupts).

7.3 Attention pattern analysis

Hallucinated content correlates with attention-pattern abnormalities:

Attention heads that usually focus on retrieved context spread their attention more uniformly when hallucinating.
Specific "factuality heads" identified in some models.

Used for diagnosis more than production detection.

Beyond detection: at generation time, add a "truthful" direction to the residual stream (the difference between truthful and untruthful internal representations). This pushes the model's distribution toward truthful outputs. Used in Anthropic, OpenAI, and academic work on alignment.

7.5 Code snippets — major detectors (whiteboardable in 5-10 min each)

You'll be asked to implement these. Below are minimal idiomatic versions.

SelfCheckGPT (NLI variant)

def selfcheck_nli(question, original, llm, nli, K=5, T=1.0):
    """
    For each sentence in `original`, sample K alternative responses;
    score each sentence by mean NLI-entailment from the samples.
    Returns: list of (sentence, support_fraction). Low fraction → likely hallucination.
    """
    samples = [llm(question, temperature=T) for _ in range(K)]   # K diverse responses
    sentences = split_into_sentences(original)
    scores = []
    for sent in sentences:
        ent_count = sum(
            1 for s in samples
            if nli(premise=s, hypothesis=sent) == "entailment"
        )
        scores.append((sent, ent_count / K))
    return scores

Production cost: K × (LLM generation) + |sentences| × K × (NLI call). Typically 5-6× single generation.

Semantic entropy (Farquhar et al. 2024)

def semantic_entropy(question, llm, nli, K=10, T=1.0):
    """
    1) Sample K responses.
    2) Cluster by bidirectional NLI entailment (semantic equivalence).
    3) Entropy over cluster sizes.
    Returns: scalar entropy. High → uncertain about meaning → likely hallucination.
    """
    samples = [llm(question, temperature=T) for _ in range(K)]

    # Cluster by bidirectional NLI entailment
    clusters = []                                                # list of lists of sample indices
    for i, s in enumerate(samples):
        placed = False
        for c in clusters:
            rep = samples[c[0]]
            # bidirectional entailment = "same meaning"
            if (nli(rep, s) == "entailment" and
                nli(s, rep) == "entailment"):
                c.append(i)
                placed = True
                break
        if not placed:
            clusters.append([i])

    # Entropy over cluster probabilities
    sizes = np.array([len(c) for c in clusters], dtype=float)
    p = sizes / sizes.sum()
    return float(-np.sum(p * np.log(p + 1e-12)))

What to say while coding: "K samples, pairwise bidirectional-NLI clustering, entropy over cluster sizes. Captures meaning-level uncertainty — different token sequences with the same meaning don't add to the entropy."

NLI-based faithfulness check (RAG)

def faithfulness_score(claims, context, nli):
    """
    For each atomic claim in the response, check if `context` entails it.
    Returns: fraction supported.
    """
    supported = 0
    for claim in claims:
        # Sliding window over context to handle long passages
        for chunk in chunks_of(context, max_tokens=512, overlap=64):
            if nli(premise=chunk, hypothesis=claim) == "entailment":
                supported += 1
                break
    return supported / len(claims)

def extract_claims(response, llm):
    """LLM-prompted decomposition into atomic factual claims."""
    return llm(f"List the atomic factual claims in this text, one per line:\n{response}").split("\n")

Standard RAGAS faithfulness pipeline.

Token-level uncertainty (cheap baseline)

def token_uncertainty(prompt, response, llm_with_logits):
    """Mean and min log-prob across the response tokens."""
    logprobs = llm_with_logits(prompt, response)        # [L] per-token log-probs of the response under the model
    return {
        "mean_logprob": float(np.mean(logprobs)),
        "min_logprob":  float(np.min(logprobs)),
        "perplexity":   float(np.exp(-np.mean(logprobs))),
    }

Cheap (you already get logits during generation). Often used as a feature in a learned classifier alongside other signals.

Chain-of-Verification (CoVe)

def chain_of_verification(question, llm):
    """
    Draft → verification questions → fresh answers → reconcile → final.
    """
    # Step 1: baseline draft
    draft = llm(f"Answer the question:\n{question}")

    # Step 2: generate verification questions
    qs = llm(
        f"List independent verification questions for the facts in this draft:\n{draft}"
    ).split("\n")

    # Step 3: answer each verification question independently (no draft as context)
    fresh = {q: llm(f"Answer concisely: {q}") for q in qs if q.strip()}

    # Step 4: revise the draft using the fresh answers
    final = llm(
        f"Original question: {question}\n"
        f"Initial draft: {draft}\n"
        f"Verification answers: {fresh}\n"
        f"Produce a final answer that corrects any inconsistencies and "
        f"acknowledges uncertainty where the verification doesn't support the draft."
    )
    return final

Cost: ~5 LLM calls per query. Best for high-stakes long-form generation.

Truth probe (linear probe on activations)

def train_truth_probe(model, statements_with_labels, layer=16):
    """
    statements_with_labels: list of (text, label∈{0,1}) where 1=true, 0=false.
    Returns: a logistic-regression probe on activations from `layer`.
    """
    X, y = [], []
    for text, label in statements_with_labels:
        # Forward pass; collect hidden states at chosen layer at the final token
        h = model.forward_hidden(text, layer=layer)[-1]      # [d]
        X.append(h.numpy())
        y.append(label)
    X = np.stack(X); y = np.array(y)
    from sklearn.linear_model import LogisticRegression
    return LogisticRegression(max_iter=1000).fit(X, y)

def score_truth(probe, model, text, layer=16):
    """Apply the probe to a new statement; returns p(true)."""
    h = model.forward_hidden(text, layer=layer)[-1]
    return float(probe.predict_proba(h.numpy().reshape(1, -1))[0, 1])

Cost at inference: one forward pass + one dot product. Very cheap once trained. Often achieves 80-90% truth-classification on benchmarks.

Citation faithfulness check (per-claim)

def verify_citations(response, citations, nli):
    """
    response: str with inline citations like "[1]", "[2]".
    citations: dict[citation_id → cited_passage].
    Returns: list of (citation_id, claim_around_citation, supported∈bool).
    """
    out = []
    for cite_id in re.findall(r"\[(\d+)\]", response):
        # Sentence containing the citation = the claim being attributed
        claim = sentence_containing_citation(response, cite_id)
        passage = citations[cite_id]
        supported = nli(premise=passage, hypothesis=claim) == "entailment"
        out.append((cite_id, claim, supported))
    return out

Frontier RAG metric: citation faithfulness = fraction of citations that actually support their claim.

Putting it all together — production cascade

def detect_hallucination(query, response, context=None, llm=None, nli=None,
                         logits=None, fast_threshold=0.3, escalate_threshold=0.6):
    """
    Cascade: cheap signals first; escalate to expensive ones if uncertain.
    Returns: dict with overall confidence and per-stage scores.
    """
    scores = {}

    # Stage 1 (cheap): token-level signal if logits available
    if logits is not None:
        scores["token_unc"] = token_uncertainty(query, response, logits)

    # Stage 2 (medium): NLI vs context for RAG
    if context is not None:
        claims = extract_claims(response, llm)
        scores["faithfulness"] = faithfulness_score(claims, context, nli)
        # Confident clean → return
        if scores["faithfulness"] > 0.95:
            return {"verdict": "pass", "scores": scores}
        if scores["faithfulness"] < fast_threshold:
            return {"verdict": "fail", "scores": scores}

    # Stage 3 (expensive): semantic entropy
    scores["sem_entropy"] = semantic_entropy(query, llm, nli, K=10)
    if scores["sem_entropy"] > escalate_threshold:
        return {"verdict": "escalate_to_human", "scores": scores}

    return {"verdict": "pass", "scores": scores}

This is the canonical production design. Tune thresholds per domain.

8. RAG-specific hallucination detection

When the model has retrieved context, faithfulness to that context is the primary check. Different from general factuality.

8.1 Faithfulness vs factuality (the key distinction)

Faithfulness: response is supported by retrieved context. (Even if the context is wrong, a faithful response is one that doesn't add unsupported claims.)
Factuality: response is true in the real world.

Frontier-lab interview probe: "Can a faithful response be wrong?" Yes — if retrieved context is wrong, faithful response inherits the error. Faithfulness is the ML problem; factuality is the data problem.

8.2 RAGAS metrics

The standard framework (Es et al. 2023) has four metrics:

Metric	What it measures	How
Faithfulness	Response supported by context?	Extract claims from response; verify each via NLI/LLM-judge against context.
Answer relevance	Response addresses the question?	Generate questions from response; compare similarity to original question.
Context precision	Were retrieved chunks relevant?	LLM-judge each chunk for relevance to question.
Context recall	Did retrieval find all needed info?	Compare retrieved context to a gold answer.

In production: faithfulness is the most-monitored. Context precision/recall are diagnostics.

8.3 Citation correctness

Modern RAG systems should cite. Citation correctness has two parts:

Citation existence: does the cited source exist?
Citation faithfulness: does the cited source support the specific claim?

(2) is harder. Implementation: for each (claim, citation) pair, retrieve the cited passage, run NLI to verify entailment.

Empirical: GPT-4 / Claude RAG outputs have ~70-85% citation faithfulness. Production-grade systems target ≥95%.

8.4 Attribution evaluation (Rashkin et al. 2023, AIS)

A more rigorous framework: Attributable to Identified Sources (AIS). For each claim:

Is the claim interpretable? (Concrete, verifiable.)
Is it attributable to the cited source? (Source supports it.)

Used as the gold standard for evaluating RAG outputs at frontier labs.

9. Benchmarks and datasets

You'll be asked which datasets / benchmarks measure hallucination. Have these ready:

9.1 General factuality

TruthfulQA (Lin et al. 2021): 817 questions designed to elicit common misconceptions ("How do you cause an avalanche?"). Evaluates whether models repeat false-but-popular beliefs.
SimpleQA (OpenAI, 2024): 4,326 short-answer factuality questions, designed to be answerable but adversarial. Most LLMs score 30-60% accuracy.
HaluEval (Li et al. 2023): 35K hallucinated-vs-correct examples for QA, dialogue, summarization.
FactScore (Min et al. 2023): per-fact factuality scoring for long-form generation.
FActScore-Bio: biographies — evaluates fine-grained factuality in long generation.

9.2 RAG / faithfulness

FEVER (Thorne et al. 2018): claim verification against Wikipedia.
WICE: Wikipedia citation entailment.
RAGTruth (Niu et al. 2024): ~18K hallucinated-vs-faithful RAG outputs across QA / summarization / data2text.
HHEM benchmark (Vectara): hallucination detection evaluation.
FACTS Grounding (Google DeepMind, 2024): a benchmark and leaderboard specifically for grounding/faithfulness evaluation.

9.3 Reasoning / math hallucination

MATH-Verifier: process-level errors in chain-of-thought.
PRM800K (Lightman et al. 2023): step-level annotations for math reasoning.

9.4 Code

HumanEval, MBPP, SWE-Bench: execution-based; "hallucination" = code doesn't pass tests.

9.5 Multilingual / multimodal

MULTIHAL, VLM-Hallucination-Bench: hallucinations beyond English text.

10. Mitigation strategies (the production playbook)

Detection is half the story. Here's what frontier labs deploy.

10.1 Retrieval grounding

Most effective single intervention. Constrain the model to ground its output in retrieved context.

Prompt: "Answer ONLY using information from the context below. If the answer isn't in the context, say 'I don't know'."
Combined with citation requirement: forces explicit attribution.
Reduces hallucination rate by ~50-80% empirically.

10.2 Refusal training

Train the model to say "I don't know" when uncertain. RLHF with explicit reward for refusing hard questions.

Drawback: too aggressive refusal hurts UX; tuning is hard.
Modern approach: calibrated refusal — the model refuses only when its calibrated confidence is below threshold.

10.3 Constitutional / honesty principles

Prompt the model with explicit honesty constraints:

"Acknowledge uncertainty when present."
"Don't fabricate citations."
"If asked about events after [cutoff], note your knowledge limits."

Augmented with constitutional-AI-style critique-and-revise loops.

10.4 Chain-of-Verification (CoVe)

§6.4 above. Effective but expensive (~5×). Used selectively for high-stakes outputs.

10.5 Conservative decoding

Temperature 0 or low.
Top-p narrow.
Deterministic for factual queries; stochastic for creative ones.

10.6 Calibration

Post-hoc calibration of token-level probabilities so they actually mean what they say. Platt scaling on a held-out set; updates the model's stated confidences. Doesn't reduce hallucinations but makes them flaggable.

10.7 Tool use / verifiable execution

For computable claims (math, code, data lookups): outsource to a tool. The tool either succeeds or fails. Hallucination rate ≈ 0 for the tool-handled portion.

10.8 Honest-trained models

Models specifically RLHF'd or fine-tuned for honesty: Anthropic Claude (constitutional AI), OpenAI o1 / GPT-4 with deliberative alignment.

Empirical claims: o1 reportedly hallucinates less because the long reasoning chain catches its own errors. Mileage varies.

10.9 Rejection sampling

Generate K candidates; verify each with a hallucination detector; return the highest-scoring or refuse if all fail. Best-of-N for factuality.

11. Production system design — how to deploy hallucination detection

The interview question: "Design a hallucination-detection system for a production LLM application."

11.1 Where in the pipeline?

user query
  │
  ▼
[generate response]
  │
  ▼
[hallucination detector]
  ├─ confident clean → return
  ├─ borderline      → re-generate / verify / refuse
  └─ confident bad   → block + escalate
  │
  ▼
[post-hoc logging for monitoring]

11.2 The detector stack

Typically a cascade:

fast cheap detectors (token-level uncertainty, small classifier)
  │  if uncertain
  ▼
medium cost (NLI vs retrieved context, citation check)
  │  if still uncertain
  ▼
expensive (semantic entropy, LLM-as-judge, CoVe)
  │  if still uncertain
  ▼
human review (for high-stakes domains)

Latency budget determines how much you can afford. For chat: 100ms budget. For background research: 30s.

11.3 RAG-specific detector

generated response
  │
  ▼
[claim extraction] — split response into atomic claims
  │
  ▼
[citation verification] — does each citation actually entail the claim?
  ├─ NLI model
  └─ optional LLM judge for borderline
  │
  ▼
[unsupported-claim detection] — claims without citations or with weak citations
  │
  ▼
[score per claim + aggregate]
  │
  ▼
[action: pass / regenerate / refuse]

11.4 Domain-specific layers

For high-stakes domains, add domain detectors:

Medical: drug-name lookup against drug databases.
Legal: citation verification against legal corpus.
Finance: numerical consistency check (does the claimed % match the underlying data?).

11.5 Online metrics

Track in production:

Estimated hallucination rate (from sampled audits + verifier).
Refusal rate (high refusal = over-cautious).
User report rate ("this answer was wrong" buttons).
Per-domain breakdown (medical hallucination ≠ general hallucination).

11.6 Feedback loop

User reports → labeled examples → retrain verifier model. The detection system improves over time as it learns from real failures.

12. Evaluation methodology — how to measure the detector itself

A subtle interview probe: "How do you know your hallucination detector works?"

12.1 Ground-truth annotation challenges

Inter-annotator agreement on hallucination labels is often low. Different humans disagree on whether a claim is "supported."
Granularity matters: per-sentence, per-claim, per-response — the same response can have different scores at different granularities.
Domain expertise needed: medical hallucinations require doctors to label.

12.2 Metrics for the detector

Precision: of flagged hallucinations, what fraction are real? (High → few false alarms.)
Recall: of actual hallucinations, what fraction did we catch?
AUPRC: precision-recall curve. Standard for imbalanced detection.
Per-severity: don't average across critical and cosmetic; report separately.

12.3 Cost-aware evaluation

In production, false alarms (legit response flagged as hallucinated) are costly: trigger expensive re-generation or wrong refusals. Optimize cost-weighted F-beta:

$τ^{*} = ar g τ min [c_{FN} \cdot FN (τ) + c_{FP} \cdot FP (τ)]$

Often very different from accuracy-optimal threshold.

12.4 Calibration of the detector

The detector outputs a confidence score. Is it calibrated?

Reliability diagram of detector confidence vs realized error rate.
ECE (expected calibration error).

A well-calibrated detector enables risk-based decisions: "if confidence < 0.8, refuse; else return."

13. Common interview gotchas

Question	Strong answer
"Why does RLHF sometimes increase hallucinations?"	It rewards confident-sounding outputs; humans prefer them; the model learns to rarely say "I don't know" → confident wrongness.
"Is a true-but-unsupported claim a hallucination?"	Depends on application: yes for RAG (faithfulness criterion), no for general QA (factuality criterion). Distinction matters.
"Can the model always detect its own hallucinations?"	Sometimes — it has internal uncertainty signals (truth probes, semantic entropy). But for confidently-wrong outputs (memorized misinformation), no — the model is internally certain.
"Why is token-level entropy a weak signal?"	Different token sequences can mean the same thing. Semantic entropy aggregates by meaning, not tokens — much stronger signal (Farquhar et al. 2024).
"What's intrinsic vs extrinsic hallucination?"	Intrinsic = contradicts source. Extrinsic = unsupported by source but not contradicted. Extrinsic is harder to detect because source doesn't contradict it.
"How would you build a hallucination detector from scratch?"	Cascade: fast token-level signal → NLI vs context (if RAG) or self-consistency → LLM-as-judge → human review. Budget by latency / domain.
"RAGAS faithfulness — how is it computed?"	Extract atomic claims from response → for each, verify entailment vs retrieved context (NLI or LLM-judge) → fraction supported = faithfulness score.
"What's semantic entropy?"	Sample K responses, cluster by NLI-based meaning equivalence, compute entropy over clusters. High → uncertain about meaning → likely hallucination. (Farquhar et al. 2024 Nature.)
"What's CoVe?"	Chain-of-Verification: generate draft → generate verification questions → answer them independently → fix inconsistencies → emit final. Reduces hallucinations ~30-50% on factual long-form.
"Why does the model hallucinate citations?"	Pretraining sees citations as a textual pattern (X et al., year). The model learned the form but not the truth-binding — when asked to cite, it produces well-formed but invented references.

14. The 12 most-asked hallucination interview questions

(Summary; full grilling in the dedicated grill below.)

Define hallucination precisely. Content unsupported by relevant ground truth. Distinguish factual / faithfulness / source / logical / self-contradictory.
Why do LLMs hallucinate? 5 reasons: training objective, coverage gaps, RLHF, sampling, compounding errors.
Walk me through reference-based detection methods. String overlap, NLI, QA-based, citation verification, KG matching, code execution.
Walk me through reference-free methods. Self-consistency (SelfCheckGPT), token-level uncertainty, semantic entropy, LLM-as-judge, verifier models.
Walk me through internal-states-based detection. Truth probes, EigenScore, SAPLMA, attention patterns, activation steering.
What's semantic entropy? Sample K → cluster by meaning → entropy over clusters.
What's CoVe? Chain-of-Verification — generate, verify, correct, emit.
How would you measure faithfulness in a RAG system? RAGAS faithfulness: claim extraction → NLI vs context.
Why does RLHF sometimes increase hallucinations? Rewards confident outputs; humans prefer them; model learns to never say "I don't know."
Production design: detect hallucinations in real-time chat. Cascade: fast token-level → NLI vs context → semantic entropy / LLM-judge → human review.
How do you evaluate a hallucination detector? Precision, recall, AUPRC; calibration; cost-weighted thresholds; per-severity.
What's the difference between faithfulness and factuality? Faithful = supported by source. Factual = true in the real world. Faithful response can be factually wrong if source is wrong.

15. Interview grill — 50 questions

Drill these. Aim for 35+/50 cold.

A. Definitions

1. Define hallucination. Content unsupported by, or contradicted by, the relevant ground truth.

2. Five hallucination types? Factual, faithfulness, logical, source/citation, self-contradictory.

3. Intrinsic vs extrinsic? Intrinsic = contradicts source. Extrinsic = unsupported by source. Extrinsic is harder to detect.

4. Faithfulness vs factuality? Faithfulness = supported by retrieved source. Factuality = true in the real world. A faithful response inherits errors from a wrong source.

5. Why is "true but unsupported" still a hallucination in RAG? RAG's contract is "ground in retrieved context." Adding unsupported information violates that contract even if true.

B. Causes

6. Why does next-token prediction lead to hallucinations? It rewards plausibility, not truth. Confident-sounding wrong continuations beat "I don't know."

7. Why does RLHF often increase hallucinations? Reward model trained on human preferences; humans prefer confident answers; model learns to never say "I don't know."

8. How does long-context degrade factuality? Lost-in-the-middle. Attention concentrates on edges; mid-context information used unreliably; model "fills in" instead of attending.

9. Why are citations especially likely to be hallucinated? Pretraining sees citations as a textual pattern; model learned form (Author et al., year) but not truth-binding. When asked to cite, produces well-formed but invented references.

10. Why does sampling temperature affect hallucination rate? Higher temperature = wider exploration = more chances to sample low-probability (often wrong) tokens.

C. Reference-based detection

11. NLI-based detection — how? Each generated sentence as hypothesis; source as premise; check entailment.

12. Common NLI models? RoBERTa-MNLI, DeBERTa-v3, SummaC, FactCC.

13. QA-based detection? Generate questions from candidate; answer with source; check candidate's answers match.

14. When does string overlap (BLEU/ROUGE) fail for hallucination detection? Paraphrasing. High overlap doesn't guarantee correctness; low overlap doesn't guarantee error.

15. Citation verification flow? For each (claim, citation) pair: retrieve cited passage; check NLI entailment; flag unsupported.

D. Reference-free detection

16. SelfCheckGPT idea? Generate K=5 responses with different temperature. Check consistency of each claim across samples. Inconsistent = hallucination.

17. SelfCheckGPT cost? ~5-6× single generation. K samples + K-1 NLI/judge calls.

18. Token-level uncertainty signals? Mean log-prob, min log-prob, entropy, perplexity.

19. Why is token-level uncertainty unreliable post-RLHF? RLHF makes the model more confident on hallucinated outputs; calibration breaks.

20. What's semantic entropy (Farquhar et al. 2024)? Sample K responses; cluster by NLI-based bidirectional entailment; entropy over clusters.

21. Why does semantic entropy beat token entropy? Different tokens can mean the same; semantic clustering captures meaning equivalence.

22. Cite Farquhar et al. — what venue? Nature 2024.

23. What's Chain-of-Verification (CoVe)? Draft → verification questions → answer independently → fix inconsistencies → final.

24. CoVe cost? ~5× single generation.

25. Verifier model approach? Train classifier on (prompt, response) → hallucination label. Vectara HHEM, Patronus AI, Galileo are examples.

E. Internal-states-based

26. What's a truth probe? Linear probe on internal activations trained to predict true vs false. Often achieves 80-90% accuracy at middle layers.

27. Why do truth probes work even when output is wrong? The model "internally knows" — uncertainty is encoded in activations even when softmax produces a confident wrong token.

28. EigenScore? Spread of representations across multiple sampled responses. High spread → uncertain → hallucinatory.

29. SAPLMA? Train a small MLP on activations to predict factuality.

30. Activation steering for mitigation? Add a "truthful" direction (difference between truthful and untruthful representations) to the residual stream during generation.

F. RAG-specific

31. RAGAS faithfulness? Extract claims; for each, NLI/judge entailment vs retrieved context; fraction supported.

32. RAGAS context precision? Of retrieved chunks, what fraction are actually relevant to the question?

33. RAGAS context recall? Did the retrieval find all info needed for a gold answer?

34. Citation faithfulness vs citation existence? Existence: does the cited source exist? Faithfulness: does the source support the claim? Faithfulness is the harder problem.

35. Empirical citation faithfulness rate of frontier RAG? Around 70-85% for vanilla GPT-4/Claude. Production-grade systems target ≥95%.

G. Benchmarks

36. TruthfulQA? 817 questions designed to elicit common misconceptions.

37. SimpleQA? OpenAI 2024. 4326 short-answer factuality questions. Most LLMs score 30-60%.

38. HaluEval? 35K hallucinated-vs-correct examples for QA, dialogue, summarization.

39. FactScore? Per-fact factuality scoring for long-form generation.

40. RAGTruth? ~18K hallucinated vs faithful RAG outputs (Niu et al. 2024).

H. Production / system design

41. Hallucination-detection cascade? Fast cheap (token-level, classifier) → medium (NLI vs context) → expensive (semantic entropy, LLM-judge) → human review.

42. Faithfulness vs factuality monitoring in production? Faithfulness directly monitorable from logs; factuality requires human audits or external KB lookups.

43. Cost-weighted detector threshold? $τ^{*} = ar g min [c_{FN} FN + c_{FP} FP]$ . False positives (legit response refused) often more costly than false negatives in chat UX.

44. Domain-specific layers (medical, legal, finance)? Plug in domain-specific verifiers — drug DB, citation DB, numerical consistency checks.

45. Detector feedback loop? User reports → labeled examples → retrain verifier. Improves with deployment.

I. Mitigations

46. Most effective single mitigation? Retrieval grounding with citation requirement. Cuts hallucination rate ~50-80%.

47. Refusal training trade-off? Aggressive refusal hurts UX; calibrated refusal (refuse only below confidence threshold) is better.

48. Best-of-N for factuality? Generate K candidates; rank by hallucination detector; return top.

49. Tool use for hallucination prevention? Outsource computable claims (math, code, lookups) to tools. Tool either succeeds or fails — eliminates hallucination on tool-handled portion.

50. Why is conservative decoding (low temp) only a partial fix? Reduces variance but doesn't fix the core problem: high-probability outputs can be confidently wrong.

16. Quick-fire (single-line answers)

51. NLI model standard? RoBERTa-MNLI, DeBERTa-v3. 52. SelfCheckGPT K typical? 5. 53. Semantic entropy clustering? Bidirectional NLI entailment. 54. CoVe steps? Draft → verify-Qs → fresh-A → reconcile → final. 55. Token-level signal weakness? Calibration breaks post-RLHF. 56. Truth probe accuracy? 80-90% on labeled benchmarks. 57. RAGAS metric count? 4 (faithfulness, answer relevance, context P, context R). 58. Faithfulness vs factuality — easier to monitor? Faithfulness (just check vs context). 59. Most-cited hallucination benchmark? TruthfulQA. 60. Most effective mitigation? Retrieval grounding + citation.

17. The senior-level discussion

When the case is winding down, volunteer 1-2 of these unprompted:

The RLHF-honesty paradox: alignment training increases confident wrongness; new techniques (constitutional AI, deliberative alignment, calibrated refusal) try to fix it.
Semantic entropy as the modern reference-free baseline (cite Farquhar Nature 2024).
Truth probes as an internal-states-based alternative — cheaper than reference-free, white-box.
Faithfulness vs factuality distinction — most production systems can only measure faithfulness; factuality is a deeper data-quality problem.
Citation faithfulness gap — even GPT-4 gets ~80%; this is the next frontier in RAG quality.
Cascade architecture — never one detector; always a tier of cheap-to-expensive.
Cost-weighted thresholds — false positives often more expensive than false negatives in chat UX.
The fundamental limit — for confidently-memorized misinformation, no method works; data quality matters most.

18. Drill plan

Master the taxonomy (§2): be able to name 5 hallucination types in 30 seconds.
Master the 3 detection families (§4-7): be able to walk through each with one canonical method.
Master semantic entropy: be able to describe the algorithm in 60 seconds.
Master CoVe: same.
Master RAG faithfulness measurement (§8): RAGAS pipeline.
Drill the 50 grill questions; aim for 35+/50 cold.
Practice the system-design answer (§11) in 5 minutes.

19. Further reading

Foundational papers:

Maynez et al. (2020). On Faithfulness and Factuality in Abstractive Summarization — the intrinsic/extrinsic split.
Lin et al. (2021). TruthfulQA: Measuring How Models Mimic Human Falsehoods.
Manakul et al. (2023). SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection.
Farquhar, Kossen, et al. (2024). Detecting hallucinations in large language models using semantic entropy. Nature 630, 625-630. (The most cited modern paper.)
Dhuliawala et al. (2023). Chain-of-Verification Reduces Hallucination in Large Language Models.

RAG-specific:

Es et al. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation.
Rashkin et al. (2023). Measuring Attribution in Natural Language Generation Models (AIS framework).
Niu et al. (2024). RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models.

Internal-states / probes:

Burns et al. (2022). Discovering Latent Knowledge in Language Models Without Supervision (CCS).
Azaria & Mitchell (2023). The Internal State of an LLM Knows When It's Lying (SAPLMA).
Chen et al. (2024). INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection.

Surveys:

Ji et al. (2023). Survey of Hallucination in Natural Language Generation. — early but useful.
Huang et al. (2023). A Survey on Hallucination in Large Language Models. — modern.

Production / industry:

Vectara HHEM (Hughes Hallucination Evaluation Model) — public eval.
FACTS Grounding (Google DeepMind, 2024) — leaderboard.
Patronus AI, Galileo, Arize evaluation tooling — commercial.

If you internalize this document, hallucination detection stops being a buzzword and becomes a coherent algorithmic + engineering discipline.

ML & LLM Interview Prep — Deep Dives