Topic 7: LLM Problem Solving

🔥 For interviews, read these first:

LLM_PROBLEMS_DEEP_DIVE.md — frontier-lab deep dive: long-context challenges (lost-in-the-middle), hallucination types and mitigations (overview), prompting techniques (CoT, self-consistency, ToT, ReAct), jailbreaks + defenses, indirect prompt injection, agent architectures and failure modes, multi-turn memory, latency/cost, evaluation.

HALLUCINATION_DETECTION_DEEP_DIVE.md — dedicated comprehensive chapter on detecting hallucinations in LLM outputs: full taxonomy, why models hallucinate, reference-based / reference-free / internal-states-based detection (NLI, SelfCheckGPT, semantic entropy, CoVe, truth probes, RAGAS), benchmarks, mitigation, production design. The interview-grade reference.

HALLUCINATION_INTERVIEW_GRILL.md — 90 active-recall questions on hallucination detection (taxonomy → causes → reference-based → reference-free → internal-states → RAG → benchmarks → mitigation → production → eval methodology).

LLM_EVALUATION_DEEP_DIVE.md — frontier-lab deep dive on LLM evaluation: why eval is hard, capability benchmarks (knowledge/reasoning/code/long-context/agent/multimodal), instruction following, LLM-as-judge (biases + calibration), pairwise / ELO / Chatbot Arena, open-ended evaluation, factuality measurement (FactScore/SAFE/RAGAS), contamination detection, robustness, statistical methodology, harnesses (lm-eval-harness/HELM/Inspect), online telemetry, A/B testing, full product eval case study.

LLM_EVALUATION_INTERVIEW_GRILL.md — 115 active-recall questions on LLM eval, organized A–M.

AGENT_IN_30_MIN.md — codable-from-memory agent: 70-line working loop with tool calls + parser + complete walkthrough. Drill until you can write it cold in 25 minutes.

INTERVIEW_GRILL.md — 55 active-recall questions on broader LLM problems.

What You'll Learn

This topic teaches you to solve common LLM problems:

Long context length solutions
Memory efficiency
Speed optimization
Detailed explanations with code

Why We Need This

Interview Importance

Common question: "How do you handle long context?"
Problem-solving: Show you understand challenges
Optimization: Critical for production

Real-World Application

Production constraints: Memory, speed limits
User requirements: Need long context
Cost optimization: Efficient solutions save money

Industry Use Cases

1. Long Context Processing

Use Case: Document analysis, code review

Process entire codebases
Analyze long documents
Multi-document reasoning

2. Memory Efficiency

Use Case: Resource-constrained environments

Edge devices
Cost optimization
Multiple models

3. Speed Optimization

Use Case: Real-time applications

Chatbots
Code completion
Interactive applications

Core Intuition

LLM problems are usually not about one isolated trick. They are about conflicting constraints.

In practice, you often want all of these at once:

longer context
lower latency
lower memory
lower cost
better quality

But those goals fight each other.

That is why interview questions in this area are really trade-off questions.

Long Context

Long context is useful because many tasks need information that does not fit in a short prompt:

codebases
long documents
multi-turn conversations
retrieval-heavy tasks

But vanilla attention becomes expensive as sequence length grows.

Memory Efficiency

Memory pressure appears in both:

training
inference

At inference time, memory is often dominated by:

model weights
KV cache
batching overhead

At training time, memory is also heavily affected by:

activations
gradients
optimizer states

Speed Optimization

Speed is not one thing.

You should separate:

latency: how long one request takes
throughput: how many requests you can process overall

Many interview mistakes come from improving one while ignoring the other.

Technical Details Interviewers Often Want

Chunking Trade-Off

Chunking reduces memory pressure by avoiding a single giant attention computation.

But it can hurt because:

information across chunk boundaries becomes weaker
global reasoning can degrade

That is why chunking often needs overlap or retrieval support.

Sliding Window Trade-Off

Sliding window attention works well when dependencies are mostly local.

It struggles when:

critical dependencies are far away
the task needs global document structure

Long Context Does Not Automatically Mean Better Quality

This is a very common misconception.

If the extra context is noisy, redundant, or poorly selected, performance can stay flat or even get worse.

Speed Techniques Have Costs

KV cache improves speed but increases memory
batching improves throughput but can hurt latency
quantization saves memory but can degrade quality
speculative decoding can help only if draft generation and verification balance out well

Common Failure Modes

assuming longer context always helps
chunking without overlap and losing important dependencies
optimizing throughput while making per-user latency unacceptable
solving compute cost while creating a KV-cache memory bottleneck
mixing up training memory solutions with inference memory solutions

Edge Cases and Follow-Up Questions

Why can longer context hurt quality?
Why is chunking not a free solution to long context?
What is the difference between latency and throughput in LLM serving?
Why can a memory optimization at training time be irrelevant at inference time?
Why does the KV cache become a bottleneck for long-context serving?

What to Practice Saying Out Loud

Why long-context handling is a trade-off problem
Why more context is only useful if relevant context is selected well
Why speed and memory optimizations usually help one bottleneck by stressing another

Industry-Standard Boilerplate Code

Long Context Solutions

"""
Long Context Solutions
Problem: Standard attention is O(n²), too expensive for long sequences
Solutions: Chunking, sliding window, sparse attention
"""
import numpy as np

def chunked_attention(Q: np.ndarray, K: np.ndarray, V: np.ndarray,
                     chunk_size: int, d_k: int) -> np.ndarray:
    """
    Chunked Attention: Process in chunks to reduce memory
    
    Problem: Full attention O(n²) memory
    Solution: Process sequence in chunks
    """
    seq_len = Q.shape[0]
    outputs = []
    
    for i in range(0, seq_len, chunk_size):
        chunk_end = min(i + chunk_size, seq_len)
        Q_chunk = Q[i:chunk_end]
        
        # Attend to all K, V (or can limit to window)
        scores = Q_chunk @ K.T / np.sqrt(d_k)
        attention_weights = softmax(scores)
        output_chunk = attention_weights @ V
        
        outputs.append(output_chunk)
    
    return np.concatenate(outputs, axis=0)

def sliding_window_attention(Q: np.ndarray, K: np.ndarray, V: np.ndarray,
                            window_size: int, d_k: int) -> np.ndarray:
    """
    Sliding Window Attention: Each position only attends to local window
    
    Problem: Full attention too expensive
    Solution: Local attention + some global positions
    """
    seq_len = Q.shape[0]
    outputs = []
    
    for i in range(seq_len):
        # Define window
        start = max(0, i - window_size // 2)
        end = min(seq_len, i + window_size // 2)
        
        Q_i = Q[i:i+1]
        K_window = K[start:end]
        V_window = V[start:end]
        
        # Local attention
        scores = Q_i @ K_window.T / np.sqrt(d_k)
        attention_weights = softmax(scores)
        output = attention_weights @ V_window
        
        outputs.append(output)
    
    return np.concatenate(outputs, axis=0)