Topic 9: Sampling Techniques

🔥 For interviews, read these first:

SAMPLING_DEEP_DIVE.md — frontier-lab interview deep dive: greedy/beam/temperature/top-k/top-p/min-p/typical/Mirostat/penalties, why beam search fails for LLMs, speculative decoding, best-of-N for test-time scaling.

INTERVIEW_GRILL.md — 45 active-recall questions.

What You'll Learn

This topic teaches you text generation sampling:

Greedy decoding
Top-k sampling
Top-p (nucleus) sampling
Temperature sampling
Beam search
Implementations from scratch

Why We Need This

Interview Importance

Common question: "Implement top-p sampling"
Understanding: Know how LLMs generate text
Application: Choose right sampling for task

Real-World Application

Text generation: All LLMs use sampling
Quality control: Sampling affects output quality
Creativity vs determinism: Trade-off

Industry Use Cases

1. Greedy Decoding

Use Case: Deterministic tasks

Code generation
Translation
When you want same output

2. Top-p Sampling

Use Case: Most common

ChatGPT, Claude
Balanced creativity/quality
Default for many models

3. Temperature Sampling

Use Case: Control creativity

Low temp = more deterministic
High temp = more creative
Adjustable per use case

Core Intuition

Sampling is the step where model probabilities become actual generated tokens.

That means decoding controls the model's behavior a lot more than many people first realize.

Even with the same model:

greedy decoding may look repetitive
high temperature may look creative but unstable
top-p may feel more natural than top-k

So decoding is not just a post-processing detail. It is part of system behavior.

Greedy Decoding

Greedy decoding always takes the highest-probability token.

That makes it:

deterministic
simple
often too repetitive or myopic

Top-k Sampling

Top-k keeps only the k most likely options and samples from them.

This gives some diversity while preventing very low-probability tokens from being chosen.

Top-p Sampling

Top-p keeps the smallest set of tokens whose cumulative probability mass exceeds p.

This is adaptive:

if the distribution is sharp, the candidate set stays small
if the distribution is broad, the candidate set can grow

That is why top-p often feels more natural than fixed top-k.

Temperature

Temperature reshapes the distribution before sampling.

low temperature sharpens the distribution
high temperature flattens it

That means temperature is not choosing tokens by itself. It changes the probability landscape first.

Technical Details Interviewers Often Want

Why Greedy Can Be Weak

Greedy decoding is locally optimal, not globally optimal for quality or diversity.

It can:

lock into repetitive loops
over-commit early
miss good but slightly lower-probability branches

Top-k vs Top-p

This is a classic follow-up.

Top-k: fixed candidate count
Top-p: variable candidate count based on probability mass

Top-p adapts to uncertainty better, which is why it is common in LLM products.

Temperature Edge Cases

If temperature is very low:

output approaches greedy decoding

If temperature is very high:

the distribution becomes too flat
low-quality tokens become more likely

Beam Search

Beam search is different from random sampling.

It tries to keep multiple high-probability partial sequences alive, which is useful in structured tasks like translation but not always ideal for open-ended chat generation.

Common Failure Modes

high temperature causing nonsense generations
greedy decoding causing repetition
forgetting to renormalize probabilities after top-k or top-p filtering
claiming one sampling method is always best
using beam search for tasks where diversity matters more than likelihood

Edge Cases and Follow-Up Questions

Why is top-p often preferred over top-k?
Why can greedy decoding be repetitive?
What happens when temperature approaches zero?
Why does beam search often look better for translation than for creative chat?
Why must probabilities be renormalized after filtering?

What to Practice Saying Out Loud

Why decoding strategy changes behavior even with the same model
The difference between temperature and top-p
Why "more randomness" is not the same as "better creativity"

Industry-Standard Boilerplate Code

Greedy Decoding

"""
Greedy Decoding
Always pick most likely token
Deterministic but can be repetitive
"""
import numpy as np

def greedy_decode(logits: np.ndarray) -> int:
    """
    Greedy: Pick token with highest probability
    
    Args:
        logits: (vocab_size,) unnormalized scores
    Returns:
        token_id: Most likely token
    """
    return np.argmax(logits)

Top-k Sampling

"""
Top-k Sampling
Sample from top k most likely tokens
"""
import numpy as np

def top_k_sampling(logits: np.ndarray, k: int = 50) -> int:
    """
    Top-k: Sample from top k tokens
    
    Args:
        logits: (vocab_size,) unnormalized scores
        k: Number of top tokens to consider
    Returns:
        token_id: Sampled token
    """
    # Get top k indices
    top_k_indices = np.argsort(logits)[-k:]
    top_k_logits = logits[top_k_indices]
    
    # Softmax over top k
    exp_logits = np.exp(top_k_logits - np.max(top_k_logits))
    probs = exp_logits / np.sum(exp_logits)
    
    # Sample
    sampled_idx = np.random.choice(len(top_k_indices), p=probs)
    return top_k_indices[sampled_idx]

Top-p (Nucleus) Sampling

"""
Top-p (Nucleus) Sampling
Sample from smallest set of tokens with cumulative probability >= p
Most popular method
"""
import numpy as np

def top_p_sampling(logits: np.ndarray, p: float = 0.9) -> int:
    """
    Top-p: Sample from tokens whose cumulative probability >= p
    
    Args:
        logits: (vocab_size,) unnormalized scores
        p: Nucleus probability threshold (0.0 to 1.0)
    Returns:
        token_id: Sampled token
    """
    # Sort logits descending
    sorted_indices = np.argsort(logits)[::-1]
    sorted_logits = logits[sorted_indices]
    
    # Softmax
    exp_logits = np.exp(sorted_logits - np.max(sorted_logits))
    probs = exp_logits / np.sum(exp_logits)
    
    # Cumulative probability
    cum_probs = np.cumsum(probs)
    
    # Find smallest set with cum_prob >= p
    nucleus_size = np.searchsorted(cum_probs, p) + 1
    nucleus_size = min(nucleus_size, len(probs))
    
    # Sample from nucleus
    nucleus_probs = probs[:nucleus_size]
    nucleus_probs = nucleus_probs / np.sum(nucleus_probs)  # Renormalize
    
    sampled_idx = np.random.choice(nucleus_size, p=nucleus_probs)
    return sorted_indices[sampled_idx]

Temperature Sampling

"""
Temperature Sampling
Control randomness by scaling logits
"""
import numpy as np

def temperature_sampling(logits: np.ndarray, temperature: float = 1.0) -> int:
    """
    Temperature: Scale logits before softmax
    
    Args:
        logits: (vocab_size,) unnormalized scores
        temperature: 
            - < 1.0: More deterministic (sharp distribution)
            - = 1.0: Normal
            - > 1.0: More random (flat distribution)
    Returns:
        token_id: Sampled token
    """
    # Scale by temperature
    scaled_logits = logits / temperature
    
    # Softmax
    exp_logits = np.exp(scaled_logits - np.max(scaled_logits))
    probs = exp_logits / np.sum(exp_logits)
    
    # Sample
    return np.random.choice(len(logits), p=probs)

Combined: Top-p + Temperature

"""
Combined Sampling: Top-p + Temperature
Most common in practice (ChatGPT, Claude)
"""
def sample_token(logits: np.ndarray, 
                 temperature: float = 1.0,
                 top_p: float = 0.9) -> int:
    """
    Combined: Apply temperature, then top-p
    
    This is what most production LLMs use
    """
    # Apply temperature
    scaled_logits = logits / temperature
    
    # Then top-p
    return top_p_sampling(scaled_logits, top_p)

Theory

Sampling Comparison

Method	Determinism	Creativity	Use Case
Greedy	Very High	Very Low	Code, translation
Top-k	Medium	Medium	General purpose
Top-p	Medium	Medium	Most common
Temperature	Adjustable	Adjustable	Control creativity

When to Use Which

Greedy: Need deterministic output
Top-k: Simple, works well
Top-p: Default choice, adaptive
Temperature: Fine-tune creativity

Exercises

Implement all sampling methods
Compare outputs
Tune temperature
Combine methods

Next Steps

Topic 10: Optimizers
Topic 11: Regularization

ML & LLM Interview Prep — Deep Dives