Topic 16: Training Behaviors & Single GPU Optimization

🔥 For interviews, read these first:

TRAINING_BEHAVIORS_DEEP_DIVE.md — frontier-lab deep dive: healthy loss curves and pathologies, LR (warmup, decay, finder), batch size effects (linear scaling, critical batch, generalization gap), gradient norm tracking, mixed precision (FP16/BF16/FP8), loss spike recovery, catastrophic forgetting + mitigations.

INTERVIEW_GRILL.md — 45 active-recall questions.

What You'll Learn

This topic teaches you:

How to train on single GPU efficiently
Parameter changes for memory optimization
Why loss spikes happen
Gradient accumulation
Mixed precision training
Memory optimization techniques

Why We Need This

Interview Importance

Common question: "How to fit large model in single GPU?"
Practical knowledge: Essential for real training
Problem-solving: Shows you understand training

Real-World Application

Resource constraints: Not everyone has multi-GPU
Cost optimization: Single GPU training saves money
Debugging: Understand training issues

Industry Use Cases

1. Single GPU Training

Use Case: Personal projects, startups

Fit large models in limited memory
Use gradient accumulation
Mixed precision training

2. Memory Optimization

Use Case: All training scenarios

Gradient checkpointing
Parameter sharding
Efficient attention

3. Loss Spike Debugging

Use Case: Training stability

Identify causes
Fix training issues
Improve stability

Core Intuition

Training instability is usually not one mysterious bug.

It is usually one of a small number of things:

the step size is wrong
gradients are too large or too noisy
the batch is problematic
precision or normalization is unstable
memory pressure forces a bad training configuration

Good interview answers in this area are procedural.

You should sound like you know how to isolate causes, not just list buzzwords.

Why Single-GPU Optimization Matters

A lot of real experiments start with resource constraints.

If a model does not fit in memory, you need to decide which trade-off to make:

smaller batch
more accumulation
lower precision
shorter context
checkpointing

Each one changes a different part of the training system.

Technical Details Interviewers Often Want

Gradient Accumulation

Gradient accumulation does not change instantaneous memory for activations of one microbatch very much.

What it does is:

keep microbatches small enough to fit
accumulate their gradients
update less often

That gives a larger effective batch size while respecting memory limits.

Mixed Precision

Mixed precision helps because many tensors can safely use lower precision.

But it introduces risks:

underflow
overflow
gradient-scaling issues

So the correct explanation is:

saves memory
often improves throughput
can require careful stability handling

Gradient Checkpointing

Checkpointing saves memory by recomputing some activations during backward pass.

The trade-off is simple:

lower memory
more compute
slower wall-clock in exchange for larger trainable configuration

Common Failure Modes

accumulation used incorrectly so effective learning-rate assumptions break
mixed precision producing NaNs without gradient scaling or stable ops
checkpointing expected to reduce all memory, when activations were not the main bottleneck
loss spikes caused by a few bad batches rather than the whole run
blaming the optimizer when the real issue is data or targets

Edge Cases and Follow-Up Questions

Why can loss spike even if average training looks stable?
Why does reducing batch size not always fix OOM?
What if the model fits but optimizer states do not?
Why can mixed precision help throughput but hurt stability?
What is the difference between effective batch size and microbatch size?

What to Practice Saying Out Loud

How you would debug a sudden loss spike
How you would fit a slightly larger model on the same GPU
Which memory lever you would try first, and why

Industry-Standard Boilerplate Code

Gradient Accumulation

"""
Gradient Accumulation: Simulate larger batch size
"""
import torch
import torch.nn as nn

def train_with_gradient_accumulation(model, dataloader, optimizer, 
                                     accumulation_steps: int = 4):
    """
    Train with gradient accumulation
    
    Accumulates gradients over multiple batches
    Before updating weights
    Effectively increases batch size
    """
    model.train()
    optimizer.zero_grad()
    
    for batch_idx, (data, target) in enumerate(dataloader):
        output = model(data)
        loss = nn.functional.cross_entropy(output, target)
        
        # Scale loss by accumulation steps
        loss = loss / accumulation_steps
        
        # Backward pass (accumulates gradients)
        loss.backward()
        
        # Update weights every accumulation_steps
        if (batch_idx + 1) % accumulation_steps == 0:
            optimizer.step()
            optimizer.zero_grad()

Mixed Precision Training

"""
Mixed Precision: Use FP16 to save memory
"""
from torch.cuda.amp import autocast, GradScaler

def train_mixed_precision(model, dataloader, optimizer):
    """
    Mixed precision training
    
    Forward pass: FP16 (half precision)
    Backward pass: FP32 (full precision)
    Saves ~50% memory
    """
    scaler = GradScaler()
    model.train()
    
    for data, target in dataloader:
        optimizer.zero_grad()
        
        # Forward pass in FP16
        with autocast():
            output = model(data)
            loss = nn.functional.cross_entropy(output, target)
        
        # Backward pass (scaler handles FP16/FP32)
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

Memory Optimization Checklist

"""
Memory Optimization Techniques
"""
def optimize_memory_usage():
    """
    Checklist for single GPU training:
    
    1. Reduce batch size
    2. Use gradient accumulation (simulate larger batch)
    3. Use mixed precision (FP16)
    4. Use gradient checkpointing
    5. Reduce sequence length
    6. Use efficient attention (flash attention)
    7. Free unused variables
    8. Use CPU offloading for optimizer states
    """
    pass

# Parameter changes:
# - batch_size: 32 → 8 (4x less memory)
# - gradient_accumulation_steps: 1 → 4 (same effective batch size)
# - precision: fp32 → fp16 (2x less memory)
# - max_seq_len: 2048 → 1024 (2x less memory)
# - gradient_checkpointing: False → True (trade compute for memory)

Why Loss Spikes Happen

Common Causes

Learning Rate Too High
- Solution: Reduce learning rate
- Check: LR schedule, warmup
Gradient Explosion
- Solution: Gradient clipping
- Check: Gradient norms
Bad Batch
- Solution: Skip or downweight
- Check: Batch statistics
Numerical Instability
- Solution: Mixed precision, better initialization
- Check: NaN/Inf values
Scheduler Issues
- Solution: Fix LR schedule
- Check: LR at spike time

Detection Code

"""
Detect and handle loss spikes
"""
def detect_loss_spike(losses: list, threshold: float = 2.0) -> bool:
    """
    Detect if current loss is spike
    
    Spike = loss > threshold * recent_average
    """
    if len(losses) < 10:
        return False
    
    recent_avg = np.mean(losses[-10:-1])
    current = losses[-1]
    
    if current > threshold * recent_avg:
        return True
    return False

def handle_loss_spike(model, optimizer, losses):
    """Handle loss spike"""
    if detect_loss_spike(losses):
        # Option 1: Reduce learning rate
        for param_group in optimizer.param_groups:
            param_group['lr'] *= 0.5
        
        # Option 2: Skip this update
        # optimizer.zero_grad()
        
        # Option 3: Restore previous checkpoint
        # load_checkpoint(model, optimizer)

Single GPU Training Strategy

Memory Budget Breakdown

Total GPU Memory (e.g., 24GB):
- Model weights: ~7GB (7B params × 4 bytes)
- Optimizer states: ~14GB (Adam: 2x model size)
- Activations: ~2GB (batch × seq_len)
- Gradients: ~7GB (same as weights)
- Overhead: ~1GB

Optimization Steps

Reduce Batch Size: 32 → 8 (saves ~6GB)
Gradient Accumulation: Accumulate 4 batches (same effective batch)
Mixed Precision: FP32 → FP16 (saves ~7GB)
Gradient Checkpointing: Trade compute for memory (saves ~2GB)
Reduce Sequence Length: 2048 → 1024 (saves ~1GB)

Exercises

Implement gradient accumulation
Add mixed precision
Detect loss spikes
Optimize memory usage

Next Steps

Topic 17: Probability math Q&A
Topic 18: Distribution classification

ML & LLM Interview Prep — Deep Dives