Topic 25: Adapters & LoRA (Parameter-Efficient Fine-tuning)

🔥 For interviews, read these first:

LORA_DEEP_DIVE.md — frontier-lab interview deep dive: LoRA math (ΔW = B·A), intrinsic-dimension hypothesis, α/r scaling, QLoRA's three innovations (NF4, double quantization, paged optimizer), adapter modules, prefix tuning, IA³, DoRA, GaLore, multi-LoRA serving (S-LoRA, Punica).

INTERVIEW_GRILL.md — 40 active-recall questions.

What You'll Learn

This topic teaches you parameter-efficient fine-tuning:

Adapters
LoRA (Low-Rank Adaptation)
How they work
When to use them
Simple implementations

Why We Need This

Interview Importance

Hot topic: LoRA is widely used
Efficiency: Shows understanding of efficient training
Practical knowledge: Used in production

Real-World Application

Fine-tuning: Fine-tune large models efficiently
Cost savings: Much cheaper than full fine-tuning
Multiple tasks: Train multiple adapters for different tasks

Industry Use Cases

1. Adapters

Use Case: Task-specific fine-tuning

Add small adapter layers
Freeze main model
Train only adapters

2. LoRA

Use Case: Most popular PEFT method

Low-rank decomposition
Train only small matrices
Can combine multiple LoRAs

Core Intuition

Parameter-efficient fine-tuning exists because full fine-tuning of large models is expensive in:

memory
optimizer state
storage
deployment complexity

The key idea is:

keep most pretrained weights frozen
learn a small set of task-specific updates

Adapters

Adapters insert small trainable modules into the network.

The intuition is:

the backbone already knows a lot
a small bottleneck module can steer behavior for a task

LoRA

LoRA does not insert a whole new transformation in the same way adapters do.

Instead, it learns a low-rank update to an existing weight matrix.

That is why LoRA feels lightweight:

frozen base weight
small trainable update
task adaptation with far fewer parameters

Technical Details Interviewers Often Want

Why Low Rank Might Work

LoRA assumes the important task-specific update can often be represented in a much lower-rank subspace than a full dense update.

That is the key modeling assumption behind the method.

Why LoRA Is Operationally Attractive

LoRA is popular because it often gives:

very low trainable parameter count
lower optimizer memory
easy swapping of task-specific adapters

Adapter vs LoRA

This is a common follow-up.

Adapters add trainable modules to the network path
LoRA modifies an existing linear transform through a low-rank update

Both are PEFT, but they intervene differently.

Common Failure Modes

treating PEFT as always equivalent to full fine-tuning
choosing rank too low and underfitting the task
applying LoRA to the wrong target modules
ignoring inference-time or deployment composition issues with many adapters
assuming fewer trainable parameters always means equal final quality

Edge Cases and Follow-Up Questions

Why can LoRA work with so few trainable parameters?
What does the rank r control?
Why might full fine-tuning still outperform PEFT?
How are adapters different from LoRA conceptually?
Why is PEFT especially valuable for multi-task or resource-constrained setups?

What to Practice Saying Out Loud

Why PEFT exists at all
The core idea behind low-rank adaptation
The difference between "cheaper training" and "equally expressive training"

Theory

Adapters

Concept:

Add small adapter layers between transformer layers
Freeze original model weights
Train only adapter parameters

Architecture:

Original: X → Transformer → Y
With Adapter: X → Transformer → Adapter → Y

Parameters:

adapter_size: Hidden dimension of adapter (e.g., 64, 128)
adapter_layers: Which layers to add adapters (all or specific)

LoRA (Low-Rank Adaptation)

Concept:

Instead of updating all weights W, update low-rank matrices
W' = W + ΔW, where ΔW = BA (low-rank)
Train only B and A matrices

Mathematical Formulation:

Original: h = Wx
LoRA: h = Wx + ΔWx = Wx + BAx

Where:
- W: Original weight matrix (d × d)
- B: Low-rank matrix (d × r), r << d
- A: Low-rank matrix (r × d)
- r: Rank (typically 1-16)

Why it works:

Low-rank assumption: Weight updates have low intrinsic rank
Much fewer parameters: r × d instead of d × d
Example: d=4096, r=8 → 8×4096×2 = 65K params vs 16M params

Parameters:

rank (r): Rank of decomposition (1-16, typically 8)
alpha: Scaling factor (usually = rank)
target_modules: Which layers to apply LoRA (attention, MLP, etc.)
dropout: Dropout in LoRA layers

Industry-Standard Boilerplate Code

Adapter Implementation

"""
Adapter: Small trainable layers
"""
import torch
import torch.nn as nn

class Adapter(nn.Module):
    """
    Adapter layer
    
    Architecture:
    - Down projection: d → adapter_size
    - Activation: ReLU
    - Up projection: adapter_size → d
    - Residual connection
    """
    
    def __init__(self, d_model: int, adapter_size: int = 64):
        super().__init__()
        self.down_proj = nn.Linear(d_model, adapter_size)
        self.activation = nn.ReLU()
        self.up_proj = nn.Linear(adapter_size, d_model)
    
    def forward(self, x):
        # Adapter: down → activation → up
        adapter_out = self.up_proj(self.activation(self.down_proj(x)))
        # Residual connection
        return x + adapter_out

LoRA Implementation

"""
LoRA: Low-Rank Adaptation
"""
import torch
import torch.nn as nn

class LoRALayer(nn.Module):
    """
    LoRA layer
    
    W' = W + BA
    Where B (d × r) and A (r × d) are trainable
    """
    
    def __init__(self, d_model: int, rank: int = 8, alpha: int = 8):
        super().__init__()
        self.rank = rank
        self.alpha = alpha
        
        # Low-rank matrices
        self.A = nn.Parameter(torch.randn(rank, d_model) * 0.02)
        self.B = nn.Parameter(torch.zeros(d_model, rank))
        
        # Scaling factor
        self.scale = alpha / rank
    
    def forward(self, x, W):
        """
        Forward pass
        
        Args:
            x: Input (batch, seq_len, d_model)
            W: Original weight matrix (frozen)
        """
        # Original: Wx
        original_out = x @ W.T
        
        # LoRA: BAx
        lora_out = (x @ self.A.T) @ self.B.T
        lora_out = lora_out * self.scale
        
        # Combined: Wx + BAx
        return original_out + lora_out

class LoRALinear(nn.Module):
    """
    LoRA Linear layer (complete implementation)
    """
    
    def __init__(self, in_features: int, out_features: int, 
                 rank: int = 8, alpha: int = 8):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.rank = rank
        self.alpha = alpha
        
        # Original weight (frozen)
        self.weight = nn.Parameter(torch.randn(out_features, in_features))
        self.weight.requires_grad = False  # Freeze
        
        # LoRA matrices
        self.lora_A = nn.Parameter(torch.randn(rank, in_features) * 0.02)
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
        
        self.scale = alpha / rank
    
    def forward(self, x):
        # Original: xW^T
        out = x @ self.weight.T
        
        # LoRA: xA^TB^T * scale
        lora_out = (x @ self.lora_A.T) @ self.lora_B.T
        lora_out = lora_out * self.scale
        
        return out + lora_out

Parameter Comparison

Full Fine-tuning

Parameters: All model parameters (7B model = 7B params)
Memory: High (gradients + optimizer states)
Time: Slow

Adapters

Parameters: ~0.1-1% of model (7B model = 7-70M params)
Memory: Medium
Time: Medium

LoRA

Parameters: ~0.01-0.1% of model (7B model = 0.7-7M params)
Memory: Low
Time: Fast

When to Use

Use Adapters When:

Need task-specific layers
Want modular design
Multiple adapters for different tasks

Use LoRA When:

Want maximum efficiency
Need to combine multiple LoRAs
Limited compute resources

Exercises

Implement adapter layer
Implement LoRA layer
Compare parameter counts
Fine-tune with LoRA

Prompt Tuning and Prefix Tuning

New Comprehensive Content:

prompt_prefix_tuning.md: Complete detailed guide
- What is prompt tuning and prefix tuning
- Why they work (theory and intuition)
- Mathematical formulations with detailed explanations
- Architecture details
- Initialization strategies
- Best practices and tips
- Comparison with other methods
prompt_prefix_code.py: Complete implementations
- PromptTuning class with full code
- PrefixTuning class with full code
- Training functions for both methods
- Parameter comparison utilities
- Usage examples
prompt_prefix_qa.md: Comprehensive interview Q&A
- 10 detailed questions and answers
- Comparisons with LoRA and full fine-tuning
- Implementation details
- Complexity analysis
- Parameter efficiency comparisons

Key Concepts:

Prompt tuning: Adds trainable embeddings at input (0.01% parameters)
Prefix tuning: Adds trainable key-value at each layer (0.3% parameters)
Both keep model frozen, extremely efficient
Can achieve similar performance to full fine-tuning

Next Steps

Topic 26: Tree-based methods
Review parameter-efficient methods

ML & LLM Interview Prep — Deep Dives