Topic 25: Adapters & LoRA (Parameter-Efficient Fine-tuning)
🔥 For interviews, read these first:
LORA_DEEP_DIVE.md— frontier-lab interview deep dive: LoRA math (ΔW = B·A), intrinsic-dimension hypothesis, α/r scaling, QLoRA's three innovations (NF4, double quantization, paged optimizer), adapter modules, prefix tuning, IA³, DoRA, GaLore, multi-LoRA serving (S-LoRA, Punica).INTERVIEW_GRILL.md— 40 active-recall questions.
What You'll Learn
This topic teaches you parameter-efficient fine-tuning:
- Adapters
- LoRA (Low-Rank Adaptation)
- How they work
- When to use them
- Simple implementations
Why We Need This
Interview Importance
- Hot topic: LoRA is widely used
- Efficiency: Shows understanding of efficient training
- Practical knowledge: Used in production
Real-World Application
- Fine-tuning: Fine-tune large models efficiently
- Cost savings: Much cheaper than full fine-tuning
- Multiple tasks: Train multiple adapters for different tasks
Industry Use Cases
1. Adapters
Use Case: Task-specific fine-tuning
- Add small adapter layers
- Freeze main model
- Train only adapters
2. LoRA
Use Case: Most popular PEFT method
- Low-rank decomposition
- Train only small matrices
- Can combine multiple LoRAs
Core Intuition
Parameter-efficient fine-tuning exists because full fine-tuning of large models is expensive in:
- memory
- optimizer state
- storage
- deployment complexity
The key idea is:
- keep most pretrained weights frozen
- learn a small set of task-specific updates
Adapters
Adapters insert small trainable modules into the network.
The intuition is:
- the backbone already knows a lot
- a small bottleneck module can steer behavior for a task
LoRA
LoRA does not insert a whole new transformation in the same way adapters do.
Instead, it learns a low-rank update to an existing weight matrix.
That is why LoRA feels lightweight:
- frozen base weight
- small trainable update
- task adaptation with far fewer parameters
Technical Details Interviewers Often Want
Why Low Rank Might Work
LoRA assumes the important task-specific update can often be represented in a much lower-rank subspace than a full dense update.
That is the key modeling assumption behind the method.
Why LoRA Is Operationally Attractive
LoRA is popular because it often gives:
- very low trainable parameter count
- lower optimizer memory
- easy swapping of task-specific adapters
Adapter vs LoRA
This is a common follow-up.
- Adapters add trainable modules to the network path
- LoRA modifies an existing linear transform through a low-rank update
Both are PEFT, but they intervene differently.
Common Failure Modes
- treating PEFT as always equivalent to full fine-tuning
- choosing rank too low and underfitting the task
- applying LoRA to the wrong target modules
- ignoring inference-time or deployment composition issues with many adapters
- assuming fewer trainable parameters always means equal final quality
Edge Cases and Follow-Up Questions
- Why can LoRA work with so few trainable parameters?
- What does the rank
rcontrol? - Why might full fine-tuning still outperform PEFT?
- How are adapters different from LoRA conceptually?
- Why is PEFT especially valuable for multi-task or resource-constrained setups?
What to Practice Saying Out Loud
- Why PEFT exists at all
- The core idea behind low-rank adaptation
- The difference between "cheaper training" and "equally expressive training"
Theory
Adapters
Concept:
- Add small adapter layers between transformer layers
- Freeze original model weights
- Train only adapter parameters
Architecture:
Original: X → Transformer → Y
With Adapter: X → Transformer → Adapter → Y
Parameters:
adapter_size: Hidden dimension of adapter (e.g., 64, 128)adapter_layers: Which layers to add adapters (all or specific)
LoRA (Low-Rank Adaptation)
Concept:
- Instead of updating all weights W, update low-rank matrices
- W' = W + ΔW, where ΔW = BA (low-rank)
- Train only B and A matrices
Mathematical Formulation:
Original: h = Wx
LoRA: h = Wx + ΔWx = Wx + BAx
Where:
- W: Original weight matrix (d × d)
- B: Low-rank matrix (d × r), r << d
- A: Low-rank matrix (r × d)
- r: Rank (typically 1-16)
Why it works:
- Low-rank assumption: Weight updates have low intrinsic rank
- Much fewer parameters: r × d instead of d × d
- Example: d=4096, r=8 → 8×4096×2 = 65K params vs 16M params
Parameters:
rank(r): Rank of decomposition (1-16, typically 8)alpha: Scaling factor (usually = rank)target_modules: Which layers to apply LoRA (attention, MLP, etc.)dropout: Dropout in LoRA layers
Industry-Standard Boilerplate Code
Adapter Implementation
"""
Adapter: Small trainable layers
"""
import torch
import torch.nn as nn
class Adapter(nn.Module):
"""
Adapter layer
Architecture:
- Down projection: d → adapter_size
- Activation: ReLU
- Up projection: adapter_size → d
- Residual connection
"""
def __init__(self, d_model: int, adapter_size: int = 64):
super().__init__()
self.down_proj = nn.Linear(d_model, adapter_size)
self.activation = nn.ReLU()
self.up_proj = nn.Linear(adapter_size, d_model)
def forward(self, x):
# Adapter: down → activation → up
adapter_out = self.up_proj(self.activation(self.down_proj(x)))
# Residual connection
return x + adapter_out
LoRA Implementation
"""
LoRA: Low-Rank Adaptation
"""
import torch
import torch.nn as nn
class LoRALayer(nn.Module):
"""
LoRA layer
W' = W + BA
Where B (d × r) and A (r × d) are trainable
"""
def __init__(self, d_model: int, rank: int = 8, alpha: int = 8):
super().__init__()
self.rank = rank
self.alpha = alpha
# Low-rank matrices
self.A = nn.Parameter(torch.randn(rank, d_model) * 0.02)
self.B = nn.Parameter(torch.zeros(d_model, rank))
# Scaling factor
self.scale = alpha / rank
def forward(self, x, W):
"""
Forward pass
Args:
x: Input (batch, seq_len, d_model)
W: Original weight matrix (frozen)
"""
# Original: Wx
original_out = x @ W.T
# LoRA: BAx
lora_out = (x @ self.A.T) @ self.B.T
lora_out = lora_out * self.scale
# Combined: Wx + BAx
return original_out + lora_out
class LoRALinear(nn.Module):
"""
LoRA Linear layer (complete implementation)
"""
def __init__(self, in_features: int, out_features: int,
rank: int = 8, alpha: int = 8):
super().__init__()
self.in_features = in_features
self.out_features = out_features
self.rank = rank
self.alpha = alpha
# Original weight (frozen)
self.weight = nn.Parameter(torch.randn(out_features, in_features))
self.weight.requires_grad = False # Freeze
# LoRA matrices
self.lora_A = nn.Parameter(torch.randn(rank, in_features) * 0.02)
self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
self.scale = alpha / rank
def forward(self, x):
# Original: xW^T
out = x @ self.weight.T
# LoRA: xA^TB^T * scale
lora_out = (x @ self.lora_A.T) @ self.lora_B.T
lora_out = lora_out * self.scale
return out + lora_out
Parameter Comparison
Full Fine-tuning
- Parameters: All model parameters (7B model = 7B params)
- Memory: High (gradients + optimizer states)
- Time: Slow
Adapters
- Parameters: ~0.1-1% of model (7B model = 7-70M params)
- Memory: Medium
- Time: Medium
LoRA
- Parameters: ~0.01-0.1% of model (7B model = 0.7-7M params)
- Memory: Low
- Time: Fast
When to Use
Use Adapters When:
- Need task-specific layers
- Want modular design
- Multiple adapters for different tasks
Use LoRA When:
- Want maximum efficiency
- Need to combine multiple LoRAs
- Limited compute resources
Exercises
- Implement adapter layer
- Implement LoRA layer
- Compare parameter counts
- Fine-tune with LoRA
Prompt Tuning and Prefix Tuning
New Comprehensive Content:
-
prompt_prefix_tuning.md: Complete detailed guide- What is prompt tuning and prefix tuning
- Why they work (theory and intuition)
- Mathematical formulations with detailed explanations
- Architecture details
- Initialization strategies
- Best practices and tips
- Comparison with other methods
-
prompt_prefix_code.py: Complete implementationsPromptTuningclass with full codePrefixTuningclass with full code- Training functions for both methods
- Parameter comparison utilities
- Usage examples
-
prompt_prefix_qa.md: Comprehensive interview Q&A- 10 detailed questions and answers
- Comparisons with LoRA and full fine-tuning
- Implementation details
- Complexity analysis
- Parameter efficiency comparisons
Key Concepts:
- Prompt tuning: Adds trainable embeddings at input (0.01% parameters)
- Prefix tuning: Adds trainable key-value at each layer (0.3% parameters)
- Both keep model frozen, extremely efficient
- Can achieve similar performance to full fine-tuning
Next Steps
- Topic 26: Tree-based methods
- Review parameter-efficient methods