Prompt Tuning and Prefix Tuning: Interview Q&A

Q1: What is prompt tuning? How does it differ from fine-tuning?

Answer:

Prompt Tuning:

  • Parameter-efficient fine-tuning method
  • Adds trainable "soft prompts" (continuous embeddings) to input
  • Keeps entire pre-trained model frozen
  • Only trains prompt embeddings (typically 20-100 tokens)
  • Extremely parameter-efficient: <0.01% of model parameters

Key Differences from Fine-tuning:

1. Parameters Updated:

  • Fine-tuning: Updates all model parameters (billions)
  • Prompt tuning: Updates only prompt embeddings (thousands)
  • Efficiency: Prompt tuning uses 100-1000x fewer parameters

2. Storage:

  • Fine-tuning: Need full model copy per task (GBs)
  • Prompt tuning: Only store small prompt embeddings (MBs)
  • Benefit: Can deploy many tasks with same base model

3. Training:

  • Fine-tuning: Updates all parameters, risk of catastrophic forgetting
  • Prompt tuning: Model stays frozen, preserves pre-trained knowledge
  • Benefit: Model can still do original tasks

4. Multi-task:

  • Fine-tuning: Separate model per task
  • Prompt tuning: Same model, different prompts per task
  • Benefit: Efficient multi-task serving

How It Works:

  1. Prepend trainable prompt embeddings to input
  2. Pass [prompt; input] through frozen model
  3. Only update prompt embeddings during training
  4. Prompt learns to encode task-specific information

Example:

  • Model: GPT-2 (125M parameters)
  • Prompt tuning: 20 tokens × 768 dim = 15,360 parameters
  • Efficiency: 15,360 / 125,000,000 = 0.012% of parameters

Q2: What is prefix tuning? How does it differ from prompt tuning?

Answer:

Prefix Tuning:

  • Similar to prompt tuning but adds parameters at every layer
  • Adds trainable "prefix" key-value pairs at each transformer layer
  • More expressive than prompt tuning
  • Still parameter-efficient compared to full fine-tuning

Key Differences:

1. Where Parameters Are Added:

  • Prompt tuning: Only at input layer (embeddings)
  • Prefix tuning: At every transformer layer (attention)
  • Impact: Prefix can influence model at multiple levels

2. What's Added:

  • Prompt tuning: Prompt embeddings (added to input)
  • Prefix tuning: Prefix keys and values (added to attention)
  • Impact: Prefix directly modifies attention computation

3. Parameters:

  • Prompt tuning: p × d_model (p = prompt length)
  • Prefix tuning: L × p × 2d_model (L = layers, for K and V)
  • Example: 12 layers, 20 tokens, 768 dim
    • Prompt: 20 × 768 = 15,360
    • Prefix: 12 × 20 × 2 × 768 = 368,640
    • Still much less than full model

4. Expressiveness:

  • Prompt tuning: Simpler, may be less expressive
  • Prefix tuning: More complex, often better performance
  • Trade-off: More parameters for better performance

5. Implementation:

  • Prompt tuning: Simple concatenation at input
  • Prefix tuning: Modify attention at each layer
  • Complexity: Prefix tuning more complex to implement

When to Use:

  • Prompt tuning: Simple tasks, maximum efficiency
  • Prefix tuning: Complex tasks, need better performance

Q3: Explain the mathematical formulation of prompt tuning.

Answer:

Mathematical Setup:

Given:

  • Pre-trained model with frozen parameters θ
  • Input tokens: x = [x₁, x₂, ..., xₙ]
  • Prompt length: p
  • Model dimension: d_model

Step 1: Token Embedding

E_input = Embedding_θ(x)
Shape: (n, d_model)

Step 2: Prompt Embedding (Trainable)

P = [p₁, p₂, ..., pₚ]  # Learnable parameters
Shape: (p, d_model)

Step 3: Concatenate

E_combined = [P; E_input]
Shape: (p + n, d_model)

Step 4: Forward Pass (Model Frozen)

output = Model_θ(E_combined)
# All parameters θ are frozen, no gradients

Step 5: Loss and Update

loss = CrossEntropy(output, target)
∇P = ∂loss/∂P  # Only gradients for prompt P
P ← P - α∇P  # Update only prompt embeddings
# θ remains unchanged

Key Equations:

Attention with Prompt:

Q = E_combined W_q
K = E_combined W_k
V = E_combined W_v

Attention = softmax(QK^T / √d_k) V
# Prompt tokens influence attention to input tokens

Parameter Count:

Trainable parameters = p × d_model
# Example: 20 × 768 = 15,360 parameters

Why It Works:

  • Prompt tokens attend to and influence input processing
  • Model learns to interpret prompt as task instruction
  • Prompt encodes task-specific information efficiently

Q4: Explain the mathematical formulation of prefix tuning.

Answer:

Mathematical Setup:

Given:

  • Pre-trained model with L layers, frozen parameters θ
  • Input tokens: x = [x₁, x₂, ..., xₙ]
  • Prefix length: p
  • Model dimension: d_model

At Each Layer l (l = 1, ..., L):

Step 1: Standard Q, K, V Computation

Q_l = X_l W_q^l  # Queries from sequence
K_l = X_l W_k^l  # Keys from sequence
V_l = X_l W_v^l  # Values from sequence
Shape: (n, d_model)

Step 2: Add Prefix (Trainable)

P_l^K = PrefixKey_l  # Learnable prefix keys
P_l^V = PrefixValue_l  # Learnable prefix values
Shape: (p, d_model) each

K_l = [P_l^K; K_l]  # Concatenate prefix keys
V_l = [P_l^V; V_l]  # Concatenate prefix values
Shape: (p + n, d_model)
# Q_l remains unchanged

Step 3: Attention with Prefix

Attention_l = softmax(Q_l K_l^T / √d_k) V_l
# Q_l attends to both prefix and sequence tokens
Shape: (n, d_model)

Total Parameters:

For each layer:
- Prefix embeddings: p × d_model (reparameterized: p × d_model/2)
- K projection: d_model × d_model
- V projection: d_model × d_model

Total = L × (p × d_model/2 + 2 × d_model²)
# With reparameterization

Example:

  • L = 12 layers
  • p = 20 tokens
  • d_model = 768
  • Total ≈ 12 × (20 × 384 + 2 × 768²) ≈ 14M parameters
  • Still much less than full model (125M+)

Why It Works:

  • Prefix influences attention at every layer
  • Can guide model behavior at multiple abstraction levels
  • More flexible than prompt tuning (only input layer)

Q5: What are the advantages and disadvantages of prompt tuning vs prefix tuning?

Answer:

Prompt Tuning Advantages:

  • Maximum efficiency: Fewest parameters (p × d_model)
  • Simple implementation: Just concatenate at input
  • Fast training: Fewer parameters to update
  • Easy to deploy: Very small storage per task
  • Good for simple tasks: Sufficient for many applications

Prompt Tuning Disadvantages:

  • Less expressive: Only influences input layer
  • May underperform: On complex tasks compared to prefix/full fine-tuning
  • Limited capacity: Small number of parameters may not capture complex patterns

Prefix Tuning Advantages:

  • More expressive: Influences every layer
  • Better performance: Often matches full fine-tuning
  • Multi-level influence: Can guide model at different abstraction levels
  • Still efficient: Much fewer parameters than full fine-tuning

Prefix Tuning Disadvantages:

  • More parameters: L × p × 2d_model vs p × d_model
  • More complex: Need to modify attention at each layer
  • Slower training: More parameters to update
  • More storage: Larger than prompt tuning (but still small)

Comparison Table:

AspectPrompt TuningPrefix TuningFull Fine-tuning
Parametersp × d_modelL × p × 2d_modelAll parameters
EfficiencyHighestHighLow
PerformanceGoodVery GoodBest
ComplexitySimpleModerateSimple
StorageSmallestSmallLarge
Use CaseSimple tasksComplex tasksMaximum performance

Recommendation:

  • Start with prompt tuning (simpler, more efficient)
  • If performance insufficient, try prefix tuning
  • Use full fine-tuning only if needed and resources available

Q6: How do you initialize prompt/prefix embeddings? What strategies work best?

Answer:

Initialization Strategies:

1. Random Initialization:

P ~ N(0, 0.02²)  # Small random values
  • Pros: Simple, unbiased
  • Cons: May require more training, slower convergence
  • Use: Default, works for most cases

2. Vocabulary-Based Initialization:

Sample random tokens from vocabulary
Use their embeddings as initial prompt
  • Pros: Starts with semantic information
  • Cons: May bias towards specific tokens
  • Use: Often works better than random

3. Task-Specific Initialization:

Use embeddings from task-related tokens
E.g., for sentiment: "sentiment", "positive", "negative"
  • Pros: Better starting point, faster convergence
  • Cons: Requires domain knowledge
  • Use: When you know relevant tokens

4. Learned Initialization (Transfer):

Train on related task first
Use learned prompts as initialization
  • Pros: Transfers knowledge from related tasks
  • Cons: Requires related task data
  • Use: Multi-task scenarios

5. Reparameterization (Prefix Tuning):

Learn in smaller space (d_model/2)
Project up to full dimension
  • Pros: More stable training
  • Cons: Slightly more parameters
  • Use: Prefix tuning, improves stability

Best Practices:

  • Prompt tuning: Start with vocabulary-based
  • Prefix tuning: Use reparameterization + random init
  • Experiment: Try different strategies, use validation performance
  • Task-specific: Use domain knowledge when available

Q7: What is the optimal prompt/prefix length? How do you choose it?

Answer:

Prompt/Prefix Length Selection:

Typical Ranges:

  • Prompt tuning: 20-100 tokens (commonly 20-50)
  • Prefix tuning: 10-50 tokens (commonly 10-20 per layer)

Factors to Consider:

1. Task Complexity:

  • Simple tasks (binary classification): 20 tokens often sufficient
  • Complex tasks (QA, generation): 50-100 tokens may be needed
  • Rule of thumb: More complex task → longer prompt/prefix

2. Dataset Size:

  • Large datasets: Can support longer prompts (less overfitting risk)
  • Small datasets: Shorter prompts (avoid overfitting)
  • Balance: Enough capacity but not too much

3. Model Size:

  • Larger models: Can utilize longer prompts effectively
  • Smaller models: Shorter prompts may be sufficient
  • Match capacity: Prompt capacity should match model capacity

Selection Process:

1. Start with Moderate Length:

  • Prompt: 20-30 tokens
  • Prefix: 10-20 tokens per layer

2. Validation Experiment:

  • Try different lengths: [10, 20, 50, 100]
  • Train on each, evaluate on validation set
  • Choose length with best validation performance

3. Consider Trade-offs:

  • Longer: More capacity, more parameters, risk of overfitting
  • Shorter: Less capacity, fewer parameters, less overfitting risk

4. Practical Guidelines:

  • Minimum: 10 tokens (may not have enough capacity)
  • Common: 20-50 tokens (good balance)
  • Maximum: 100+ tokens (diminishing returns, overfitting risk)

Empirical Finding:

  • Performance improves with length up to a point
  • Then plateaus or degrades (overfitting)
  • Sweet spot: 20-50 tokens for most tasks

Q8: Compare prompt tuning, prefix tuning, LoRA, and full fine-tuning.

Answer:

Parameter Efficiency:

Full Fine-tuning:

  • Parameters: 100% of model
  • Example: 125M parameters for GPT-2
  • Storage: Full model per task

LoRA:

  • Parameters: Low-rank matrices (r × d_model)
  • Example: r=8, d_model=768 → ~6K per layer
  • Total: ~0.1-1% of model parameters
  • Storage: Small adapter weights

Prefix Tuning:

  • Parameters: L × p × 2d_model
  • Example: 12 × 20 × 2 × 768 ≈ 368K
  • Total: ~0.3% of model parameters
  • Storage: Prefix embeddings per task

Prompt Tuning:

  • Parameters: p × d_model
  • Example: 20 × 768 = 15K
  • Total: ~0.01% of model parameters
  • Storage: Smallest (just prompt embeddings)

Performance:

Full Fine-tuning:

  • Best performance (all parameters optimized)
  • Risk of catastrophic forgetting
  • Requires most resources

LoRA:

  • Near full fine-tuning performance
  • Good balance of efficiency and performance
  • Most popular method currently

Prefix Tuning:

  • Very good performance (often matches full fine-tuning)
  • More expressive than prompt tuning
  • Good for complex tasks

Prompt Tuning:

  • Good performance (may be slightly lower)
  • Sufficient for many tasks
  • Maximum efficiency

Use Cases:

Full Fine-tuning:

  • Maximum performance needed
  • Have resources and data
  • Single task deployment

LoRA:

  • Best balance of efficiency and performance
  • Most common in practice
  • Multi-task scenarios

Prefix Tuning:

  • Complex tasks, need good performance
  • Can afford slightly more parameters
  • Multi-task scenarios

Prompt Tuning:

  • Simple tasks
  • Maximum efficiency needed
  • Many tasks, limited resources

Comparison Table:

MethodParametersPerformanceComplexityStorage
Full Fine-tuning100%BestSimpleLarge
LoRA0.1-1%ExcellentModerateSmall
Prefix Tuning0.3%Very GoodModerateSmall
Prompt Tuning0.01%GoodSimpleSmallest

Q9: How do you implement prompt tuning? Show the key code.

Answer:

Key Implementation Steps:

1. Freeze Model:

for param in model.parameters():
    param.requires_grad = False

2. Create Prompt Embeddings:

prompt_length = 20
d_model = 768
prompt_embeddings = nn.Parameter(
    torch.randn(prompt_length, d_model) * 0.02
)

3. Concatenate with Input:

input_embeddings = model.transformer.wte(input_ids)
# Shape: (batch, seq_len, d_model)

prompt = prompt_embeddings.unsqueeze(0).expand(batch_size, -1, -1)
# Shape: (batch, prompt_length, d_model)

combined = torch.cat([prompt, input_embeddings], dim=1)
# Shape: (batch, prompt_length + seq_len, d_model)

4. Forward Pass:

outputs = model.transformer(inputs_embeds=combined)
logits = model.lm_head(outputs.last_hidden_state)

5. Training:

optimizer = torch.optim.Adam([prompt_embeddings], lr=0.3)
# Only prompt_embeddings are updated

Complete Code: See prompt_prefix_code.py for full implementation!

Key Points:

  • Only prompt_embeddings requires gradients
  • Model parameters stay frozen
  • Very simple implementation
  • Extremely parameter-efficient

Q10: How do you implement prefix tuning? What's the complexity?

Answer:

Key Implementation Steps:

1. Freeze Model:

for param in model.parameters():
    param.requires_grad = False

2. Create Prefix for Each Layer:

# Reparameterized prefix
prefix_emb = nn.Parameter(
    torch.randn(prefix_length, d_model // 2) * 0.02
)
prefix_proj = nn.Linear(d_model // 2, d_model)

# Project to K and V for each layer
prefix_k_proj = nn.ModuleList([
    nn.Linear(d_model, d_model) for _ in range(num_layers)
])
prefix_v_proj = nn.ModuleList([
    nn.Linear(d_model, d_model) for _ in range(num_layers)
])

3. Modify Attention at Each Layer:

for layer_idx, layer in enumerate(model.transformer.h):
    # Get prefix for this layer
    prefix_k, prefix_v = get_prefix_kv(layer_idx)
    
    # Standard Q, K, V
    Q = compute_queries(hidden_states)
    K = compute_keys(hidden_states)
    V = compute_values(hidden_states)
    
    # Add prefix
    K = torch.cat([prefix_k, K], dim=1)
    V = torch.cat([prefix_v, V], dim=1)
    
    # Attention
    attention = softmax(Q @ K.T / sqrt(d_k)) @ V

Complexity:

Parameters:

  • Prefix embeddings: p × d_model/2 (reparameterized)
  • Projection: (d_model/2) × d_model
  • K/V projections per layer: 2 × d_model²
  • Total: L × (p × d_model/2 + 2 × d_model²)

Time Complexity:

  • Same as standard attention: O(n²d)
  • Prefix adds p tokens, so O((n+p)²d)
  • Typically p << n, so similar to standard

Space Complexity:

  • Store prefix embeddings: O(p × d_model)
  • Per-layer projections: O(L × d_model²)
  • Much less than full model

Implementation Complexity:

  • More complex than prompt tuning
  • Need to modify attention at each layer
  • Requires understanding of transformer internals

See prompt_prefix_code.py for complete implementation!


Summary

Prompt tuning and prefix tuning are powerful parameter-efficient fine-tuning methods. Prompt tuning adds trainable embeddings at the input layer, making it extremely efficient (0.01% parameters) and simple to implement. Prefix tuning adds trainable key-value pairs at every layer, providing more expressiveness (0.3% parameters) and often matching full fine-tuning performance. Both methods keep the pre-trained model frozen, enabling efficient multi-task deployment and preserving pre-trained knowledge. The choice between them depends on the task complexity, available resources, and performance requirements.