Mixture of Experts: Interview Q&A

Q1: What is Mixture of Experts? How does it work?

Answer:

Mixture of Experts (MoE):

  • Architecture with multiple expert networks
  • Router decides which experts to activate
  • Only subset of experts process each input
  • Enables models with trillions of parameters

How It Works:

1. Multiple Experts:

  • 8-128 feed-forward networks (experts)
  • Each expert is independent
  • All experts have same architecture

2. Router/Gating:

  • Takes input, outputs expert scores
  • Computes probability distribution over experts
  • Selects top-k experts with highest scores

3. Sparse Activation:

  • Only k experts activated per token
  • Typically k=1 or k=2
  • Most experts remain inactive

4. Weighted Combination:

  • Process input through selected experts
  • Weighted combination of expert outputs
  • Weights from router probabilities

Mathematical Formulation:

scores = Router(x)  # Expert scores
probs = softmax(scores)  # Probabilities
top_k_indices = topk(probs, k)  # Select k experts
output = sum(probs[i] * Expert[i](x) for i in top_k_indices)

Key Insight:

  • Total parameters: num_experts × params_per_expert (large)
  • Active parameters: k × params_per_expert (small)
  • Enables scaling without proportional compute increase

Q2: How does MoE reduce computation compared to dense models?

Answer:

Dense Model:

  • All parameters used for every input
  • Computation: O(d_model²) per token
  • Example: 7B parameters, all active

MoE Model:

  • Total parameters: num_experts × params_per_expert
  • Active parameters: k × params_per_expert
  • Computation: O(k × d_model²) per token

Example: Mixtral-8x7B

  • 8 experts, each 7B parameters
  • Total: 8 × 7B = 56B parameters
  • Active: k=2, so 2 × 7B = 14B parameters per token
  • Computation: Only 14B parameters active (not 56B!)

Efficiency:

  • Total capacity: 56B parameters
  • Computation: Only 14B parameters
  • 4× more parameters, but similar computation to 14B dense model

Memory:

  • During training: Need all expert parameters (56B)
  • During inference: Can load only active experts (14B)
  • KV cache: Same as dense model (not affected by MoE)

Reduction:

  • Computation: (num_experts / k)× reduction
  • Example: 8 experts, k=2 → 4× reduction in computation
  • But total parameters: num_experts× more

Q3: What is the routing mechanism? How does top-k routing work?

Answer:

Routing Mechanism:

  • Router (gating network) decides which experts to use
  • Takes input, outputs scores for each expert
  • Selects experts based on scores

Top-k Routing Algorithm:

1. Compute Scores:

scores = Router(x)  # (num_experts,) - logits
probs = softmax(scores)  # Probabilities

2. Select Top-k:

top_k_probs, top_k_indices = torch.topk(probs, k)
# Select k experts with highest probabilities

3. Renormalize:

top_k_probs = top_k_probs / top_k_probs.sum()
# Renormalize so probabilities sum to 1

4. Weighted Combination:

output = 0
for i, expert_idx in enumerate(top_k_indices):
    expert_output = Expert[expert_idx](x)
    output += top_k_probs[i] * expert_output

Example:

  • 8 experts, k=2
  • Router scores: [0.1, 0.3, 0.05, 0.2, 0.15, 0.1, 0.05, 0.05]
  • Top-2: experts 1 and 3 (scores 0.3 and 0.2)
  • Renormalize: [0.6, 0.4] (for experts 1 and 3)
  • Output: 0.6 × Expert1(x) + 0.4 × Expert3(x)

Why Top-k?

  • Hard routing: Only use k experts (efficient)
  • Soft routing: Use all experts with weights (less efficient)
  • Top-k balances efficiency and flexibility

Q4: What is load balancing? Why is it important?

Answer:

Load Balancing Problem:

  • Without balancing, router might always select same experts
  • Some experts never used (waste of parameters)
  • Others overloaded (bottleneck)
  • Expert collapse: Only few experts ever used

Load Balancing Solution:

  • Encourage uniform expert usage
  • Ensure all experts are utilized
  • Prevent expert collapse

Load Balancing Loss:

L_balance = (1/num_experts) * sum(load_i)²

Where load_i is fraction of tokens routed to expert i.

Goal:

  • Minimize variance of expert usage
  • Distribute tokens evenly across experts
  • All experts should be used roughly equally

Why Important:

  • Without balancing: Experts 0-2 always used, 3-7 never used
  • With balancing: All experts used roughly equally
  • Better parameter utilization
  • Prevents expert collapse

Training:

  • Add load balancing loss to total loss
  • L_total = L_main + α * L_balance
  • Encourages router to distribute tokens

Q5: Compare MoE with dense models. What are the trade-offs?

Answer:

Comparison:

AspectDense ModelMoE Model
Total ParametersPnum_experts × P
Active ParametersP (all)k × P
ComputationO(P)O(k × P)
Memory (Training)Pnum_experts × P
Memory (Inference)Pk × P (can load only active)
QualityBaselineSimilar (slight trade-off)
TrainingSimpleComplex (need balancing)

Trade-offs:

MoE Advantages:

  • Can have many more parameters (trillions)
  • Only use subset per input (efficient)
  • Experts can specialize
  • Better for diverse inputs

MoE Disadvantages:

  • More complex training (load balancing)
  • Higher memory during training
  • Routing overhead (small)
  • Slight quality trade-off (often negligible)

When to Use:

  • Dense: Small-medium models, simplicity
  • MoE: Large models, need efficiency, diverse inputs

Q6: How is MoE used in modern LLMs like GPT-4 and Mixtral?

Answer:

GPT-4 (Rumored):

  • Uses MoE architecture (exact details not public)
  • Multiple experts
  • Top-k routing
  • Enables very large model (trillions of parameters)

Mixtral-8x7B:

  • 8 experts, each 7B parameters
  • Total: 56B parameters
  • Top-2 routing (k=2)
  • Active: 14B parameters per token

Architecture:

  • Replace standard FFN with MoE-FFN
  • Each transformer block has MoE layer
  • Router decides which experts per token

Efficiency:

  • Total capacity: 56B parameters
  • Computation: Only 14B parameters active
  • Similar computation to 14B dense model
  • But 4× more capacity

Quality:

  • Achieves quality of larger dense models
  • With computation of smaller models
  • Best of both worlds

Q7: What are the challenges in training MoE models?

Answer:

1. Expert Collapse:

  • Router might always select same experts
  • Other experts never trained
  • Solution: Load balancing loss

2. Gradient Flow:

  • Only active experts receive gradients
  • Inactive experts don't learn
  • Solution: Expert sampling, auxiliary losses

3. Routing Instability:

  • Router decisions can be unstable
  • Experts might not converge
  • Solution: Temperature annealing, regularization

4. Load Imbalance:

  • Uneven expert usage
  • Some experts overloaded
  • Solution: Load balancing loss, expert capacity limits

5. Memory:

  • Need to store all expert parameters
  • Higher memory than dense
  • Solution: Expert sharding, gradient checkpointing

Training Techniques:

  • Load balancing loss
  • Expert sampling (random experts sometimes)
  • Temperature annealing (soft → hard routing)
  • Gradient clipping
  • Careful initialization

Summary

Mixture of Experts enables training models with trillions of parameters while keeping computation efficient. By activating only a subset of experts for each input, MoE achieves the capacity of very large models with the computation of much smaller models. Key components include multiple expert networks, a routing mechanism for expert selection, and load balancing to ensure all experts are utilized. Modern models like GPT-4 and Mixtral-8x7B use MoE to achieve unprecedented scale and efficiency.