Mixture of Experts: Interview Q&A

Q1: What is Mixture of Experts? How does it work?

Answer:

Mixture of Experts (MoE):

Architecture with multiple expert networks
Router decides which experts to activate
Only subset of experts process each input
Enables models with trillions of parameters

How It Works:

1. Multiple Experts:

8-128 feed-forward networks (experts)
Each expert is independent
All experts have same architecture

2. Router/Gating:

Takes input, outputs expert scores
Computes probability distribution over experts
Selects top-k experts with highest scores

3. Sparse Activation:

Only k experts activated per token
Typically k=1 or k=2
Most experts remain inactive

4. Weighted Combination:

Process input through selected experts
Weighted combination of expert outputs
Weights from router probabilities

Mathematical Formulation:

scores = Router(x)  # Expert scores
probs = softmax(scores)  # Probabilities
top_k_indices = topk(probs, k)  # Select k experts
output = sum(probs[i] * Expert[i](x) for i in top_k_indices)

Key Insight:

Total parameters: num_experts × params_per_expert (large)
Active parameters: k × params_per_expert (small)
Enables scaling without proportional compute increase

Q2: How does MoE reduce computation compared to dense models?

Answer:

Dense Model:

All parameters used for every input
Computation: O(d_model²) per token
Example: 7B parameters, all active

MoE Model:

Total parameters: num_experts × params_per_expert
Active parameters: k × params_per_expert
Computation: O(k × d_model²) per token

Example: Mixtral-8x7B

8 experts, each 7B parameters
Total: 8 × 7B = 56B parameters
Active: k=2, so 2 × 7B = 14B parameters per token
Computation: Only 14B parameters active (not 56B!)

Efficiency:

Total capacity: 56B parameters
Computation: Only 14B parameters
4× more parameters, but similar computation to 14B dense model

Memory:

During training: Need all expert parameters (56B)
During inference: Can load only active experts (14B)
KV cache: Same as dense model (not affected by MoE)

Reduction:

Computation: (num_experts / k)× reduction
Example: 8 experts, k=2 → 4× reduction in computation
But total parameters: num_experts× more

Q3: What is the routing mechanism? How does top-k routing work?

Answer:

Routing Mechanism:

Router (gating network) decides which experts to use
Takes input, outputs scores for each expert
Selects experts based on scores

Top-k Routing Algorithm:

1. Compute Scores:

scores = Router(x)  # (num_experts,) - logits
probs = softmax(scores)  # Probabilities

2. Select Top-k:

top_k_probs, top_k_indices = torch.topk(probs, k)
# Select k experts with highest probabilities

3. Renormalize:

top_k_probs = top_k_probs / top_k_probs.sum()
# Renormalize so probabilities sum to 1

4. Weighted Combination:

output = 0
for i, expert_idx in enumerate(top_k_indices):
    expert_output = Expert[expert_idx](x)
    output += top_k_probs[i] * expert_output

Example:

8 experts, k=2
Router scores: [0.1, 0.3, 0.05, 0.2, 0.15, 0.1, 0.05, 0.05]
Top-2: experts 1 and 3 (scores 0.3 and 0.2)
Renormalize: [0.6, 0.4] (for experts 1 and 3)
Output: 0.6 × Expert1(x) + 0.4 × Expert3(x)

Why Top-k?

Hard routing: Only use k experts (efficient)
Soft routing: Use all experts with weights (less efficient)
Top-k balances efficiency and flexibility

Q4: What is load balancing? Why is it important?

Answer:

Load Balancing Problem:

Without balancing, router might always select same experts
Some experts never used (waste of parameters)
Others overloaded (bottleneck)
Expert collapse: Only few experts ever used

Load Balancing Solution:

Encourage uniform expert usage
Ensure all experts are utilized
Prevent expert collapse

Load Balancing Loss:

L_balance = (1/num_experts) * sum(load_i)²

Where load_i is fraction of tokens routed to expert i.

Goal:

Minimize variance of expert usage
Distribute tokens evenly across experts
All experts should be used roughly equally

Why Important:

Without balancing: Experts 0-2 always used, 3-7 never used
With balancing: All experts used roughly equally
Better parameter utilization
Prevents expert collapse

Training:

Add load balancing loss to total loss
L_total = L_main + α * L_balance
Encourages router to distribute tokens

Q5: Compare MoE with dense models. What are the trade-offs?

Answer:

Comparison:

Aspect	Dense Model	MoE Model
Total Parameters	P	num_experts × P
Active Parameters	P (all)	k × P
Computation	O(P)	O(k × P)
Memory (Training)	P	num_experts × P
Memory (Inference)	P	k × P (can load only active)
Quality	Baseline	Similar (slight trade-off)
Training	Simple	Complex (need balancing)

Trade-offs:

MoE Advantages:

Can have many more parameters (trillions)
Only use subset per input (efficient)
Experts can specialize
Better for diverse inputs

MoE Disadvantages:

More complex training (load balancing)
Higher memory during training
Routing overhead (small)
Slight quality trade-off (often negligible)

When to Use:

Dense: Small-medium models, simplicity
MoE: Large models, need efficiency, diverse inputs

Q6: How is MoE used in modern LLMs like GPT-4 and Mixtral?

Answer:

GPT-4 (Rumored):

Uses MoE architecture (exact details not public)
Multiple experts
Top-k routing
Enables very large model (trillions of parameters)

Mixtral-8x7B:

8 experts, each 7B parameters
Total: 56B parameters
Top-2 routing (k=2)
Active: 14B parameters per token

Architecture:

Replace standard FFN with MoE-FFN
Each transformer block has MoE layer
Router decides which experts per token

Efficiency:

Total capacity: 56B parameters
Computation: Only 14B parameters active
Similar computation to 14B dense model
But 4× more capacity

Quality:

Achieves quality of larger dense models
With computation of smaller models
Best of both worlds

Q7: What are the challenges in training MoE models?

Answer:

1. Expert Collapse:

Router might always select same experts
Other experts never trained
Solution: Load balancing loss

2. Gradient Flow:

Only active experts receive gradients
Inactive experts don't learn
Solution: Expert sampling, auxiliary losses

3. Routing Instability:

Router decisions can be unstable
Experts might not converge
Solution: Temperature annealing, regularization

4. Load Imbalance:

Uneven expert usage
Some experts overloaded
Solution: Load balancing loss, expert capacity limits

5. Memory:

Need to store all expert parameters
Higher memory than dense
Solution: Expert sharding, gradient checkpointing

Training Techniques:

Load balancing loss
Expert sampling (random experts sometimes)
Temperature annealing (soft → hard routing)
Gradient clipping
Careful initialization

Summary

Mixture of Experts enables training models with trillions of parameters while keeping computation efficient. By activating only a subset of experts for each input, MoE achieves the capacity of very large models with the computation of much smaller models. Key components include multiple expert networks, a routing mechanism for expert selection, and load balancing to ensure all experts are utilized. Modern models like GPT-4 and Mixtral-8x7B use MoE to achieve unprecedented scale and efficiency.

ML & LLM Interview Prep — Deep Dives