Topic 14: Advanced Positional Embeddings (RoPE, ALiBi)
🔥 For interviews, read these first:
POSITIONAL_DEEP_DIVE.md— frontier-lab interview deep dive: sinusoidal/learned/T5-bias/RoPE/ALiBi/NoPE, full RoPE derivation showing relative-position from rotated dot products, NTK scaling, YaRN, length extrapolation.INTERVIEW_GRILL.md— 45 active-recall questions.
What You'll Learn
This topic teaches you advanced positional embedding methods:
- RoPE (Rotary Position Embedding)
- ALiBi (Attention with Linear Biases)
- Why they're better than sinusoidal
- Simple implementations
Why We Need This
Interview Importance
- Hot topic: RoPE used in LLaMA, GPT-NeoX
- Recent advancement: Shows you know latest techniques
- Understanding: Better than basic positional encoding
Real-World Application
- LLaMA: Uses RoPE
- Long context: RoPE generalizes to longer sequences
- Extrapolation: Better than sinusoidal for longer sequences
Industry Use Cases
1. RoPE (Rotary Position Embedding)
Use Case: LLaMA, GPT-NeoX, PaLM
- Better extrapolation to longer sequences
- Relative position encoding
- Used in modern LLMs
2. ALiBi (Attention with Linear Biases)
Use Case: Alternative to positional embeddings
- No positional embeddings needed
- Add bias to attention scores
- Works well for long sequences
Core Intuition
Transformers need positional information because self-attention alone does not know token order.
Basic sinusoidal encodings work, but modern LLMs often use alternatives because positional handling affects:
- long-context behavior
- extrapolation beyond training length
- implementation simplicity
- inductive bias about relative distance
RoPE
RoPE injects position by rotating query and key vectors.
The important intuition is:
- position affects how queries and keys line up
- relative offsets emerge naturally from the rotation structure
That is why RoPE is often described as giving relative position behavior inside the attention computation itself.
ALiBi
ALiBi does not add a positional vector to token embeddings.
Instead, it changes attention scores directly using a distance-dependent bias.
Intuition:
- farther positions receive a stronger penalty
- attention naturally prefers closer positions unless evidence is strong
Technical Details Interviewers Often Want
Why Modern LLMs Often Prefer RoPE
RoPE is attractive because it:
- integrates with attention cleanly
- captures relative position effects
- often extrapolates better than simple absolute schemes
Why ALiBi Is Interesting
ALiBi is interesting because it is simple:
- no learned positional embedding table
- no explicit sinusoidal addition
- just score biasing based on relative distance
This simplicity is part of why it is a common interview discussion point.
Positional Method Trade-Off
The real interview answer is not "RoPE is better."
It is:
- different methods encode order differently
- some favor relative structure
- some extrapolate better
- some are simpler to implement or reason about
Common Failure Modes
- describing RoPE as if it were just another additive embedding
- ignoring that positional methods affect extrapolation behavior
- assuming better long-context extrapolation on paper always means better end-task behavior
- forgetting that context scaling tricks can interact with positional encoding choices
Edge Cases and Follow-Up Questions
- Why does self-attention need position information at all?
- Why is RoPE often described as relative rather than purely absolute?
- Why can positional encoding choice matter for long-context extrapolation?
- How is ALiBi different from adding learned position embeddings?
- Why is "works for longer context" not the same as "works better on all long-context tasks"?
What to Practice Saying Out Loud
- Why RoPE is useful in modern LLMs
- How ALiBi changes attention without explicit positional embeddings
- Why positional encoding choice is really an inductive-bias choice
Industry-Standard Boilerplate Code
RoPE Implementation
"""
RoPE: Rotary Position Embedding
Interview question: "Implement RoPE"
"""
import numpy as np
def apply_rope(q: np.ndarray, k: np.ndarray, position: int,
dim: int, base: float = 10000.0) -> tuple:
"""
Apply RoPE to query and key
RoPE rotates Q and K by position-dependent angles
This encodes relative position information
Args:
q: Query vector (d_model,)
k: Key vector (d_model,)
position: Current position
dim: Dimension of model
base: Base for frequency calculation
"""
# Create rotation angles
inv_freq = 1.0 / (base ** (np.arange(0, dim, 2) / dim))
angles = position * inv_freq
# Split into pairs for rotation
q_rot = np.zeros_like(q)
k_rot = np.zeros_like(k)
# Apply rotation to pairs
for i in range(0, dim, 2):
if i + 1 < dim:
cos_angle = np.cos(angles[i // 2])
sin_angle = np.sin(angles[i // 2])
# Rotate pair
q_rot[i] = q[i] * cos_angle - q[i+1] * sin_angle
q_rot[i+1] = q[i] * sin_angle + q[i+1] * cos_angle
k_rot[i] = k[i] * cos_angle - k[i+1] * sin_angle
k_rot[i+1] = k[i] * sin_angle + k[i+1] * cos_angle
else:
# Odd dimension, no rotation
q_rot[i] = q[i]
k_rot[i] = k[i]
return q_rot, k_rot
ALiBi Implementation
"""
ALiBi: Attention with Linear Biases
No positional embeddings, just add bias to attention scores
"""
import numpy as np
def alibi_attention_bias(seq_len: int, num_heads: int) -> np.ndarray:
"""
Generate ALiBi bias matrix
Adds linear bias based on distance between positions
Closer positions get less negative bias
Args:
seq_len: Sequence length
num_heads: Number of attention heads
Returns:
Bias matrix (num_heads, seq_len, seq_len)
"""
# Different slopes for different heads
slopes = np.array([2 ** (-(2 ** -(np.log2(8) - 3)) * i)
for i in range(1, num_heads + 1)])
bias = np.zeros((num_heads, seq_len, seq_len))
for head in range(num_heads):
for i in range(seq_len):
for j in range(seq_len):
# Linear bias based on distance
distance = j - i
bias[head, i, j] = -slopes[head] * abs(distance)
return bias
Theory
Why RoPE is Better
Sinusoidal Problems:
- Fixed maximum length
- Doesn't extrapolate well
- Absolute positions only
RoPE Advantages:
- Relative position encoding
- Better extrapolation
- Used in modern LLMs (LLaMA)
How RoPE Works
- Rotates Q and K by position-dependent angles
- Relative positions encoded in rotation
- Generalizes to longer sequences
Exercises
- Implement RoPE
- Compare with sinusoidal
- Test extrapolation
- Implement ALiBi
Next Steps
- Topic 15: Tokenization methods
- Topic 16: Training behaviors