Topic 15: Tokenization Methods
🔥 For interviews, read these first:
TOKENIZATION_DEEP_DIVE.md— frontier-lab interview deep dive: BPE/WordPiece/Unigram/SentencePiece, byte-level BPE, vocabulary trade-offs, arithmetic and multilingual quirks, glitch tokens, special tokens, multimodal extensions.INTERVIEW_GRILL.md— 45 active-recall questions.
What You'll Learn
This topic teaches you different tokenization methods:
- BPE (Byte Pair Encoding)
- WordPiece
- SentencePiece
- Character-level
- Subword tokenization
- Simple implementations
Why We Need This
Interview Importance
- Common question: "How does BPE work?"
- Understanding: Tokenization is fundamental
- Implementation: May ask to implement BPE
Real-World Application
- All LLMs: Use tokenization
- GPT: Uses BPE
- BERT: Uses WordPiece
- T5: Uses SentencePiece
Industry Use Cases
1. BPE (Byte Pair Encoding)
Use Case: GPT-2, GPT-3, GPT-4
- Most common in modern LLMs
- Learns subword units
- Handles unknown words
2. WordPiece
Use Case: BERT
- Similar to BPE
- Different merging strategy
- Used in BERT family
3. SentencePiece
Use Case: T5, mT5
- Language-agnostic
- Handles multiple languages
- Used in multilingual models
Core Intuition
Tokenization decides how raw text becomes model input units.
That sounds like a preprocessing detail, but it affects:
- vocabulary size
- sequence length
- memory cost
- softmax cost
- how well rare words and multilingual text are handled
This is why tokenization shows up in serious LLM interviews.
Why Not Word-Level Tokens?
Word-level vocabularies have a major problem:
- too many rare or unseen words
Character-level vocabularies avoid that problem, but sequences become much longer.
Subword tokenization is the compromise:
- shorter than characters
- more flexible than words
BPE
BPE repeatedly merges the most useful symbol pairs.
Intuition:
- common fragments become single units
- rare words can still be decomposed into smaller pieces
WordPiece
WordPiece is similar in spirit to BPE, but the scoring of merges differs.
The interview-safe answer is:
- both build subwords
- they differ in how they choose which merges to keep
SentencePiece
SentencePiece is useful because it can operate without assuming whitespace-separated words in the same way many older tokenizers do.
That makes it strong for:
- multilingual data
- languages with different segmentation properties
Technical Details Interviewers Often Want
Vocabulary Size Trade-Off
Large vocabulary:
- shorter sequences
- larger embedding and output layers
Small vocabulary:
- longer sequences
- more compositional coverage
This is one of the most important tokenization trade-offs to explain clearly.
Tokenization Affects Everything Downstream
Bad tokenization can hurt:
- efficiency
- context length usage
- handling of code or multilingual text
- alignment between pretraining and downstream tasks
"Unknown Word" Handling
Subword tokenization is powerful because it avoids a hard unknown-word failure mode for many tokens by decomposing unfamiliar words into smaller known pieces.
Common Failure Modes
- treating tokenization as an unimportant preprocessing detail
- confusing BPE and WordPiece as identical
- ignoring the cost trade-off between vocab size and sequence length
- choosing tokenization poorly for multilingual or code data
- assuming more tokens always means more information rather than just more fragmentation
Edge Cases and Follow-Up Questions
- Why not use pure word-level tokens?
- Why not use pure character-level tokens?
- How does vocabulary size trade off against sequence length?
- Why can tokenization matter a lot for multilingual models?
- Why can tokenization affect both quality and cost?
What to Practice Saying Out Loud
- Why subword tokenization became dominant
- The practical trade-off between vocabulary size and sequence length
- Why tokenization is a modeling choice, not just a preprocessing choice
Industry-Standard Boilerplate Code
BPE (Simplified)
"""
BPE: Byte Pair Encoding
Interview question: "Implement BPE"
"""
from collections import Counter, defaultdict
class BPE:
"""
Simple BPE implementation
"""
def __init__(self, vocab_size: int = 1000):
self.vocab_size = vocab_size
self.vocab = {}
self.merges = []
def get_word_freqs(self, corpus: list) -> dict:
"""Get word frequencies"""
word_freqs = Counter()
for text in corpus:
words = text.split()
word_freqs.update(words)
return dict(word_freqs)
def get_splits(self, word: str) -> list:
"""Split word into characters"""
return list(word) + ['</w>'] # End token
def get_pairs(self, word: list) -> set:
"""Get all pairs of consecutive symbols"""
pairs = set()
for i in range(len(word) - 1):
pairs.add((word[i], word[i+1]))
return pairs
def merge_vocab(self, pair: tuple, vocab: dict) -> dict:
"""Merge most frequent pair"""
new_vocab = {}
bigram = ''.join(pair)
for word in vocab:
new_word = word.replace(' '.join(pair), bigram)
new_vocab[new_word] = vocab[word]
return new_vocab
def train(self, corpus: list):
"""Train BPE"""
# Initialize vocabulary
word_freqs = self.get_word_freqs(corpus)
vocab = {}
for word, freq in word_freqs.items():
vocab[' '.join(self.get_splits(word))] = freq
# Build vocabulary
num_merges = self.vocab_size - 256 # Start with 256 base chars
for i in range(num_merges):
# Count pairs
pairs = defaultdict(int)
for word, freq in vocab.items():
word_pairs = self.get_pairs(word.split())
for pair in word_pairs:
pairs[pair] += freq
if not pairs:
break
# Get most frequent pair
best_pair = max(pairs, key=pairs.get)
self.merges.append(best_pair)
# Merge
vocab = self.merge_vocab(best_pair, vocab)
self.vocab = vocab
def tokenize(self, text: str) -> list:
"""Tokenize text"""
words = text.split()
tokens = []
for word in words:
word_split = self.get_splits(word)
# Apply merges
for pair in self.merges:
bigram = ' '.join(pair)
if bigram in ' '.join(word_split):
word_split = ' '.join(word_split).replace(bigram, ''.join(pair)).split()
tokens.extend(word_split)
return tokens
WordPiece (Simplified)
"""
WordPiece: Similar to BPE but different scoring
"""
class WordPiece:
"""
WordPiece tokenization
Uses likelihood instead of frequency
"""
def __init__(self, vocab_size: int = 1000):
self.vocab_size = vocab_size
self.vocab = {}
self.merges = []
def get_score(self, pair: tuple, vocab: dict) -> float:
"""
Score pair using likelihood
score = count(pair) / (count(pair[0]) * count(pair[1]))
"""
pair_count = sum(vocab.get(''.join(pair), 0) for vocab in [vocab])
first_count = sum(1 for word in vocab if pair[0] in word)
second_count = sum(1 for word in vocab if pair[1] in word)
if first_count * second_count == 0:
return 0
return pair_count / (first_count * second_count)
# Similar structure to BPE but uses score instead of frequency
Theory
BPE Algorithm
- Start with character vocabulary
- Count all pairs of consecutive symbols
- Merge most frequent pair
- Repeat until desired vocab size
Why Subword Tokenization
- Handles unknown words: Can break into subwords
- Balanced vocabulary: Not too many (word-level) or too few (char-level)
- Efficient: Good compression
Exercises
- Implement BPE
- Compare BPE vs WordPiece
- Tokenize sample text
- Measure vocabulary size
Next Steps
- Topic 16: Training behaviors
- Topic 17: Probability math