Topic 15: Tokenization Methods

🔥 For interviews, read these first:

TOKENIZATION_DEEP_DIVE.md — frontier-lab interview deep dive: BPE/WordPiece/Unigram/SentencePiece, byte-level BPE, vocabulary trade-offs, arithmetic and multilingual quirks, glitch tokens, special tokens, multimodal extensions.

INTERVIEW_GRILL.md — 45 active-recall questions.

What You'll Learn

This topic teaches you different tokenization methods:

BPE (Byte Pair Encoding)
WordPiece
SentencePiece
Character-level
Subword tokenization
Simple implementations

Why We Need This

Interview Importance

Common question: "How does BPE work?"
Understanding: Tokenization is fundamental
Implementation: May ask to implement BPE

Real-World Application

All LLMs: Use tokenization
GPT: Uses BPE
BERT: Uses WordPiece
T5: Uses SentencePiece

Industry Use Cases

1. BPE (Byte Pair Encoding)

Use Case: GPT-2, GPT-3, GPT-4

Most common in modern LLMs
Learns subword units
Handles unknown words

2. WordPiece

Use Case: BERT

Similar to BPE
Different merging strategy
Used in BERT family

3. SentencePiece

Use Case: T5, mT5

Language-agnostic
Handles multiple languages
Used in multilingual models

Core Intuition

Tokenization decides how raw text becomes model input units.

That sounds like a preprocessing detail, but it affects:

vocabulary size
sequence length
memory cost
softmax cost
how well rare words and multilingual text are handled

This is why tokenization shows up in serious LLM interviews.

Why Not Word-Level Tokens?

Word-level vocabularies have a major problem:

too many rare or unseen words

Character-level vocabularies avoid that problem, but sequences become much longer.

Subword tokenization is the compromise:

shorter than characters
more flexible than words

BPE

BPE repeatedly merges the most useful symbol pairs.

Intuition:

common fragments become single units
rare words can still be decomposed into smaller pieces

WordPiece

WordPiece is similar in spirit to BPE, but the scoring of merges differs.

The interview-safe answer is:

both build subwords
they differ in how they choose which merges to keep

SentencePiece

SentencePiece is useful because it can operate without assuming whitespace-separated words in the same way many older tokenizers do.

That makes it strong for:

multilingual data
languages with different segmentation properties

Technical Details Interviewers Often Want

Vocabulary Size Trade-Off

Large vocabulary:

shorter sequences
larger embedding and output layers

Small vocabulary:

longer sequences
more compositional coverage

This is one of the most important tokenization trade-offs to explain clearly.

Tokenization Affects Everything Downstream

Bad tokenization can hurt:

efficiency
context length usage
handling of code or multilingual text
alignment between pretraining and downstream tasks

"Unknown Word" Handling

Subword tokenization is powerful because it avoids a hard unknown-word failure mode for many tokens by decomposing unfamiliar words into smaller known pieces.

Common Failure Modes

treating tokenization as an unimportant preprocessing detail
confusing BPE and WordPiece as identical
ignoring the cost trade-off between vocab size and sequence length
choosing tokenization poorly for multilingual or code data
assuming more tokens always means more information rather than just more fragmentation

Edge Cases and Follow-Up Questions

Why not use pure word-level tokens?
Why not use pure character-level tokens?
How does vocabulary size trade off against sequence length?
Why can tokenization matter a lot for multilingual models?
Why can tokenization affect both quality and cost?

What to Practice Saying Out Loud

Why subword tokenization became dominant
The practical trade-off between vocabulary size and sequence length
Why tokenization is a modeling choice, not just a preprocessing choice

Industry-Standard Boilerplate Code

BPE (Simplified)

"""
BPE: Byte Pair Encoding
Interview question: "Implement BPE"
"""
from collections import Counter, defaultdict

class BPE:
    """
    Simple BPE implementation
    """
    
    def __init__(self, vocab_size: int = 1000):
        self.vocab_size = vocab_size
        self.vocab = {}
        self.merges = []
    
    def get_word_freqs(self, corpus: list) -> dict:
        """Get word frequencies"""
        word_freqs = Counter()
        for text in corpus:
            words = text.split()
            word_freqs.update(words)
        return dict(word_freqs)
    
    def get_splits(self, word: str) -> list:
        """Split word into characters"""
        return list(word) + ['</w>']  # End token
    
    def get_pairs(self, word: list) -> set:
        """Get all pairs of consecutive symbols"""
        pairs = set()
        for i in range(len(word) - 1):
            pairs.add((word[i], word[i+1]))
        return pairs
    
    def merge_vocab(self, pair: tuple, vocab: dict) -> dict:
        """Merge most frequent pair"""
        new_vocab = {}
        bigram = ''.join(pair)
        
        for word in vocab:
            new_word = word.replace(' '.join(pair), bigram)
            new_vocab[new_word] = vocab[word]
        
        return new_vocab
    
    def train(self, corpus: list):
        """Train BPE"""
        # Initialize vocabulary
        word_freqs = self.get_word_freqs(corpus)
        vocab = {}
        for word, freq in word_freqs.items():
            vocab[' '.join(self.get_splits(word))] = freq
        
        # Build vocabulary
        num_merges = self.vocab_size - 256  # Start with 256 base chars
        
        for i in range(num_merges):
            # Count pairs
            pairs = defaultdict(int)
            for word, freq in vocab.items():
                word_pairs = self.get_pairs(word.split())
                for pair in word_pairs:
                    pairs[pair] += freq
            
            if not pairs:
                break
            
            # Get most frequent pair
            best_pair = max(pairs, key=pairs.get)
            self.merges.append(best_pair)
            
            # Merge
            vocab = self.merge_vocab(best_pair, vocab)
        
        self.vocab = vocab
    
    def tokenize(self, text: str) -> list:
        """Tokenize text"""
        words = text.split()
        tokens = []
        
        for word in words:
            word_split = self.get_splits(word)
            
            # Apply merges
            for pair in self.merges:
                bigram = ' '.join(pair)
                if bigram in ' '.join(word_split):
                    word_split = ' '.join(word_split).replace(bigram, ''.join(pair)).split()
            
            tokens.extend(word_split)
        
        return tokens

WordPiece (Simplified)

"""
WordPiece: Similar to BPE but different scoring
"""
class WordPiece:
    """
    WordPiece tokenization
    Uses likelihood instead of frequency
    """
    
    def __init__(self, vocab_size: int = 1000):
        self.vocab_size = vocab_size
        self.vocab = {}
        self.merges = []
    
    def get_score(self, pair: tuple, vocab: dict) -> float:
        """
        Score pair using likelihood
        score = count(pair) / (count(pair[0]) * count(pair[1]))
        """
        pair_count = sum(vocab.get(''.join(pair), 0) for vocab in [vocab])
        first_count = sum(1 for word in vocab if pair[0] in word)
        second_count = sum(1 for word in vocab if pair[1] in word)
        
        if first_count * second_count == 0:
            return 0
        
        return pair_count / (first_count * second_count)
    
    # Similar structure to BPE but uses score instead of frequency

Theory

BPE Algorithm

Start with character vocabulary
Count all pairs of consecutive symbols
Merge most frequent pair
Repeat until desired vocab size

Why Subword Tokenization

Handles unknown words: Can break into subwords
Balanced vocabulary: Not too many (word-level) or too few (char-level)
Efficient: Good compression

Exercises

Implement BPE
Compare BPE vs WordPiece
Tokenize sample text
Measure vocabulary size

Next Steps

Topic 16: Training behaviors
Topic 17: Probability math

ML & LLM Interview Prep — Deep Dives