Retrieval Methods: BM25, TF-IDF, and Hybrid Search

Overview

RAG systems use various retrieval methods to find relevant documents. This document covers sparse retrieval (BM25, TF-IDF) and how to combine them with dense retrieval (hybrid search).

Retrieval Methods Comparison

Method	Type	How it Works	Pros	Cons
TF-IDF	Sparse	Term frequency × inverse document frequency	Simple, interpretable	No semantic understanding
BM25	Sparse	Advanced TF-IDF with saturation	Better than TF-IDF, handles term frequency better	Still no semantics
Dense (Embeddings)	Dense	Vector similarity (cosine, dot product)	Semantic understanding	Needs good embeddings
Hybrid	Both	Combine sparse + dense	Best of both worlds	More complex, higher cost

1. TF-IDF Retrieval

What is TF-IDF?

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure that reflects how important a word is to a document in a collection.

Components:

Term Frequency (TF):

TF(t, d) = count(t in d) / total_terms_in_d

How often word appears in document
Higher TF = more important to document

Inverse Document Frequency (IDF):

IDF(t, D) = log(N / documents_containing_t)

How rare word is across documents
Common words → low IDF
Rare words → high IDF

TF-IDF:

TF-IDF(t, d) = TF(t, d) × IDF(t, D)

How to Use for Retrieval

Step 1: Build TF-IDF Index

Compute TF-IDF for all terms in all documents
Store in sparse matrix

Step 2: Query Processing

Compute TF-IDF for query terms
Match with document TF-IDF scores

Step 3: Scoring

Score(query, document) = Σ TF-IDF(term, document) for term in query

Step 4: Ranking

Sort documents by score
Return top-K

Use Cases

When to use:

✅ Keyword-based queries
✅ Exact term matching important
✅ Fast retrieval needed
✅ Interpretable results

Limitations:

❌ No semantic understanding
❌ Can't handle synonyms
❌ Misses related concepts

2. BM25 (Best Matching 25)

What is BM25?

BM25 is an advanced ranking function that improves upon TF-IDF. It's the de facto standard for sparse retrieval in search engines.

Key Improvements over TF-IDF:

Term frequency saturation: Prevents very frequent terms from dominating
Document length normalization: Handles variable document lengths
Tunable parameters: Can adjust for different domains

BM25 Formula

Score for a single term:

BM25(t, d) = IDF(t) × (f(t, d) × (k₁ + 1)) / (f(t, d) + k₁ × (1 - b + b × |d|/avgdl))

Where:
- f(t, d): Term frequency in document
- |d|: Document length
- avgdl: Average document length
- k₁: Term frequency saturation parameter (usually 1.2-2.0)
- b: Length normalization parameter (usually 0.75)
- IDF(t): Inverse document frequency

IDF in BM25:

IDF(t) = log((N - df(t) + 0.5) / (df(t) + 0.5))

Where:
- N: Total number of documents
- df(t): Number of documents containing term t

Total score:

BM25(query, document) = Σ BM25(term, document) for term in query

Why BM25 is Better

1. Term Frequency Saturation:

TF-IDF: Linear with frequency (can be dominated by one term)
BM25: Saturates (bounded growth)
Example: Term appears 10x vs 20x → Similar scores (not 2x)

2. Document Length Normalization:

Long documents naturally have more terms
BM25 normalizes by document length
Prevents bias toward long documents

3. Tunable Parameters:

k₁: Controls term frequency saturation
- Higher k₁: More weight to frequency
- Lower k₁: Less weight to frequency
b: Controls length normalization
- b = 0: No normalization
- b = 1: Full normalization
- Usually b = 0.75

Intuitive Explanation

Term Frequency Saturation:

TF-IDF: 10 occurrences → score = 10
        20 occurrences → score = 20 (2x)

BM25:   10 occurrences → score = 8.5
        20 occurrences → score = 9.2 (only slightly higher)

Why this helps:

Prevents one very frequent term from dominating
More balanced scoring
Better ranking

Document Length Normalization:

Short doc (100 words): "machine learning" appears 2x
Long doc (1000 words): "machine learning" appears 2x

TF-IDF: Same score (doesn't account for length)
BM25: Short doc gets higher score (more relevant per word)

Use Cases

When to use:

✅ Keyword-based search
✅ Exact term matching
✅ Production search systems
✅ When you need interpretable results

Industry Usage:

Elasticsearch (default ranking)
Lucene (Apache)
Most search engines

Limitations:

❌ Still no semantic understanding
❌ Can't handle synonyms well
❌ Misses related concepts

3. Dense Retrieval (Embeddings)

What is Dense Retrieval?

Dense retrieval uses vector embeddings to find semantically similar documents.

How it works:

Encode documents to embeddings
Encode query to embedding
Compute similarity (cosine, dot product)
Rank by similarity

Advantages:

✅ Semantic understanding
✅ Handles synonyms
✅ Finds related concepts
✅ Works with modern LLMs

Disadvantages:

❌ Requires good embeddings
❌ Can miss exact matches
❌ More expensive

Similarity Metrics

Cosine Similarity:

similarity = (A · B) / (||A|| × ||B||)

Measures angle between vectors
Range: [-1, 1] (usually [0, 1] for normalized)
Most common for embeddings

Dot Product:

similarity = A · B

Simpler computation
Depends on vector magnitude
Use when embeddings not normalized

Euclidean Distance:

distance = ||A - B||

Measures distance
Lower = more similar
Less common for embeddings

4. Hybrid Search (Sparse + Dense)

Why Hybrid Search?

Problem:

Sparse (BM25): Good for exact matches, keywords
Dense (Embeddings): Good for semantic similarity
Neither alone is perfect

Solution:

Combine both!
Get benefits of both methods
Better overall retrieval

How Hybrid Search Works

Step 1: Retrieve from Both

Sparse retrieval: Top-K from BM25
Dense retrieval: Top-K from embeddings

Step 2: Combine Scores

Final_Score = α × BM25_Score + (1 - α) × Dense_Score

Where:
- α: Weight for sparse (usually 0.3-0.7)
- (1 - α): Weight for dense

Step 3: Re-rank

Sort by combined score
Return top-K

Score Normalization

Problem: BM25 and dense scores have different scales

Solution: Normalize before combining

Min-Max Normalization:

normalized_score = (score - min) / (max - min)

Z-score Normalization:

normalized_score = (score - mean) / std

Common Approach:

Normalize both to [0, 1]
Then combine with weights

Weight Selection

How to choose α:

More sparse (α = 0.7):

Keyword-heavy queries
Exact matching important
Domain-specific terms

More dense (α = 0.3):

Semantic queries
Synonyms important
General understanding

Balanced (α = 0.5):

General use case
Good default
Works for most queries

Tuning:

Test on validation set
Measure retrieval accuracy
Choose α that maximizes performance

Industry Examples

Pinecone:

Hybrid search with BM25 + dense
Automatic score normalization
Tunable weights

Weaviate:

Hybrid search support
BM25 + vector search
Configurable weights

Elasticsearch:

BM25 (default) + vector search
Can combine in queries

5. Other Retrieval Methods

1. Query Expansion

What it is:

Add related terms to query
Synonyms, related concepts
Improve recall

Methods:

Thesaurus: Use synonym dictionary
Word embeddings: Find similar words
LLM-based: Generate related terms

Example:

Original: "machine learning"
Expanded: "machine learning, AI, artificial intelligence, ML, deep learning"

2. Re-ranking

What it is:

Initial retrieval: Fast, approximate (BM25, dense)
Re-ranking: Slow, accurate (cross-encoder)

Two-stage retrieval:

Stage 1: Retrieve top-100 (fast, approximate)
Stage 2: Re-rank top-100 → top-10 (slow, accurate)

Re-ranking models:

Cross-encoder: More accurate than bi-encoder
Learning-to-rank: Trained models
LLM-based: Use LLM to score

3. Multi-Vector Retrieval

What it is:

Multiple embeddings per document
Different granularities (sentence, paragraph, document)
Better coverage

How it works:

Generate embeddings at multiple levels
Query matches against all
Aggregate scores

4. Learned Sparse Retrieval

What it is:

Learn sparse representations (like BM25 but learned)
SPLADE, ColBERT
Better than BM25, interpretable

Advantages:

Learned (optimized for task)
Still interpretable (sparse)
Better than BM25

When to Use Each Method

Use BM25 When:

✅ Keyword-based queries
✅ Exact term matching important
✅ Fast retrieval needed
✅ Interpretable results
✅ Domain-specific terms

Use Dense Retrieval When:

✅ Semantic queries
✅ Synonyms important
✅ Related concepts matter
✅ General understanding needed
✅ Good embeddings available

Use Hybrid Search When:

✅ Production systems
✅ Want best of both
✅ Can afford complexity
✅ Mixed query types
✅ High accuracy needs

Best Practices

1. Start with BM25

Simple, fast, interpretable
Good baseline
Works for keyword queries

2. Add Dense if Needed

If BM25 insufficient
Semantic understanding needed
Good embeddings available

3. Use Hybrid for Production

Best overall performance
Handles both keyword and semantic
Industry standard

4. Tune Parameters

BM25: k₁, b parameters
Hybrid: α weight
Test on validation set

5. Normalize Scores

Before combining
Min-max or z-score
Ensure fair combination

Summary

Key Methods:

TF-IDF: Simple, interpretable, keyword-based
BM25: Advanced TF-IDF, industry standard
Dense: Semantic, embeddings-based
Hybrid: Best of both (BM25 + Dense)

Recommendation:

Start: BM25 (simple, fast)
Upgrade: Add dense retrieval
Production: Hybrid search (BM25 + Dense)

Key Insight:

Sparse (BM25): Exact matches, keywords
Dense: Semantic similarity
Hybrid: Combines both for best results

ML & LLM Interview Prep — Deep Dives