RAG Architecture: Industry-Standard Design
Overview
RAG (Retrieval-Augmented Generation) combines retrieval of relevant documents with language model generation. This document covers the complete architecture used in production systems.
Core Components
1. Document Ingestion Pipeline
Purpose: Process and store documents for retrieval
Steps:
a) Document Loading:
- PDF, DOCX, HTML, Markdown, etc.
- Web scraping
- Database queries
- API integrations
b) Text Extraction:
- OCR for scanned documents
- Table extraction
- Code extraction
- Metadata extraction
c) Preprocessing:
- Clean text (remove headers, footers)
- Normalize whitespace
- Handle special characters
- Language detection
d) Chunking:
- Split documents into smaller pieces
- Multiple strategies (see chunking strategies)
e) Metadata Extraction:
- Document source
- Creation date
- Author
- Section titles
- Page numbers
2. Embedding Generation
Purpose: Convert text chunks into vector representations
Components:
a) Embedding Model:
- Text embeddings: sentence-transformers, OpenAI embeddings
- Multilingual: mBERT, multilingual models
- Domain-specific: Fine-tuned on domain data
b) Embedding Strategy:
- Single embedding: One vector per chunk
- Multi-vector: Multiple embeddings per chunk (different granularities)
- Hybrid: Dense + sparse (BM25) embeddings
c) Embedding Storage:
- Vector database (Pinecone, Weaviate, Qdrant, Chroma)
- Metadata filtering
- Index optimization
3. Query Processing
Purpose: Process user queries for retrieval
Steps:
a) Query Understanding:
- Intent detection
- Entity extraction
- Query expansion
- Query rewriting
b) Embedding Generation:
- Same embedding model as documents
- Query-specific preprocessing
c) Hybrid Search:
- Dense retrieval (vector similarity)
- Sparse retrieval (keyword matching, BM25)
- Combine both
4. Retrieval System
Purpose: Find relevant documents for query
Components:
a) Sparse Retrieval (BM25, TF-IDF):
- BM25: Industry standard for keyword-based retrieval
- Term frequency saturation
- Document length normalization
- Better than TF-IDF
- TF-IDF: Simple, interpretable
- Use: Exact term matching, keywords
b) Dense Retrieval (Embeddings):
- Vector similarity (cosine, dot product)
- Semantic understanding
- Use: Related concepts, synonyms
c) Hybrid Search:
- Combine sparse + dense
- Weighted combination: α × BM25 + (1-α) × Dense
- Use: Production systems (best of both)
d) Vector Search:
- Approximate nearest neighbor (ANN) search
- FAISS, HNSW, IVF
- Fast retrieval for large datasets
e) Filtering:
- Metadata filters (date, source, type)
- Access control
- Relevance thresholds
f) Re-ranking:
- Cross-encoder for better accuracy
- Learning-to-rank models
- Multi-stage retrieval
See retrieval_methods.md and retrieval_implementations.py for detailed implementations!
5. Context Assembly
Purpose: Prepare context for generation
Steps:
a) Chunk Selection:
- Top-K retrieval results
- Diversity selection
- Deduplication
b) Context Ordering:
- Relevance-based ordering
- Chronological ordering
- Hierarchical ordering
c) Context Truncation:
- Fit within model context window
- Priority-based truncation
- Summary for overflow
6. Generation
Purpose: Generate answer using retrieved context
Components:
a) Prompt Engineering:
- System prompts
- Context formatting
- Few-shot examples
- Instructions
b) Generation Parameters:
- Temperature
- Top-p sampling
- Max tokens
- Stop sequences
c) Post-processing:
- Answer extraction
- Citation generation
- Confidence scoring
- Hallucination detection
Complete RAG Pipeline
User Query
↓
Query Processing (embedding, expansion)
↓
Retrieval (vector search + filtering)
↓
Re-ranking (optional, for accuracy)
↓
Context Assembly (top-K chunks, ordering)
↓
Generation (LLM with context)
↓
Post-processing (extraction, citations)
↓
Final Answer
Industry-Standard Architecture Patterns
Pattern 1: Simple RAG
Use Case: Basic Q&A, small document sets
Flow:
- Query → Embedding
- Vector search → Top-K chunks
- Context + Query → LLM
- Return answer
Pros:
- Simple to implement
- Fast
- Low cost
Cons:
- Limited accuracy
- No re-ranking
- Basic chunking
Pattern 2: Advanced RAG
Use Case: Production systems, large document sets
Flow:
- Query → Query expansion/rewriting
- Hybrid search (dense + sparse)
- Re-ranking with cross-encoder
- Context assembly with diversity
- Generation with citations
- Answer validation
Pros:
- High accuracy
- Better retrieval
- Citations
Cons:
- More complex
- Higher latency
- Higher cost
Pattern 3: Multi-Stage RAG
Use Case: Very large document sets, high accuracy needs
Flow:
- Query → Query understanding
- Stage 1: Coarse retrieval (ANN, top-100)
- Stage 2: Re-ranking (cross-encoder, top-10)
- Stage 3: Fine-grained selection (top-5)
- Context assembly
- Generation
- Answer validation + citations
Pros:
- Best accuracy
- Handles large scale
- Efficient
Cons:
- Most complex
- Multiple models
- Higher latency
Key Design Decisions
1. Chunking Strategy
Options:
- Fixed-size chunks
- Sentence-based
- Paragraph-based
- Semantic chunking
- Hierarchical chunking
Considerations:
- Document type
- Query type
- Context window size
- Retrieval accuracy
2. Embedding Model
Options:
- General-purpose (OpenAI, sentence-transformers)
- Domain-specific (fine-tuned)
- Multilingual
- Multimodal
Considerations:
- Domain match
- Language support
- Cost
- Latency
3. Vector Database
Options:
- Pinecone (managed)
- Weaviate (self-hosted)
- Qdrant (self-hosted)
- Chroma (open-source)
- FAISS (library)
Considerations:
- Scale
- Latency
- Cost
- Features (filtering, metadata)
4. Retrieval Strategy
Options:
- Dense only (vector similarity)
- Sparse only (BM25)
- Hybrid (dense + sparse)
- Multi-vector
Considerations:
- Query type
- Document type
- Accuracy needs
- Latency requirements
5. Re-ranking
Options:
- No re-ranking
- Cross-encoder
- Learning-to-rank
- LLM-based re-ranking
Considerations:
- Accuracy needs
- Latency budget
- Cost budget
Production Considerations
1. Scalability
Challenges:
- Large document sets (millions)
- High query volume
- Real-time updates
Solutions:
- Distributed vector databases
- Caching
- Incremental indexing
- Sharding
2. Latency
Challenges:
- Vector search time
- Re-ranking time
- Generation time
Solutions:
- ANN search (approximate)
- Parallel retrieval
- Caching frequent queries
- Model optimization
3. Cost
Challenges:
- Embedding API costs
- LLM API costs
- Vector database costs
Solutions:
- Self-hosted models
- Batch processing
- Caching
- Efficient chunking
4. Accuracy
Challenges:
- Retrieval accuracy
- Generation quality
- Hallucination
Solutions:
- Better embeddings
- Re-ranking
- Prompt engineering
- Answer validation
5. Monitoring
Metrics:
- Retrieval accuracy
- Generation quality
- Latency
- Cost per query
- User satisfaction
Tools:
- Logging
- Analytics
- A/B testing
- User feedback
Summary
Key Components:
- Document ingestion and chunking
- Embedding generation and storage
- Query processing
- Retrieval system
- Context assembly
- Generation
Architecture Patterns:
- Simple RAG: Basic implementation
- Advanced RAG: Production-ready
- Multi-Stage RAG: High accuracy
Production Considerations:
- Scalability, latency, cost, accuracy, monitoring