RAG Architecture: Industry-Standard Design

Overview

RAG (Retrieval-Augmented Generation) combines retrieval of relevant documents with language model generation. This document covers the complete architecture used in production systems.

Core Components

1. Document Ingestion Pipeline

Purpose: Process and store documents for retrieval

Steps:

a) Document Loading:

  • PDF, DOCX, HTML, Markdown, etc.
  • Web scraping
  • Database queries
  • API integrations

b) Text Extraction:

  • OCR for scanned documents
  • Table extraction
  • Code extraction
  • Metadata extraction

c) Preprocessing:

  • Clean text (remove headers, footers)
  • Normalize whitespace
  • Handle special characters
  • Language detection

d) Chunking:

  • Split documents into smaller pieces
  • Multiple strategies (see chunking strategies)

e) Metadata Extraction:

  • Document source
  • Creation date
  • Author
  • Section titles
  • Page numbers

2. Embedding Generation

Purpose: Convert text chunks into vector representations

Components:

a) Embedding Model:

  • Text embeddings: sentence-transformers, OpenAI embeddings
  • Multilingual: mBERT, multilingual models
  • Domain-specific: Fine-tuned on domain data

b) Embedding Strategy:

  • Single embedding: One vector per chunk
  • Multi-vector: Multiple embeddings per chunk (different granularities)
  • Hybrid: Dense + sparse (BM25) embeddings

c) Embedding Storage:

  • Vector database (Pinecone, Weaviate, Qdrant, Chroma)
  • Metadata filtering
  • Index optimization

3. Query Processing

Purpose: Process user queries for retrieval

Steps:

a) Query Understanding:

  • Intent detection
  • Entity extraction
  • Query expansion
  • Query rewriting

b) Embedding Generation:

  • Same embedding model as documents
  • Query-specific preprocessing

c) Hybrid Search:

  • Dense retrieval (vector similarity)
  • Sparse retrieval (keyword matching, BM25)
  • Combine both

4. Retrieval System

Purpose: Find relevant documents for query

Components:

a) Sparse Retrieval (BM25, TF-IDF):

  • BM25: Industry standard for keyword-based retrieval
    • Term frequency saturation
    • Document length normalization
    • Better than TF-IDF
  • TF-IDF: Simple, interpretable
  • Use: Exact term matching, keywords

b) Dense Retrieval (Embeddings):

  • Vector similarity (cosine, dot product)
  • Semantic understanding
  • Use: Related concepts, synonyms

c) Hybrid Search:

  • Combine sparse + dense
  • Weighted combination: α × BM25 + (1-α) × Dense
  • Use: Production systems (best of both)

d) Vector Search:

  • Approximate nearest neighbor (ANN) search
  • FAISS, HNSW, IVF
  • Fast retrieval for large datasets

e) Filtering:

  • Metadata filters (date, source, type)
  • Access control
  • Relevance thresholds

f) Re-ranking:

  • Cross-encoder for better accuracy
  • Learning-to-rank models
  • Multi-stage retrieval

See retrieval_methods.md and retrieval_implementations.py for detailed implementations!

5. Context Assembly

Purpose: Prepare context for generation

Steps:

a) Chunk Selection:

  • Top-K retrieval results
  • Diversity selection
  • Deduplication

b) Context Ordering:

  • Relevance-based ordering
  • Chronological ordering
  • Hierarchical ordering

c) Context Truncation:

  • Fit within model context window
  • Priority-based truncation
  • Summary for overflow

6. Generation

Purpose: Generate answer using retrieved context

Components:

a) Prompt Engineering:

  • System prompts
  • Context formatting
  • Few-shot examples
  • Instructions

b) Generation Parameters:

  • Temperature
  • Top-p sampling
  • Max tokens
  • Stop sequences

c) Post-processing:

  • Answer extraction
  • Citation generation
  • Confidence scoring
  • Hallucination detection

Complete RAG Pipeline

User Query
    ↓
Query Processing (embedding, expansion)
    ↓
Retrieval (vector search + filtering)
    ↓
Re-ranking (optional, for accuracy)
    ↓
Context Assembly (top-K chunks, ordering)
    ↓
Generation (LLM with context)
    ↓
Post-processing (extraction, citations)
    ↓
Final Answer

Industry-Standard Architecture Patterns

Pattern 1: Simple RAG

Use Case: Basic Q&A, small document sets

Flow:

  1. Query → Embedding
  2. Vector search → Top-K chunks
  3. Context + Query → LLM
  4. Return answer

Pros:

  • Simple to implement
  • Fast
  • Low cost

Cons:

  • Limited accuracy
  • No re-ranking
  • Basic chunking

Pattern 2: Advanced RAG

Use Case: Production systems, large document sets

Flow:

  1. Query → Query expansion/rewriting
  2. Hybrid search (dense + sparse)
  3. Re-ranking with cross-encoder
  4. Context assembly with diversity
  5. Generation with citations
  6. Answer validation

Pros:

  • High accuracy
  • Better retrieval
  • Citations

Cons:

  • More complex
  • Higher latency
  • Higher cost

Pattern 3: Multi-Stage RAG

Use Case: Very large document sets, high accuracy needs

Flow:

  1. Query → Query understanding
  2. Stage 1: Coarse retrieval (ANN, top-100)
  3. Stage 2: Re-ranking (cross-encoder, top-10)
  4. Stage 3: Fine-grained selection (top-5)
  5. Context assembly
  6. Generation
  7. Answer validation + citations

Pros:

  • Best accuracy
  • Handles large scale
  • Efficient

Cons:

  • Most complex
  • Multiple models
  • Higher latency

Key Design Decisions

1. Chunking Strategy

Options:

  • Fixed-size chunks
  • Sentence-based
  • Paragraph-based
  • Semantic chunking
  • Hierarchical chunking

Considerations:

  • Document type
  • Query type
  • Context window size
  • Retrieval accuracy

2. Embedding Model

Options:

  • General-purpose (OpenAI, sentence-transformers)
  • Domain-specific (fine-tuned)
  • Multilingual
  • Multimodal

Considerations:

  • Domain match
  • Language support
  • Cost
  • Latency

3. Vector Database

Options:

  • Pinecone (managed)
  • Weaviate (self-hosted)
  • Qdrant (self-hosted)
  • Chroma (open-source)
  • FAISS (library)

Considerations:

  • Scale
  • Latency
  • Cost
  • Features (filtering, metadata)

4. Retrieval Strategy

Options:

  • Dense only (vector similarity)
  • Sparse only (BM25)
  • Hybrid (dense + sparse)
  • Multi-vector

Considerations:

  • Query type
  • Document type
  • Accuracy needs
  • Latency requirements

5. Re-ranking

Options:

  • No re-ranking
  • Cross-encoder
  • Learning-to-rank
  • LLM-based re-ranking

Considerations:

  • Accuracy needs
  • Latency budget
  • Cost budget

Production Considerations

1. Scalability

Challenges:

  • Large document sets (millions)
  • High query volume
  • Real-time updates

Solutions:

  • Distributed vector databases
  • Caching
  • Incremental indexing
  • Sharding

2. Latency

Challenges:

  • Vector search time
  • Re-ranking time
  • Generation time

Solutions:

  • ANN search (approximate)
  • Parallel retrieval
  • Caching frequent queries
  • Model optimization

3. Cost

Challenges:

  • Embedding API costs
  • LLM API costs
  • Vector database costs

Solutions:

  • Self-hosted models
  • Batch processing
  • Caching
  • Efficient chunking

4. Accuracy

Challenges:

  • Retrieval accuracy
  • Generation quality
  • Hallucination

Solutions:

  • Better embeddings
  • Re-ranking
  • Prompt engineering
  • Answer validation

5. Monitoring

Metrics:

  • Retrieval accuracy
  • Generation quality
  • Latency
  • Cost per query
  • User satisfaction

Tools:

  • Logging
  • Analytics
  • A/B testing
  • User feedback

Summary

Key Components:

  1. Document ingestion and chunking
  2. Embedding generation and storage
  3. Query processing
  4. Retrieval system
  5. Context assembly
  6. Generation

Architecture Patterns:

  • Simple RAG: Basic implementation
  • Advanced RAG: Production-ready
  • Multi-Stage RAG: High accuracy

Production Considerations:

  • Scalability, latency, cost, accuracy, monitoring