RAG (Retrieval-Augmented Generation) Architecture

Grounding LLMs in knowledge: Combine document retrieval + LLM generation to answer questions with up-to-date, verifiable information.

Posted Mar 20, 2025

10 min read

Pipeline Overview

User Query → “What are Anthropic’s latest products?”
Retrieval → Search knowledge base for relevant documents
Augmentation → Insert retrieved documents into prompt
Generation → LLM generates answer based on context
Output → “Anthropic released Claude 3.5 Sonnet in June 2024…”

Key Components

Component	Purpose	Technology
Document Store	Stores all knowledge	Vector DB (Pinecone, Weaviate), Elasticsearch
Encoder	Converts text → embeddings	BERT, OpenAI embeddings (text-embedding-3-large)
Retriever	Finds similar documents	Semantic search, BM25, hybrid search
Prompt Template	Formats context + query	Simple string formatting, LangChain
LLM	Generates response	GPT-4, Claude, open-source (LLaMA)

Why RAG Matters

Problem: LLMs have knowledge cutoffs. GPT-4 trained on data until April 2024. Can’t answer “What happened yesterday?”

Solution: Retrieve up-to-date documents, feed to LLM.

Benefits:

✅ Answers grounded in provided documents (verifiable, not hallucinated)
✅ Easy to update—just add new documents
✅ Cheaper than fine-tuning—no retraining
✅ Attribution—can cite sources
✅ Reduces hallucination by ~30–50%

Production Impact:

OpenAI’s ChatGPT: Uses RAG for plugins + web browsing
Google Gemini: Integrates with search results
Stripe: Customer support bot uses RAG over documentation

Implementation Example

        
      
from langchain.vectorstores import Pinecone
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA

# 1. Load and split documents
documents = load_documents("knowledge_base/")
chunks = split_into_chunks(documents, chunk_size=512)

# 2. Embed and store
embeddings = OpenAIEmbeddings()
vector_store = Pinecone.from_documents(chunks, embeddings, index_name="rag-index")

# 3. Create retrieval chain
retriever = vector_store.as_retriever(search_kwargs={"k": 3})
llm = OpenAI(temperature=0.7)
chain = RetrievalQA.from_chain_type(llm, retriever=retriever)

# 4. Query
response = chain.run("What's Anthropic's latest product?")

Advanced Techniques

Reranking: Retrieve top-50, rerank with cross-encoder, use top-5

Improves relevance by ~10%
Trade-off: +50ms latency

Hybrid Search: Combine semantic (vector) + keyword (BM25) search

Captures both semantic similarity and exact term matches
Weaviate, Elasticsearch support natively

Iterative Retrieval: Multi-hop reasoning

Query 1: “Who founded OpenAI?”
Query 2: “What are their latest products?” (based on answer to Q1)

Production Metrics

Metric	Target	How to Measure
Latency	<500ms end-to-end	Retrieval + LLM inference
Retrieval Precision@5	>80%	Relevance of top-5 docs
Answer Accuracy	>85%	Human evaluation vs sources
Hallucination Rate	<5%	Claims not supported by retrieved docs

Key Properties Table

Property	Details
Retrieval Latency	30-100ms (vector search) + 20-50ms (cross-encode reranking)
Indexing Cost	One-time: embeddings for all docs; incremental: O(1) per new doc
Context Window Limit	GPT-4: 128K tokens (~96KB); Claude 3.5: 200K tokens (~150KB)
Hallucination Reduction	RAG reduces hallucination by 30-50% vs base LLM
Retrieval Precision@5	Target >80% (top-5 docs are relevant)
Cost per Query	~$0.001-0.01 (depends on retriever + LLM)
Update Latency	<1 minute for new docs to be searchable

When to Use / Avoid RAG

Use RAG when:

✅ Knowledge cutoff matters (need current information)
✅ Verifiability is critical (citations required)
✅ Data changes frequently (documents added/updated regularly)
✅ Hallucination risk is high (factual tasks)
✅ Cost per query is critical (cheaper than fine-tuning)

Avoid RAG when:

❌ Response latency <100ms required (retrieval adds overhead)
❌ Domain knowledge is proprietary/encoded in weights (fine-tuning better)
❌ Knowledge base is <100 documents (prompting with examples sufficient)
❌ Perfect real-time updates needed (vector indices have indexing delay)

How Real Systems Use This

Perplexity AI (Web Search RAG): Perplexity built a conversational search engine that retrieves web results in real-time. For each user query (“What are today’s top tech news?”), they: (1) rewrite the query to optimize retrieval, (2) search the web using Bing API (~50ms), (3) retrieve the top 5-10 results, (4) encode results into embeddings, (5) feed to Claude with citations. The system handles ~50M queries/month. Latency is P50=800ms, P99=2s because web search adds 200-300ms overhead. Why Perplexity chose RAG: Web data is constantly updated; grounding all answers in sources provides credibility; chat-based interaction requires conversation memory which RAG handles naturally.

Notion AI (Document-Aware Generation): Notion integrated RAG to make their AI assistant aware of user workspace documents. When a user types “@AI summarize my meeting notes”, the system: (1) searches user’s workspace for documents matching query context, (2) ranks by recency and relevance, (3) retrieves top-5 documents, (4) augments prompt: “Based on these documents from your workspace: [doc1] [doc2]…, summarize key action items”, (5) generates response grounded in user’s actual data. They use pgvector (PostgreSQL vector extension) for embeddings, storing ~1B vectors across their user base. Latency: P50=200ms (local vector search), P99=500ms. Cost is $0.0001-0.0003/query. Why Notion chose RAG: Users want AI scoped to their own content, not general knowledge; built on existing PostgreSQL infrastructure; easy to add new document types.

GitHub Copilot (Code Context RAG): GitHub Copilot uses RAG when suggesting completions. When a developer types in an editor, the system: (1) retrieves similar code patterns from the user’s repo (using semantic search), (2) retrieves from public open-source projects (billions of examples), (3) retrieves recent files the user edited, (4) augments prompt with relevant code context, (5) generates completion grounded in codebase patterns. They use a custom vector index on 50M+ GitHub repositories. Latency: P50=50ms (in-memory retrieval), P99=200ms. Why they chose RAG: Generated code must match project’s style/patterns; hallucinated code is worse than no suggestion; context from repo prevents generic suggestions.

Stripe Support AI (Documentation RAG): Stripe built an AI support assistant grounded in their documentation. When a developer asks “How do I handle failed payments?”, the system: (1) embeds the query with OpenAI text-embedding-3-large, (2) searches their documentation index (1000+ pages) using hybrid search (semantic + keyword BM25), (3) ranks results by relevance and recency, (4) retrieves top-3 docs, (5) adds to prompt: “Answer based on these docs: [doc1] [doc2] [doc3]”. Latency: P50=150ms, P99=400ms. Metrics: 92% of generated answers cite sources correctly; 8% hallucinate despite retrieval. Why RAG: Stripe docs update frequently (new API features weekly); developers need accurate, version-specific answers; citations allow users to verify or dig deeper.

Anthropic Constitutional AI with RAG: Anthropic uses RAG internally for customer support, augmenting Claude responses with product documentation and FAQ. For each customer query, they: (1) search a knowledge base of 500+ documents (docs, FAQs, past issues), (2) use dense retrieval (embeddings) + sparse retrieval (BM25), (3) rerank with a cross-encoder model (improves precision by ~8%), (4) truncate to fit context window, (5) generate response with citations. Latency: P50=300ms, P99=800ms (includes cross-encoding). Accuracy: 94% of answers are factually correct per human eval. Why RAG: Internal systems benefit from guaranteed grounding; customer trust requires citations; easy to iterate on knowledge base without model retraining.

Advanced Techniques Deep Dive

Reranking Strategy: Retrieve 50 candidates, rerank top-20 with cross-encoder

COLBERT re-ranking improves relevance by 10-15%
Latency cost: 50ms for batch re-ranking
Trade-off: Better quality vs higher latency

Chunking Strategies:

Fixed-size chunks (512 tokens): Simple, fast indexing
Semantic chunks (break at paragraph/section boundaries): Better context preservation, slightly slower
Sliding window chunks (256-token stride): Captures cross-boundary context, 2x memory cost

Multi-hop Retrieval: Use conversation history to inform subsequent retrievals

Query 1: “Who founded OpenAI?” → Retrieve about Sam Altman
Query 2: “What companies did they found before?” → Reuse Query 1 context to guide new search
Improves answer quality by 15-20% for complex multi-step questions

Hybrid Search Pattern:

Dense Retrieval (vector): Top 20 semantically similar docs
Sparse Retrieval (BM25): Top 20 exact term matches
Union: Combine and deduplicate (39 docs max)
Rerank: Cross-encoder scores all 39, keep top 5

Production Metrics

Metric	Target	How to Measure
Retrieval Latency	<100ms	Log time from query to retrieved docs
Reranking Latency	<50ms	Cross-encoder inference time
End-to-end Latency	<500ms	Query start to LLM output
Retrieval Precision@5	>80%	Human evaluation: are top-5 docs relevant?
Retrieval Recall@20	>90%	Did search miss relevant docs?
Answer Accuracy	>85%	Does answer match retrieved docs?
Hallucination Rate	<5%	% of claims contradicted by sources
Citation Accuracy	>95%	Are cited documents actually used?

Implementation Best Practices

Separate indexing from retrieval: Index documents asynchronously, query synchronously
Monitor retrieval quality: Track P@5, P@20, user feedback on retrieved results
Implement query expansion: Rephrase user query for better retrieval
Use query routing: Route queries to appropriate knowledge base (product docs vs internal FAQ vs web search)
Cache frequently asked questions: Skip retrieval for known Q&A pairs
Monitor hallucination: Flag answers that contradict sources; collect human feedback
Implement feedback loops: Use user feedback (thumbs up/down) to improve retrieval weights
Batch indexing: Index documents in bulk (1000 at a time) for 10-20x faster throughput vs individual inserts
Implement rate limiting on retrievals: Limit vector DB queries/sec to prevent runaway costs
Version knowledge base: Track document versions to enable rollback if bad data indexed

Common Failure Modes & Mitigations

Retrieval Failure (No Relevant Docs Found):

Symptom: Query returns top-5 docs with low similarity scores (<0.5)
Root cause: Knowledge base doesn’t contain relevant information
Mitigation: Expand knowledge base, use query expansion to rephrase query, implement fallback to prompting-only mode

Hallucination Despite RAG (LLM Invents Facts):

Symptom: Answer contradicts retrieved documents
Root cause: LLM ignores context, generates from training data instead
Mitigation: Implement strict “grounding” requirement (model trained to cite sources), use smaller context windows to force model to focus on documents

Context Window Overflow (Retrieved Context Too Large):

Symptom: Latency spikes when context exceeds model’s window (e.g., >120K tokens for GPT-4)
Root cause: Retrieved too many documents
Mitigation: Limit retrieval to top-3-5 docs, implement token-aware chunking (split at sentence/paragraph boundaries to preserve meaning in truncated context)

Stale/Incorrect Context (Knowledge Base Contains Outdated Info):

Symptom: Answers reference outdated information (e.g., “product is $99” but price changed to $79)
Root cause: Knowledge base not updated when source data changes
Mitigation: Implement document versioning, set aggressive TTL on embeddings (re-index every day), monitor document freshness

Latency Regression (Slow Retrievals):

Symptom: Retrieval latency suddenly increases from 50ms to 500ms
Root cause: Vector DB performance degraded (overloaded, disk slow), or network latency spike
Mitigation: Implement timeout on retrieval (fail-fast if >200ms), add retrieval latency SLO to monitoring, scale vector DB horizontally if consistently over-loaded

Integration Patterns with LLM Applications

Pattern 1: Conversational RAG (Multi-turn)

Maintain conversation history
On each turn, extract entities from user message
Retrieve documents relevant to current + previous context
Augment prompt with conversation history + retrieved docs
Example: Customer support chatbot remembering previous issues

Pattern 2: Query Classification + RAG

Classify incoming query: “Is this factual/knowledge question or opinion/creative?”
Route: factual → RAG (retrieves docs), opinion → prompt-only (no retrieval overhead)
Benefits: Skip retrieval for questions that don’t need it, faster response

Pattern 3: Ensemble Retrieval

Multiple retrievers (dense, sparse, lexical, re-rank)
Aggregate scores, return top-k union
Benefits: Catch results missed by any single retriever

Deployment Checklist

Before deploying RAG system to production:

Retrieval Quality: Benchmark P@5, P@20 on 100 test queries (target >80% precision)
Hallucination Rate: Evaluate 50 generated answers, flag contradictions (<5% target)
Latency SLO: P99 <500ms, P50 <200ms (set alerts if violated)
Cost Budget: $XXX/month for vector DB + API calls (monitor daily spend)
Monitoring: Dashboard for retrieval quality, hallucination detection, latency trends
Fallback: If retrieval fails (no good docs), gracefully degrade to prompting-only mode
Knowledge Base Freshness: Document last-update timestamps, alert if docs stale >30 days
User Feedback Loop: Collect thumbs up/down on answers, analyze failure patterns monthly

References

📄 RAG Survey (Lewis et al., 2021) 📄 Retrieval-Augmented Generation for Knowledge Intensive NLP (Lewis et al., 2020) 📄 COLBERT: Contextualized Late Interaction (Khattab & Zaharia, 2020) 📄 Hybrid Search Survey (Guo et al., 2016) 🔗 LangChain Documentation 🔗 LlamaIndex Documentation 🎥 RAG Deep Dive (Jeremy Howard, fast.ai)

AI & Agents, GenAI & LLMs

rag-and-retrieval

This post is licensed under CC BY 4.0 by the author.

Pipeline Overview

Key Components

Why RAG Matters

Implementation Example

Advanced Techniques

Production Metrics

Key Properties Table

When to Use / Avoid RAG

How Real Systems Use This

Advanced Techniques Deep Dive

Production Metrics

Implementation Best Practices

Common Failure Modes & Mitigations

Integration Patterns with LLM Applications

Deployment Checklist

References

Trending Tags