RAG (Retrieval-Augmented Generation) Architecture
Grounding LLMs in knowledge: Combine document retrieval + LLM generation to answer questions with up-to-date, verifiable information.
Pipeline Overview
- User Query → “What are Anthropic’s latest products?”
- Retrieval → Search knowledge base for relevant documents
- Augmentation → Insert retrieved documents into prompt
- Generation → LLM generates answer based on context
- Output → “Anthropic released Claude 3.5 Sonnet in June 2024…”
Key Components
| Component | Purpose | Technology |
|---|---|---|
| Document Store | Stores all knowledge | Vector DB (Pinecone, Weaviate), Elasticsearch |
| Encoder | Converts text → embeddings | BERT, OpenAI embeddings (text-embedding-3-large) |
| Retriever | Finds similar documents | Semantic search, BM25, hybrid search |
| Prompt Template | Formats context + query | Simple string formatting, LangChain |
| LLM | Generates response | GPT-4, Claude, open-source (LLaMA) |
Why RAG Matters
Problem: LLMs have knowledge cutoffs. GPT-4 trained on data until April 2024. Can’t answer “What happened yesterday?”
Solution: Retrieve up-to-date documents, feed to LLM.
Benefits:
- ✅ Answers grounded in provided documents (verifiable, not hallucinated)
- ✅ Easy to update—just add new documents
- ✅ Cheaper than fine-tuning—no retraining
- ✅ Attribution—can cite sources
- ✅ Reduces hallucination by ~30–50%
Production Impact:
- OpenAI’s ChatGPT: Uses RAG for plugins + web browsing
- Google Gemini: Integrates with search results
- Stripe: Customer support bot uses RAG over documentation
Implementation Example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from langchain.vectorstores import Pinecone
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
# 1. Load and split documents
documents = load_documents("knowledge_base/")
chunks = split_into_chunks(documents, chunk_size=512)
# 2. Embed and store
embeddings = OpenAIEmbeddings()
vector_store = Pinecone.from_documents(chunks, embeddings, index_name="rag-index")
# 3. Create retrieval chain
retriever = vector_store.as_retriever(search_kwargs={"k": 3})
llm = OpenAI(temperature=0.7)
chain = RetrievalQA.from_chain_type(llm, retriever=retriever)
# 4. Query
response = chain.run("What's Anthropic's latest product?")
Advanced Techniques
Reranking: Retrieve top-50, rerank with cross-encoder, use top-5
- Improves relevance by ~10%
- Trade-off: +50ms latency
Hybrid Search: Combine semantic (vector) + keyword (BM25) search
- Captures both semantic similarity and exact term matches
- Weaviate, Elasticsearch support natively
Iterative Retrieval: Multi-hop reasoning
- Query 1: “Who founded OpenAI?”
- Query 2: “What are their latest products?” (based on answer to Q1)
Production Metrics
| Metric | Target | How to Measure |
|---|---|---|
| Latency | <500ms end-to-end | Retrieval + LLM inference |
| Retrieval Precision@5 | >80% | Relevance of top-5 docs |
| Answer Accuracy | >85% | Human evaluation vs sources |
| Hallucination Rate | <5% | Claims not supported by retrieved docs |
Key Properties Table
| Property | Details |
|---|---|
| Retrieval Latency | 30-100ms (vector search) + 20-50ms (cross-encode reranking) |
| Indexing Cost | One-time: embeddings for all docs; incremental: O(1) per new doc |
| Context Window Limit | GPT-4: 128K tokens (~96KB); Claude 3.5: 200K tokens (~150KB) |
| Hallucination Reduction | RAG reduces hallucination by 30-50% vs base LLM |
| Retrieval Precision@5 | Target >80% (top-5 docs are relevant) |
| Cost per Query | ~$0.001-0.01 (depends on retriever + LLM) |
| Update Latency | <1 minute for new docs to be searchable |
When to Use / Avoid RAG
Use RAG when:
- ✅ Knowledge cutoff matters (need current information)
- ✅ Verifiability is critical (citations required)
- ✅ Data changes frequently (documents added/updated regularly)
- ✅ Hallucination risk is high (factual tasks)
- ✅ Cost per query is critical (cheaper than fine-tuning)
Avoid RAG when:
- ❌ Response latency <100ms required (retrieval adds overhead)
- ❌ Domain knowledge is proprietary/encoded in weights (fine-tuning better)
- ❌ Knowledge base is <100 documents (prompting with examples sufficient)
- ❌ Perfect real-time updates needed (vector indices have indexing delay)
How Real Systems Use This
Perplexity AI (Web Search RAG): Perplexity built a conversational search engine that retrieves web results in real-time. For each user query (“What are today’s top tech news?”), they: (1) rewrite the query to optimize retrieval, (2) search the web using Bing API (~50ms), (3) retrieve the top 5-10 results, (4) encode results into embeddings, (5) feed to Claude with citations. The system handles ~50M queries/month. Latency is P50=800ms, P99=2s because web search adds 200-300ms overhead. Why Perplexity chose RAG: Web data is constantly updated; grounding all answers in sources provides credibility; chat-based interaction requires conversation memory which RAG handles naturally.
Notion AI (Document-Aware Generation): Notion integrated RAG to make their AI assistant aware of user workspace documents. When a user types “@AI summarize my meeting notes”, the system: (1) searches user’s workspace for documents matching query context, (2) ranks by recency and relevance, (3) retrieves top-5 documents, (4) augments prompt: “Based on these documents from your workspace: [doc1] [doc2]…, summarize key action items”, (5) generates response grounded in user’s actual data. They use pgvector (PostgreSQL vector extension) for embeddings, storing ~1B vectors across their user base. Latency: P50=200ms (local vector search), P99=500ms. Cost is $0.0001-0.0003/query. Why Notion chose RAG: Users want AI scoped to their own content, not general knowledge; built on existing PostgreSQL infrastructure; easy to add new document types.
GitHub Copilot (Code Context RAG): GitHub Copilot uses RAG when suggesting completions. When a developer types in an editor, the system: (1) retrieves similar code patterns from the user’s repo (using semantic search), (2) retrieves from public open-source projects (billions of examples), (3) retrieves recent files the user edited, (4) augments prompt with relevant code context, (5) generates completion grounded in codebase patterns. They use a custom vector index on 50M+ GitHub repositories. Latency: P50=50ms (in-memory retrieval), P99=200ms. Why they chose RAG: Generated code must match project’s style/patterns; hallucinated code is worse than no suggestion; context from repo prevents generic suggestions.
Stripe Support AI (Documentation RAG): Stripe built an AI support assistant grounded in their documentation. When a developer asks “How do I handle failed payments?”, the system: (1) embeds the query with OpenAI text-embedding-3-large, (2) searches their documentation index (1000+ pages) using hybrid search (semantic + keyword BM25), (3) ranks results by relevance and recency, (4) retrieves top-3 docs, (5) adds to prompt: “Answer based on these docs: [doc1] [doc2] [doc3]”. Latency: P50=150ms, P99=400ms. Metrics: 92% of generated answers cite sources correctly; 8% hallucinate despite retrieval. Why RAG: Stripe docs update frequently (new API features weekly); developers need accurate, version-specific answers; citations allow users to verify or dig deeper.
Anthropic Constitutional AI with RAG: Anthropic uses RAG internally for customer support, augmenting Claude responses with product documentation and FAQ. For each customer query, they: (1) search a knowledge base of 500+ documents (docs, FAQs, past issues), (2) use dense retrieval (embeddings) + sparse retrieval (BM25), (3) rerank with a cross-encoder model (improves precision by ~8%), (4) truncate to fit context window, (5) generate response with citations. Latency: P50=300ms, P99=800ms (includes cross-encoding). Accuracy: 94% of answers are factually correct per human eval. Why RAG: Internal systems benefit from guaranteed grounding; customer trust requires citations; easy to iterate on knowledge base without model retraining.
Advanced Techniques Deep Dive
Reranking Strategy: Retrieve 50 candidates, rerank top-20 with cross-encoder
- COLBERT re-ranking improves relevance by 10-15%
- Latency cost: 50ms for batch re-ranking
- Trade-off: Better quality vs higher latency
Chunking Strategies:
- Fixed-size chunks (512 tokens): Simple, fast indexing
- Semantic chunks (break at paragraph/section boundaries): Better context preservation, slightly slower
- Sliding window chunks (256-token stride): Captures cross-boundary context, 2x memory cost
Multi-hop Retrieval: Use conversation history to inform subsequent retrievals
- Query 1: “Who founded OpenAI?” → Retrieve about Sam Altman
- Query 2: “What companies did they found before?” → Reuse Query 1 context to guide new search
- Improves answer quality by 15-20% for complex multi-step questions
Hybrid Search Pattern:
1
2
3
4
Dense Retrieval (vector): Top 20 semantically similar docs
Sparse Retrieval (BM25): Top 20 exact term matches
Union: Combine and deduplicate (39 docs max)
Rerank: Cross-encoder scores all 39, keep top 5
Production Metrics
| Metric | Target | How to Measure |
|---|---|---|
| Retrieval Latency | <100ms | Log time from query to retrieved docs |
| Reranking Latency | <50ms | Cross-encoder inference time |
| End-to-end Latency | <500ms | Query start to LLM output |
| Retrieval Precision@5 | >80% | Human evaluation: are top-5 docs relevant? |
| Retrieval Recall@20 | >90% | Did search miss relevant docs? |
| Answer Accuracy | >85% | Does answer match retrieved docs? |
| Hallucination Rate | <5% | % of claims contradicted by sources |
| Citation Accuracy | >95% | Are cited documents actually used? |
Implementation Best Practices
- Separate indexing from retrieval: Index documents asynchronously, query synchronously
- Monitor retrieval quality: Track P@5, P@20, user feedback on retrieved results
- Implement query expansion: Rephrase user query for better retrieval
- Use query routing: Route queries to appropriate knowledge base (product docs vs internal FAQ vs web search)
- Cache frequently asked questions: Skip retrieval for known Q&A pairs
- Monitor hallucination: Flag answers that contradict sources; collect human feedback
- Implement feedback loops: Use user feedback (thumbs up/down) to improve retrieval weights
- Batch indexing: Index documents in bulk (1000 at a time) for 10-20x faster throughput vs individual inserts
- Implement rate limiting on retrievals: Limit vector DB queries/sec to prevent runaway costs
- Version knowledge base: Track document versions to enable rollback if bad data indexed
Common Failure Modes & Mitigations
Retrieval Failure (No Relevant Docs Found):
- Symptom: Query returns top-5 docs with low similarity scores (<0.5)
- Root cause: Knowledge base doesn’t contain relevant information
- Mitigation: Expand knowledge base, use query expansion to rephrase query, implement fallback to prompting-only mode
Hallucination Despite RAG (LLM Invents Facts):
- Symptom: Answer contradicts retrieved documents
- Root cause: LLM ignores context, generates from training data instead
- Mitigation: Implement strict “grounding” requirement (model trained to cite sources), use smaller context windows to force model to focus on documents
Context Window Overflow (Retrieved Context Too Large):
- Symptom: Latency spikes when context exceeds model’s window (e.g., >120K tokens for GPT-4)
- Root cause: Retrieved too many documents
- Mitigation: Limit retrieval to top-3-5 docs, implement token-aware chunking (split at sentence/paragraph boundaries to preserve meaning in truncated context)
Stale/Incorrect Context (Knowledge Base Contains Outdated Info):
- Symptom: Answers reference outdated information (e.g., “product is $99” but price changed to $79)
- Root cause: Knowledge base not updated when source data changes
- Mitigation: Implement document versioning, set aggressive TTL on embeddings (re-index every day), monitor document freshness
Latency Regression (Slow Retrievals):
- Symptom: Retrieval latency suddenly increases from 50ms to 500ms
- Root cause: Vector DB performance degraded (overloaded, disk slow), or network latency spike
- Mitigation: Implement timeout on retrieval (fail-fast if >200ms), add retrieval latency SLO to monitoring, scale vector DB horizontally if consistently over-loaded
Integration Patterns with LLM Applications
Pattern 1: Conversational RAG (Multi-turn)
- Maintain conversation history
- On each turn, extract entities from user message
- Retrieve documents relevant to current + previous context
- Augment prompt with conversation history + retrieved docs
- Example: Customer support chatbot remembering previous issues
Pattern 2: Query Classification + RAG
- Classify incoming query: “Is this factual/knowledge question or opinion/creative?”
- Route: factual → RAG (retrieves docs), opinion → prompt-only (no retrieval overhead)
- Benefits: Skip retrieval for questions that don’t need it, faster response
Pattern 3: Ensemble Retrieval
- Multiple retrievers (dense, sparse, lexical, re-rank)
- Aggregate scores, return top-k union
- Benefits: Catch results missed by any single retriever
Deployment Checklist
Before deploying RAG system to production:
- Retrieval Quality: Benchmark P@5, P@20 on 100 test queries (target >80% precision)
- Hallucination Rate: Evaluate 50 generated answers, flag contradictions (<5% target)
- Latency SLO: P99 <500ms, P50 <200ms (set alerts if violated)
- Cost Budget: $XXX/month for vector DB + API calls (monitor daily spend)
- Monitoring: Dashboard for retrieval quality, hallucination detection, latency trends
- Fallback: If retrieval fails (no good docs), gracefully degrade to prompting-only mode
- Knowledge Base Freshness: Document last-update timestamps, alert if docs stale >30 days
- User Feedback Loop: Collect thumbs up/down on answers, analyze failure patterns monthly
References
📄 RAG Survey (Lewis et al., 2021) 📄 Retrieval-Augmented Generation for Knowledge Intensive NLP (Lewis et al., 2020) 📄 COLBERT: Contextualized Late Interaction (Khattab & Zaharia, 2020) 📄 Hybrid Search Survey (Guo et al., 2016) 🔗 LangChain Documentation 🔗 LlamaIndex Documentation 🎥 RAG Deep Dive (Jeremy Howard, fast.ai)