Post

RAG (Retrieval-Augmented Generation) Architecture

Grounding LLMs in knowledge: Combine document retrieval + LLM generation to answer questions with up-to-date, verifiable information.

RAG (Retrieval-Augmented Generation) Architecture

Pipeline Overview

  1. User Query → “What are Anthropic’s latest products?”
  2. Retrieval → Search knowledge base for relevant documents
  3. Augmentation → Insert retrieved documents into prompt
  4. Generation → LLM generates answer based on context
  5. Output → “Anthropic released Claude 3.5 Sonnet in June 2024…”

Key Components

Component Purpose Technology
Document Store Stores all knowledge Vector DB (Pinecone, Weaviate), Elasticsearch
Encoder Converts text → embeddings BERT, OpenAI embeddings (text-embedding-3-large)
Retriever Finds similar documents Semantic search, BM25, hybrid search
Prompt Template Formats context + query Simple string formatting, LangChain
LLM Generates response GPT-4, Claude, open-source (LLaMA)

Why RAG Matters

Problem: LLMs have knowledge cutoffs. GPT-4 trained on data until April 2024. Can’t answer “What happened yesterday?”

Solution: Retrieve up-to-date documents, feed to LLM.

Benefits:

  • ✅ Answers grounded in provided documents (verifiable, not hallucinated)
  • ✅ Easy to update—just add new documents
  • ✅ Cheaper than fine-tuning—no retraining
  • ✅ Attribution—can cite sources
  • ✅ Reduces hallucination by ~30–50%

Production Impact:

  • OpenAI’s ChatGPT: Uses RAG for plugins + web browsing
  • Google Gemini: Integrates with search results
  • Stripe: Customer support bot uses RAG over documentation

Implementation Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from langchain.vectorstores import Pinecone
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA

# 1. Load and split documents
documents = load_documents("knowledge_base/")
chunks = split_into_chunks(documents, chunk_size=512)

# 2. Embed and store
embeddings = OpenAIEmbeddings()
vector_store = Pinecone.from_documents(chunks, embeddings, index_name="rag-index")

# 3. Create retrieval chain
retriever = vector_store.as_retriever(search_kwargs={"k": 3})
llm = OpenAI(temperature=0.7)
chain = RetrievalQA.from_chain_type(llm, retriever=retriever)

# 4. Query
response = chain.run("What's Anthropic's latest product?")

Advanced Techniques

Reranking: Retrieve top-50, rerank with cross-encoder, use top-5

  • Improves relevance by ~10%
  • Trade-off: +50ms latency

Hybrid Search: Combine semantic (vector) + keyword (BM25) search

  • Captures both semantic similarity and exact term matches
  • Weaviate, Elasticsearch support natively

Iterative Retrieval: Multi-hop reasoning

  • Query 1: “Who founded OpenAI?”
  • Query 2: “What are their latest products?” (based on answer to Q1)

Production Metrics

Metric Target How to Measure
Latency <500ms end-to-end Retrieval + LLM inference
Retrieval Precision@5 >80% Relevance of top-5 docs
Answer Accuracy >85% Human evaluation vs sources
Hallucination Rate <5% Claims not supported by retrieved docs

Key Properties Table

Property Details
Retrieval Latency 30-100ms (vector search) + 20-50ms (cross-encode reranking)
Indexing Cost One-time: embeddings for all docs; incremental: O(1) per new doc
Context Window Limit GPT-4: 128K tokens (~96KB); Claude 3.5: 200K tokens (~150KB)
Hallucination Reduction RAG reduces hallucination by 30-50% vs base LLM
Retrieval Precision@5 Target >80% (top-5 docs are relevant)
Cost per Query ~$0.001-0.01 (depends on retriever + LLM)
Update Latency <1 minute for new docs to be searchable

When to Use / Avoid RAG

Use RAG when:

  • ✅ Knowledge cutoff matters (need current information)
  • ✅ Verifiability is critical (citations required)
  • ✅ Data changes frequently (documents added/updated regularly)
  • ✅ Hallucination risk is high (factual tasks)
  • ✅ Cost per query is critical (cheaper than fine-tuning)

Avoid RAG when:

  • ❌ Response latency <100ms required (retrieval adds overhead)
  • ❌ Domain knowledge is proprietary/encoded in weights (fine-tuning better)
  • ❌ Knowledge base is <100 documents (prompting with examples sufficient)
  • ❌ Perfect real-time updates needed (vector indices have indexing delay)

How Real Systems Use This

Perplexity AI (Web Search RAG): Perplexity built a conversational search engine that retrieves web results in real-time. For each user query (“What are today’s top tech news?”), they: (1) rewrite the query to optimize retrieval, (2) search the web using Bing API (~50ms), (3) retrieve the top 5-10 results, (4) encode results into embeddings, (5) feed to Claude with citations. The system handles ~50M queries/month. Latency is P50=800ms, P99=2s because web search adds 200-300ms overhead. Why Perplexity chose RAG: Web data is constantly updated; grounding all answers in sources provides credibility; chat-based interaction requires conversation memory which RAG handles naturally.

Notion AI (Document-Aware Generation): Notion integrated RAG to make their AI assistant aware of user workspace documents. When a user types “@AI summarize my meeting notes”, the system: (1) searches user’s workspace for documents matching query context, (2) ranks by recency and relevance, (3) retrieves top-5 documents, (4) augments prompt: “Based on these documents from your workspace: [doc1] [doc2]…, summarize key action items”, (5) generates response grounded in user’s actual data. They use pgvector (PostgreSQL vector extension) for embeddings, storing ~1B vectors across their user base. Latency: P50=200ms (local vector search), P99=500ms. Cost is $0.0001-0.0003/query. Why Notion chose RAG: Users want AI scoped to their own content, not general knowledge; built on existing PostgreSQL infrastructure; easy to add new document types.

GitHub Copilot (Code Context RAG): GitHub Copilot uses RAG when suggesting completions. When a developer types in an editor, the system: (1) retrieves similar code patterns from the user’s repo (using semantic search), (2) retrieves from public open-source projects (billions of examples), (3) retrieves recent files the user edited, (4) augments prompt with relevant code context, (5) generates completion grounded in codebase patterns. They use a custom vector index on 50M+ GitHub repositories. Latency: P50=50ms (in-memory retrieval), P99=200ms. Why they chose RAG: Generated code must match project’s style/patterns; hallucinated code is worse than no suggestion; context from repo prevents generic suggestions.

Stripe Support AI (Documentation RAG): Stripe built an AI support assistant grounded in their documentation. When a developer asks “How do I handle failed payments?”, the system: (1) embeds the query with OpenAI text-embedding-3-large, (2) searches their documentation index (1000+ pages) using hybrid search (semantic + keyword BM25), (3) ranks results by relevance and recency, (4) retrieves top-3 docs, (5) adds to prompt: “Answer based on these docs: [doc1] [doc2] [doc3]”. Latency: P50=150ms, P99=400ms. Metrics: 92% of generated answers cite sources correctly; 8% hallucinate despite retrieval. Why RAG: Stripe docs update frequently (new API features weekly); developers need accurate, version-specific answers; citations allow users to verify or dig deeper.

Anthropic Constitutional AI with RAG: Anthropic uses RAG internally for customer support, augmenting Claude responses with product documentation and FAQ. For each customer query, they: (1) search a knowledge base of 500+ documents (docs, FAQs, past issues), (2) use dense retrieval (embeddings) + sparse retrieval (BM25), (3) rerank with a cross-encoder model (improves precision by ~8%), (4) truncate to fit context window, (5) generate response with citations. Latency: P50=300ms, P99=800ms (includes cross-encoding). Accuracy: 94% of answers are factually correct per human eval. Why RAG: Internal systems benefit from guaranteed grounding; customer trust requires citations; easy to iterate on knowledge base without model retraining.


Advanced Techniques Deep Dive

Reranking Strategy: Retrieve 50 candidates, rerank top-20 with cross-encoder

  • COLBERT re-ranking improves relevance by 10-15%
  • Latency cost: 50ms for batch re-ranking
  • Trade-off: Better quality vs higher latency

Chunking Strategies:

  • Fixed-size chunks (512 tokens): Simple, fast indexing
  • Semantic chunks (break at paragraph/section boundaries): Better context preservation, slightly slower
  • Sliding window chunks (256-token stride): Captures cross-boundary context, 2x memory cost

Multi-hop Retrieval: Use conversation history to inform subsequent retrievals

  • Query 1: “Who founded OpenAI?” → Retrieve about Sam Altman
  • Query 2: “What companies did they found before?” → Reuse Query 1 context to guide new search
  • Improves answer quality by 15-20% for complex multi-step questions

Hybrid Search Pattern:

1
2
3
4
Dense Retrieval (vector): Top 20 semantically similar docs
Sparse Retrieval (BM25): Top 20 exact term matches
Union: Combine and deduplicate (39 docs max)
Rerank: Cross-encoder scores all 39, keep top 5

Production Metrics

Metric Target How to Measure
Retrieval Latency <100ms Log time from query to retrieved docs
Reranking Latency <50ms Cross-encoder inference time
End-to-end Latency <500ms Query start to LLM output
Retrieval Precision@5 >80% Human evaluation: are top-5 docs relevant?
Retrieval Recall@20 >90% Did search miss relevant docs?
Answer Accuracy >85% Does answer match retrieved docs?
Hallucination Rate <5% % of claims contradicted by sources
Citation Accuracy >95% Are cited documents actually used?

Implementation Best Practices

  1. Separate indexing from retrieval: Index documents asynchronously, query synchronously
  2. Monitor retrieval quality: Track P@5, P@20, user feedback on retrieved results
  3. Implement query expansion: Rephrase user query for better retrieval
  4. Use query routing: Route queries to appropriate knowledge base (product docs vs internal FAQ vs web search)
  5. Cache frequently asked questions: Skip retrieval for known Q&A pairs
  6. Monitor hallucination: Flag answers that contradict sources; collect human feedback
  7. Implement feedback loops: Use user feedback (thumbs up/down) to improve retrieval weights
  8. Batch indexing: Index documents in bulk (1000 at a time) for 10-20x faster throughput vs individual inserts
  9. Implement rate limiting on retrievals: Limit vector DB queries/sec to prevent runaway costs
  10. Version knowledge base: Track document versions to enable rollback if bad data indexed

Common Failure Modes & Mitigations

Retrieval Failure (No Relevant Docs Found):

  • Symptom: Query returns top-5 docs with low similarity scores (<0.5)
  • Root cause: Knowledge base doesn’t contain relevant information
  • Mitigation: Expand knowledge base, use query expansion to rephrase query, implement fallback to prompting-only mode

Hallucination Despite RAG (LLM Invents Facts):

  • Symptom: Answer contradicts retrieved documents
  • Root cause: LLM ignores context, generates from training data instead
  • Mitigation: Implement strict “grounding” requirement (model trained to cite sources), use smaller context windows to force model to focus on documents

Context Window Overflow (Retrieved Context Too Large):

  • Symptom: Latency spikes when context exceeds model’s window (e.g., >120K tokens for GPT-4)
  • Root cause: Retrieved too many documents
  • Mitigation: Limit retrieval to top-3-5 docs, implement token-aware chunking (split at sentence/paragraph boundaries to preserve meaning in truncated context)

Stale/Incorrect Context (Knowledge Base Contains Outdated Info):

  • Symptom: Answers reference outdated information (e.g., “product is $99” but price changed to $79)
  • Root cause: Knowledge base not updated when source data changes
  • Mitigation: Implement document versioning, set aggressive TTL on embeddings (re-index every day), monitor document freshness

Latency Regression (Slow Retrievals):

  • Symptom: Retrieval latency suddenly increases from 50ms to 500ms
  • Root cause: Vector DB performance degraded (overloaded, disk slow), or network latency spike
  • Mitigation: Implement timeout on retrieval (fail-fast if >200ms), add retrieval latency SLO to monitoring, scale vector DB horizontally if consistently over-loaded

Integration Patterns with LLM Applications

Pattern 1: Conversational RAG (Multi-turn)

  • Maintain conversation history
  • On each turn, extract entities from user message
  • Retrieve documents relevant to current + previous context
  • Augment prompt with conversation history + retrieved docs
  • Example: Customer support chatbot remembering previous issues

Pattern 2: Query Classification + RAG

  • Classify incoming query: “Is this factual/knowledge question or opinion/creative?”
  • Route: factual → RAG (retrieves docs), opinion → prompt-only (no retrieval overhead)
  • Benefits: Skip retrieval for questions that don’t need it, faster response

Pattern 3: Ensemble Retrieval

  • Multiple retrievers (dense, sparse, lexical, re-rank)
  • Aggregate scores, return top-k union
  • Benefits: Catch results missed by any single retriever

Deployment Checklist

Before deploying RAG system to production:

  1. Retrieval Quality: Benchmark P@5, P@20 on 100 test queries (target >80% precision)
  2. Hallucination Rate: Evaluate 50 generated answers, flag contradictions (<5% target)
  3. Latency SLO: P99 <500ms, P50 <200ms (set alerts if violated)
  4. Cost Budget: $XXX/month for vector DB + API calls (monitor daily spend)
  5. Monitoring: Dashboard for retrieval quality, hallucination detection, latency trends
  6. Fallback: If retrieval fails (no good docs), gracefully degrade to prompting-only mode
  7. Knowledge Base Freshness: Document last-update timestamps, alert if docs stale >30 days
  8. User Feedback Loop: Collect thumbs up/down on answers, analyze failure patterns monthly

References

📄 RAG Survey (Lewis et al., 2021) 📄 Retrieval-Augmented Generation for Knowledge Intensive NLP (Lewis et al., 2020) 📄 COLBERT: Contextualized Late Interaction (Khattab & Zaharia, 2020) 📄 Hybrid Search Survey (Guo et al., 2016) 🔗 LangChain Documentation 🔗 LlamaIndex Documentation 🎥 RAG Deep Dive (Jeremy Howard, fast.ai)

This post is licensed under CC BY 4.0 by the author.