Fine-tuning vs RAG vs Prompting
Three ways to customize LLMs: Each with different trade-offs in cost, latency, and quality.
Quick Comparison
| Method | Training | Cost | Latency | Quality | Updates | Best For |
|---|---|---|---|---|---|---|
| Prompting | None | $0.001–0.05/query | <1s | Good (with examples) | Instant | Quick experiments, APIs |
| Few-Shot | None | $0.001–0.05/query | <1s | Excellent with examples | Instant | Specialized tasks, demo |
| RAG | Index docs | $0–1000 setup | <500ms | Grounded, current | Easy (add docs) | Knowledge-heavy, fact-based |
| Fine-tuning | Expensive | $100–100K | <500ms | Best for domain | Days–weeks | Domain-specific, proprietary |
1. Prompting (Zero-Shot)
Definition: Ask LLM directly without examples or context.
1
2
User: "Classify this email as spam or not: 'Click here to claim your FREE MONEY!!!'"
LLM: "Spam"
Pros: Instant, free, simple Cons: Quality varies, no guarantees
Cost: ~$0.002 per query (GPT-3.5: ~$0.001 per 1K tokens)
2. Few-Shot Prompting
Definition: Provide examples in the prompt.
1
2
3
4
5
6
7
8
9
User: "Classify emails as spam or not.
Example 1: 'Re: Meeting tomorrow at 3pm' → Not spam
Example 2: 'Click here for FREE money!!!' → Spam
Example 3: 'Confirm your password here' → Spam
Classify: 'Order confirmation: Your package arrives tomorrow'"
LLM: "Not spam"
Pros: Better quality, still instant, no training Cons: Token cost increases with examples
Cost: ~$0.01 per query (examples + input + output)
Production Impact: Google found few-shot prompting improves accuracy by 10–30% on specialized tasks.
3. Retrieval-Augmented Generation (RAG)
Definition: Retrieve relevant documents, add to context, then generate.
1
2
3
4
5
User Query: "What are the latest updates?"
Retrieval: Find 3 relevant docs from knowledge base
Augmented Prompt: "Based on these documents: [doc1] [doc2] [doc3], answer: ..."
LLM: Generates answer grounded in documents
Pros:
- ✅ Uses current information (no knowledge cutoff)
- ✅ Verifiable answers (citable sources)
- ✅ Easy to update (add/remove docs, no retraining)
- ✅ Reduces hallucination by ~30–50%
Cons:
- ❌ Retrieval latency (~50–200ms)
- ❌ Requires infrastructure (vector DB)
Cost: $0–1000 setup (vector DB) + $0.001–0.01/query
Production Examples:
- OpenAI ChatGPT (plugins, web browsing)
- Stripe Support AI (internal docs)
- Notion AI (personal documents)
4. Fine-Tuning
Definition: Update model weights on task-specific data.
Process:
- Collect labeled dataset (50–10K examples depending on task)
- Train for 3–10 epochs
- Deploy custom model
Cost Breakdown:
- OpenAI Fine-tuning: $0.008–0.04 per 1K tokens (training)
- Example: 10K examples × 500 tokens avg = 5M tokens = $40–200
- Per-query inference: 2–5x cheaper than base model
Quality Gain: 5–20% improvement over base model for specialized domains
Production Examples:
- Jasper AI: Fine-tuned on brand guidelines + marketing templates
- Amazon/Alibaba: Fine-tuned on product catalogs for recommendations
- Companies with proprietary data/style
When It Pays Off:
- High-volume queries (>1M/month)
- Domain-specific language style
- Proprietary knowledge (trade secrets)
- Cost-sensitive production
Decision Framework
1
2
3
4
5
6
7
Can you solve with good examples in prompt?
├─ Yes → Use Few-Shot Prompting
└─ No: Does the answer need current information?
├─ Yes → Use RAG
└─ No: Do you have 50+ labeled examples?
├─ Yes → Fine-tune
└─ No: Stay with prompting, collect more data
Hybrid Approaches
Fine-tuning + RAG: Best of both
1
2
3
4
Fine-tuned base model (domain knowledge) + RAG (current facts)
Benefits: Domain expertise + up-to-date answers
Cost: Both setup costs
Common in: Enterprise (legal, healthcare)
Few-Shot + RAG: Few examples + retrieved context
1
2
3
4
Prompt: "Examples: [ex1] [ex2] Based on context: [doc1] [doc2], answer..."
Benefits: Examples guide style, context provides facts
Cost: Cheap (no training), better quality
Common in: Most production RAG systems
Cost Comparison: Detailed Breakdown
Prompting (Zero-Shot):
- Cost per query: $0.001-0.005 (GPT-3.5) or $0.01-0.05 (GPT-4)
- Example: 1M queries/month × $0.002 = $2K/month
- Setup cost: $0 (use API immediately)
- Quality: Baseline (~70-80% accuracy on tasks)
Few-Shot Prompting:
- Cost per query: $0.005-0.02 (more tokens due to examples)
- Example: 1M queries/month × $0.01 = $10K/month
- Setup cost: $0 (no training)
- Quality: Improved (~80-90% accuracy, 10-20% lift)
- Example: Include 3 examples = +150 tokens per query × $0.0005/token
RAG (Retrieval + Prompting):
- Retrieval cost: $0-100/month (if using self-hosted Weaviate/Qdrant)
- LLM cost: $0.001-0.01 per query (same as prompting)
- Vector DB cost: $0-1000/month (Pinecone serverless: $0.08 per 100K vectors stored)
- Example: 1B vectors × 1536 dims = $800/month + $10K API calls = $10.8K/month
- Setup cost: $1-5K (infrastructure, indexing pipeline)
- Quality: Excellent (~85-95% accuracy, grounded in sources)
Fine-tuning:
- Training cost: $0.008-0.04 per 1K tokens (OpenAI)
- Example: 10K training examples × 500 tokens = 5M tokens = $40-200
- One-time cost (amortized)
- Inference cost: 2-5x cheaper per query after fine-tuning
- Example: 1M queries/month × ($0.002 fine-tuned) = $2K/month (vs $10K base model)
- Break-even: ~500K-1M queries (payback in 1-2 months at high volume)
- Setup cost: $1K-5K (data collection, labeling, experimentation)
- Quality: Best for domain-specific tasks (~85-95% accuracy, higher ceiling)
Recommendation Matrix:
1
2
3
4
5
Query Volume | Latency | Cost Sensitivity | Recommendation
<10K/mo | Any | Yes | Prompting
10K-100K/mo | <200ms | Yes | Few-shot + RAG
100K-1M/mo | <100ms | No | RAG only
>1M/mo | <50ms | Yes | Fine-tuning + RAG
How Real Systems Use This
Jasper AI (Fine-tuning + Few-Shot): Jasper, an AI copywriting platform, combines fine-tuning with few-shot prompting. They fine-tuned GPT-3 on 50K marketing templates (copywriting style, tone, format patterns). For each user request (“Write a product description for running shoes”), the system: (1) retrieves 2-3 similar templates from database (semantic search), (2) includes them as few-shot examples in prompt, (3) applies fine-tuned model to generate output. Result: Domain-specific tone matching, 30% higher user satisfaction vs base GPT-3. Cost: Fine-tuning one-time ($200), inference at scale ($5K/month for 1M queries). Why combo: Fine-tuning captures marketing style/voice; few-shot examples show format/tone within context window.
Amazon Customer Support (RAG + Rule-based Routing): Amazon uses a hybrid approach for their support bots. Incoming tickets are routed: (1) if known issue (FAQ match), return RAG-augmented answer from docs, (2) if escalation needed, route to human. The RAG system retrieves support docs using semantic search, combines top-3 with LLM to generate response. Latency: P50=300ms, P99=800ms. Accuracy: 85% of auto-responses rated “helpful” by customers. Cost: RAG infrastructure ($50K/month), API calls ($20K/month). Why RAG: New issues arise daily; fine-tuning would require retraining; docs update constantly; citations provide transparency for escalations.
Google BERT Fine-tuning for Search Ranking: Google fine-tuned BERT on 100M search queries to improve ranking. They collected labeled training data (query → relevant documents) and trained a custom ranking model. Fine-tuning took 3 weeks on TPU clusters. Result: 5-10% improvement in click-through-rate (CTR) over base BERT. Cost: Infrastructure ($5M), data labeling ($1M). Ongoing: $500K/month compute for inference. Why fine-tuning: Scale of queries (10B/day) makes fine-tuning cost-effective; proprietary ranking task (competitive advantage); one-time training investment amortized across billions of queries.
OpenAI ChatGPT (Hybrid: Few-Shot + RLHF): ChatGPT uses few-shot examples in system prompt combined with RLHF alignment (not traditional fine-tuning). For each conversation: (1) system prompt includes 2-3 examples of helpful responses, (2) model was trained with RLHF on 50K human preference pairs (not gradient descent on labeled data). This is technically not fine-tuning, but achieved through post-training alignment. Result: High helpfulness, safety compliance, conversational quality. Cost: One-time RLHF ($1-5M), inference costs ($100M+/year). Why hybrid: Few-shot + RLHF is more scalable than traditional fine-tuning; generalizes better to unseen tasks; aligns with human values.
Anthropic Constitutional AI (Rule-based + RAG): Anthropic uses Constitutional AI (criteria-based generation) combined with RAG for safety. Instead of fine-tuning on human feedback, they: (1) define constitution of rules (e.g., “be helpful, harmless, honest”), (2) generate candidate responses, (3) critique responses against constitution (LLM critique), (4) rank by constitution adherence, (5) augment with RAG for factual accuracy. Cost: One-time constitution definition ($50K), inference with RAG ($500K/month). Why constitutional approach: Scalable (no human labeling), transparent (rules are explicit), safer (principles-based alignment vs implicit from data).
When to Combine Methods
Fine-tuning + Few-Shot + RAG (Enterprise Best Practice):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Complex customer support system:
1. RAG Layer: Retrieve top-5 relevant docs (knowledge base)
2. Few-Shot Examples: Include 2-3 similar resolved tickets
3. Fine-tuned Model: LLM fine-tuned on company ticket data
4. Prompt: "Here are similar issues: [ex1] [ex2]
Knowledge base: [doc1] [doc2]
Resolution for new ticket..."
5. Output: Company-specific, grounded in docs, stylistically consistent
Result:
- Fine-tuning: 10-15% quality improvement (domain knowledge)
- Few-shot: Additional 5-10% improvement (in-context examples)
- RAG: Adds verifiability + freshness (docs always current)
- Combined: 95%+ accuracy, all benefits
Few-Shot + RAG (Cost-Effective Majority):
1
2
3
4
5
6
7
8
9
10
11
Document Q&A system (Wikipedia, product docs):
1. RAG: Retrieve relevant passages (no training needed)
2. Few-Shot: Include 2 Q&A examples in prompt
3. Generate: LLM answers based on context + examples
4. Cost: $0 training + $5K/month infrastructure + $10K API
Good for:
- Rapidly changing knowledge (docs update weekly)
- Budget-conscious projects
- When domain is broad (100+ topics)
Trade-off Decision Matrix
| Scenario | Method | Rationale |
|---|---|---|
| Prototyping, <1K examples | Prompting | No training cost, immediate feedback |
| <100K/month queries | Few-shot | Cheap, good quality, instant iterations |
| Knowledge-heavy, changing docs | RAG | Freshness + verifiability, no retraining |
| Domain-specific, high volume (>500K/mo) | Fine-tuning | Amortizes training cost, cheaper inference |
| Legal/Finance/Medical (accuracy critical) | RAG + Fine-tuning | Best of both: grounded + domain-expert |
| Real-time requirements (<50ms) | Fine-tuning only | RAG retrieval adds latency |
References
📄 Language Models are Few-Shot Learners (GPT-3 Paper, Brown et al., 2020) 📄 Fine-Tuning Language Models (OpenAI Docs) 📄 Constitutional AI (Anthropic, 2023) 📄 In-Context Learning in Large Language Models (Dong et al., 2022) 🔗 LangChain: Prompting Best Practices 🎥 Fine-tuning vs RAG (Jeremy Howard)