Post

Fine-tuning vs RAG vs Prompting

Three ways to customize LLMs: Each with different trade-offs in cost, latency, and quality.

Fine-tuning vs RAG vs Prompting

Quick Comparison

Method Training Cost Latency Quality Updates Best For
Prompting None $0.001–0.05/query <1s Good (with examples) Instant Quick experiments, APIs
Few-Shot None $0.001–0.05/query <1s Excellent with examples Instant Specialized tasks, demo
RAG Index docs $0–1000 setup <500ms Grounded, current Easy (add docs) Knowledge-heavy, fact-based
Fine-tuning Expensive $100–100K <500ms Best for domain Days–weeks Domain-specific, proprietary

1. Prompting (Zero-Shot)

Definition: Ask LLM directly without examples or context.

1
2
User: "Classify this email as spam or not: 'Click here to claim your FREE MONEY!!!'"
LLM: "Spam"

Pros: Instant, free, simple Cons: Quality varies, no guarantees

Cost: ~$0.002 per query (GPT-3.5: ~$0.001 per 1K tokens)


2. Few-Shot Prompting

Definition: Provide examples in the prompt.

1
2
3
4
5
6
7
8
9
User: "Classify emails as spam or not.

Example 1: 'Re: Meeting tomorrow at 3pm' → Not spam
Example 2: 'Click here for FREE money!!!' → Spam
Example 3: 'Confirm your password here' → Spam

Classify: 'Order confirmation: Your package arrives tomorrow'"

LLM: "Not spam"

Pros: Better quality, still instant, no training Cons: Token cost increases with examples

Cost: ~$0.01 per query (examples + input + output)

Production Impact: Google found few-shot prompting improves accuracy by 10–30% on specialized tasks.


3. Retrieval-Augmented Generation (RAG)

Definition: Retrieve relevant documents, add to context, then generate.

1
2
3
4
5
User Query: "What are the latest updates?"

Retrieval: Find 3 relevant docs from knowledge base
Augmented Prompt: "Based on these documents: [doc1] [doc2] [doc3], answer: ..."
LLM: Generates answer grounded in documents

Pros:

  • ✅ Uses current information (no knowledge cutoff)
  • ✅ Verifiable answers (citable sources)
  • ✅ Easy to update (add/remove docs, no retraining)
  • ✅ Reduces hallucination by ~30–50%

Cons:

  • ❌ Retrieval latency (~50–200ms)
  • ❌ Requires infrastructure (vector DB)

Cost: $0–1000 setup (vector DB) + $0.001–0.01/query

Production Examples:

  • OpenAI ChatGPT (plugins, web browsing)
  • Stripe Support AI (internal docs)
  • Notion AI (personal documents)

4. Fine-Tuning

Definition: Update model weights on task-specific data.

Process:

  1. Collect labeled dataset (50–10K examples depending on task)
  2. Train for 3–10 epochs
  3. Deploy custom model

Cost Breakdown:

  • OpenAI Fine-tuning: $0.008–0.04 per 1K tokens (training)
  • Example: 10K examples × 500 tokens avg = 5M tokens = $40–200
  • Per-query inference: 2–5x cheaper than base model

Quality Gain: 5–20% improvement over base model for specialized domains

Production Examples:

  • Jasper AI: Fine-tuned on brand guidelines + marketing templates
  • Amazon/Alibaba: Fine-tuned on product catalogs for recommendations
  • Companies with proprietary data/style

When It Pays Off:

  • High-volume queries (>1M/month)
  • Domain-specific language style
  • Proprietary knowledge (trade secrets)
  • Cost-sensitive production

Decision Framework

1
2
3
4
5
6
7
Can you solve with good examples in prompt?
  ├─ Yes → Use Few-Shot Prompting
  └─ No: Does the answer need current information?
        ├─ Yes → Use RAG
        └─ No: Do you have 50+ labeled examples?
              ├─ Yes → Fine-tune
              └─ No: Stay with prompting, collect more data

Hybrid Approaches

Fine-tuning + RAG: Best of both

1
2
3
4
Fine-tuned base model (domain knowledge) + RAG (current facts)
Benefits: Domain expertise + up-to-date answers
Cost: Both setup costs
Common in: Enterprise (legal, healthcare)

Few-Shot + RAG: Few examples + retrieved context

1
2
3
4
Prompt: "Examples: [ex1] [ex2] Based on context: [doc1] [doc2], answer..."
Benefits: Examples guide style, context provides facts
Cost: Cheap (no training), better quality
Common in: Most production RAG systems

Cost Comparison: Detailed Breakdown

Prompting (Zero-Shot):

  • Cost per query: $0.001-0.005 (GPT-3.5) or $0.01-0.05 (GPT-4)
  • Example: 1M queries/month × $0.002 = $2K/month
  • Setup cost: $0 (use API immediately)
  • Quality: Baseline (~70-80% accuracy on tasks)

Few-Shot Prompting:

  • Cost per query: $0.005-0.02 (more tokens due to examples)
  • Example: 1M queries/month × $0.01 = $10K/month
  • Setup cost: $0 (no training)
  • Quality: Improved (~80-90% accuracy, 10-20% lift)
  • Example: Include 3 examples = +150 tokens per query × $0.0005/token

RAG (Retrieval + Prompting):

  • Retrieval cost: $0-100/month (if using self-hosted Weaviate/Qdrant)
  • LLM cost: $0.001-0.01 per query (same as prompting)
  • Vector DB cost: $0-1000/month (Pinecone serverless: $0.08 per 100K vectors stored)
  • Example: 1B vectors × 1536 dims = $800/month + $10K API calls = $10.8K/month
  • Setup cost: $1-5K (infrastructure, indexing pipeline)
  • Quality: Excellent (~85-95% accuracy, grounded in sources)

Fine-tuning:

  • Training cost: $0.008-0.04 per 1K tokens (OpenAI)
  • Example: 10K training examples × 500 tokens = 5M tokens = $40-200
  • One-time cost (amortized)
  • Inference cost: 2-5x cheaper per query after fine-tuning
  • Example: 1M queries/month × ($0.002 fine-tuned) = $2K/month (vs $10K base model)
  • Break-even: ~500K-1M queries (payback in 1-2 months at high volume)
  • Setup cost: $1K-5K (data collection, labeling, experimentation)
  • Quality: Best for domain-specific tasks (~85-95% accuracy, higher ceiling)

Recommendation Matrix:

1
2
3
4
5
Query Volume | Latency | Cost Sensitivity | Recommendation
<10K/mo      | Any     | Yes              | Prompting
10K-100K/mo  | <200ms  | Yes              | Few-shot + RAG
100K-1M/mo   | <100ms  | No               | RAG only
>1M/mo       | <50ms   | Yes              | Fine-tuning + RAG

How Real Systems Use This

Jasper AI (Fine-tuning + Few-Shot): Jasper, an AI copywriting platform, combines fine-tuning with few-shot prompting. They fine-tuned GPT-3 on 50K marketing templates (copywriting style, tone, format patterns). For each user request (“Write a product description for running shoes”), the system: (1) retrieves 2-3 similar templates from database (semantic search), (2) includes them as few-shot examples in prompt, (3) applies fine-tuned model to generate output. Result: Domain-specific tone matching, 30% higher user satisfaction vs base GPT-3. Cost: Fine-tuning one-time ($200), inference at scale ($5K/month for 1M queries). Why combo: Fine-tuning captures marketing style/voice; few-shot examples show format/tone within context window.

Amazon Customer Support (RAG + Rule-based Routing): Amazon uses a hybrid approach for their support bots. Incoming tickets are routed: (1) if known issue (FAQ match), return RAG-augmented answer from docs, (2) if escalation needed, route to human. The RAG system retrieves support docs using semantic search, combines top-3 with LLM to generate response. Latency: P50=300ms, P99=800ms. Accuracy: 85% of auto-responses rated “helpful” by customers. Cost: RAG infrastructure ($50K/month), API calls ($20K/month). Why RAG: New issues arise daily; fine-tuning would require retraining; docs update constantly; citations provide transparency for escalations.

Google BERT Fine-tuning for Search Ranking: Google fine-tuned BERT on 100M search queries to improve ranking. They collected labeled training data (query → relevant documents) and trained a custom ranking model. Fine-tuning took 3 weeks on TPU clusters. Result: 5-10% improvement in click-through-rate (CTR) over base BERT. Cost: Infrastructure ($5M), data labeling ($1M). Ongoing: $500K/month compute for inference. Why fine-tuning: Scale of queries (10B/day) makes fine-tuning cost-effective; proprietary ranking task (competitive advantage); one-time training investment amortized across billions of queries.

OpenAI ChatGPT (Hybrid: Few-Shot + RLHF): ChatGPT uses few-shot examples in system prompt combined with RLHF alignment (not traditional fine-tuning). For each conversation: (1) system prompt includes 2-3 examples of helpful responses, (2) model was trained with RLHF on 50K human preference pairs (not gradient descent on labeled data). This is technically not fine-tuning, but achieved through post-training alignment. Result: High helpfulness, safety compliance, conversational quality. Cost: One-time RLHF ($1-5M), inference costs ($100M+/year). Why hybrid: Few-shot + RLHF is more scalable than traditional fine-tuning; generalizes better to unseen tasks; aligns with human values.

Anthropic Constitutional AI (Rule-based + RAG): Anthropic uses Constitutional AI (criteria-based generation) combined with RAG for safety. Instead of fine-tuning on human feedback, they: (1) define constitution of rules (e.g., “be helpful, harmless, honest”), (2) generate candidate responses, (3) critique responses against constitution (LLM critique), (4) rank by constitution adherence, (5) augment with RAG for factual accuracy. Cost: One-time constitution definition ($50K), inference with RAG ($500K/month). Why constitutional approach: Scalable (no human labeling), transparent (rules are explicit), safer (principles-based alignment vs implicit from data).


When to Combine Methods

Fine-tuning + Few-Shot + RAG (Enterprise Best Practice):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Complex customer support system:

1. RAG Layer: Retrieve top-5 relevant docs (knowledge base)
2. Few-Shot Examples: Include 2-3 similar resolved tickets
3. Fine-tuned Model: LLM fine-tuned on company ticket data
4. Prompt: "Here are similar issues: [ex1] [ex2]
           Knowledge base: [doc1] [doc2]
           Resolution for new ticket..."
5. Output: Company-specific, grounded in docs, stylistically consistent

Result:
- Fine-tuning: 10-15% quality improvement (domain knowledge)
- Few-shot: Additional 5-10% improvement (in-context examples)
- RAG: Adds verifiability + freshness (docs always current)
- Combined: 95%+ accuracy, all benefits

Few-Shot + RAG (Cost-Effective Majority):

1
2
3
4
5
6
7
8
9
10
11
Document Q&A system (Wikipedia, product docs):

1. RAG: Retrieve relevant passages (no training needed)
2. Few-Shot: Include 2 Q&A examples in prompt
3. Generate: LLM answers based on context + examples
4. Cost: $0 training + $5K/month infrastructure + $10K API

Good for:
- Rapidly changing knowledge (docs update weekly)
- Budget-conscious projects
- When domain is broad (100+ topics)

Trade-off Decision Matrix

Scenario Method Rationale
Prototyping, <1K examples Prompting No training cost, immediate feedback
<100K/month queries Few-shot Cheap, good quality, instant iterations
Knowledge-heavy, changing docs RAG Freshness + verifiability, no retraining
Domain-specific, high volume (>500K/mo) Fine-tuning Amortizes training cost, cheaper inference
Legal/Finance/Medical (accuracy critical) RAG + Fine-tuning Best of both: grounded + domain-expert
Real-time requirements (<50ms) Fine-tuning only RAG retrieval adds latency

References

📄 Language Models are Few-Shot Learners (GPT-3 Paper, Brown et al., 2020) 📄 Fine-Tuning Language Models (OpenAI Docs) 📄 Constitutional AI (Anthropic, 2023) 📄 In-Context Learning in Large Language Models (Dong et al., 2022) 🔗 LangChain: Prompting Best Practices 🎥 Fine-tuning vs RAG (Jeremy Howard)

This post is licensed under CC BY 4.0 by the author.