Cost Control & Optimization
A single agentic workflow can burn through $5 of API credits in one task if you're not careful -- and at enterprise scale with thousands of daily tasks, that's the difference between a viable product and a budget crisis.
A single agentic workflow can burn through $5 of API credits in one task if you’re not careful – and at enterprise scale with thousands of daily tasks, that’s the difference between a viable product and a budget crisis.
Why Agent Costs Explode
Agents are expensive because they make multiple LLM calls per task. A simple chatbot: 1 call. An agent that researches, plans, executes 3 tool calls, and synthesizes: 5-8 calls. A multi-agent system with supervisor + 3 specialists: 10-20 calls. Each call sends the full conversation history, so token usage grows quadratically with conversation length.
Cost anatomy of a single agent task:
- System prompt: 500-2000 tokens (sent every call)
- Conversation history: grows with each turn
- Tool definitions: 200-1000 tokens per tool (sent every call)
- Tool results: varies wildly (a web search result could be 5000 tokens)
- Agent reasoning: 500-2000 tokens output per turn
A 5-turn agent interaction with Claude Sonnet might cost $0.02-0.10. Sounds cheap until you multiply by 50,000 daily users.
Strategy 1: Model Routing
The highest-impact optimization. Use expensive models only when you need them.
Tiered Routing Architecture
1
2
3
4
5
6
7
8
9
10
User Request --> [Router] --> Complexity Assessment
|
┌───────────┼───────────┐
│ │ │
[Tier 1] [Tier 2] [Tier 3]
Haiku/GPT-4o Sonnet Opus/GPT-4
mini 3.5 full
$0.001/task $0.01 $0.10
Simple Q&A Analysis Complex
reasoning
Router Implementation Options
Keyword/heuristic router (cheapest, fastest):
1
2
3
4
5
6
7
8
9
def route_request(user_input: str) -> str:
simple_patterns = ["what is", "how do I", "tell me about"]
complex_signals = ["compare", "analyze", "debug", "why did", "design"]
if any(p in user_input.lower() for p in simple_patterns) and len(user_input) < 100:
return "haiku"
elif any(s in user_input.lower() for s in complex_signals):
return "opus"
return "sonnet" # default middle tier
LLM-based router (more accurate, adds cost): Use a cheap model (Haiku) to classify the request complexity before routing. The router call costs $0.0002 but saves $0.08 by avoiding unnecessary Opus calls. Worth it when >30% of requests are simple.
Learned router: Train a small classifier on historical (request, best_model) pairs. Lowest latency, no LLM cost for routing. Requires enough data to train.
When to Use Which Model
| Task Type | Recommended Model | Typical Cost |
|---|---|---|
| FAQ / simple lookup | Haiku / GPT-4o mini | $0.0005-0.002 |
| Summarization | Sonnet / GPT-4o | $0.005-0.02 |
| Tool selection & use | Sonnet | $0.005-0.02 |
| Complex reasoning | Opus / o1 / Gemini Pro | $0.05-0.20 |
| Code generation | Sonnet (good enough for most) | $0.01-0.05 |
| Routing/classification | Haiku / GPT-4o mini | $0.0002-0.001 |
Strategy 2: Caching
Prompt Caching (Anthropic)
Anthropic’s prompt caching stores frequently-sent prefixes server-side. Subsequent requests with the same prefix skip re-processing those tokens, reducing cost by 90% on cached portions and reducing latency.
1
2
3
4
5
6
7
8
9
10
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=[{
"type": "text",
"text": long_system_prompt, # 3000 tokens, cached after first call
"cache_control": {"type": "ephemeral"}
}],
messages=[{"role": "user", "content": user_input}]
)
Impact: If your system prompt + tool definitions are 3000 tokens and you make 1000 calls/hour, prompt caching saves ~2.7M input tokens/hour. At Sonnet pricing ($3/M input), that’s ~$8/hour saved.
Semantic Caching
Cache agent responses for semantically similar inputs. When a new request is similar enough to a cached one (cosine similarity > 0.95), return the cached response.
1
2
3
4
5
6
7
8
9
# Pseudocode
embedding = embed(user_input)
cached = vector_db.search(embedding, threshold=0.95)
if cached:
return cached.response # Free, instant
else:
response = agent.run(user_input)
vector_db.store(embedding, response)
return response
Best for: FAQ-heavy workloads, repeated queries. Not suitable for personalized or time-sensitive responses.
Tool Result Caching
Cache the results of deterministic tool calls. If get_product_details(sku="ABC123") returned the same result 5 minutes ago, reuse it.
1
2
3
@cache(ttl=300) # 5-minute TTL
def get_product_details(sku: str) -> dict:
return db.query("SELECT * FROM products WHERE sku = ?", sku)
Strategy 3: Token Optimization
Reduce System Prompt Size
Every token in your system prompt is sent with every request. Trim ruthlessly.
- Remove examples that don’t improve output quality (test this with evals)
- Use concise instructions (“Be brief” not “Please try to keep your responses concise and to the point while still being helpful”)
- Move rarely-needed context to retrievable tools instead of the system prompt
Conversation History Compression
Don’t send the full history every turn. Strategies:
- Sliding window: Last N messages only
- Summarize-and-drop: After 10 turns, summarize the first 8 into a paragraph, keep last 2 verbatim
- Selective inclusion: Only include messages relevant to the current subtask
1
2
3
4
5
6
7
def compress_history(messages, max_tokens=4000):
recent = messages[-4:] # Always keep last 4 messages
older = messages[:-4]
if count_tokens(older) > max_tokens:
summary = llm.summarize(older) # One cheap summarization call
return [{"role": "system", "content": f"Prior context: {summary}"}] + recent
return messages
Structured Output
Request structured output (JSON) instead of free-text when you’re parsing the response programmatically anyway. Structured output is typically shorter and cheaper.
Strategy 4: Batch Processing
For async workloads (overnight agents, bulk processing), use batch APIs:
- Anthropic Batch API: 50% cost reduction, results within 24 hours
- OpenAI Batch API: 50% cost reduction, results within 24 hours
1
2
3
4
5
6
7
8
# Anthropic batch (conceptual)
batch = client.batches.create(
requests=[
{"custom_id": "task-1", "params": {"model": "claude-sonnet-4-20250514", ...}},
{"custom_id": "task-2", "params": {"model": "claude-sonnet-4-20250514", ...}},
]
)
# Poll for results -- typically completes in 1-6 hours
Perfect for: Side venture agents that run overnight. Queue up 100 research tasks, batch them, pay half price.
Strategy 5: Budget Caps & Monitoring
Per-Task Budget Caps
Set a maximum spend per agent task. Kill the task if it exceeds the cap.
1
2
3
4
5
6
7
8
9
10
11
12
class BudgetTracker:
def __init__(self, max_cost_usd: float = 0.50):
self.max_cost = max_cost_usd
self.current_cost = 0.0
def track(self, input_tokens: int, output_tokens: int, model: str):
cost = calculate_cost(input_tokens, output_tokens, model)
self.current_cost += cost
if self.current_cost > self.max_cost:
raise BudgetExceededError(
f"Task exceeded ${self.max_cost} budget (spent ${self.current_cost:.4f})"
)
Organization-Level Monitoring
Track costs at multiple granularities:
1
2
3
4
5
6
7
8
9
10
11
Dashboard:
├── Total spend: $X,XXX/month
├── Per-agent breakdown:
│ ├── Support agent: $XXX (avg $0.03/task)
│ ├── Research agent: $XXX (avg $0.12/task)
│ └── Coding agent: $XXX (avg $0.08/task)
├── Per-model breakdown:
│ ├── Opus: $XXX (15% of calls, 60% of cost)
│ ├── Sonnet: $XXX (70% of calls, 35% of cost)
│ └── Haiku: $XXX (15% of calls, 5% of cost)
└── Anomaly alerts: tasks > $1.00, daily spend > $XXX
Cost Per Task Benchmarks
Establish baselines and track drift:
| Task Type | Target Cost | Alert Threshold |
|---|---|---|
| Simple Q&A | $0.005 | $0.02 |
| Tool-assisted lookup | $0.02 | $0.08 |
| Multi-step research | $0.10 | $0.30 |
| Code generation + review | $0.08 | $0.25 |
| Full agent workflow (multi-agent) | $0.25 | $0.75 |
Strategy 6: Architecture-Level Optimizations
Replace LLM Calls with Deterministic Logic
If the agent always calls the same tool for a specific input pattern, replace that LLM call with a rule:
1
2
3
4
5
# Before: LLM decides to call get_order_status every time
# After: Pattern match and skip the LLM
if re.match(r"(where is|track|status of) my order", user_input):
order_id = extract_order_id(user_input)
return get_order_status(order_id) # No LLM call needed
Fine-Tuning for Specialized Tasks
Fine-tune a small model on your specific task. A fine-tuned GPT-4o mini can match Sonnet quality on narrow tasks at 1/10th the cost. Worth it when you have 1000+ examples and a stable task definition.
Reduce Tool Definitions
Each tool definition costs tokens. If your agent has 20 tools but typically uses 3, dynamically select which tool definitions to include based on the conversation context.
1
2
3
def select_relevant_tools(user_input: str, all_tools: list) -> list:
# Use embeddings or keywords to pick top 5 most relevant tools
return rank_tools_by_relevance(user_input, all_tools)[:5]
Cost Optimization Decision Tree
1
2
3
4
5
6
7
8
9
10
Is the task simple/repetitive?
├── Yes --> Can you use rules instead of LLM?
│ ├── Yes --> Deterministic logic (free)
│ └── No --> Haiku/GPT-4o mini + semantic cache
└── No --> Is it batch/async?
├── Yes --> Batch API (50% savings)
└── No --> Model routing (Sonnet default, Opus for complex)
+ prompt caching
+ conversation compression
+ budget cap
Practical Savings Estimates
Starting from a naive implementation (Opus for everything, full history, no caching):
| Optimization | Savings | Cumulative |
|---|---|---|
| Model routing (70% Sonnet, 15% Haiku, 15% Opus) | 40-60% | 50% |
| Prompt caching | 15-25% | 60% |
| Conversation compression | 10-20% | 70% |
| Semantic caching (for repeated queries) | 5-15% | 75% |
| Batch API (for async work) | 10-15% | 80% |
| Tool definition pruning | 3-5% | 82% |
A system spending $10K/month naively can often be brought to $2-3K/month with these optimizations stacked.