Post

Cost Control & Optimization

A single agentic workflow can burn through $5 of API credits in one task if you're not careful -- and at enterprise scale with thousands of daily tasks, that's the difference between a viable product and a budget crisis.

Cost Control & Optimization

A single agentic workflow can burn through $5 of API credits in one task if you’re not careful – and at enterprise scale with thousands of daily tasks, that’s the difference between a viable product and a budget crisis.


Why Agent Costs Explode

Agents are expensive because they make multiple LLM calls per task. A simple chatbot: 1 call. An agent that researches, plans, executes 3 tool calls, and synthesizes: 5-8 calls. A multi-agent system with supervisor + 3 specialists: 10-20 calls. Each call sends the full conversation history, so token usage grows quadratically with conversation length.

Cost anatomy of a single agent task:

  • System prompt: 500-2000 tokens (sent every call)
  • Conversation history: grows with each turn
  • Tool definitions: 200-1000 tokens per tool (sent every call)
  • Tool results: varies wildly (a web search result could be 5000 tokens)
  • Agent reasoning: 500-2000 tokens output per turn

A 5-turn agent interaction with Claude Sonnet might cost $0.02-0.10. Sounds cheap until you multiply by 50,000 daily users.


Strategy 1: Model Routing

The highest-impact optimization. Use expensive models only when you need them.

Tiered Routing Architecture

1
2
3
4
5
6
7
8
9
10
User Request --> [Router] --> Complexity Assessment
                                |
                    ┌───────────┼───────────┐
                    │           │           │
               [Tier 1]    [Tier 2]    [Tier 3]
              Haiku/GPT-4o  Sonnet    Opus/GPT-4
              mini          3.5       full
              $0.001/task   $0.01     $0.10
              Simple Q&A    Analysis  Complex
                                      reasoning

Router Implementation Options

Keyword/heuristic router (cheapest, fastest):

1
2
3
4
5
6
7
8
9
def route_request(user_input: str) -> str:
    simple_patterns = ["what is", "how do I", "tell me about"]
    complex_signals = ["compare", "analyze", "debug", "why did", "design"]
    
    if any(p in user_input.lower() for p in simple_patterns) and len(user_input) < 100:
        return "haiku"
    elif any(s in user_input.lower() for s in complex_signals):
        return "opus"
    return "sonnet"  # default middle tier

LLM-based router (more accurate, adds cost): Use a cheap model (Haiku) to classify the request complexity before routing. The router call costs $0.0002 but saves $0.08 by avoiding unnecessary Opus calls. Worth it when >30% of requests are simple.

Learned router: Train a small classifier on historical (request, best_model) pairs. Lowest latency, no LLM cost for routing. Requires enough data to train.

When to Use Which Model

Task Type Recommended Model Typical Cost
FAQ / simple lookup Haiku / GPT-4o mini $0.0005-0.002
Summarization Sonnet / GPT-4o $0.005-0.02
Tool selection & use Sonnet $0.005-0.02
Complex reasoning Opus / o1 / Gemini Pro $0.05-0.20
Code generation Sonnet (good enough for most) $0.01-0.05
Routing/classification Haiku / GPT-4o mini $0.0002-0.001

Strategy 2: Caching

Prompt Caching (Anthropic)

Anthropic’s prompt caching stores frequently-sent prefixes server-side. Subsequent requests with the same prefix skip re-processing those tokens, reducing cost by 90% on cached portions and reducing latency.

1
2
3
4
5
6
7
8
9
10
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[{
        "type": "text",
        "text": long_system_prompt,  # 3000 tokens, cached after first call
        "cache_control": {"type": "ephemeral"}
    }],
    messages=[{"role": "user", "content": user_input}]
)

Impact: If your system prompt + tool definitions are 3000 tokens and you make 1000 calls/hour, prompt caching saves ~2.7M input tokens/hour. At Sonnet pricing ($3/M input), that’s ~$8/hour saved.

Semantic Caching

Cache agent responses for semantically similar inputs. When a new request is similar enough to a cached one (cosine similarity > 0.95), return the cached response.

1
2
3
4
5
6
7
8
9
# Pseudocode
embedding = embed(user_input)
cached = vector_db.search(embedding, threshold=0.95)
if cached:
    return cached.response  # Free, instant
else:
    response = agent.run(user_input)
    vector_db.store(embedding, response)
    return response

Best for: FAQ-heavy workloads, repeated queries. Not suitable for personalized or time-sensitive responses.

Tool Result Caching

Cache the results of deterministic tool calls. If get_product_details(sku="ABC123") returned the same result 5 minutes ago, reuse it.

1
2
3
@cache(ttl=300)  # 5-minute TTL
def get_product_details(sku: str) -> dict:
    return db.query("SELECT * FROM products WHERE sku = ?", sku)

Strategy 3: Token Optimization

Reduce System Prompt Size

Every token in your system prompt is sent with every request. Trim ruthlessly.

  • Remove examples that don’t improve output quality (test this with evals)
  • Use concise instructions (“Be brief” not “Please try to keep your responses concise and to the point while still being helpful”)
  • Move rarely-needed context to retrievable tools instead of the system prompt

Conversation History Compression

Don’t send the full history every turn. Strategies:

  1. Sliding window: Last N messages only
  2. Summarize-and-drop: After 10 turns, summarize the first 8 into a paragraph, keep last 2 verbatim
  3. Selective inclusion: Only include messages relevant to the current subtask
1
2
3
4
5
6
7
def compress_history(messages, max_tokens=4000):
    recent = messages[-4:]  # Always keep last 4 messages
    older = messages[:-4]
    if count_tokens(older) > max_tokens:
        summary = llm.summarize(older)  # One cheap summarization call
        return [{"role": "system", "content": f"Prior context: {summary}"}] + recent
    return messages

Structured Output

Request structured output (JSON) instead of free-text when you’re parsing the response programmatically anyway. Structured output is typically shorter and cheaper.


Strategy 4: Batch Processing

For async workloads (overnight agents, bulk processing), use batch APIs:

  • Anthropic Batch API: 50% cost reduction, results within 24 hours
  • OpenAI Batch API: 50% cost reduction, results within 24 hours
1
2
3
4
5
6
7
8
# Anthropic batch (conceptual)
batch = client.batches.create(
    requests=[
        {"custom_id": "task-1", "params": {"model": "claude-sonnet-4-20250514", ...}},
        {"custom_id": "task-2", "params": {"model": "claude-sonnet-4-20250514", ...}},
    ]
)
# Poll for results -- typically completes in 1-6 hours

Perfect for: Side venture agents that run overnight. Queue up 100 research tasks, batch them, pay half price.


Strategy 5: Budget Caps & Monitoring

Per-Task Budget Caps

Set a maximum spend per agent task. Kill the task if it exceeds the cap.

1
2
3
4
5
6
7
8
9
10
11
12
class BudgetTracker:
    def __init__(self, max_cost_usd: float = 0.50):
        self.max_cost = max_cost_usd
        self.current_cost = 0.0
    
    def track(self, input_tokens: int, output_tokens: int, model: str):
        cost = calculate_cost(input_tokens, output_tokens, model)
        self.current_cost += cost
        if self.current_cost > self.max_cost:
            raise BudgetExceededError(
                f"Task exceeded ${self.max_cost} budget (spent ${self.current_cost:.4f})"
            )

Organization-Level Monitoring

Track costs at multiple granularities:

1
2
3
4
5
6
7
8
9
10
11
Dashboard:
├── Total spend: $X,XXX/month
├── Per-agent breakdown:
│   ├── Support agent: $XXX (avg $0.03/task)
│   ├── Research agent: $XXX (avg $0.12/task)
│   └── Coding agent: $XXX (avg $0.08/task)
├── Per-model breakdown:
│   ├── Opus: $XXX (15% of calls, 60% of cost)
│   ├── Sonnet: $XXX (70% of calls, 35% of cost)
│   └── Haiku: $XXX (15% of calls, 5% of cost)
└── Anomaly alerts: tasks > $1.00, daily spend > $XXX

Cost Per Task Benchmarks

Establish baselines and track drift:

Task Type Target Cost Alert Threshold
Simple Q&A $0.005 $0.02
Tool-assisted lookup $0.02 $0.08
Multi-step research $0.10 $0.30
Code generation + review $0.08 $0.25
Full agent workflow (multi-agent) $0.25 $0.75

Strategy 6: Architecture-Level Optimizations

Replace LLM Calls with Deterministic Logic

If the agent always calls the same tool for a specific input pattern, replace that LLM call with a rule:

1
2
3
4
5
# Before: LLM decides to call get_order_status every time
# After: Pattern match and skip the LLM
if re.match(r"(where is|track|status of) my order", user_input):
    order_id = extract_order_id(user_input)
    return get_order_status(order_id)  # No LLM call needed

Fine-Tuning for Specialized Tasks

Fine-tune a small model on your specific task. A fine-tuned GPT-4o mini can match Sonnet quality on narrow tasks at 1/10th the cost. Worth it when you have 1000+ examples and a stable task definition.

Reduce Tool Definitions

Each tool definition costs tokens. If your agent has 20 tools but typically uses 3, dynamically select which tool definitions to include based on the conversation context.

1
2
3
def select_relevant_tools(user_input: str, all_tools: list) -> list:
    # Use embeddings or keywords to pick top 5 most relevant tools
    return rank_tools_by_relevance(user_input, all_tools)[:5]

Cost Optimization Decision Tree

1
2
3
4
5
6
7
8
9
10
Is the task simple/repetitive?
├── Yes --> Can you use rules instead of LLM? 
│           ├── Yes --> Deterministic logic (free)
│           └── No  --> Haiku/GPT-4o mini + semantic cache
└── No  --> Is it batch/async?
            ├── Yes --> Batch API (50% savings)
            └── No  --> Model routing (Sonnet default, Opus for complex)
                        + prompt caching
                        + conversation compression
                        + budget cap

Practical Savings Estimates

Starting from a naive implementation (Opus for everything, full history, no caching):

Optimization Savings Cumulative
Model routing (70% Sonnet, 15% Haiku, 15% Opus) 40-60% 50%
Prompt caching 15-25% 60%
Conversation compression 10-20% 70%
Semantic caching (for repeated queries) 5-15% 75%
Batch API (for async work) 10-15% 80%
Tool definition pruning 3-5% 82%

A system spending $10K/month naively can often be brought to $2-3K/month with these optimizations stacked.


References

This post is licensed under CC BY 4.0 by the author.