AI Observability and Monitoring
Observability for AI systems goes beyond traditional APM -- you need to track token usage, latency distributions, hallucination rates, drift, cost per request, and agent decision traces alongside standard infra metrics.
Observability for AI systems goes beyond traditional APM — you need to track token usage, latency distributions, hallucination rates, drift, cost per request, and agent decision traces alongside standard infra metrics.
Why AI Observability Is Different
Traditional observability (logs, metrics, traces) covers infrastructure health. AI systems add layers that standard tools do not address:
| Traditional | AI-Specific |
|---|---|
| HTTP latency | Time-to-first-token, total generation time |
| Error rates | Hallucination rates, refusal rates |
| Request volume | Token consumption (input/output) |
| CPU/memory | GPU utilization, VRAM usage |
| API costs | Cost per query, cost per agent task |
| Request traces | Agent reasoning traces, tool call chains |
| Security (network) | Prompt injection, data leakage |
| Testing (pass/fail) | Eval pass rates (probabilistic) |
The Four Layers of AI Observability
AI observability operates at four distinct layers. Each requires different tools and metrics:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
┌─────────────────────────────────────────────────────────────┐
│ Layer 4: Business │
│ Task completion rate, user satisfaction, ROI per agent │
├─────────────────────────────────────────────────────────────┤
│ Layer 3: Agent │
│ Reasoning traces, tool call chains, session replays, │
│ guardrail triggers, multi-agent orchestration │
├─────────────────────────────────────────────────────────────┤
│ Layer 2: Model │
│ Latency (TTFT, P95), token throughput, cost per call, │
│ error rates, model version tracking │
├─────────────────────────────────────────────────────────────┤
│ Layer 1: Infrastructure │
│ GPU utilization, pod health, request rates, network, │
│ memory, standard Kubernetes metrics │
└─────────────────────────────────────────────────────────────┘
Most teams start at Layer 1 (infrastructure) and Layer 2 (model). The real value is at Layer 3 (agent) and Layer 4 (business). This section of the vault covers all four layers, with this file as the entry point.
Key Metrics to Track
Model Performance (Layer 2)
- Latency: Time-to-first-token (TTFT), inter-token latency, total response time. Track P50, P95, P99.
- Throughput: Requests/sec, tokens/sec per model
- Error rates: API failures, timeouts, rate limit hits
- Quality: Hallucination rate, factual accuracy (via automated evals), user feedback scores
Cost & Usage (Layer 2)
- Token consumption: Input vs output tokens per request, per user, per agent
- Cost per request: Model cost + infra cost, attributed per agent and task type
- Budget burn rate: Daily/weekly spend vs budget cap
- Model utilization: Which models handle what % of traffic
Agent-Specific (Layer 3)
- Tool call success rate: Per tool, per agent
- Agent task completion: End-to-end success rate for multi-step agent tasks
- Decision traces: Full reasoning chain for debugging and audit
- Loop detection: Agents stuck in retry loops or circular reasoning
- Guardrail trigger rate: Per guardrail type, per agent
Data & Drift (Layer 4)
- Input drift: Distribution shift in user queries over time
- Output drift: Changes in model response patterns
- Embedding drift: Vector space shifts in RAG retrieval quality
- Eval regression: Weekly production eval pass rate vs baseline
Recommended Stack Architecture
For an enterprise AI platform on GCP with gateway architecture:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
┌──────────────────────────────────────────────────────────────────┐
│ User Request │
│ ┌──────────┐ ┌────────────┐ ┌──────────┐ ┌───────────────┐ │
│ │ API GW │─>│ Agent GW │─>│ Agent │─>│ LLM GW │ │
│ │ (Kong) │ │ (routing) │ │ (ADK) │ │ (Kong AI/ │ │
│ │ │ │ │ │ │ │ LiteLLM) │ │
│ └────┬─────┘ └────┬───────┘ └────┬─────┘ └───────┬───────┘ │
│ │ │ │ │ │
│ └──────────────┴───────────────┴────────────────┘ │
│ │ │
│ OTel spans (W3C traceparent) │
└──────────────────────────────┼───────────────────────────────────┘
│
┌────────v────────┐
│ OTel Collector │
└───┬─────────┬───┘
│ │
┌───────────v──┐ ┌───v───────────┐
│ Cloud Trace │ │ Langfuse │
│ Cloud Monitor│ │ (self-hosted) │
│ Cloud Logging│ │ Evals, prompts│
└──────┬───────┘ └───────┬───────┘
│ │
┌──────v──────────────────v──────┐
│ Looker / Grafana Dashboards │
└────────────────────────────────┘
Alerting Strategy
| Alert | Threshold | Action |
|---|---|---|
| TTFT > 5s (P95) | 3 consecutive minutes | Page on-call, check model provider status |
| Hallucination rate > 10% | Rolling 1-hour window | Trigger eval pipeline, consider model rollback |
| Daily spend > budget cap | 80% of daily budget | Throttle non-critical requests, alert engineering |
| Tool call failure rate > 20% | Rolling 15-minute window | Check tool API health, failover if available |
| Agent loop detected | > 5 iterations without progress | Kill agent task, log for investigation |
| Guardrail trigger spike | > 2x baseline (1-hour window) | Investigate injection attack or quality regression |
| Eval regression detected | > 2% drop in weekly production evals | Investigate model or data drift |
Production Checklist
- Instrument all LLM calls with OTel GenAI semantic conventions
- Track token usage per request, per user, per agent
- Set cost budgets with automated throttling at 80% threshold
- Log agent traces with full reasoning chains for debugging
- Monitor retrieval quality if using RAG (precision@k, relevance scores)
- Set up drift detection on input distributions (weekly review minimum)
- Dashboard: Real-time cost, latency, error rate, quality metrics
- Alerting: PagerDuty/Slack for latency spikes, cost overruns, quality drops
- Eval regression monitoring: Weekly production eval runs with baseline comparison
- OTel GenAI conventions: Use standardized
gen_ai.*attributes for vendor-neutral telemetry