Post

AI Observability and Monitoring

Observability for AI systems goes beyond traditional APM -- you need to track token usage, latency distributions, hallucination rates, drift, cost per request, and agent decision traces alongside standard infra metrics.

AI Observability and Monitoring

Observability for AI systems goes beyond traditional APM — you need to track token usage, latency distributions, hallucination rates, drift, cost per request, and agent decision traces alongside standard infra metrics.


Why AI Observability Is Different

Traditional observability (logs, metrics, traces) covers infrastructure health. AI systems add layers that standard tools do not address:

Traditional AI-Specific
HTTP latency Time-to-first-token, total generation time
Error rates Hallucination rates, refusal rates
Request volume Token consumption (input/output)
CPU/memory GPU utilization, VRAM usage
API costs Cost per query, cost per agent task
Request traces Agent reasoning traces, tool call chains
Security (network) Prompt injection, data leakage
Testing (pass/fail) Eval pass rates (probabilistic)

The Four Layers of AI Observability

AI observability operates at four distinct layers. Each requires different tools and metrics:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
┌─────────────────────────────────────────────────────────────┐
│  Layer 4: Business                                          │
│  Task completion rate, user satisfaction, ROI per agent      │
├─────────────────────────────────────────────────────────────┤
│  Layer 3: Agent                                             │
│  Reasoning traces, tool call chains, session replays,       │
│  guardrail triggers, multi-agent orchestration              │
├─────────────────────────────────────────────────────────────┤
│  Layer 2: Model                                             │
│  Latency (TTFT, P95), token throughput, cost per call,      │
│  error rates, model version tracking                        │
├─────────────────────────────────────────────────────────────┤
│  Layer 1: Infrastructure                                    │
│  GPU utilization, pod health, request rates, network,       │
│  memory, standard Kubernetes metrics                        │
└─────────────────────────────────────────────────────────────┘

Most teams start at Layer 1 (infrastructure) and Layer 2 (model). The real value is at Layer 3 (agent) and Layer 4 (business). This section of the vault covers all four layers, with this file as the entry point.


Key Metrics to Track

Model Performance (Layer 2)

  • Latency: Time-to-first-token (TTFT), inter-token latency, total response time. Track P50, P95, P99.
  • Throughput: Requests/sec, tokens/sec per model
  • Error rates: API failures, timeouts, rate limit hits
  • Quality: Hallucination rate, factual accuracy (via automated evals), user feedback scores

Cost & Usage (Layer 2)

  • Token consumption: Input vs output tokens per request, per user, per agent
  • Cost per request: Model cost + infra cost, attributed per agent and task type
  • Budget burn rate: Daily/weekly spend vs budget cap
  • Model utilization: Which models handle what % of traffic

Agent-Specific (Layer 3)

  • Tool call success rate: Per tool, per agent
  • Agent task completion: End-to-end success rate for multi-step agent tasks
  • Decision traces: Full reasoning chain for debugging and audit
  • Loop detection: Agents stuck in retry loops or circular reasoning
  • Guardrail trigger rate: Per guardrail type, per agent

Data & Drift (Layer 4)

  • Input drift: Distribution shift in user queries over time
  • Output drift: Changes in model response patterns
  • Embedding drift: Vector space shifts in RAG retrieval quality
  • Eval regression: Weekly production eval pass rate vs baseline

For an enterprise AI platform on GCP with gateway architecture:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
┌──────────────────────────────────────────────────────────────────┐
│  User Request                                                    │
│  ┌──────────┐  ┌────────────┐  ┌──────────┐  ┌───────────────┐  │
│  │ API GW   │─>│ Agent GW   │─>│ Agent    │─>│ LLM GW        │  │
│  │ (Kong)   │  │ (routing)  │  │ (ADK)    │  │ (Kong AI/     │  │
│  │          │  │            │  │          │  │  LiteLLM)     │  │
│  └────┬─────┘  └────┬───────┘  └────┬─────┘  └───────┬───────┘  │
│       │              │               │                │          │
│       └──────────────┴───────────────┴────────────────┘          │
│                              │                                   │
│                    OTel spans (W3C traceparent)                   │
└──────────────────────────────┼───────────────────────────────────┘
                               │
                      ┌────────v────────┐
                      │  OTel Collector │
                      └───┬─────────┬───┘
                          │         │
              ┌───────────v──┐  ┌───v───────────┐
              │ Cloud Trace  │  │ Langfuse      │
              │ Cloud Monitor│  │ (self-hosted) │
              │ Cloud Logging│  │ Evals, prompts│
              └──────┬───────┘  └───────┬───────┘
                     │                  │
              ┌──────v──────────────────v──────┐
              │  Looker / Grafana Dashboards   │
              └────────────────────────────────┘

Alerting Strategy

Alert Threshold Action
TTFT > 5s (P95) 3 consecutive minutes Page on-call, check model provider status
Hallucination rate > 10% Rolling 1-hour window Trigger eval pipeline, consider model rollback
Daily spend > budget cap 80% of daily budget Throttle non-critical requests, alert engineering
Tool call failure rate > 20% Rolling 15-minute window Check tool API health, failover if available
Agent loop detected > 5 iterations without progress Kill agent task, log for investigation
Guardrail trigger spike > 2x baseline (1-hour window) Investigate injection attack or quality regression
Eval regression detected > 2% drop in weekly production evals Investigate model or data drift

Production Checklist

  1. Instrument all LLM calls with OTel GenAI semantic conventions
  2. Track token usage per request, per user, per agent
  3. Set cost budgets with automated throttling at 80% threshold
  4. Log agent traces with full reasoning chains for debugging
  5. Monitor retrieval quality if using RAG (precision@k, relevance scores)
  6. Set up drift detection on input distributions (weekly review minimum)
  7. Dashboard: Real-time cost, latency, error rate, quality metrics
  8. Alerting: PagerDuty/Slack for latency spikes, cost overruns, quality drops
  9. Eval regression monitoring: Weekly production eval runs with baseline comparison
  10. OTel GenAI conventions: Use standardized gen_ai.* attributes for vendor-neutral telemetry

References

This post is licensed under CC BY 4.0 by the author.