AI Observability and Monitoring

Observability for AI systems goes beyond traditional APM -- you need to track token usage, latency distributions, hallucination rates, drift, cost per request, and agent decision traces alongside standard infra metrics.

Posted Dec 25, 2025

4 min read

Observability for AI systems goes beyond traditional APM — you need to track token usage, latency distributions, hallucination rates, drift, cost per request, and agent decision traces alongside standard infra metrics.

Why AI Observability Is Different

Traditional observability (logs, metrics, traces) covers infrastructure health. AI systems add layers that standard tools do not address:

Traditional	AI-Specific
HTTP latency	Time-to-first-token, total generation time
Error rates	Hallucination rates, refusal rates
Request volume	Token consumption (input/output)
CPU/memory	GPU utilization, VRAM usage
API costs	Cost per query, cost per agent task
Request traces	Agent reasoning traces, tool call chains
Security (network)	Prompt injection, data leakage
Testing (pass/fail)	Eval pass rates (probabilistic)

The Four Layers of AI Observability

AI observability operates at four distinct layers. Each requires different tools and metrics:

┌─────────────────────────────────────────────────────────────┐
│  Layer 4: Business                                          │
│  Task completion rate, user satisfaction, ROI per agent      │
├─────────────────────────────────────────────────────────────┤
│  Layer 3: Agent                                             │
│  Reasoning traces, tool call chains, session replays,       │
│  guardrail triggers, multi-agent orchestration              │
├─────────────────────────────────────────────────────────────┤
│  Layer 2: Model                                             │
│  Latency (TTFT, P95), token throughput, cost per call,      │
│  error rates, model version tracking                        │
├─────────────────────────────────────────────────────────────┤
│  Layer 1: Infrastructure                                    │
│  GPU utilization, pod health, request rates, network,       │
│  memory, standard Kubernetes metrics                        │
└─────────────────────────────────────────────────────────────┘

Most teams start at Layer 1 (infrastructure) and Layer 2 (model). The real value is at Layer 3 (agent) and Layer 4 (business). This section of the vault covers all four layers, with this file as the entry point.

Key Metrics to Track

Model Performance (Layer 2)

Latency: Time-to-first-token (TTFT), inter-token latency, total response time. Track P50, P95, P99.
Throughput: Requests/sec, tokens/sec per model
Error rates: API failures, timeouts, rate limit hits
Quality: Hallucination rate, factual accuracy (via automated evals), user feedback scores

Cost & Usage (Layer 2)

Token consumption: Input vs output tokens per request, per user, per agent
Cost per request: Model cost + infra cost, attributed per agent and task type
Budget burn rate: Daily/weekly spend vs budget cap
Model utilization: Which models handle what % of traffic

Agent-Specific (Layer 3)

Tool call success rate: Per tool, per agent
Agent task completion: End-to-end success rate for multi-step agent tasks
Decision traces: Full reasoning chain for debugging and audit
Loop detection: Agents stuck in retry loops or circular reasoning
Guardrail trigger rate: Per guardrail type, per agent

Data & Drift (Layer 4)

Input drift: Distribution shift in user queries over time
Output drift: Changes in model response patterns
Embedding drift: Vector space shifts in RAG retrieval quality
Eval regression: Weekly production eval pass rate vs baseline

Recommended Stack Architecture

For an enterprise AI platform on GCP with gateway architecture:

┌──────────────────────────────────────────────────────────────────┐
│  User Request                                                    │
│  ┌──────────┐  ┌────────────┐  ┌──────────┐  ┌───────────────┐  │
│  │ API GW   │─>│ Agent GW   │─>│ Agent    │─>│ LLM GW        │  │
│  │ (Kong)   │  │ (routing)  │  │ (ADK)    │  │ (Kong AI/     │  │
│  │          │  │            │  │          │  │  LiteLLM)     │  │
│  └────┬─────┘  └────┬───────┘  └────┬─────┘  └───────┬───────┘  │
│       │              │               │                │          │
│       └──────────────┴───────────────┴────────────────┘          │
│                              │                                   │
│                    OTel spans (W3C traceparent)                   │
└──────────────────────────────┼───────────────────────────────────┘
                               │
                      ┌────────v────────┐
                      │  OTel Collector │
                      └───┬─────────┬───┘
                          │         │
              ┌───────────v──┐  ┌───v───────────┐
              │ Cloud Trace  │  │ Langfuse      │
              │ Cloud Monitor│  │ (self-hosted) │
              │ Cloud Logging│  │ Evals, prompts│
              └──────┬───────┘  └───────┬───────┘
                     │                  │
              ┌──────v──────────────────v──────┐
              │  Looker / Grafana Dashboards   │
              └────────────────────────────────┘

Alerting Strategy

Alert	Threshold	Action
TTFT > 5s (P95)	3 consecutive minutes	Page on-call, check model provider status
Hallucination rate > 10%	Rolling 1-hour window	Trigger eval pipeline, consider model rollback
Daily spend > budget cap	80% of daily budget	Throttle non-critical requests, alert engineering
Tool call failure rate > 20%	Rolling 15-minute window	Check tool API health, failover if available
Agent loop detected	> 5 iterations without progress	Kill agent task, log for investigation
Guardrail trigger spike	> 2x baseline (1-hour window)	Investigate injection attack or quality regression
Eval regression detected	> 2% drop in weekly production evals	Investigate model or data drift

Production Checklist

Instrument all LLM calls with OTel GenAI semantic conventions
Track token usage per request, per user, per agent
Set cost budgets with automated throttling at 80% threshold
Log agent traces with full reasoning chains for debugging
Monitor retrieval quality if using RAG (precision@k, relevance scores)
Set up drift detection on input distributions (weekly review minimum)
Dashboard: Real-time cost, latency, error rate, quality metrics
Alerting: PagerDuty/Slack for latency spikes, cost overruns, quality drops
Eval regression monitoring: Weekly production eval runs with baseline comparison
OTel GenAI conventions: Use standardized gen_ai.* attributes for vendor-neutral telemetry

References

AI & Agents, AI Ops

guardrails

This post is licensed under CC BY 4.0 by the author.