AI Tracing and OpenTelemetry
Standard distributed tracing tells you which services a request touched and how long each took. AI tracing must also capture what model was called, how many tokens were consumed, what the agent was thinking, which tools it invoked, and how much it cost.
Standard distributed tracing tells you which services a request touched and how long each took. AI tracing must also capture what model was called, how many tokens were consumed, what the agent was thinking, which tools it invoked, and how much it cost — all within the same trace.
Why Standard Tracing Is Not Enough for AI
A traditional HTTP span captures: method, path, status code, latency. An AI system needs all of that plus:
| Standard Span Attribute | AI-Specific Attribute Needed |
|---|---|
http.method, http.route |
gen_ai.operation.name (chat, embeddings, tool_call) |
http.status_code |
gen_ai.response.finish_reasons (stop, length, tool_calls) |
| Duration (ms) | Time-to-first-token, inter-token latency, total generation time |
| Request size (bytes) | gen_ai.usage.input_tokens, gen_ai.usage.output_tokens |
| Service name | gen_ai.system (openai, anthropic, vertex_ai), gen_ai.request.model |
| Error message | Hallucination detected, guardrail triggered, tool call failed |
| Cost (not tracked) | Cost per span (tokens x pricing), cost per agent task |
Without these attributes, you cannot answer basic operational questions: “Which agent is the most expensive?”, “Which model version increased latency?”, “Why did the agent call the same tool 5 times?”
OpenTelemetry GenAI Semantic Conventions
The OTel GenAI SIG (Special Interest Group) defines standardized attributes for AI telemetry. As of early 2026, there are two parallel convention tracks:
1. GenAI Client Spans (Stable)
Cover individual LLM API calls — the lowest level of AI tracing. These are production-ready and supported by major vendors (Datadog OTel v1.37+, Grafana Loki).
Key attributes:
| Attribute | Type | Description |
|---|---|---|
gen_ai.system |
string | Provider identifier: openai, anthropic, vertex_ai, azure.ai.inference |
gen_ai.operation.name |
string | Operation type: chat, text_completion, embeddings |
gen_ai.request.model |
string | Model requested: gemini-2.0-flash, claude-sonnet-4-20250514 |
gen_ai.response.model |
string | Model actually used (may differ from requested) |
gen_ai.request.temperature |
float | Sampling temperature |
gen_ai.request.max_tokens |
int | Max output tokens requested |
gen_ai.usage.input_tokens |
int | Prompt tokens consumed |
gen_ai.usage.output_tokens |
int | Completion tokens generated |
gen_ai.response.finish_reasons |
string[] | Why generation stopped: ["stop"], ["tool_calls"] |
Span naming convention: {gen_ai.operation.name} {gen_ai.request.model} — e.g., chat gemini-2.0-flash.
Events within spans:
gen_ai.content.prompt— captures the full prompt (opt-in, disabled by default for security)gen_ai.content.completion— captures the full response (opt-in)gen_ai.tool.message— captures tool call requests and results
2. GenAI Agent Spans (Experimental)
Cover higher-level agent concepts. Two sub-tracks are under development:
Agent Application Convention (more mature, based on Google’s AI agent white paper):
| Attribute | Type | Description |
|---|---|---|
gen_ai.agent.name |
string | Human-readable agent name |
gen_ai.agent.id |
string | Unique identifier |
gen_ai.agent.description |
string | Free-form description of agent purpose |
gen_ai.agent.version |
string | Agent version |
gen_ai.conversation.id |
string | Conversation/session identifier |
Agent Framework Convention (in progress, aiming to unify ADK, LangGraph, CrewAI, AutoGen, IBM Bee):
Defines spans for task, action, tool_call, and agent with relationships between them. This track is working toward a standard where any agent framework emits compatible traces, making it possible to compare agent behavior across frameworks.
Provider-Specific Extensions
The OTel GenAI SIG also defines technology-specific conventions for:
- OpenAI — function calling, structured outputs
- Anthropic — thinking blocks, tool use
- Azure AI Inference — deployment-specific attributes
- AWS Bedrock — guardrail IDs, model ARNs
Two Instrumentation Paths
There are two fundamentally different ways to get OTel-compatible AI traces:
| Approach | How It Works | Pros | Cons |
|---|---|---|---|
| Built-in (framework-native) | The agent framework emits OTel spans directly. ADK, LangGraph, and some LLM SDKs have native OTel support. | Zero additional dependencies, deepest span coverage, maintained by framework team | Locked to framework’s span design, may not cover all attributes you need |
| External instrumentation | A library wraps LLM API calls and emits OTel spans. OpenLLMetry, Langtrace, and vendor SDKs (Langfuse, LangSmith) work this way. | Framework-agnostic, add to any codebase, often richer attributes | Additional dependency, may conflict with framework-native tracing, overhead varies |
Built-in Examples
ADK (Google Agent Development Kit):
1
2
3
4
5
6
7
from google.adk.telemetry import google_cloud
# Enable Cloud Trace export
exporters = google_cloud.get_gcp_exporters(enable_cloud_tracing=True)
# Or via env vars:
# GOOGLE_CLOUD_AGENT_ENGINE_ENABLE_TELEMETRY=true
# OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=true
LangGraph:
1
2
3
4
from langchain.callbacks.tracers import LangChainTracer
# Or via LangSmith environment variables
# LANGCHAIN_TRACING_V2=true
# LANGCHAIN_API_KEY=...
External Instrumentation Examples
OpenLLMetry (Traceloop):
1
2
from traceloop.sdk import Traceloop
Traceloop.init() # Auto-instruments OpenAI, Anthropic, Cohere, etc.
Langfuse:
1
2
3
4
5
6
7
from langfuse.openai import openai # Drop-in replacement, auto-traces
# Or use the decorator for custom spans:
from langfuse.decorators import observe
@observe()
def my_agent_step(input: str) -> str:
...
Which Path to Choose
- ADK on GCP — use built-in tracing (Cloud Trace), complement with Langfuse for evals and prompt management
- LangGraph/LangChain — use LangSmith for the tightest integration, or OpenLLMetry for vendor-neutral OTel
- Custom agent framework — use OpenLLMetry or Langfuse SDK for immediate instrumentation without framework coupling
- Multi-framework enterprise — use OpenLLMetry as the common layer, export to a single backend (Langfuse, Jaeger, Grafana Tempo)
Multi-Layer Trace Design
In an enterprise AI platform with gateway architecture, a single user request passes through multiple layers. Each layer should contribute spans to a single distributed trace:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
User Request (traceparent: 00-{trace_id}-{span_id}-01)
│
├── [API Gateway] ── Span: api_gateway (auth, rate limit)
│ │ W3C traceparent propagated →
│ │
│ ├── [Agent Gateway] ── Span: agent_routing (capability match, version select)
│ │ │ W3C traceparent propagated →
│ │ │
│ │ ├── [Agent Service] ── Span: agent_task "resolve_support_ticket"
│ │ │ │ gen_ai.agent.name = "support-agent-v2"
│ │ │ │
│ │ │ ├── Span: chat gemini-2.0-flash (reasoning step 1)
│ │ │ │ gen_ai.usage.input_tokens = 1200
│ │ │ │ gen_ai.usage.output_tokens = 350
│ │ │ │
│ │ │ ├── Span: tool_call search_knowledge_base
│ │ │ │ tool.name = "search_knowledge_base"
│ │ │ │ tool.status = "success"
│ │ │ │ duration = 450ms
│ │ │ │
│ │ │ ├── [LLM Gateway] ── Span: llm_routing (model select, fallback)
│ │ │ │ │
│ │ │ │ └── Span: chat gemini-2.0-flash (reasoning step 2)
│ │ │ │ gen_ai.usage.input_tokens = 2100
│ │ │ │ gen_ai.usage.output_tokens = 600
│ │ │ │
│ │ │ └── Span: guardrail_check (output validation)
│ │ │ guardrail.result = "pass"
│ │ │
│ │ └── [Output Guardrails] ── Span: output_filter (PII scrub)
│
└── Total: 2.8s, 4250 tokens, $0.012, SUCCESS
Key principles:
- W3C traceparent propagation at every boundary — gateways, agent services, tool APIs, LLM providers
- One trace per user request — not per LLM call. The trace should show the complete journey from API gateway to final response
- Semantic attributes at every span — use GenAI conventions on LLM spans, standard HTTP conventions on gateway spans, custom attributes on tool spans
- Cost attribution per span — calculate cost from
input_tokens * input_price + output_tokens * output_priceat each LLM span
Span Design Patterns
Span Types and Their Attributes
| Span Type | Key Attributes | Notes |
|---|---|---|
| Agent Task | gen_ai.agent.name, gen_ai.agent.version, gen_ai.conversation.id, task description |
Root span for agent work. All child spans roll up cost/tokens here. |
| LLM Call | gen_ai.system, gen_ai.request.model, gen_ai.usage.*, gen_ai.response.finish_reasons |
One span per LLM API call. Include TTFT if streaming. |
| Tool Call | tool.name, tool.status, input/output (opt-in), duration |
Track success rate and latency per tool. |
| Guardrail Check | guardrail.type (input/output), guardrail.result (pass/block/modify), guardrail.name |
Every guardrail invocation should be a span for audit. |
| Retrieval (RAG) | retrieval.source, retrieval.num_results, retrieval.relevance_score |
Track retrieval quality alongside generation quality. |
Streaming Response Handling
Streaming LLM responses require special span design:
- Start the span when the API call begins
- Record TTFT (time-to-first-token) as a span event or attribute
- End the span only when the full response is received (or the stream closes)
- Token counts are available only after stream completion — set
gen_ai.usage.*attributes at span close - Finish reasons are in the final stream chunk — capture them
Thinking vs Acting Phases
For agents with explicit reasoning (e.g., Claude’s extended thinking, chain-of-thought), distinguish:
- Thinking spans: Internal reasoning. High token count, no external side effects. May contain sensitive reasoning that should not be logged in production.
- Acting spans: Tool calls, API requests, state mutations. Lower token count but real-world impact. Must be logged for audit.
Cost Attribution
Calculate cost per span at trace ingestion time:
1
span_cost = (input_tokens * model_input_price) + (output_tokens * model_output_price)
Maintain a pricing table (updated when model pricing changes) and attach gen_ai.cost as a custom attribute. Roll up costs from child spans to the agent task root span for per-task cost tracking.
Sampling Strategies
At scale, tracing 100% of requests is expensive. Recommended approach:
| Strategy | When to Use |
|---|---|
| 100% tracing | Development, staging, low-traffic agents |
| Head-based sampling (10-50%) | High-traffic production agents where you need representative coverage |
| Tail-based sampling | Keep all traces with errors, high cost, high latency, or guardrail triggers. Sample the rest. |
| Always trace | Agent tasks with financial impact, compliance-sensitive workflows, incidents |
Tail-based sampling is the best fit for AI workloads because the interesting traces (failures, expensive runs, guardrail triggers) are exactly the ones you want to keep at 100%.
Agent Tracing Pattern – Full Example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
Trace: resolve_support_ticket (trace_id: abc123)
│
├── Span: agent_task
│ gen_ai.agent.name: "support-agent"
│ gen_ai.agent.version: "2.1.0"
│ gen_ai.conversation.id: "conv_789"
│ agent.task.description: "Resolve ticket #4521: customer cannot reset password"
│
│ ├── Span: chat gemini-2.0-flash [0ms–180ms]
│ │ gen_ai.system: "vertex_ai"
│ │ gen_ai.request.model: "gemini-2.0-flash"
│ │ gen_ai.usage.input_tokens: 850
│ │ gen_ai.usage.output_tokens: 120
│ │ gen_ai.response.finish_reasons: ["tool_calls"]
│ │ gen_ai.cost: $0.0004
│ │
│ ├── Span: tool_call lookup_customer [180ms–420ms]
│ │ tool.name: "lookup_customer"
│ │ tool.status: "success"
│ │ tool.input: {"email": "user@example.com"} (opt-in)
│ │ tool.output.size_bytes: 2048
│ │
│ ├── Span: tool_call search_knowledge_base [420ms–870ms]
│ │ tool.name: "search_knowledge_base"
│ │ tool.status: "success"
│ │ retrieval.num_results: 3
│ │ retrieval.relevance_score: 0.87
│ │
│ ├── Span: chat gemini-2.0-flash [870ms–1400ms]
│ │ gen_ai.usage.input_tokens: 2100
│ │ gen_ai.usage.output_tokens: 450
│ │ gen_ai.response.finish_reasons: ["tool_calls"]
│ │ gen_ai.cost: $0.0012
│ │
│ ├── Span: tool_call reset_password [1400ms–1800ms]
│ │ tool.name: "reset_password"
│ │ tool.status: "success"
│ │
│ ├── Span: guardrail_check output_validation [1800ms–1850ms]
│ │ guardrail.type: "output"
│ │ guardrail.name: "pii_filter"
│ │ guardrail.result: "pass"
│ │
│ ├── Span: chat gemini-2.0-flash [1850ms–2200ms]
│ │ gen_ai.usage.input_tokens: 2800
│ │ gen_ai.usage.output_tokens: 280
│ │ gen_ai.response.finish_reasons: ["stop"]
│ │ gen_ai.cost: $0.0009
│ │
│ └── Summary:
│ total_duration: 2200ms
│ total_tokens: 6600 (5750 input + 850 output)
│ total_cost: $0.0025
│ tool_calls: 3 (3 success, 0 failure)
│ llm_calls: 3
│ result: SUCCESS
This trace gives you everything needed to debug, optimize, and audit the agent’s behavior.
Trace Storage and Querying
Backend Options
| Backend | Strengths | AI-Specific Features |
|---|---|---|
| Cloud Trace (GCP) | Native GCP integration, auto-correlated with Cloud Logging/Monitoring | Built-in ADK span support, Vertex AI integration |
| Jaeger | Mature, self-hosted, OTel-native | None built-in; relies on custom attributes |
| Grafana Tempo | Scalable, cost-effective (object storage backend), integrates with Grafana dashboards | Starting to support GenAI semantic conventions |
| Langfuse | Purpose-built for LLM traces, includes eval and prompt management | Native GenAI support, cost tracking, session replay, eval integration |
| Datadog | Full APM platform with LLM Observability add-on | OTel GenAI convention support (v1.37+), LLM cost tracking |
Useful Trace Queries
These are the queries you will run most often:
| Question | Query Pattern |
|---|---|
| “Which traces cost the most?” | Sort by gen_ai.cost descending, filter last 24h |
| “Which traces are slowest?” | Sort by duration, filter by gen_ai.agent.name |
| “Which tool calls are failing?” | Filter tool.status = "error", group by tool.name |
| “Show me guardrail triggers” | Filter guardrail.result != "pass", group by guardrail.name |
| “Agent stuck in a loop?” | Count LLM call spans per trace, alert if > threshold |
| “What did the agent do for request X?” | Look up by gen_ai.conversation.id or trace_id |
| “Model version regression?” | Compare latency/cost/quality metrics before/after gen_ai.request.model change |