Post

AI Tracing and OpenTelemetry

Standard distributed tracing tells you which services a request touched and how long each took. AI tracing must also capture what model was called, how many tokens were consumed, what the agent was thinking, which tools it invoked, and how much it cost.

AI Tracing and OpenTelemetry

Standard distributed tracing tells you which services a request touched and how long each took. AI tracing must also capture what model was called, how many tokens were consumed, what the agent was thinking, which tools it invoked, and how much it cost — all within the same trace.


Why Standard Tracing Is Not Enough for AI

A traditional HTTP span captures: method, path, status code, latency. An AI system needs all of that plus:

Standard Span Attribute AI-Specific Attribute Needed
http.method, http.route gen_ai.operation.name (chat, embeddings, tool_call)
http.status_code gen_ai.response.finish_reasons (stop, length, tool_calls)
Duration (ms) Time-to-first-token, inter-token latency, total generation time
Request size (bytes) gen_ai.usage.input_tokens, gen_ai.usage.output_tokens
Service name gen_ai.system (openai, anthropic, vertex_ai), gen_ai.request.model
Error message Hallucination detected, guardrail triggered, tool call failed
Cost (not tracked) Cost per span (tokens x pricing), cost per agent task

Without these attributes, you cannot answer basic operational questions: “Which agent is the most expensive?”, “Which model version increased latency?”, “Why did the agent call the same tool 5 times?”


OpenTelemetry GenAI Semantic Conventions

The OTel GenAI SIG (Special Interest Group) defines standardized attributes for AI telemetry. As of early 2026, there are two parallel convention tracks:

1. GenAI Client Spans (Stable)

Cover individual LLM API calls — the lowest level of AI tracing. These are production-ready and supported by major vendors (Datadog OTel v1.37+, Grafana Loki).

Key attributes:

Attribute Type Description
gen_ai.system string Provider identifier: openai, anthropic, vertex_ai, azure.ai.inference
gen_ai.operation.name string Operation type: chat, text_completion, embeddings
gen_ai.request.model string Model requested: gemini-2.0-flash, claude-sonnet-4-20250514
gen_ai.response.model string Model actually used (may differ from requested)
gen_ai.request.temperature float Sampling temperature
gen_ai.request.max_tokens int Max output tokens requested
gen_ai.usage.input_tokens int Prompt tokens consumed
gen_ai.usage.output_tokens int Completion tokens generated
gen_ai.response.finish_reasons string[] Why generation stopped: ["stop"], ["tool_calls"]

Span naming convention: {gen_ai.operation.name} {gen_ai.request.model} — e.g., chat gemini-2.0-flash.

Events within spans:

  • gen_ai.content.prompt — captures the full prompt (opt-in, disabled by default for security)
  • gen_ai.content.completion — captures the full response (opt-in)
  • gen_ai.tool.message — captures tool call requests and results

2. GenAI Agent Spans (Experimental)

Cover higher-level agent concepts. Two sub-tracks are under development:

Agent Application Convention (more mature, based on Google’s AI agent white paper):

Attribute Type Description
gen_ai.agent.name string Human-readable agent name
gen_ai.agent.id string Unique identifier
gen_ai.agent.description string Free-form description of agent purpose
gen_ai.agent.version string Agent version
gen_ai.conversation.id string Conversation/session identifier

Agent Framework Convention (in progress, aiming to unify ADK, LangGraph, CrewAI, AutoGen, IBM Bee):

Defines spans for task, action, tool_call, and agent with relationships between them. This track is working toward a standard where any agent framework emits compatible traces, making it possible to compare agent behavior across frameworks.

Provider-Specific Extensions

The OTel GenAI SIG also defines technology-specific conventions for:

  • OpenAI — function calling, structured outputs
  • Anthropic — thinking blocks, tool use
  • Azure AI Inference — deployment-specific attributes
  • AWS Bedrock — guardrail IDs, model ARNs

Two Instrumentation Paths

There are two fundamentally different ways to get OTel-compatible AI traces:

Approach How It Works Pros Cons
Built-in (framework-native) The agent framework emits OTel spans directly. ADK, LangGraph, and some LLM SDKs have native OTel support. Zero additional dependencies, deepest span coverage, maintained by framework team Locked to framework’s span design, may not cover all attributes you need
External instrumentation A library wraps LLM API calls and emits OTel spans. OpenLLMetry, Langtrace, and vendor SDKs (Langfuse, LangSmith) work this way. Framework-agnostic, add to any codebase, often richer attributes Additional dependency, may conflict with framework-native tracing, overhead varies

Built-in Examples

ADK (Google Agent Development Kit):

1
2
3
4
5
6
7
from google.adk.telemetry import google_cloud

# Enable Cloud Trace export
exporters = google_cloud.get_gcp_exporters(enable_cloud_tracing=True)
# Or via env vars:
# GOOGLE_CLOUD_AGENT_ENGINE_ENABLE_TELEMETRY=true
# OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=true

LangGraph:

1
2
3
4
from langchain.callbacks.tracers import LangChainTracer
# Or via LangSmith environment variables
# LANGCHAIN_TRACING_V2=true
# LANGCHAIN_API_KEY=...

External Instrumentation Examples

OpenLLMetry (Traceloop):

1
2
from traceloop.sdk import Traceloop
Traceloop.init()  # Auto-instruments OpenAI, Anthropic, Cohere, etc.

Langfuse:

1
2
3
4
5
6
7
from langfuse.openai import openai  # Drop-in replacement, auto-traces
# Or use the decorator for custom spans:
from langfuse.decorators import observe

@observe()
def my_agent_step(input: str) -> str:
    ...

Which Path to Choose

  • ADK on GCP — use built-in tracing (Cloud Trace), complement with Langfuse for evals and prompt management
  • LangGraph/LangChain — use LangSmith for the tightest integration, or OpenLLMetry for vendor-neutral OTel
  • Custom agent framework — use OpenLLMetry or Langfuse SDK for immediate instrumentation without framework coupling
  • Multi-framework enterprise — use OpenLLMetry as the common layer, export to a single backend (Langfuse, Jaeger, Grafana Tempo)

Multi-Layer Trace Design

In an enterprise AI platform with gateway architecture, a single user request passes through multiple layers. Each layer should contribute spans to a single distributed trace:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
User Request (traceparent: 00-{trace_id}-{span_id}-01)
│
├── [API Gateway] ── Span: api_gateway (auth, rate limit)
│   │                W3C traceparent propagated →
│   │
│   ├── [Agent Gateway] ── Span: agent_routing (capability match, version select)
│   │   │                  W3C traceparent propagated →
│   │   │
│   │   ├── [Agent Service] ── Span: agent_task "resolve_support_ticket"
│   │   │   │                  gen_ai.agent.name = "support-agent-v2"
│   │   │   │
│   │   │   ├── Span: chat gemini-2.0-flash (reasoning step 1)
│   │   │   │   gen_ai.usage.input_tokens = 1200
│   │   │   │   gen_ai.usage.output_tokens = 350
│   │   │   │
│   │   │   ├── Span: tool_call search_knowledge_base
│   │   │   │   tool.name = "search_knowledge_base"
│   │   │   │   tool.status = "success"
│   │   │   │   duration = 450ms
│   │   │   │
│   │   │   ├── [LLM Gateway] ── Span: llm_routing (model select, fallback)
│   │   │   │   │
│   │   │   │   └── Span: chat gemini-2.0-flash (reasoning step 2)
│   │   │   │       gen_ai.usage.input_tokens = 2100
│   │   │   │       gen_ai.usage.output_tokens = 600
│   │   │   │
│   │   │   └── Span: guardrail_check (output validation)
│   │   │       guardrail.result = "pass"
│   │   │
│   │   └── [Output Guardrails] ── Span: output_filter (PII scrub)
│
└── Total: 2.8s, 4250 tokens, $0.012, SUCCESS

Key principles:

  1. W3C traceparent propagation at every boundary — gateways, agent services, tool APIs, LLM providers
  2. One trace per user request — not per LLM call. The trace should show the complete journey from API gateway to final response
  3. Semantic attributes at every span — use GenAI conventions on LLM spans, standard HTTP conventions on gateway spans, custom attributes on tool spans
  4. Cost attribution per span — calculate cost from input_tokens * input_price + output_tokens * output_price at each LLM span

Span Design Patterns

Span Types and Their Attributes

Span Type Key Attributes Notes
Agent Task gen_ai.agent.name, gen_ai.agent.version, gen_ai.conversation.id, task description Root span for agent work. All child spans roll up cost/tokens here.
LLM Call gen_ai.system, gen_ai.request.model, gen_ai.usage.*, gen_ai.response.finish_reasons One span per LLM API call. Include TTFT if streaming.
Tool Call tool.name, tool.status, input/output (opt-in), duration Track success rate and latency per tool.
Guardrail Check guardrail.type (input/output), guardrail.result (pass/block/modify), guardrail.name Every guardrail invocation should be a span for audit.
Retrieval (RAG) retrieval.source, retrieval.num_results, retrieval.relevance_score Track retrieval quality alongside generation quality.

Streaming Response Handling

Streaming LLM responses require special span design:

  1. Start the span when the API call begins
  2. Record TTFT (time-to-first-token) as a span event or attribute
  3. End the span only when the full response is received (or the stream closes)
  4. Token counts are available only after stream completion — set gen_ai.usage.* attributes at span close
  5. Finish reasons are in the final stream chunk — capture them

Thinking vs Acting Phases

For agents with explicit reasoning (e.g., Claude’s extended thinking, chain-of-thought), distinguish:

  • Thinking spans: Internal reasoning. High token count, no external side effects. May contain sensitive reasoning that should not be logged in production.
  • Acting spans: Tool calls, API requests, state mutations. Lower token count but real-world impact. Must be logged for audit.

Cost Attribution

Calculate cost per span at trace ingestion time:

1
span_cost = (input_tokens * model_input_price) + (output_tokens * model_output_price)

Maintain a pricing table (updated when model pricing changes) and attach gen_ai.cost as a custom attribute. Roll up costs from child spans to the agent task root span for per-task cost tracking.

Sampling Strategies

At scale, tracing 100% of requests is expensive. Recommended approach:

Strategy When to Use
100% tracing Development, staging, low-traffic agents
Head-based sampling (10-50%) High-traffic production agents where you need representative coverage
Tail-based sampling Keep all traces with errors, high cost, high latency, or guardrail triggers. Sample the rest.
Always trace Agent tasks with financial impact, compliance-sensitive workflows, incidents

Tail-based sampling is the best fit for AI workloads because the interesting traces (failures, expensive runs, guardrail triggers) are exactly the ones you want to keep at 100%.


Agent Tracing Pattern – Full Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
Trace: resolve_support_ticket (trace_id: abc123)
│
├── Span: agent_task
│   gen_ai.agent.name: "support-agent"
│   gen_ai.agent.version: "2.1.0"
│   gen_ai.conversation.id: "conv_789"
│   agent.task.description: "Resolve ticket #4521: customer cannot reset password"
│
│   ├── Span: chat gemini-2.0-flash                         [0ms–180ms]
│   │   gen_ai.system: "vertex_ai"
│   │   gen_ai.request.model: "gemini-2.0-flash"
│   │   gen_ai.usage.input_tokens: 850
│   │   gen_ai.usage.output_tokens: 120
│   │   gen_ai.response.finish_reasons: ["tool_calls"]
│   │   gen_ai.cost: $0.0004
│   │
│   ├── Span: tool_call lookup_customer                      [180ms–420ms]
│   │   tool.name: "lookup_customer"
│   │   tool.status: "success"
│   │   tool.input: {"email": "user@example.com"}  (opt-in)
│   │   tool.output.size_bytes: 2048
│   │
│   ├── Span: tool_call search_knowledge_base                [420ms–870ms]
│   │   tool.name: "search_knowledge_base"
│   │   tool.status: "success"
│   │   retrieval.num_results: 3
│   │   retrieval.relevance_score: 0.87
│   │
│   ├── Span: chat gemini-2.0-flash                         [870ms–1400ms]
│   │   gen_ai.usage.input_tokens: 2100
│   │   gen_ai.usage.output_tokens: 450
│   │   gen_ai.response.finish_reasons: ["tool_calls"]
│   │   gen_ai.cost: $0.0012
│   │
│   ├── Span: tool_call reset_password                       [1400ms–1800ms]
│   │   tool.name: "reset_password"
│   │   tool.status: "success"
│   │
│   ├── Span: guardrail_check output_validation              [1800ms–1850ms]
│   │   guardrail.type: "output"
│   │   guardrail.name: "pii_filter"
│   │   guardrail.result: "pass"
│   │
│   ├── Span: chat gemini-2.0-flash                         [1850ms–2200ms]
│   │   gen_ai.usage.input_tokens: 2800
│   │   gen_ai.usage.output_tokens: 280
│   │   gen_ai.response.finish_reasons: ["stop"]
│   │   gen_ai.cost: $0.0009
│   │
│   └── Summary:
│       total_duration: 2200ms
│       total_tokens: 6600 (5750 input + 850 output)
│       total_cost: $0.0025
│       tool_calls: 3 (3 success, 0 failure)
│       llm_calls: 3
│       result: SUCCESS

This trace gives you everything needed to debug, optimize, and audit the agent’s behavior.


Trace Storage and Querying

Backend Options

Backend Strengths AI-Specific Features
Cloud Trace (GCP) Native GCP integration, auto-correlated with Cloud Logging/Monitoring Built-in ADK span support, Vertex AI integration
Jaeger Mature, self-hosted, OTel-native None built-in; relies on custom attributes
Grafana Tempo Scalable, cost-effective (object storage backend), integrates with Grafana dashboards Starting to support GenAI semantic conventions
Langfuse Purpose-built for LLM traces, includes eval and prompt management Native GenAI support, cost tracking, session replay, eval integration
Datadog Full APM platform with LLM Observability add-on OTel GenAI convention support (v1.37+), LLM cost tracking

Useful Trace Queries

These are the queries you will run most often:

Question Query Pattern
“Which traces cost the most?” Sort by gen_ai.cost descending, filter last 24h
“Which traces are slowest?” Sort by duration, filter by gen_ai.agent.name
“Which tool calls are failing?” Filter tool.status = "error", group by tool.name
“Show me guardrail triggers” Filter guardrail.result != "pass", group by guardrail.name
“Agent stuck in a loop?” Count LLM call spans per trace, alert if > threshold
“What did the agent do for request X?” Look up by gen_ai.conversation.id or trace_id
“Model version regression?” Compare latency/cost/quality metrics before/after gen_ai.request.model change

References

This post is licensed under CC BY 4.0 by the author.