AI Tracing and OpenTelemetry

Standard distributed tracing tells you which services a request touched and how long each took. AI tracing must also capture what model was called, how many tokens were consumed, what the agent was thinking, which tools it invoked, and how much it cost.

Posted Jan 5, 2026

10 min read

Standard distributed tracing tells you which services a request touched and how long each took. AI tracing must also capture what model was called, how many tokens were consumed, what the agent was thinking, which tools it invoked, and how much it cost — all within the same trace.

Why Standard Tracing Is Not Enough for AI

A traditional HTTP span captures: method, path, status code, latency. An AI system needs all of that plus:

Standard Span Attribute	AI-Specific Attribute Needed
`http.method`, `http.route`	`gen_ai.operation.name` (chat, embeddings, tool_call)
`http.status_code`	`gen_ai.response.finish_reasons` (stop, length, tool_calls)
Duration (ms)	Time-to-first-token, inter-token latency, total generation time
Request size (bytes)	`gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`
Service name	`gen_ai.system` (openai, anthropic, vertex_ai), `gen_ai.request.model`
Error message	Hallucination detected, guardrail triggered, tool call failed
Cost (not tracked)	Cost per span (tokens x pricing), cost per agent task

Without these attributes, you cannot answer basic operational questions: “Which agent is the most expensive?”, “Which model version increased latency?”, “Why did the agent call the same tool 5 times?”

OpenTelemetry GenAI Semantic Conventions

The OTel GenAI SIG (Special Interest Group) defines standardized attributes for AI telemetry. As of early 2026, there are two parallel convention tracks:

1. GenAI Client Spans (Stable)

Cover individual LLM API calls — the lowest level of AI tracing. These are production-ready and supported by major vendors (Datadog OTel v1.37+, Grafana Loki).

Key attributes:

Attribute	Type	Description
`gen_ai.system`	string	Provider identifier: `openai`, `anthropic`, `vertex_ai`, `azure.ai.inference`
`gen_ai.operation.name`	string	Operation type: `chat`, `text_completion`, `embeddings`
`gen_ai.request.model`	string	Model requested: `gemini-2.0-flash`, `claude-sonnet-4-20250514`
`gen_ai.response.model`	string	Model actually used (may differ from requested)
`gen_ai.request.temperature`	float	Sampling temperature
`gen_ai.request.max_tokens`	int	Max output tokens requested
`gen_ai.usage.input_tokens`	int	Prompt tokens consumed
`gen_ai.usage.output_tokens`	int	Completion tokens generated
`gen_ai.response.finish_reasons`	string[]	Why generation stopped: `["stop"]`, `["tool_calls"]`

Span naming convention: {gen_ai.operation.name} {gen_ai.request.model} — e.g., chat gemini-2.0-flash.

Events within spans:

gen_ai.content.prompt — captures the full prompt (opt-in, disabled by default for security)
gen_ai.content.completion — captures the full response (opt-in)
gen_ai.tool.message — captures tool call requests and results

2. GenAI Agent Spans (Experimental)

Cover higher-level agent concepts. Two sub-tracks are under development:

Agent Application Convention (more mature, based on Google’s AI agent white paper):

Attribute	Type	Description
`gen_ai.agent.name`	string	Human-readable agent name
`gen_ai.agent.id`	string	Unique identifier
`gen_ai.agent.description`	string	Free-form description of agent purpose
`gen_ai.agent.version`	string	Agent version
`gen_ai.conversation.id`	string	Conversation/session identifier

Agent Framework Convention (in progress, aiming to unify ADK, LangGraph, CrewAI, AutoGen, IBM Bee):

Defines spans for task, action, tool_call, and agent with relationships between them. This track is working toward a standard where any agent framework emits compatible traces, making it possible to compare agent behavior across frameworks.

Provider-Specific Extensions

The OTel GenAI SIG also defines technology-specific conventions for:

OpenAI — function calling, structured outputs
Anthropic — thinking blocks, tool use
Azure AI Inference — deployment-specific attributes
AWS Bedrock — guardrail IDs, model ARNs

Two Instrumentation Paths

There are two fundamentally different ways to get OTel-compatible AI traces:

Approach	How It Works	Pros	Cons
Built-in (framework-native)	The agent framework emits OTel spans directly. ADK, LangGraph, and some LLM SDKs have native OTel support.	Zero additional dependencies, deepest span coverage, maintained by framework team	Locked to framework’s span design, may not cover all attributes you need
External instrumentation	A library wraps LLM API calls and emits OTel spans. OpenLLMetry, Langtrace, and vendor SDKs (Langfuse, LangSmith) work this way.	Framework-agnostic, add to any codebase, often richer attributes	Additional dependency, may conflict with framework-native tracing, overhead varies

Built-in Examples

ADK (Google Agent Development Kit):

        
      
from google.adk.telemetry import google_cloud

# Enable Cloud Trace export
exporters = google_cloud.get_gcp_exporters(enable_cloud_tracing=True)
# Or via env vars:
# GOOGLE_CLOUD_AGENT_ENGINE_ENABLE_TELEMETRY=true
# OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=true

LangGraph:

        
      
from langchain.callbacks.tracers import LangChainTracer
# Or via LangSmith environment variables
# LANGCHAIN_TRACING_V2=true
# LANGCHAIN_API_KEY=...

External Instrumentation Examples

OpenLLMetry (Traceloop):

        
from traceloop.sdk import Traceloop
Traceloop.init()  # Auto-instruments OpenAI, Anthropic, Cohere, etc.

Langfuse:

        
      
from langfuse.openai import openai  # Drop-in replacement, auto-traces
# Or use the decorator for custom spans:
from langfuse.decorators import observe

@observe()
def my_agent_step(input: str) -> str:
    ...

Which Path to Choose

ADK on GCP — use built-in tracing (Cloud Trace), complement with Langfuse for evals and prompt management
LangGraph/LangChain — use LangSmith for the tightest integration, or OpenLLMetry for vendor-neutral OTel
Custom agent framework — use OpenLLMetry or Langfuse SDK for immediate instrumentation without framework coupling
Multi-framework enterprise — use OpenLLMetry as the common layer, export to a single backend (Langfuse, Jaeger, Grafana Tempo)

Multi-Layer Trace Design

In an enterprise AI platform with gateway architecture, a single user request passes through multiple layers. Each layer should contribute spans to a single distributed trace:

User Request (traceparent: 00-{trace_id}-{span_id}-01)
│
├── [API Gateway] ── Span: api_gateway (auth, rate limit)
│   │                W3C traceparent propagated →
│   │
│   ├── [Agent Gateway] ── Span: agent_routing (capability match, version select)
│   │   │                  W3C traceparent propagated →
│   │   │
│   │   ├── [Agent Service] ── Span: agent_task "resolve_support_ticket"
│   │   │   │                  gen_ai.agent.name = "support-agent-v2"
│   │   │   │
│   │   │   ├── Span: chat gemini-2.0-flash (reasoning step 1)
│   │   │   │   gen_ai.usage.input_tokens = 1200
│   │   │   │   gen_ai.usage.output_tokens = 350
│   │   │   │
│   │   │   ├── Span: tool_call search_knowledge_base
│   │   │   │   tool.name = "search_knowledge_base"
│   │   │   │   tool.status = "success"
│   │   │   │   duration = 450ms
│   │   │   │
│   │   │   ├── [LLM Gateway] ── Span: llm_routing (model select, fallback)
│   │   │   │   │
│   │   │   │   └── Span: chat gemini-2.0-flash (reasoning step 2)
│   │   │   │       gen_ai.usage.input_tokens = 2100
│   │   │   │       gen_ai.usage.output_tokens = 600
│   │   │   │
│   │   │   └── Span: guardrail_check (output validation)
│   │   │       guardrail.result = "pass"
│   │   │
│   │   └── [Output Guardrails] ── Span: output_filter (PII scrub)
│
└── Total: 2.8s, 4250 tokens, $0.012, SUCCESS

Key principles:

W3C traceparent propagation at every boundary — gateways, agent services, tool APIs, LLM providers
One trace per user request — not per LLM call. The trace should show the complete journey from API gateway to final response
Semantic attributes at every span — use GenAI conventions on LLM spans, standard HTTP conventions on gateway spans, custom attributes on tool spans
Cost attribution per span — calculate cost from input_tokens * input_price + output_tokens * output_price at each LLM span

Span Design Patterns

Span Types and Their Attributes

Span Type	Key Attributes	Notes
Agent Task	`gen_ai.agent.name`, `gen_ai.agent.version`, `gen_ai.conversation.id`, task description	Root span for agent work. All child spans roll up cost/tokens here.
LLM Call	`gen_ai.system`, `gen_ai.request.model`, `gen_ai.usage.*`, `gen_ai.response.finish_reasons`	One span per LLM API call. Include TTFT if streaming.
Tool Call	`tool.name`, `tool.status`, input/output (opt-in), duration	Track success rate and latency per tool.
Guardrail Check	`guardrail.type` (input/output), `guardrail.result` (pass/block/modify), `guardrail.name`	Every guardrail invocation should be a span for audit.
Retrieval (RAG)	`retrieval.source`, `retrieval.num_results`, `retrieval.relevance_score`	Track retrieval quality alongside generation quality.

Streaming Response Handling

Streaming LLM responses require special span design:

Start the span when the API call begins
Record TTFT (time-to-first-token) as a span event or attribute
End the span only when the full response is received (or the stream closes)
Token counts are available only after stream completion — set gen_ai.usage.* attributes at span close
Finish reasons are in the final stream chunk — capture them

Thinking vs Acting Phases

For agents with explicit reasoning (e.g., Claude’s extended thinking, chain-of-thought), distinguish:

Thinking spans: Internal reasoning. High token count, no external side effects. May contain sensitive reasoning that should not be logged in production.
Acting spans: Tool calls, API requests, state mutations. Lower token count but real-world impact. Must be logged for audit.

Cost Attribution

Calculate cost per span at trace ingestion time:

span_cost = (input_tokens * model_input_price) + (output_tokens * model_output_price)

Maintain a pricing table (updated when model pricing changes) and attach gen_ai.cost as a custom attribute. Roll up costs from child spans to the agent task root span for per-task cost tracking.

Sampling Strategies

At scale, tracing 100% of requests is expensive. Recommended approach:

Strategy	When to Use
100% tracing	Development, staging, low-traffic agents
Head-based sampling (10-50%)	High-traffic production agents where you need representative coverage
Tail-based sampling	Keep all traces with errors, high cost, high latency, or guardrail triggers. Sample the rest.
Always trace	Agent tasks with financial impact, compliance-sensitive workflows, incidents

Tail-based sampling is the best fit for AI workloads because the interesting traces (failures, expensive runs, guardrail triggers) are exactly the ones you want to keep at 100%.

Agent Tracing Pattern – Full Example

Trace: resolve_support_ticket (trace_id: abc123)
│
├── Span: agent_task
│   gen_ai.agent.name: "support-agent"
│   gen_ai.agent.version: "2.1.0"
│   gen_ai.conversation.id: "conv_789"
│   agent.task.description: "Resolve ticket #4521: customer cannot reset password"
│
│   ├── Span: chat gemini-2.0-flash                         [0ms–180ms]
│   │   gen_ai.system: "vertex_ai"
│   │   gen_ai.request.model: "gemini-2.0-flash"
│   │   gen_ai.usage.input_tokens: 850
│   │   gen_ai.usage.output_tokens: 120
│   │   gen_ai.response.finish_reasons: ["tool_calls"]
│   │   gen_ai.cost: $0.0004
│   │
│   ├── Span: tool_call lookup_customer                      [180ms–420ms]
│   │   tool.name: "lookup_customer"
│   │   tool.status: "success"
│   │   tool.input: {"email": "user@example.com"}  (opt-in)
│   │   tool.output.size_bytes: 2048
│   │
│   ├── Span: tool_call search_knowledge_base                [420ms–870ms]
│   │   tool.name: "search_knowledge_base"
│   │   tool.status: "success"
│   │   retrieval.num_results: 3
│   │   retrieval.relevance_score: 0.87
│   │
│   ├── Span: chat gemini-2.0-flash                         [870ms–1400ms]
│   │   gen_ai.usage.input_tokens: 2100
│   │   gen_ai.usage.output_tokens: 450
│   │   gen_ai.response.finish_reasons: ["tool_calls"]
│   │   gen_ai.cost: $0.0012
│   │
│   ├── Span: tool_call reset_password                       [1400ms–1800ms]
│   │   tool.name: "reset_password"
│   │   tool.status: "success"
│   │
│   ├── Span: guardrail_check output_validation              [1800ms–1850ms]
│   │   guardrail.type: "output"
│   │   guardrail.name: "pii_filter"
│   │   guardrail.result: "pass"
│   │
│   ├── Span: chat gemini-2.0-flash                         [1850ms–2200ms]
│   │   gen_ai.usage.input_tokens: 2800
│   │   gen_ai.usage.output_tokens: 280
│   │   gen_ai.response.finish_reasons: ["stop"]
│   │   gen_ai.cost: $0.0009
│   │
│   └── Summary:
│       total_duration: 2200ms
│       total_tokens: 6600 (5750 input + 850 output)
│       total_cost: $0.0025
│       tool_calls: 3 (3 success, 0 failure)
│       llm_calls: 3
│       result: SUCCESS

This trace gives you everything needed to debug, optimize, and audit the agent’s behavior.

Trace Storage and Querying

Backend Options

Backend	Strengths	AI-Specific Features
Cloud Trace (GCP)	Native GCP integration, auto-correlated with Cloud Logging/Monitoring	Built-in ADK span support, Vertex AI integration
Jaeger	Mature, self-hosted, OTel-native	None built-in; relies on custom attributes
Grafana Tempo	Scalable, cost-effective (object storage backend), integrates with Grafana dashboards	Starting to support GenAI semantic conventions
Langfuse	Purpose-built for LLM traces, includes eval and prompt management	Native GenAI support, cost tracking, session replay, eval integration
Datadog	Full APM platform with LLM Observability add-on	OTel GenAI convention support (v1.37+), LLM cost tracking

Useful Trace Queries

These are the queries you will run most often:

Question	Query Pattern
“Which traces cost the most?”	Sort by `gen_ai.cost` descending, filter last 24h
“Which traces are slowest?”	Sort by duration, filter by `gen_ai.agent.name`
“Which tool calls are failing?”	Filter `tool.status = "error"`, group by `tool.name`
“Show me guardrail triggers”	Filter `guardrail.result != "pass"`, group by `guardrail.name`
“Agent stuck in a loop?”	Count LLM call spans per trace, alert if > threshold
“What did the agent do for request X?”	Look up by `gen_ai.conversation.id` or trace_id
“Model version regression?”	Compare latency/cost/quality metrics before/after `gen_ai.request.model` change

References

AI & Agents, AI Ops

guardrails

This post is licensed under CC BY 4.0 by the author.