Post

AgentOps -- The Discipline

AgentOps is to AI agents what DevOps is to services -- the operational discipline for deploying, monitoring, governing, and continuously improving autonomous AI agents in production.

AgentOps -- The Discipline

AgentOps is to AI agents what DevOps is to services — the operational discipline for deploying, monitoring, governing, and continuously improving autonomous AI agents in production. Where MLOps stops at the model boundary, AgentOps extends to the full execution surface: reasoning traces, tool invocations, cost attribution, session management, and behavioral guardrails.


What Is AgentOps?

An agent is not a model. It is a system that uses models to take actions in the world — querying databases, calling APIs, spawning sub-agents, writing files, triggering workflows. Each of these actions can fail, cost money, leak data, or produce incorrect results. Managing this complexity requires a distinct operational discipline.

AgentOps encompasses the practices, tooling, and governance controls required to:

  1. Instrument agent behavior at every layer (LLM calls, tool invocations, reasoning steps, session state)
  2. Monitor production agents in real-time (latency, cost, quality, safety)
  3. Debug non-deterministic failures using full session replays and trace analysis
  4. Evaluate agent quality continuously (not just at deploy time)
  5. Govern agent behavior with guardrails, access controls, and audit trails

AgentOps is not a tool or a product — it is a discipline that integrates with your existing DevOps, MLOps, SecOps, and FinOps practices.


How AgentOps Relates to Existing Disciplines

AgentOps does not replace existing operational disciplines — it inherits from each and adds agent-specific concerns:

Discipline What AgentOps Inherits What AgentOps Adds
DevOps CI/CD pipelines, infrastructure-as-code, monitoring, incident response Eval gates in CI/CD, prompt versioning, non-deterministic test strategies
MLOps Model versioning, drift detection, A/B testing, experiment tracking Agent workflow versioning (not just model weights), tool chain monitoring, multi-agent orchestration
SecOps Threat detection, access control, audit logging, compliance Prompt injection detection, data leak prevention, tool access governance, AI-specific incident runbooks
FinOps Cloud cost management, resource optimization, chargeback Token-level cost attribution, cost-per-task tracking, model routing for cost optimization, budget caps per agent

The key insight: MLOps manages the model lifecycle (training, versioning, deployment, monitoring). AgentOps manages the agent lifecycle — the system built on top of models. An agent workflow version includes prompts, tool definitions, guardrail configs, orchestration logic, and model selections. Changing any of these can cause regressions, and all must be tracked together.


The 5-Stage AgentOps Lifecycle

Agent operations follow a continuous lifecycle. Each stage feeds into the next:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
                    ┌─────────────────────┐
                    │  1. Design &        │
              ┌────>│     Prototyping     │────┐
              │     └─────────────────────┘    │
              │                                v
┌─────────────┴───────────┐    ┌───────────────────────────┐
│  5. Feedback &          │    │  2. Deployment &          │
│     Iteration           │    │     Orchestration         │
└─────────────┬───────────┘    └───────────────┬───────────┘
              ^                                │
              │     ┌─────────────────────┐    │
              └─────│  4. Testing &       │<───┘
                    │     Validation      │
                    └────────┬────────────┘
                             │
                    ┌────────v────────────┐
                    │  3. Observability & │
                    │     Monitoring      │
                    └─────────────────────┘

Stage 1: Design & Prototyping

Define agent capabilities, tool access, guardrails, and success criteria. Build eval datasets before writing the first prompt. Select models and frameworks (ADK, LangGraph, CrewAI). This is where you define what “working correctly” means — without this, monitoring is meaningless.

Stage 2: Deployment & Orchestration

Deploy agents to production infrastructure (Vertex AI Agent Engine, Cloud Run, Kubernetes). Configure routing, scaling, versioning (canary/blue-green). Set up multi-agent orchestration and handoff protocols. This stage inherits directly from DevOps but adds prompt versioning and agent workflow versioning as first-class deployment artifacts.

Stage 3: Observability & Monitoring

Instrument every layer: LLM calls (tokens, latency, cost), tool invocations (success/failure, duration), reasoning traces (full decision chain), and session state. This is the core of what most people think of as “AgentOps” — but it is only one stage.

Stage 4: Testing & Validation

Run eval suites against production traffic samples. Detect regressions in quality, safety, and cost. Compare baseline metrics across versions. This stage bridges offline evals with production monitoring.

Stage 5: Feedback & Iteration

Collect human feedback, analyze guardrail trigger patterns, review cost trends, and feed insights back into Stage 1. Update prompts, add eval cases from production failures, tune guardrail thresholds. This is the learning loop that makes agents improve over time rather than silently degrading.


The 8-Step Operational Flow

Within Stage 3 (Observability), the day-to-day operational flow follows these steps:

Step What How
1. Instrumentation Embed telemetry in agent code OTel SDKs, Langfuse/LangSmith SDKs, framework-native tracing (ADK enable_tracing=True)
2. Data Collection Aggregate telemetry centrally OTel Collector, Cloud Logging, Langfuse ingestion API
3. Dashboards Visualize key metrics Grafana, Looker, Langfuse dashboard, LangSmith UI
4. Session Replays Inspect individual agent runs Full trace view: every LLM call, tool call, reasoning step, input/output
5. Alerting Detect anomalies automatically Cost spikes, latency degradation, quality drops, guardrail trigger surges
6. Reporting Summarize trends for stakeholders Weekly quality reports, cost breakdowns per team/agent, SLO compliance
7. Feedback Loop Route insights to improvement Flag failing cases for eval dataset, adjust guardrail thresholds, update prompts
8. Scalability Ensure ops scale with agent fleet Sampling strategies, tiered storage, automated triage, multi-tenant isolation

AgentOps Maturity Model

Most organizations are at Level 1 or 2. The goal is to reach Level 3 for production agents and Level 4 for business-critical agents.

Level Name Characteristics Tooling
L1 Ad-hoc Agents in production with basic logging. No structured tracing. Debug by reading logs manually. No evals in CI. Cost tracked at cloud billing level only. Print statements, basic Cloud Logging
L2 Reactive Structured tracing with LLM-specific tools. Alerts on hard failures (errors, timeouts). Manual investigation on quality issues. Eval suites exist but run manually. Langfuse or LangSmith, basic dashboards, manual eval runs
L3 Proactive Full OTel-based tracing with GenAI semantic conventions. Automated eval regression detection in CI/CD. Drift monitoring. Cost attribution per agent/task. Guardrail trigger monitoring with automated alerts. OTel + Langfuse, CI eval gates, Grafana/Looker dashboards, automated alerting
L4 Optimized Continuous production eval loops. Automated quality feedback into prompt tuning. Model routing optimized by cost/quality tradeoff. Predictive alerting (detect degradation before users notice). Full audit trail for compliance. Full AgentOps platform, ML-driven anomaly detection, automated prompt optimization pipeline

Self-Assessment Questions

  • Can you replay any production agent session with full trace detail? (L2+)
  • Do your CI/CD pipelines gate on eval pass rates? (L3+)
  • Do you detect quality degradation before users report it? (L4)
  • Can you attribute cost to individual agents, teams, and tasks? (L3+)
  • Do production failures automatically generate new eval test cases? (L4)

Market Context

The AgentOps tooling market is growing rapidly as enterprise AI agent deployments scale:

  • 2025: $1.8B market size
  • 2026: $2.6B projected (45% CAGR through 2034)
  • Fastest-growing segment: AI Security & Guardrails (~51% CAGR through 2028)
  • Gartner prediction: 40% of enterprise applications will integrate task-specific AI agents by end of 2026 (up from <5% in 2025)
  • LangChain 2026 State of Agent Engineering: 89% of teams have implemented observability for agents, but only 52% have adopted evals — a gap this vault’s Observability section aims to close

Key market events:

  • Jan 2026: ClickHouse acquired Langfuse as part of a $400M Series D, signaling that AI observability is becoming core data infrastructure, not a niche developer tool
  • 2025-2026: OpenTelemetry GenAI SIG finalized agent application semantic conventions, establishing vendor-neutral standards for AI telemetry

Anti-Patterns

  • Instrumenting only the model. The model call is one span in a 10-span agent trace. If you only trace LLM calls, you miss tool failures, orchestration delays, and session state bugs.
  • Treating agent monitoring as APM. Standard APM dashboards do not show token usage, reasoning quality, guardrail triggers, or cost-per-task. Bolting AI metrics onto your existing Datadog dashboard is not AgentOps.
  • Skipping evals because “monitoring is enough.” Monitoring tells you something went wrong. Evals tell you what “wrong” means and whether it is getting worse. You need both.
  • No cost attribution. If you cannot answer “how much does Agent X cost per task?”, you cannot optimize, budget, or make build-vs-buy decisions about agent capabilities.

References

This post is licensed under CC BY 4.0 by the author.