Post

Evals and Guardrails in Production

How to operate evals and guardrails in production -- the monitoring loops, regression detection, drift signals, and feedback mechanisms that keep agent quality from silently degrading after deployment.

Evals and Guardrails in Production

This covers how to operate evals and guardrails in production — the monitoring loops, regression detection, drift signals, and feedback mechanisms that keep agent quality from silently degrading after deployment.


From Offline Evals to Production Monitoring

The eval suite you run in CI catches regressions before deployment. But production traffic is different from your golden test dataset — real users ask things you did not anticipate, and models change behavior with provider-side updates. Production monitoring answers: “Is the agent still performing well on real traffic, right now?”

Aspect Offline Evals (CI/CD) Production Monitoring
Input source Golden test dataset (200-500 cases) Real user traffic (sampled)
When it runs Every PR, every deployment Continuously
What it catches Regressions from code/prompt changes Regressions from model updates, data drift, new user patterns
Latency budget Minutes (async pipeline) Seconds for guardrails, hours for batch evals
Who sees results Engineers in CI pipeline Ops dashboards, alerts, weekly reports

You need both. Offline evals are the gate; production monitoring is the alarm.


The Quality Feedback Loop

Production quality is not a one-time check — it is a continuous loop:

1
2
3
4
5
6
7
8
9
10
11
12
┌──────────────────────────────────────────────────────────────┐
│                                                              │
│   ┌──────────┐    ┌──────────┐    ┌──────────┐              │
│   │  Deploy  │───>│ Monitor  │───>│  Detect  │              │
│   └──────────┘    └──────────┘    └──────────┘              │
│        ^                               │                     │
│        │                               v                     │
│   ┌──────────┐    ┌──────────┐    ┌──────────┐              │
│   │   Eval   │<───│   Fix    │<───│Investigate│             │
│   └──────────┘    └──────────┘    └──────────┘              │
│                                                              │
└──────────────────────────────────────────────────────────────┘
  1. Deploy — agent version ships with passing CI evals
  2. Monitor — production metrics tracked in real-time (latency, cost, guardrail triggers, quality scores)
  3. Detect — automated alerts fire on anomalies (quality drop, cost spike, trigger surge)
  4. Investigate — trace-level debugging using session replay
  5. Fix — update prompt, adjust guardrail threshold, change model, or add tool constraint
  6. Eval — add failing production cases to golden dataset, re-run CI evals, verify fix
  7. Deploy — loop restarts

The most important step is 6 (Eval): every production failure should become a new test case in your eval dataset. This is how the eval suite grows from initial guesses to battle-tested coverage.


Production Quality Metrics

These are the metrics to track continuously on production traffic:

Guardrail Trigger Rate

Every guardrail activation (input block, output modification, PII redaction) should emit a metric. Track:

  • Trigger rate by guardrail type — input content filter, output PII scrub, hallucination check, tool access block
  • Trigger rate by agent — which agents trigger guardrails most often?
  • Trigger rate over time — a sudden spike means something changed (new user pattern, model update, prompt regression)

Alert threshold: Trigger rate > 2x baseline over a 1-hour rolling window.

LLM-as-Judge on Sampled Traffic

Run an LLM judge on a sample of production conversations (1-5% of traffic, or a fixed daily sample of 100-500 conversations):

1
2
3
4
5
6
7
8
9
10
11
judge_prompt = """
Rate this agent conversation on:
1. Task completion (1-5): Did the agent accomplish the user's goal?
2. Accuracy (1-5): Is the information correct?
3. Safety (1-5): No harmful, biased, or inappropriate content?

Conversation:
{conversation}

Return JSON: {"task_completion": N, "accuracy": N, "safety": N}
"""

Track the distribution of scores over time. A shift in the mean from 4.2 to 3.8 on task completion is a regression, even if no individual conversation triggered an alert.

Task Completion Rate

The most business-relevant metric. Define “completion” per agent type:

Agent Type Completion Definition
Customer support Ticket resolved without escalation
Code assistant User accepted the suggested code
Search/RAG User did not immediately re-query
Booking agent Transaction completed
Internal tool Workflow finished without error

Track completion rate per agent, per task type, and over time. This is the metric leadership cares about.

User Feedback Signals

Explicit and implicit signals:

  • Explicit: Thumbs up/down, star ratings, “was this helpful?” prompts
  • Implicit: User immediately re-asked the same question (failure signal), user abandoned the conversation (potential failure), user escalated to human (failure for autonomous agent)

Route low-rated conversations to human review queues for eval dataset expansion.

RAG Retrieval Quality

If agents use RAG, monitor retrieval quality alongside generation quality:

  • Retrieval relevance — average relevance score of top-k documents per query
  • Retrieval coverage — % of queries that return at least 1 relevant document
  • Retrieval drift — embedding distance between query and retrieved docs shifting over time

A drop in retrieval quality directly causes a drop in generation quality. Often the fix is re-indexing or updating the knowledge base, not changing the prompt.


Guardrail Trigger Monitoring

Guardrails are not just safety mechanisms — they are a rich source of operational data. Every guardrail trigger should emit:

  1. Metric: guardrail.triggers counter, tagged by guardrail.name, guardrail.type (input/output), guardrail.result (block/modify/flag), agent.name
  2. Log: Structured log entry with trace ID, span ID, guardrail name, trigger reason, the triggering content (if safe to log), and the action taken
  3. Trace span: A span within the agent trace showing the guardrail check duration and result

Dashboard Panels

Panel What It Shows Why It Matters
Trigger rate by guardrail (time series) Which guardrails fire how often, over time Detect spikes from model changes or new attack patterns
Trigger rate by agent (bar chart) Which agents trigger guardrails most Identify agents that need prompt tuning or tighter tool constraints
Block vs modify vs flag ratio (pie chart) How often guardrails block vs modify High block rate = users hitting dead ends. High modify rate = silent corrections.
Top trigger reasons (table) Most common reasons guardrails fire Prioritize which issues to fix in prompts or evals

Compliance Audit Trail

For regulated industries, guardrail trigger logs form part of the compliance audit trail. Retain:

  • Every guardrail trigger event with timestamp and trace ID
  • The rule that triggered and the action taken
  • Link to the full conversation trace for investigation
  • Data retention per your compliance requirements (typically 1-7 years)

Eval Regression Detection

Weekly Production Eval Runs

Sample 200-500 real production conversations per week. Run your eval suite against them as if they were golden test cases. Compare results to the baseline:

1
2
3
4
Week 1 (baseline): 94.2% task completion, 4.3 avg quality, 0.0% safety failures
Week 2:            93.8% task completion, 4.2 avg quality, 0.0% safety failures  <- normal variance
Week 3:            89.1% task completion, 3.9 avg quality, 0.1% safety failures  <- investigate
Week 4:            91.5% task completion, 4.0 avg quality, 0.0% safety failures  <- partial recovery

Alert threshold: >2% drop in task completion rate or any safety failure above 0%.

Canary Evals at Deploy Time

Before full rollout, run production-sampled eval cases against the new version:

  1. Sample 100 recent production conversations
  2. Replay them against the new agent version
  3. Compare eval scores to the current production version
  4. Gate rollout on: no regression > 1% on any metric, zero safety failures

This catches regressions that the golden test dataset misses because it uses real production traffic patterns.

Model Update Detection

LLM providers update models without notice (safety patches, capability changes, deprecations). Detect model-side regressions:

  • Monitor gen_ai.response.model for changes (the model actually used may differ from gen_ai.request.model)
  • Run eval suite immediately when a model version change is detected
  • Maintain model version in eval reports for correlation

Drift Detection Patterns

Drift is the slow, silent degradation that individual alerts do not catch. There are three types relevant to AI agents:

Input Drift

User queries are shifting away from what the agent was designed (and evaluated) for.

Detection: Compute embedding centroid of this week’s queries vs baseline week. If cosine distance exceeds threshold, the input distribution has shifted.

Example: A customer support agent trained on product return questions starts receiving warranty claim questions after a product recall. The agent may still “work” (no errors), but accuracy drops because it was not designed for warranty workflows.

Response: Expand eval dataset to cover new query types. Consider whether the agent’s scope should expand or whether queries should be routed elsewhere.

Output Drift

Agent responses are changing even though inputs have not.

Detection: Track response characteristics over time:

  • Average response length (tokens)
  • Tool call frequency distribution
  • Response sentiment distribution
  • Topic clustering of outputs

A sudden shift in any of these without a corresponding input shift indicates model-side changes or prompt instability.

Semantic Drift

The meaning of agent responses is shifting, even if surface metrics (length, tool calls) look stable.

Detection: Embed a sample of agent responses weekly. Cluster them and track cluster proportions over time. If a new cluster emerges or an existing one shrinks significantly, investigate.

This is the hardest drift to detect but often the most impactful — the agent “sounds right” but is subtly wrong.


Cost as a Quality Signal

Cost anomalies are often quality anomalies in disguise:

Cost Pattern Likely Cause Quality Impact
Sudden cost spike per task Agent stuck in a reasoning loop User sees slow response, possibly incoherent output
Gradual cost increase Prompt or context growing (conversation history accumulation) May indicate the agent is not completing tasks efficiently
Cost variance increase Agent taking inconsistent paths to solve same task type Inconsistent quality — some users get good results, others do not
Cost drop after model change Provider changed pricing or model efficiency improved Verify quality did not drop alongside cost

Link cost monitoring to quality monitoring: when cost per task exceeds 2x the baseline, automatically flag those conversations for quality review.


Human-in-the-Loop Feedback

Not all production monitoring can be automated. Build a human review pipeline for cases where automated metrics are insufficient:

Routing to Human Review

Route conversations to human reviewers when:

  • Agent confidence is below threshold (if available)
  • Guardrail triggered but action was “flag” (not block)
  • User gave explicit negative feedback
  • LLM-as-judge scored below threshold
  • High-value or high-risk transaction (e.g., refund > $100)

Feeding Corrections Back

Human review outputs should flow back into the system:

  1. Eval dataset expansion: Reviewed conversations become new eval test cases
  2. Guardrail tuning: False positive triggers inform guardrail threshold adjustments
  3. Prompt improvement: Patterns from human corrections inform prompt updates
  4. Knowledge base updates: If the agent lacked information, add it to the knowledge base

Review Queue Design

  • Daily volume: Target 20-50 conversations per reviewer per day (quality over quantity)
  • Structured rubric: Same dimensions as LLM-as-judge (task completion, accuracy, safety) for consistency
  • Disagreement resolution: When human and LLM-as-judge disagree, flag for calibration
  • Feedback latency: Reviews should complete within 48 hours to keep the feedback loop tight

References

This post is licensed under CC BY 4.0 by the author.