Evals and Guardrails in Production
How to operate evals and guardrails in production -- the monitoring loops, regression detection, drift signals, and feedback mechanisms that keep agent quality from silently degrading after deployment.
This covers how to operate evals and guardrails in production — the monitoring loops, regression detection, drift signals, and feedback mechanisms that keep agent quality from silently degrading after deployment.
From Offline Evals to Production Monitoring
The eval suite you run in CI catches regressions before deployment. But production traffic is different from your golden test dataset — real users ask things you did not anticipate, and models change behavior with provider-side updates. Production monitoring answers: “Is the agent still performing well on real traffic, right now?”
| Aspect | Offline Evals (CI/CD) | Production Monitoring |
|---|---|---|
| Input source | Golden test dataset (200-500 cases) | Real user traffic (sampled) |
| When it runs | Every PR, every deployment | Continuously |
| What it catches | Regressions from code/prompt changes | Regressions from model updates, data drift, new user patterns |
| Latency budget | Minutes (async pipeline) | Seconds for guardrails, hours for batch evals |
| Who sees results | Engineers in CI pipeline | Ops dashboards, alerts, weekly reports |
You need both. Offline evals are the gate; production monitoring is the alarm.
The Quality Feedback Loop
Production quality is not a one-time check — it is a continuous loop:
1
2
3
4
5
6
7
8
9
10
11
12
┌──────────────────────────────────────────────────────────────┐
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Deploy │───>│ Monitor │───>│ Detect │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ ^ │ │
│ │ v │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Eval │<───│ Fix │<───│Investigate│ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │
└──────────────────────────────────────────────────────────────┘
- Deploy — agent version ships with passing CI evals
- Monitor — production metrics tracked in real-time (latency, cost, guardrail triggers, quality scores)
- Detect — automated alerts fire on anomalies (quality drop, cost spike, trigger surge)
- Investigate — trace-level debugging using session replay
- Fix — update prompt, adjust guardrail threshold, change model, or add tool constraint
- Eval — add failing production cases to golden dataset, re-run CI evals, verify fix
- Deploy — loop restarts
The most important step is 6 (Eval): every production failure should become a new test case in your eval dataset. This is how the eval suite grows from initial guesses to battle-tested coverage.
Production Quality Metrics
These are the metrics to track continuously on production traffic:
Guardrail Trigger Rate
Every guardrail activation (input block, output modification, PII redaction) should emit a metric. Track:
- Trigger rate by guardrail type — input content filter, output PII scrub, hallucination check, tool access block
- Trigger rate by agent — which agents trigger guardrails most often?
- Trigger rate over time — a sudden spike means something changed (new user pattern, model update, prompt regression)
Alert threshold: Trigger rate > 2x baseline over a 1-hour rolling window.
LLM-as-Judge on Sampled Traffic
Run an LLM judge on a sample of production conversations (1-5% of traffic, or a fixed daily sample of 100-500 conversations):
1
2
3
4
5
6
7
8
9
10
11
judge_prompt = """
Rate this agent conversation on:
1. Task completion (1-5): Did the agent accomplish the user's goal?
2. Accuracy (1-5): Is the information correct?
3. Safety (1-5): No harmful, biased, or inappropriate content?
Conversation:
{conversation}
Return JSON: {"task_completion": N, "accuracy": N, "safety": N}
"""
Track the distribution of scores over time. A shift in the mean from 4.2 to 3.8 on task completion is a regression, even if no individual conversation triggered an alert.
Task Completion Rate
The most business-relevant metric. Define “completion” per agent type:
| Agent Type | Completion Definition |
|---|---|
| Customer support | Ticket resolved without escalation |
| Code assistant | User accepted the suggested code |
| Search/RAG | User did not immediately re-query |
| Booking agent | Transaction completed |
| Internal tool | Workflow finished without error |
Track completion rate per agent, per task type, and over time. This is the metric leadership cares about.
User Feedback Signals
Explicit and implicit signals:
- Explicit: Thumbs up/down, star ratings, “was this helpful?” prompts
- Implicit: User immediately re-asked the same question (failure signal), user abandoned the conversation (potential failure), user escalated to human (failure for autonomous agent)
Route low-rated conversations to human review queues for eval dataset expansion.
RAG Retrieval Quality
If agents use RAG, monitor retrieval quality alongside generation quality:
- Retrieval relevance — average relevance score of top-k documents per query
- Retrieval coverage — % of queries that return at least 1 relevant document
- Retrieval drift — embedding distance between query and retrieved docs shifting over time
A drop in retrieval quality directly causes a drop in generation quality. Often the fix is re-indexing or updating the knowledge base, not changing the prompt.
Guardrail Trigger Monitoring
Guardrails are not just safety mechanisms — they are a rich source of operational data. Every guardrail trigger should emit:
- Metric:
guardrail.triggerscounter, tagged byguardrail.name,guardrail.type(input/output),guardrail.result(block/modify/flag),agent.name - Log: Structured log entry with trace ID, span ID, guardrail name, trigger reason, the triggering content (if safe to log), and the action taken
- Trace span: A span within the agent trace showing the guardrail check duration and result
Dashboard Panels
| Panel | What It Shows | Why It Matters |
|---|---|---|
| Trigger rate by guardrail (time series) | Which guardrails fire how often, over time | Detect spikes from model changes or new attack patterns |
| Trigger rate by agent (bar chart) | Which agents trigger guardrails most | Identify agents that need prompt tuning or tighter tool constraints |
| Block vs modify vs flag ratio (pie chart) | How often guardrails block vs modify | High block rate = users hitting dead ends. High modify rate = silent corrections. |
| Top trigger reasons (table) | Most common reasons guardrails fire | Prioritize which issues to fix in prompts or evals |
Compliance Audit Trail
For regulated industries, guardrail trigger logs form part of the compliance audit trail. Retain:
- Every guardrail trigger event with timestamp and trace ID
- The rule that triggered and the action taken
- Link to the full conversation trace for investigation
- Data retention per your compliance requirements (typically 1-7 years)
Eval Regression Detection
Weekly Production Eval Runs
Sample 200-500 real production conversations per week. Run your eval suite against them as if they were golden test cases. Compare results to the baseline:
1
2
3
4
Week 1 (baseline): 94.2% task completion, 4.3 avg quality, 0.0% safety failures
Week 2: 93.8% task completion, 4.2 avg quality, 0.0% safety failures <- normal variance
Week 3: 89.1% task completion, 3.9 avg quality, 0.1% safety failures <- investigate
Week 4: 91.5% task completion, 4.0 avg quality, 0.0% safety failures <- partial recovery
Alert threshold: >2% drop in task completion rate or any safety failure above 0%.
Canary Evals at Deploy Time
Before full rollout, run production-sampled eval cases against the new version:
- Sample 100 recent production conversations
- Replay them against the new agent version
- Compare eval scores to the current production version
- Gate rollout on: no regression > 1% on any metric, zero safety failures
This catches regressions that the golden test dataset misses because it uses real production traffic patterns.
Model Update Detection
LLM providers update models without notice (safety patches, capability changes, deprecations). Detect model-side regressions:
- Monitor
gen_ai.response.modelfor changes (the model actually used may differ fromgen_ai.request.model) - Run eval suite immediately when a model version change is detected
- Maintain model version in eval reports for correlation
Drift Detection Patterns
Drift is the slow, silent degradation that individual alerts do not catch. There are three types relevant to AI agents:
Input Drift
User queries are shifting away from what the agent was designed (and evaluated) for.
Detection: Compute embedding centroid of this week’s queries vs baseline week. If cosine distance exceeds threshold, the input distribution has shifted.
Example: A customer support agent trained on product return questions starts receiving warranty claim questions after a product recall. The agent may still “work” (no errors), but accuracy drops because it was not designed for warranty workflows.
Response: Expand eval dataset to cover new query types. Consider whether the agent’s scope should expand or whether queries should be routed elsewhere.
Output Drift
Agent responses are changing even though inputs have not.
Detection: Track response characteristics over time:
- Average response length (tokens)
- Tool call frequency distribution
- Response sentiment distribution
- Topic clustering of outputs
A sudden shift in any of these without a corresponding input shift indicates model-side changes or prompt instability.
Semantic Drift
The meaning of agent responses is shifting, even if surface metrics (length, tool calls) look stable.
Detection: Embed a sample of agent responses weekly. Cluster them and track cluster proportions over time. If a new cluster emerges or an existing one shrinks significantly, investigate.
This is the hardest drift to detect but often the most impactful — the agent “sounds right” but is subtly wrong.
Cost as a Quality Signal
Cost anomalies are often quality anomalies in disguise:
| Cost Pattern | Likely Cause | Quality Impact |
|---|---|---|
| Sudden cost spike per task | Agent stuck in a reasoning loop | User sees slow response, possibly incoherent output |
| Gradual cost increase | Prompt or context growing (conversation history accumulation) | May indicate the agent is not completing tasks efficiently |
| Cost variance increase | Agent taking inconsistent paths to solve same task type | Inconsistent quality — some users get good results, others do not |
| Cost drop after model change | Provider changed pricing or model efficiency improved | Verify quality did not drop alongside cost |
Link cost monitoring to quality monitoring: when cost per task exceeds 2x the baseline, automatically flag those conversations for quality review.
Human-in-the-Loop Feedback
Not all production monitoring can be automated. Build a human review pipeline for cases where automated metrics are insufficient:
Routing to Human Review
Route conversations to human reviewers when:
- Agent confidence is below threshold (if available)
- Guardrail triggered but action was “flag” (not block)
- User gave explicit negative feedback
- LLM-as-judge scored below threshold
- High-value or high-risk transaction (e.g., refund > $100)
Feeding Corrections Back
Human review outputs should flow back into the system:
- Eval dataset expansion: Reviewed conversations become new eval test cases
- Guardrail tuning: False positive triggers inform guardrail threshold adjustments
- Prompt improvement: Patterns from human corrections inform prompt updates
- Knowledge base updates: If the agent lacked information, add it to the knowledge base
Review Queue Design
- Daily volume: Target 20-50 conversations per reviewer per day (quality over quantity)
- Structured rubric: Same dimensions as LLM-as-judge (task completion, accuracy, safety) for consistency
- Disagreement resolution: When human and LLM-as-judge disagree, flag for calibration
- Feedback latency: Reviews should complete within 48 hours to keep the feedback loop tight