AI DevSecOps and Incident Response
Standard DevSecOps assumes deterministic systems. AI systems break all three assumptions -- outputs are non-deterministic, vulnerabilities can be in the prompt or model, and rollback may mean reverting a prompt, a model version, or a guardrail configuration.
Standard DevSecOps assumes deterministic systems: the same code produces the same output, security vulnerabilities are in the code, and rollback means deploying the previous binary. AI systems break all three assumptions — outputs are non-deterministic, vulnerabilities can be in the prompt or model, and rollback may mean reverting a prompt, a model version, or a guardrail configuration.
What Changes for AI Systems
| Traditional DevSecOps | AI DevSecOps |
|---|---|
| Deterministic outputs — same input = same output | Non-deterministic outputs — same input can produce different responses across calls |
| Vulnerabilities are in code — CVEs, dependency issues | Vulnerabilities are in prompts and models — prompt injection, jailbreaks, data extraction |
| Testing is binary — tests pass or fail | Testing is probabilistic — evals have pass rates, not pass/fail |
| Rollback = previous binary | Rollback = previous prompt + model + guardrail config + tool definitions |
| Security perimeter is network | Security perimeter includes the prompt — user input is part of the “code” the LLM executes |
| Secrets are in config/env | Secrets can leak in model outputs — PII, API keys, system prompts |
| Supply chain = dependencies | Supply chain = dependencies + model weights + training data + prompt templates |
These differences mean you need AI-specific extensions to your existing DevSecOps practices, not a replacement.
CI/CD Eval Pipelines
Integrate eval gates into your existing CI/CD pipeline. This extends the standard build-test-deploy pipeline with AI-specific validation:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
┌──────────────────────────────────────────────────────────────────┐
│ Pull Request │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
│ │ Build │─>│ Lint + │─>│ Unit │─>│ Eval Suite │ │
│ │ │ │ Type │ │ Tests │ │ (golden dataset) │ │
│ └──────────┘ └──────────┘ └──────────┘ └────────┬─────────┘ │
│ │ │
│ Gate: pass rate │
│ > 93%, safety │
│ = 100%, cost │
│ < budget │
└──────────────────────────────────────────────┼───────────────────┘
│ pass
v
┌──────────────────────────────────────────────────────────────────┐
│ Staging Deploy │
│ │
│ ┌─────────────────┐ ┌─────────────────────────────────────┐ │
│ │ Deploy to │─>│ Canary Evals │ │
│ │ staging │ │ (100 sampled production inputs) │ │
│ └─────────────────┘ └──────────────────┬──────────────────┘ │
│ │ │
│ Gate: no regression > 1% │
│ vs current production │
└──────────────────────────────────┼───────────────────────────────┘
│ pass
v
┌──────────────────────────────────────────────────────────────────┐
│ Production Deploy │
│ │
│ ┌─────────────────┐ ┌─────────────────────────────────────┐ │
│ │ Canary (10%) │─>│ Monitor for 30 min │ │
│ │ then full │ │ Quality, cost, guardrail metrics │ │
│ └─────────────────┘ └──────────────────┬──────────────────┘ │
│ │ │
│ Auto-rollback if │
│ quality < threshold │
└──────────────────────────────────────────────────────────────────┘
│
v
┌──────────────────────────────────────────────────────────────────┐
│ Post-Deploy (Scheduled) │
│ │
│ Daily: Run eval suite on 200 production samples │
│ Weekly: Full drift analysis (input, output, semantic) │
│ Monthly: Red-team evaluation (adversarial inputs) │
└──────────────────────────────────────────────────────────────────┘
PR-Time Eval Gates
What to run on every PR that changes prompts, tools, or agent logic:
| Check | Pass Criteria | Runtime |
|---|---|---|
| Golden eval suite (200-500 cases) | Pass rate > 93% | 2-5 min |
| Safety eval suite (100+ adversarial cases) | Pass rate = 100% | 1-2 min |
| Cost benchmark (tokens per task) | No > 20% increase vs baseline | 1-2 min |
| LLM-as-judge quality (50 cases) | Average score > 4.0/5.0 | 2-3 min |
Total CI time for AI evals: 5-12 minutes, run in parallel with standard tests.
What Triggers a Full Eval Run
Not every code change needs eval validation. Run evals when:
- System prompt or tool definitions change
- Agent orchestration logic changes
- Model version changes (including provider-side updates)
- Guardrail configuration changes
- RAG knowledge base updates
Standard code changes (API routes, infrastructure, non-agent logic) go through normal CI without eval gates.
Security Monitoring for AI
Prompt Injection Detection
Prompt injection is the SQL injection of AI systems. Monitor for it in real-time:
Detection layers:
| Layer | Detection Method | Response |
|---|---|---|
| Input guardrail | Pattern matching for known injection templates (“ignore previous instructions”, “you are now…”) | Block or flag |
| Semantic analysis | Classifier trained on injection examples vs legitimate queries | Score and threshold |
| Output monitoring | Detect when the agent reveals system prompts, ignores role boundaries, or performs unauthorized actions | Block response, alert |
| Behavioral anomaly | Agent suddenly accesses tools it rarely uses, or generates responses that are statistically unusual | Alert for investigation |
Metrics to track:
- Injection attempt rate (detected by input guardrails)
- Injection bypass rate (detected by output monitoring — this is the one that matters)
- False positive rate (legitimate queries blocked)
Data Leak Prevention
AI agents can leak sensitive data through their responses:
| Leak Vector | Detection | Prevention |
|---|---|---|
| PII in outputs | Regex + NER on every response (SSN, credit card, email, phone) | Output guardrail: redact or block |
| System prompt exposure | Monitor for outputs matching system prompt fragments | Output guardrail: block |
| Training data extraction | Detect verbatim repetition of known sensitive training content | Output guardrail: block |
| Cross-session leakage | Agent reveals information from one user’s session to another | Session isolation architecture |
| Tool result leakage | Agent includes raw database results or API responses in output | Output filtering, structured response enforcement |
Unauthorized Tool Access
Monitor for agents calling tools outside their authorized scope:
1
2
3
4
5
6
7
8
9
10
tool_access_policy = {
"support-agent": {
"allowed": ["search_kb", "lookup_customer", "create_ticket"],
"denied": ["delete_account", "modify_payment", "export_data"],
"requires_approval": ["refund_payment", "escalate_to_human"]
}
}
# Alert if an agent attempts to call a denied tool
# Log every requires_approval tool call for audit
Compliance and Audit Trails
Trace ID Linkage
Every agent interaction should have a complete audit chain:
1
2
3
4
5
6
7
8
9
10
11
User Request → Trace ID: abc123
├── Authentication: user_id: u_456, role: customer
├── Input Guardrail: checked, result: pass
├── Agent Execution: trace in Cloud Trace
│ ├── LLM Call 1: model: gemini-2.0-flash, tokens: 1200
│ ├── Tool Call: search_kb, result: 3 documents
│ ├── LLM Call 2: model: gemini-2.0-flash, tokens: 2100
│ └── Guardrail: output filter, result: pass
├── Output Guardrail: checked, result: pass (PII: none detected)
├── Response delivered to user
└── Audit Log: Cloud Audit Logs entry with trace ID
Trace ID links everything: Cloud Trace spans, Cloud Logging entries, Cloud Audit Logs, Langfuse session, and any external tool API calls.
Data Retention Requirements
| Data Type | Typical Retention | Notes |
|---|---|---|
| Agent traces (detailed) | 30-90 days | Sampling after 30 days to reduce storage cost |
| Guardrail trigger logs | 1-7 years | Compliance requirement (varies by regulation) |
| Audit logs (who/when/what) | 1-7 years | Cloud Audit Logs: 400 days admin, 30 days data access (configurable) |
| Eval results | Indefinitely | Small volume, critical for regression tracking |
| Conversation content | Per privacy policy | GDPR: delete on user request. Anonymize for eval datasets. |
AI-Specific Incident Runbooks
Runbook 1: Model Quality Degradation
Trigger: Task completion rate drops >5% over 1-hour rolling window, or eval regression detected in weekly production eval run.
| Step | Action |
|---|---|
| 1. Verify | Check if gen_ai.response.model changed (provider-side update). Check if traffic pattern shifted (input drift). |
| 2. Scope | Is it one agent or all agents? One model or all models? One task type or all? |
| 3. Investigate | Pull sample traces from the degradation period. Run LLM-as-judge on 50 cases. Compare to baseline. |
| 4. Mitigate | If model change: pin to previous model version. If prompt issue: revert prompt. If input drift: acknowledge new traffic pattern and triage. |
| 5. Resolve | Add failing cases to eval suite. Deploy fix through normal CI/CD with eval gates. |
| 6. Prevent | Set up model version monitoring alert. Add canary eval for the failing pattern. |
Runbook 2: Cost Anomaly
Trigger: Daily spend exceeds 150% of 7-day rolling average, or individual task cost exceeds 5x the median for its type.
| Step | Action |
|---|---|
| 1. Verify | Check Cloud Billing for actual spend increase (not a reporting lag). Identify which agent/model is responsible. |
| 2. Scope | Is it a single agent in a loop, or a broad traffic increase? |
| 3. Investigate | Pull the most expensive traces. Look for: agent loops (>5 LLM calls per task), context window bloat (growing input tokens), model routing failure (expensive model used where cheap one should be). |
| 4. Mitigate | If loop: add loop detection guardrail (max iterations). If context bloat: truncate conversation history. If routing: fix model router. Emergency: throttle non-critical agents. |
| 5. Resolve | Deploy fix. Monitor cost for 24 hours. |
| 6. Prevent | Add cost-per-task alert at 3x median. Add loop detection to agent framework. |
Runbook 3: Prompt Injection Attack
Trigger: Input guardrail injection detection rate spikes >3x baseline, or output monitoring detects system prompt leak or role boundary violation.
| Step | Action |
|---|---|
| 1. Verify | Confirm this is an attack, not a false positive spike. Check injection patterns in logs. |
| 2. Scope | Is it a single user or a coordinated attack? Which agents are targeted? |
| 3. Investigate | Review the injection payloads. Did any bypass input guardrails? Did any cause harmful outputs? |
| 4. Mitigate | If bypassed: add the new injection pattern to guardrails immediately. If harmful output: block the user/IP if malicious. If system prompt leaked: rotate any secrets referenced in the system prompt. |
| 5. Resolve | Update guardrail patterns. Add bypass cases to safety eval suite. |
| 6. Prevent | Run red-team eval suite monthly. Consider adding a dedicated injection classifier. |
Runbook 4: Data Leak Detected
Trigger: Output guardrail detects PII in agent response, or user reports receiving another user’s data.
| Step | Action |
|---|---|
| 1. Verify | Confirm the leak. Pull the full trace and conversation. Identify what data was exposed. |
| 2. Scope | Is it a one-time occurrence or a systematic issue? How many users affected? |
| 3. Investigate | Trace the data source: did it come from the model, from a tool result, or from conversation history? Check session isolation. |
| 4. Mitigate | If tool result: add output filtering on the tool. If session leakage: fix session isolation bug. If model-side: add PII detection guardrail if not present. Notify affected users per privacy policy. |
| 5. Resolve | Deploy fix. Run PII detection eval across recent production conversations. |
| 6. Prevent | Add PII detection to output guardrails (if not present). Add PII leak scenarios to safety eval suite. Review data access patterns for all agent tools. |
Rollback Strategies for AI
AI systems have multiple independently deployable components. Rolling back is not just “deploy the previous container”:
| Component | Rollback Mechanism | Speed | Risk |
|---|---|---|---|
| Prompt/system prompt | Version-controlled in Langfuse or git. Revert to previous version. | Seconds (if prompt served dynamically) | Low — prompts are text |
| Model version | Pin gen_ai.request.model to specific version (e.g., gemini-2.0-flash-001). |
Seconds (config change) | Low — provider still serves the old version (usually) |
| Guardrail config | Version-controlled. Revert config and redeploy guardrail service. | Minutes | Medium — may re-expose issues the guardrail was catching |
| Agent code | Standard container rollback via Cloud Run revision or Kubernetes rollback. | Minutes | Medium — standard deployment risk |
| Tool definitions | Version-controlled alongside agent code. Rollback with agent code. | Minutes | Medium — tool changes may have data implications |
| Knowledge base (RAG) | Re-index from previous data snapshot. | Hours | High — re-indexing is slow |
Prompt + Model Rollback (Independent of Code)
The highest-impact rollback is often the fastest: reverting a prompt or model pin without any code deployment. This requires:
- Dynamic prompt serving — prompts loaded from Langfuse or a config service, not hardcoded
- Model version pinning — agent config specifies exact model version, not just model family
- Feature flags for guardrails — toggle guardrail rules without redeploy
With these in place, you can revert the most common AI regressions (prompt change, model update, guardrail misconfiguration) in seconds.