On-Call & Incident Management
On-call is a system design problem, not a people problem. If on-call burns people out, the answer is not "hire more resilient engineers" -- it is to fix the system that pages them at 3 AM for things that could wait until morning.
On-call is a system design problem, not a people problem. If on-call burns people out, the answer is not “hire more resilient engineers” — it is to fix the system that pages them at 3 AM for things that could wait until morning.
On-Call Design
Principles
-
On-call should be sustainable. If an engineer dreads their on-call rotation, the system is broken. Target: fewer than 2 pages per on-call shift, and most of those during business hours.
-
You build it, you run it. The team that writes the code is on-call for it. This creates the incentive to build reliable systems. If a separate ops team handles on-call, developers have no feedback loop from production failures.
-
Compensation is non-negotiable. On-call is extra work outside normal hours. Compensate it — either financially or with time off. Not compensating on-call creates resentment and retention problems.
-
Alerting should be actionable. Every page must require human intervention. If the alert can be handled by automation, automate it. If it can wait until morning, it is not a page — it is a ticket.
Rotation Models
| Model | Structure | Pros | Cons |
|---|---|---|---|
| Weekly rotation | One person on-call for 7 days | Simple, predictable | Long shifts cause fatigue; bad week = burnout |
| Weekday/Weekend split | One person Mon-Fri, another Sat-Sun | Weekends are distinct; compensate weekend separately | More handoffs, more coordination |
| Follow-the-sun | Teams in different time zones cover their business hours | No night pages for anyone | Requires distributed teams; handoff quality matters |
| Primary/Secondary | Primary gets paged first; secondary is backup | Reduces single-point-of-failure risk | Secondary often does nothing — needs clear escalation trigger |
For a 16-Person Org
With 16 engineers across 4 disciplines (FE, BE, QA, Data), on-call design depends on what is being monitored:
If you own backend services and infrastructure:
- On-call rotation among backend engineers (6-8 people, assuming ~half are BE)
- 2-week rotation with primary/secondary
- Frontend engineers are not on the backend on-call rotation (different failure modes)
- QA and data engineers participate in on-call only if they own production data pipelines
If you own user-facing applications:
- Include frontend engineers for UI-specific incidents (broken checkout, rendering failures)
- Backend engineers for API and service failures
- Pair a junior with a senior for their first rotation (shadow on-call)
Rotation math: With 8 engineers in the rotation, each person is on-call ~6 weeks/year (1 week every 8 weeks). That is manageable. With 4 engineers, each is on-call 13 weeks/year — borderline unsustainable.
Severity Levels
Define severity levels before you need them. During an incident is the wrong time to debate whether this is “really a SEV-1.”
| Level | Criteria | Response Time | Who is Involved | Example |
|---|---|---|---|---|
| SEV-1 | Complete service outage, data loss risk, or security breach | < 15 minutes | On-call + EM + incident commander + comms | Production database down, payment processing failed |
| SEV-2 | Significant degradation affecting many users | < 30 minutes | On-call + senior engineer | Latency spike (p99 > 5s), partial feature outage |
| SEV-3 | Minor issue, workaround available, limited user impact | < 4 hours (business hours) | On-call | One user flow broken, non-critical service degraded |
| SEV-4 | Cosmetic issue, no user impact, fix in next sprint | Next business day | Ticket in backlog | UI alignment bug, log noise |
Key rule: It is always safe to escalate. It is never safe to under-classify. If you are debating between SEV-2 and SEV-3, treat it as SEV-2.
Escalation Policies
Escalation Path
1
2
3
4
5
6
7
8
9
Page fires (PagerDuty/OpsGenie)
↓
Primary on-call acknowledges (15 min SLA)
↓ (not acknowledged or needs help)
Secondary on-call paged
↓ (SEV-1 or not resolved in 30 min)
Engineering Manager + Incident Commander activated
↓ (customer-facing or SEV-1 > 1 hour)
Director/VP + Communications team
Escalation Principles
- Escalation is not failure. It is the system working. Train people to escalate early rather than struggle alone.
- Clear trigger criteria. “If the incident is not resolved within 30 minutes, escalate” — not “use your judgment.”
- EM as escalation point, not first responder. The EM should be notified for SEV-1 and SEV-2, but should not be debugging. The EM’s role is coordination, communication, and removing blockers.
- External communication. Who tells the customer? Who updates the status page? Define this before the incident. During a SEV-1, the on-call engineer should be debugging, not writing customer emails.
Incident Commander Role
For SEV-1 and major SEV-2 incidents, an incident commander (IC) coordinates the response. This is a skill, not a seniority level.
IC Responsibilities
| Responsibility | What It Looks Like |
|---|---|
| Coordinate | Assign roles (debugging, communication, documentation). Prevent duplicate work. |
| Communicate | Regular updates to stakeholders (every 15-30 min for SEV-1). Status page updates. |
| Decide | When to escalate, when to rollback, when to declare resolved. |
| Document | Ensure timeline is being captured in real-time (shared doc or incident channel). |
| Protect the team | Shield debuggers from interruptions. Funnel all questions through the IC. |
What the IC Does NOT Do
- Debug the problem. The IC coordinates; others debug. If the IC starts debugging, coordination stops.
- Make technical decisions alone. The IC decides when to act but defers technical choices to the subject matter expert.
- Take blame. The IC is a role, not a position of responsibility for the failure.
Building IC Skills in a 16-Person Org
- Rotate the IC role — do not always give it to the most senior person. Junior engineers learn incident management by doing it.
- Run game days — simulate incidents (tabletop exercises) quarterly. Walk through: “The payment service is returning 500s. What do you do?” Practice escalation, communication, and coordination.
- Post-incident IC debrief — after each incident, the IC gets feedback on their coordination (separate from the postmortem).
SRE Principles for Engineering Teams
You do not need a dedicated SRE team to apply SRE principles. These ideas are applicable at any scale.
Key SRE Concepts
| Concept | Definition | Application for a 16-Person Org |
|---|---|---|
| Error budgets | Allowable unreliability = 100% - SLO | When budget is exhausted, shift focus from features to reliability |
| Toil | Manual, repetitive, automatable work that scales linearly with service size | Track toil hours; invest when toil exceeds 30% of any engineer’s time |
| SLIs/SLOs | Measurable indicators and targets for service reliability | Define 2-3 SLIs per service (availability, latency, correctness) |
| Eliminating toil | Automate operational work; invest in self-healing systems | Automate the top 3 manual tasks this quarter |
| Blameless culture | Focus on systems, not individuals, when things fail | Postmortem every SEV-1/2; share findings broadly |
Toil — The Silent Killer
Toil is work that:
- Is manual (a human does it)
- Is repetitive (the same task each time)
- Is automatable (a machine could do it)
- Scales with service growth (more users = more toil)
- Has no enduring value (the work does not improve the system)
Examples of toil:
- Manually restarting services when they crash
- Manually running database migrations
- Manually rotating credentials or certificates
- Manually reviewing and approving every deployment
- Manually generating reports from production data
The 30% rule: If any engineer spends more than 30% of their time on toil, they will burn out and leave. Track toil explicitly — have engineers tag their time (or estimate weekly) on operational toil versus project work.
Toil reduction as a project: Treat the top toil items as engineering projects with ROI analysis:
- Current cost: 2 hours/week x 52 weeks = 104 hours/year
- Automation cost: 40 hours to build
- Payback period: 20 weeks
- Annual savings after payback: 104 hours
Runbooks
Why Runbooks Matter
At 3 AM, the on-call engineer is tired, stressed, and possibly unfamiliar with the failing service. A runbook is the difference between a 15-minute resolution and a 3-hour investigation.
Runbook Template
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Runbook: [Service/Alert Name]
## Overview
What does this service do? Who owns it? What depends on it?
## Common Alerts
### Alert: [Alert Name]
**Severity:** SEV-2
**What it means:** [Plain English explanation]
**Likely causes:**
1. [Cause A] — [How to verify]
2. [Cause B] — [How to verify]
3. [Cause C] — [How to verify]
**Resolution steps:**
1. Check [dashboard link] for current status
2. If [condition A]: run `command` or restart service via [link]
3. If [condition B]: scale up via [process]
4. If none of the above works: escalate to [person/team]
**Rollback procedure:**
1. Revert to last known good deployment: `command`
2. Verify service health: [dashboard link]
## Dependencies
- Upstream: [services this depends on]
- Downstream: [services that depend on this]
## Contact
- Primary: @team-channel
- Escalation: @engineering-manager
Runbook Practices
- Store runbooks next to the code — in the repo, not in a wiki. They are more likely to be updated when the code changes.
- Link alerts to runbooks — the page notification should include a direct link to the relevant runbook section. PagerDuty and OpsGenie support this.
- Test runbooks during game days — if the runbook is wrong or unclear, you want to find out during a drill, not at 3 AM.
- Update after every incident — if the runbook was missing information during an incident, fix it immediately as a postmortem action item.
- Keep them concise — a runbook is a decision tree, not a textbook. If it takes more than 5 minutes to read, it is too long.
Alerting Design
The Alert Pyramid
1
2
3
4
5
6
7
/ Page / Immediate human action needed (SEV-1, SEV-2)
/------/
/ Ticket / Needs attention during business hours
/--------/
/ Dashboard / Useful context, no action needed now
/___________/
Logs only Background information for debugging
Alerting Principles
| Principle | Description |
|---|---|
| Every page must be actionable | If the response is “wait and see,” it should not be a page |
| Alert on symptoms, not causes | Alert on “error rate > 5%” not “disk usage > 80%” — the former is user-facing, the latter might not be |
| Tune relentlessly | A noisy pager trains people to ignore alerts. Review alert frequency monthly. |
| Aggregate before paging | One page for “service is degraded” not five pages for individual instances |
| Include context in the alert | Dashboard link, runbook link, recent changes, affected users |
Alert Fatigue
Alert fatigue is the #1 reason on-call fails. When engineers get 20+ pages per shift, they stop taking alerts seriously.
Measurement: Track pages per on-call shift per week. Target:
- Good: 0-2 pages per shift
- Acceptable: 3-5 pages per shift
- Action required: 5+ pages per shift — dedicate engineering time to reducing alerts
Common fixes:
- Delete alerts that have not fired in 6 months (they are probably wrong)
- Combine related alerts into one (reduce noise, not signal)
- Auto-remediate common issues (service restart, cache clear) and alert only if auto-remediation fails
- Increase thresholds for non-critical alerts
- Route informational alerts to Slack, not PagerDuty
Game Days and Chaos Engineering
Game Days (Tabletop Exercises)
A game day is a structured simulation of an incident. No production systems are harmed.
Format (60-90 minutes):
- Setup (10 min): Describe the scenario. “It is 2 PM Tuesday. The checkout API starts returning 500 errors for 30% of requests. You are on-call.”
- Response (30-40 min): The team walks through their response. What do you check first? Who do you page? How do you communicate?
- Curveballs (15 min): Add complexity. “The database looks fine, but the cache is returning stale data.” “The customer success team is forwarding angry tweets.”
- Debrief (15 min): What went well? What was unclear? What would we do differently?
Frequency: Quarterly. Rotate who plays incident commander.
Chaos Engineering (Production)
At the 16-person scale, full chaos engineering (Netflix Chaos Monkey) is usually premature. But targeted resilience testing is valuable:
- Kill a non-critical pod in staging and verify the system recovers
- Inject latency on a dependency call and verify timeouts and circuit breakers work
- Revoke a credential and verify the service fails gracefully with a clear error message
- Simulate a database failover and verify the application reconnects
Anti-Patterns
| Anti-Pattern | Symptom | Fix |
|---|---|---|
| Hero on-call | One senior engineer handles all incidents because others cannot | Pair juniors with seniors for on-call training; invest in runbooks |
| Page and pray | Alert fires, nobody responds for 30 minutes | Clear escalation policy with SLAs; PagerDuty auto-escalation |
| Firefighting culture | Team spends 50%+ time on incidents; features never ship | Error budget policy; dedicated reliability sprint when SLO is breached |
| Postmortem graveyard | Postmortems written but action items never completed | Track action items in sprint backlog; report completion rate monthly |
| Alert noise | 10+ pages per shift; on-call engineer ignores most | Monthly alert review; delete or tune any alert that did not require action |
| On-call without compensation | Engineers are on-call “because it is part of the job” | Pay on-call stipend or provide time off; this is a labor issue, not optional |
| Siloed knowledge | Only one person can debug service X | Cross-training, runbooks, pair on-call, mandatory knowledge transfer |
Real-World Application
Google SRE
Google’s SRE model sets the gold standard:
- SRE teams cap toil at 50% — if toil exceeds this, the team can refuse to take on new services
- On-call load is capped: max 2 events per 12-hour shift, max 25% of time on on-call duties
- Postmortems are mandatory for any event that consumed error budget
- SRE teams can “hand back” a service to the development team if reliability requirements are not met
At 16 engineers, you will not have a separate SRE team. But you can apply these principles: cap toil tracking, error budget policies, mandatory postmortems.
PagerDuty’s Incident Response Framework
PagerDuty publishes their incident response documentation as open source. Key elements:
- Severity definitions with clear criteria (not “use judgment”)
- Incident commander rotation among all senior engineers
- Communication templates for each severity level
- Post-incident review within 48 hours for SEV-1/2
Atlassian’s Incident Management
Atlassian (Jira, Confluence) publishes their incident management handbook:
- Incidents are managed in a dedicated Slack channel (#incident-NNNN)
- Incident commander, communications lead, and technical lead are separate roles
- Status page updates are required every 30 minutes during SEV-1
- Postmortem reviews are tracked as Jira tickets with the same SLAs as production bugs
Netflix Chaos Engineering
Netflix’s approach:
- Chaos Monkey: Randomly kills instances in production during business hours
- Chaos Kong: Simulates entire region failures
- FIT (Failure Injection Testing): Injects failures at specific points in the request path
- The philosophy: “We do not trust systems that have not been tested by failure”
Netflix can do this because their architecture is designed for it (stateless services, automated failover, regional redundancy). Do not adopt chaos engineering before your architecture supports it.
References
Beyer, B. et al. (2016). Site Reliability Engineering: How Google Runs Production Systems Beyer, B. et al. (2018). The Site Reliability Workbook: Practical Ways to Implement SRE Blank-Edelman, D. (2018). Seeking SRE: Conversations About Running Production Systems at Scale Kim, G. et al. (2016). The DevOps Handbook — Incident management and feedback loops Allspaw, J. (2015). “Trade-Offs Under Pressure” — Velocity conference talk on incident response PagerDuty Incident Response Documentation — response.pagerduty.com Atlassian Incident Management — atlassian.com/incident-management Google SRE Book — sre.google/sre-book Netflix Chaos Engineering — netflix.github.io/chaosmonkey Nora Jones — “Incident Analysis and Postmortems” (SREcon talks) John Allspaw — “Blameless Postmortems and a Just Culture” (Velocity 2012) Dekker, S. (2014). The Field Guide to Understanding Human Error — Systems thinking about failure