On-Call & Incident Management

On-call is a system design problem, not a people problem. If on-call burns people out, the answer is not "hire more resilient engineers" -- it is to fix the system that pages them at 3 AM for things that could wait until morning.

Posted Mar 25, 2026 Updated Apr 25, 2026

13 min read

On-call is a system design problem, not a people problem. If on-call burns people out, the answer is not “hire more resilient engineers” — it is to fix the system that pages them at 3 AM for things that could wait until morning.

On-Call Design

Principles

On-call should be sustainable. If an engineer dreads their on-call rotation, the system is broken. Target: fewer than 2 pages per on-call shift, and most of those during business hours.
You build it, you run it. The team that writes the code is on-call for it. This creates the incentive to build reliable systems. If a separate ops team handles on-call, developers have no feedback loop from production failures.
Compensation is non-negotiable. On-call is extra work outside normal hours. Compensate it — either financially or with time off. Not compensating on-call creates resentment and retention problems.
Alerting should be actionable. Every page must require human intervention. If the alert can be handled by automation, automate it. If it can wait until morning, it is not a page — it is a ticket.

Rotation Models

Model	Structure	Pros	Cons
Weekly rotation	One person on-call for 7 days	Simple, predictable	Long shifts cause fatigue; bad week = burnout
Weekday/Weekend split	One person Mon-Fri, another Sat-Sun	Weekends are distinct; compensate weekend separately	More handoffs, more coordination
Follow-the-sun	Teams in different time zones cover their business hours	No night pages for anyone	Requires distributed teams; handoff quality matters
Primary/Secondary	Primary gets paged first; secondary is backup	Reduces single-point-of-failure risk	Secondary often does nothing — needs clear escalation trigger

For a 16-Person Org

With 16 engineers across 4 disciplines (FE, BE, QA, Data), on-call design depends on what is being monitored:

If you own backend services and infrastructure:

On-call rotation among backend engineers (6-8 people, assuming ~half are BE)
2-week rotation with primary/secondary
Frontend engineers are not on the backend on-call rotation (different failure modes)
QA and data engineers participate in on-call only if they own production data pipelines

If you own user-facing applications:

Include frontend engineers for UI-specific incidents (broken checkout, rendering failures)
Backend engineers for API and service failures
Pair a junior with a senior for their first rotation (shadow on-call)

Rotation math: With 8 engineers in the rotation, each person is on-call ~6 weeks/year (1 week every 8 weeks). That is manageable. With 4 engineers, each is on-call 13 weeks/year — borderline unsustainable.

Severity Levels

Define severity levels before you need them. During an incident is the wrong time to debate whether this is “really a SEV-1.”

Level	Criteria	Response Time	Who is Involved	Example
SEV-1	Complete service outage, data loss risk, or security breach	< 15 minutes	On-call + EM + incident commander + comms	Production database down, payment processing failed
SEV-2	Significant degradation affecting many users	< 30 minutes	On-call + senior engineer	Latency spike (p99 > 5s), partial feature outage
SEV-3	Minor issue, workaround available, limited user impact	< 4 hours (business hours)	On-call	One user flow broken, non-critical service degraded
SEV-4	Cosmetic issue, no user impact, fix in next sprint	Next business day	Ticket in backlog	UI alignment bug, log noise

Key rule: It is always safe to escalate. It is never safe to under-classify. If you are debating between SEV-2 and SEV-3, treat it as SEV-2.

Escalation Policies

Escalation Path

Page fires (PagerDuty/OpsGenie)
    ↓
Primary on-call acknowledges (15 min SLA)
    ↓ (not acknowledged or needs help)
Secondary on-call paged
    ↓ (SEV-1 or not resolved in 30 min)
Engineering Manager + Incident Commander activated
    ↓ (customer-facing or SEV-1 > 1 hour)
Director/VP + Communications team

Escalation Principles

Escalation is not failure. It is the system working. Train people to escalate early rather than struggle alone.
Clear trigger criteria. “If the incident is not resolved within 30 minutes, escalate” — not “use your judgment.”
EM as escalation point, not first responder. The EM should be notified for SEV-1 and SEV-2, but should not be debugging. The EM’s role is coordination, communication, and removing blockers.
External communication. Who tells the customer? Who updates the status page? Define this before the incident. During a SEV-1, the on-call engineer should be debugging, not writing customer emails.

Incident Commander Role

For SEV-1 and major SEV-2 incidents, an incident commander (IC) coordinates the response. This is a skill, not a seniority level.

IC Responsibilities

Responsibility	What It Looks Like
Coordinate	Assign roles (debugging, communication, documentation). Prevent duplicate work.
Communicate	Regular updates to stakeholders (every 15-30 min for SEV-1). Status page updates.
Decide	When to escalate, when to rollback, when to declare resolved.
Document	Ensure timeline is being captured in real-time (shared doc or incident channel).
Protect the team	Shield debuggers from interruptions. Funnel all questions through the IC.

What the IC Does NOT Do

Debug the problem. The IC coordinates; others debug. If the IC starts debugging, coordination stops.
Make technical decisions alone. The IC decides when to act but defers technical choices to the subject matter expert.
Take blame. The IC is a role, not a position of responsibility for the failure.

Building IC Skills in a 16-Person Org

Rotate the IC role — do not always give it to the most senior person. Junior engineers learn incident management by doing it.
Run game days — simulate incidents (tabletop exercises) quarterly. Walk through: “The payment service is returning 500s. What do you do?” Practice escalation, communication, and coordination.
Post-incident IC debrief — after each incident, the IC gets feedback on their coordination (separate from the postmortem).

SRE Principles for Engineering Teams

You do not need a dedicated SRE team to apply SRE principles. These ideas are applicable at any scale.

Key SRE Concepts

Concept	Definition	Application for a 16-Person Org
Error budgets	Allowable unreliability = 100% - SLO	When budget is exhausted, shift focus from features to reliability
Toil	Manual, repetitive, automatable work that scales linearly with service size	Track toil hours; invest when toil exceeds 30% of any engineer’s time
SLIs/SLOs	Measurable indicators and targets for service reliability	Define 2-3 SLIs per service (availability, latency, correctness)
Eliminating toil	Automate operational work; invest in self-healing systems	Automate the top 3 manual tasks this quarter
Blameless culture	Focus on systems, not individuals, when things fail	Postmortem every SEV-1/2; share findings broadly

Toil — The Silent Killer

Toil is work that:

Is manual (a human does it)
Is repetitive (the same task each time)
Is automatable (a machine could do it)
Scales with service growth (more users = more toil)
Has no enduring value (the work does not improve the system)

Examples of toil:

Manually restarting services when they crash
Manually running database migrations
Manually rotating credentials or certificates
Manually reviewing and approving every deployment
Manually generating reports from production data

The 30% rule: If any engineer spends more than 30% of their time on toil, they will burn out and leave. Track toil explicitly — have engineers tag their time (or estimate weekly) on operational toil versus project work.

Toil reduction as a project: Treat the top toil items as engineering projects with ROI analysis:

Current cost: 2 hours/week x 52 weeks = 104 hours/year
Automation cost: 40 hours to build
Payback period: 20 weeks
Annual savings after payback: 104 hours

Runbooks

Why Runbooks Matter

At 3 AM, the on-call engineer is tired, stressed, and possibly unfamiliar with the failing service. A runbook is the difference between a 15-minute resolution and a 3-hour investigation.

Runbook Template

        
      
# Runbook: [Service/Alert Name]

## Overview
What does this service do? Who owns it? What depends on it?

## Common Alerts

### Alert: [Alert Name]
**Severity:** SEV-2
**What it means:** [Plain English explanation]
**Likely causes:**
1. [Cause A] — [How to verify]
2. [Cause B] — [How to verify]
3. [Cause C] — [How to verify]

**Resolution steps:**
1. Check [dashboard link] for current status
2. If [condition A]: run `command` or restart service via [link]
3. If [condition B]: scale up via [process]
4. If none of the above works: escalate to [person/team]

**Rollback procedure:**
1. Revert to last known good deployment: `command`
2. Verify service health: [dashboard link]

## Dependencies
- Upstream: [services this depends on]
- Downstream: [services that depend on this]

## Contact
- Primary: @team-channel
- Escalation: @engineering-manager

Runbook Practices

Store runbooks next to the code — in the repo, not in a wiki. They are more likely to be updated when the code changes.
Link alerts to runbooks — the page notification should include a direct link to the relevant runbook section. PagerDuty and OpsGenie support this.
Test runbooks during game days — if the runbook is wrong or unclear, you want to find out during a drill, not at 3 AM.
Update after every incident — if the runbook was missing information during an incident, fix it immediately as a postmortem action item.
Keep them concise — a runbook is a decision tree, not a textbook. If it takes more than 5 minutes to read, it is too long.

Alerting Design

The Alert Pyramid

        / Page /           Immediate human action needed (SEV-1, SEV-2)
       /------/
      / Ticket /           Needs attention during business hours
     /--------/
    / Dashboard /          Useful context, no action needed now
   /___________/
    Logs only              Background information for debugging

Alerting Principles

Principle	Description
Every page must be actionable	If the response is “wait and see,” it should not be a page
Alert on symptoms, not causes	Alert on “error rate > 5%” not “disk usage > 80%” — the former is user-facing, the latter might not be
Tune relentlessly	A noisy pager trains people to ignore alerts. Review alert frequency monthly.
Aggregate before paging	One page for “service is degraded” not five pages for individual instances
Include context in the alert	Dashboard link, runbook link, recent changes, affected users

Alert Fatigue

Alert fatigue is the #1 reason on-call fails. When engineers get 20+ pages per shift, they stop taking alerts seriously.

Measurement: Track pages per on-call shift per week. Target:

Good: 0-2 pages per shift
Acceptable: 3-5 pages per shift
Action required: 5+ pages per shift — dedicate engineering time to reducing alerts

Common fixes:

Delete alerts that have not fired in 6 months (they are probably wrong)
Combine related alerts into one (reduce noise, not signal)
Auto-remediate common issues (service restart, cache clear) and alert only if auto-remediation fails
Increase thresholds for non-critical alerts
Route informational alerts to Slack, not PagerDuty

Game Days and Chaos Engineering

Game Days (Tabletop Exercises)

A game day is a structured simulation of an incident. No production systems are harmed.

Format (60-90 minutes):

Setup (10 min): Describe the scenario. “It is 2 PM Tuesday. The checkout API starts returning 500 errors for 30% of requests. You are on-call.”
Response (30-40 min): The team walks through their response. What do you check first? Who do you page? How do you communicate?
Curveballs (15 min): Add complexity. “The database looks fine, but the cache is returning stale data.” “The customer success team is forwarding angry tweets.”
Debrief (15 min): What went well? What was unclear? What would we do differently?

Frequency: Quarterly. Rotate who plays incident commander.

Chaos Engineering (Production)

At the 16-person scale, full chaos engineering (Netflix Chaos Monkey) is usually premature. But targeted resilience testing is valuable:

Kill a non-critical pod in staging and verify the system recovers
Inject latency on a dependency call and verify timeouts and circuit breakers work
Revoke a credential and verify the service fails gracefully with a clear error message
Simulate a database failover and verify the application reconnects

Anti-Patterns

Anti-Pattern	Symptom	Fix
Hero on-call	One senior engineer handles all incidents because others cannot	Pair juniors with seniors for on-call training; invest in runbooks
Page and pray	Alert fires, nobody responds for 30 minutes	Clear escalation policy with SLAs; PagerDuty auto-escalation
Firefighting culture	Team spends 50%+ time on incidents; features never ship	Error budget policy; dedicated reliability sprint when SLO is breached
Postmortem graveyard	Postmortems written but action items never completed	Track action items in sprint backlog; report completion rate monthly
Alert noise	10+ pages per shift; on-call engineer ignores most	Monthly alert review; delete or tune any alert that did not require action
On-call without compensation	Engineers are on-call “because it is part of the job”	Pay on-call stipend or provide time off; this is a labor issue, not optional
Siloed knowledge	Only one person can debug service X	Cross-training, runbooks, pair on-call, mandatory knowledge transfer

Real-World Application

Google SRE

Google’s SRE model sets the gold standard:

SRE teams cap toil at 50% — if toil exceeds this, the team can refuse to take on new services
On-call load is capped: max 2 events per 12-hour shift, max 25% of time on on-call duties
Postmortems are mandatory for any event that consumed error budget
SRE teams can “hand back” a service to the development team if reliability requirements are not met

At 16 engineers, you will not have a separate SRE team. But you can apply these principles: cap toil tracking, error budget policies, mandatory postmortems.

PagerDuty’s Incident Response Framework

PagerDuty publishes their incident response documentation as open source. Key elements:

Severity definitions with clear criteria (not “use judgment”)
Incident commander rotation among all senior engineers
Communication templates for each severity level
Post-incident review within 48 hours for SEV-1/2

Atlassian’s Incident Management

Atlassian (Jira, Confluence) publishes their incident management handbook:

Incidents are managed in a dedicated Slack channel (#incident-NNNN)
Incident commander, communications lead, and technical lead are separate roles
Status page updates are required every 30 minutes during SEV-1
Postmortem reviews are tracked as Jira tickets with the same SLAs as production bugs

Netflix Chaos Engineering

Netflix’s approach:

Chaos Monkey: Randomly kills instances in production during business hours
Chaos Kong: Simulates entire region failures
FIT (Failure Injection Testing): Injects failures at specific points in the request path
The philosophy: “We do not trust systems that have not been tested by failure”

Netflix can do this because their architecture is designed for it (stateless services, automated failover, regional redundancy). Do not adopt chaos engineering before your architecture supports it.

References

Beyer, B. et al. (2016). Site Reliability Engineering: How Google Runs Production Systems Beyer, B. et al. (2018). The Site Reliability Workbook: Practical Ways to Implement SRE Blank-Edelman, D. (2018). Seeking SRE: Conversations About Running Production Systems at Scale Kim, G. et al. (2016). The DevOps Handbook — Incident management and feedback loops Allspaw, J. (2015). “Trade-Offs Under Pressure” — Velocity conference talk on incident response PagerDuty Incident Response Documentation — response.pagerduty.com Atlassian Incident Management — atlassian.com/incident-management Google SRE Book — sre.google/sre-book Netflix Chaos Engineering — netflix.github.io/chaosmonkey Nora Jones — “Incident Analysis and Postmortems” (SREcon talks) John Allspaw — “Blameless Postmortems and a Just Culture” (Velocity 2012) Dekker, S. (2014). The Field Guide to Understanding Human Error — Systems thinking about failure

Engineering Leadership, Engineering Management

reliability

This post is licensed under CC BY 4.0 by the author.