Incident response
Read this before your first on-call shift. The runbook lives in the platform repo at
runbooks/incident.md; this page is the policy and the mental model.
You will not be punished for declaring an incident that turned out not to be one. The cost of false positives is small; the cost of false negatives is missed pages, frustrated customers, and unanswered status pages.
Severity ladder
| Severity | Examples | Who pages, who joins |
|---|---|---|
| SEV-1 | Region down. Payment provider down across multiple tenants. Data breach. | All platform on-call + EM on-call + founder on-call |
| SEV-2 | One service unhealthy, customer-visible degradation, no data loss. | Service owner + platform on-call |
| SEV-3 | Internal degradation, no customer impact. | Service owner only |
Start the incident
zephyr incident open --sev 2 --summary "Checkout p99 elevated in eu-frankfurt"
The CLI creates the Slack channel, the status page entry, and the Linear ticket. It pages the right rotation. It pins the runbook in the channel.
During the incident
Three roles, even for a small SEV-2:
- Incident commander (IC) — owns the timeline, the decisions, the channel. Not necessarily the most technical person; the calmest one.
- Comms — owns the status page, the customer email, the after-action announcement. Often the EM.
- Investigator(s) — the people in the code. Usually two; one drives, one rubber-ducks.
Status updates land in the channel and on the status page every 15 minutes minimum, even if the update is "still investigating." Silence is worse than bad news.
After the incident
A blameless post-mortem within 72 hours, drafted by the IC, reviewed by the team. The template
lives at runbooks/post-mortem-template.md. It goes in the platform repo, not on the intranet —
post-mortems are searchable engineering knowledge, not internal politics.
The post-mortem closes with one or more action items, each owned by a named person and ticketed in Linear with a "post-mortem" label. If none ever ship, escalate that pattern.
Customer communication
For SEV-1 and SEV-2 we send a customer email within 24 hours of resolution. The template is in
comms/incident-email-template.md. Tone: factual, brief, no marketing language. Apologise once.
Out of hours
The on-call rotation crosses time zones so there is always someone awake. If the page rings and nobody answers within 5 minutes, the escalation flow pages the next person in the rotation. If the second person doesn't answer within 5 more, the EM on-call is paged. If they don't answer, the founder on-call is paged. The chain is unbroken for a reason.