Changes
Comparing empty → 160dc49.
| @@ -1,0 +1,65 @@ | ||
| 1 | +--- | |
| 2 | +title: Incident response | |
| 3 | +sort: 3 | |
| 4 | +tags: [policy, oncall] | |
| 5 | +--- | |
| 6 | + | |
| 7 | +# Incident response | |
| 8 | + | |
| 9 | +Read this before your first on-call shift. The runbook lives in the platform repo at | |
| 10 | +`runbooks/incident.md`; this page is the policy and the mental model. | |
| 11 | + | |
| 12 | +> **You will not be punished for declaring an incident that turned out not to be one.** The cost | |
| 13 | +> of false positives is small; the cost of false negatives is missed pages, frustrated customers, | |
| 14 | +> and unanswered status pages. | |
| 15 | + | |
| 16 | +## Severity ladder | |
| 17 | + | |
| 18 | +| Severity | Examples | Who pages, who joins | | |
| 19 | +|----------|--------------------------------------------------------------------------|--------------------------------------------| | |
| 20 | +| SEV-1 | Region down. Payment provider down across multiple tenants. Data breach. | All platform on-call + EM on-call + founder on-call | | |
| 21 | +| SEV-2 | One service unhealthy, customer-visible degradation, no data loss. | Service owner + platform on-call | | |
| 22 | +| SEV-3 | Internal degradation, no customer impact. | Service owner only | | |
| 23 | + | |
| 24 | +## Start the incident | |
| 25 | + | |
| 26 | +```bash | |
| 27 | +zephyr incident open --sev 2 --summary "Checkout p99 elevated in eu-frankfurt" | |
| 28 | +``` | |
| 29 | + | |
| 30 | +The CLI creates the Slack channel, the status page entry, and the Linear ticket. It pages the | |
| 31 | +right rotation. It pins the runbook in the channel. | |
| 32 | + | |
| 33 | +## During the incident | |
| 34 | + | |
| 35 | +Three roles, even for a small SEV-2: | |
| 36 | + | |
| 37 | +- **Incident commander (IC)** — owns the timeline, the decisions, the channel. Not necessarily | |
| 38 | + the most technical person; the calmest one. | |
| 39 | +- **Comms** — owns the status page, the customer email, the after-action announcement. Often the | |
| 40 | + EM. | |
| 41 | +- **Investigator(s)** — the people in the code. Usually two; one drives, one rubber-ducks. | |
| 42 | + | |
| 43 | +Status updates land in the channel and on the status page every 15 minutes minimum, even if the | |
| 44 | +update is "still investigating." Silence is worse than bad news. | |
| 45 | + | |
| 46 | +## After the incident | |
| 47 | + | |
| 48 | +A blameless post-mortem within 72 hours, drafted by the IC, reviewed by the team. The template | |
| 49 | +lives at `runbooks/post-mortem-template.md`. It goes in the platform repo, not on the intranet — | |
| 50 | +post-mortems are searchable engineering knowledge, not internal politics. | |
| 51 | + | |
| 52 | +The post-mortem closes with one or more **action items**, each owned by a named person and | |
| 53 | +ticketed in Linear with a "post-mortem" label. If none ever ship, escalate that pattern. | |
| 54 | + | |
| 55 | +## Customer communication | |
| 56 | + | |
| 57 | +For SEV-1 and SEV-2 we send a customer email within 24 hours of resolution. The template is in | |
| 58 | +`comms/incident-email-template.md`. Tone: factual, brief, no marketing language. Apologise once. | |
| 59 | + | |
| 60 | +## Out of hours | |
| 61 | + | |
| 62 | +The on-call rotation crosses time zones so there is always someone awake. If the page rings and | |
| 63 | +nobody answers within 5 minutes, the escalation flow pages the next person in the rotation. If | |
| 64 | +the second person doesn't answer within 5 more, the EM on-call is paged. If they don't answer, | |
| 65 | +the founder on-call is paged. The chain is unbroken for a reason. | |