Changes

Comparing empty160dc49.

@@ -1,0 +1,65 @@
1+---
2+title: Incident response
3+sort: 3
4+tags: [policy, oncall]
5+---
6+
7+# Incident response
8+
9+Read this before your first on-call shift. The runbook lives in the platform repo at
10+`runbooks/incident.md`; this page is the policy and the mental model.
11+
12+> **You will not be punished for declaring an incident that turned out not to be one.** The cost
13+> of false positives is small; the cost of false negatives is missed pages, frustrated customers,
14+> and unanswered status pages.
15+
16+## Severity ladder
17+
18+| Severity | Examples | Who pages, who joins |
19+|----------|--------------------------------------------------------------------------|--------------------------------------------|
20+| SEV-1 | Region down. Payment provider down across multiple tenants. Data breach. | All platform on-call + EM on-call + founder on-call |
21+| SEV-2 | One service unhealthy, customer-visible degradation, no data loss. | Service owner + platform on-call |
22+| SEV-3 | Internal degradation, no customer impact. | Service owner only |
23+
24+## Start the incident
25+
26+```bash
27+zephyr incident open --sev 2 --summary "Checkout p99 elevated in eu-frankfurt"
28+```
29+
30+The CLI creates the Slack channel, the status page entry, and the Linear ticket. It pages the
31+right rotation. It pins the runbook in the channel.
32+
33+## During the incident
34+
35+Three roles, even for a small SEV-2:
36+
37+- **Incident commander (IC)** — owns the timeline, the decisions, the channel. Not necessarily
38+ the most technical person; the calmest one.
39+- **Comms** — owns the status page, the customer email, the after-action announcement. Often the
40+ EM.
41+- **Investigator(s)** — the people in the code. Usually two; one drives, one rubber-ducks.
42+
43+Status updates land in the channel and on the status page every 15 minutes minimum, even if the
44+update is "still investigating." Silence is worse than bad news.
45+
46+## After the incident
47+
48+A blameless post-mortem within 72 hours, drafted by the IC, reviewed by the team. The template
49+lives at `runbooks/post-mortem-template.md`. It goes in the platform repo, not on the intranet —
50+post-mortems are searchable engineering knowledge, not internal politics.
51+
52+The post-mortem closes with one or more **action items**, each owned by a named person and
53+ticketed in Linear with a "post-mortem" label. If none ever ship, escalate that pattern.
54+
55+## Customer communication
56+
57+For SEV-1 and SEV-2 we send a customer email within 24 hours of resolution. The template is in
58+`comms/incident-email-template.md`. Tone: factual, brief, no marketing language. Apologise once.
59+
60+## Out of hours
61+
62+The on-call rotation crosses time zones so there is always someone awake. If the page rings and
63+nobody answers within 5 minutes, the escalation flow pages the next person in the rotation. If
64+the second person doesn't answer within 5 more, the EM on-call is paged. If they don't answer,
65+the founder on-call is paged. The chain is unbroken for a reason.