---
title: Incident response
sort: 3
tags: [policy, oncall]
---

# Incident response

Read this before your first on-call shift. The runbook lives in the platform repo at
`runbooks/incident.md`; this page is the policy and the mental model.

> **You will not be punished for declaring an incident that turned out not to be one.** The cost
> of false positives is small; the cost of false negatives is missed pages, frustrated customers,
> and unanswered status pages.

## Severity ladder

| Severity | Examples                                                                 | Who pages, who joins                       |
|----------|--------------------------------------------------------------------------|--------------------------------------------|
| SEV-1    | Region down. Payment provider down across multiple tenants. Data breach. | All platform on-call + EM on-call + founder on-call |
| SEV-2    | One service unhealthy, customer-visible degradation, no data loss.       | Service owner + platform on-call           |
| SEV-3    | Internal degradation, no customer impact.                                | Service owner only                         |

## Start the incident

```bash
zephyr incident open --sev 2 --summary "Checkout p99 elevated in eu-frankfurt"
```

The CLI creates the Slack channel, the status page entry, and the Linear ticket. It pages the
right rotation. It pins the runbook in the channel.

## During the incident

Three roles, even for a small SEV-2:

- **Incident commander (IC)** — owns the timeline, the decisions, the channel. Not necessarily
  the most technical person; the calmest one.
- **Comms** — owns the status page, the customer email, the after-action announcement. Often the
  EM.
- **Investigator(s)** — the people in the code. Usually two; one drives, one rubber-ducks.

Status updates land in the channel and on the status page every 15 minutes minimum, even if the
update is "still investigating." Silence is worse than bad news.

## After the incident

A blameless post-mortem within 72 hours, drafted by the IC, reviewed by the team. The template
lives at `runbooks/post-mortem-template.md`. It goes in the platform repo, not on the intranet —
post-mortems are searchable engineering knowledge, not internal politics.

The post-mortem closes with one or more **action items**, each owned by a named person and
ticketed in Linear with a "post-mortem" label. If none ever ship, escalate that pattern.

## Customer communication

For SEV-1 and SEV-2 we send a customer email within 24 hours of resolution. The template is in
`comms/incident-email-template.md`. Tone: factual, brief, no marketing language. Apologise once.

## Out of hours

The on-call rotation crosses time zones so there is always someone awake. If the page rings and
nobody answers within 5 minutes, the escalation flow pages the next person in the rotation. If
the second person doesn't answer within 5 more, the EM on-call is paged. If they don't answer,
the founder on-call is paged. The chain is unbroken for a reason.
