Incident response

Read this before your first on-call shift. The runbook lives in the platform repo at runbooks/incident.md; this page is the policy and the mental model.

You will not be punished for declaring an incident that turned out not to be one. The cost of false positives is small; the cost of false negatives is missed pages, frustrated customers, and unanswered status pages.

Severity ladder

SeverityExamplesWho pages, who joins
SEV-1Region down. Payment provider down across multiple tenants. Data breach.All platform on-call + EM on-call + founder on-call
SEV-2One service unhealthy, customer-visible degradation, no data loss.Service owner + platform on-call
SEV-3Internal degradation, no customer impact.Service owner only

Start the incident

zephyr incident open --sev 2 --summary "Checkout p99 elevated in eu-frankfurt"

The CLI creates the Slack channel, the status page entry, and the Linear ticket. It pages the right rotation. It pins the runbook in the channel.

During the incident

Three roles, even for a small SEV-2:

  • Incident commander (IC) — owns the timeline, the decisions, the channel. Not necessarily the most technical person; the calmest one.
  • Comms — owns the status page, the customer email, the after-action announcement. Often the EM.
  • Investigator(s) — the people in the code. Usually two; one drives, one rubber-ducks.

Status updates land in the channel and on the status page every 15 minutes minimum, even if the update is "still investigating." Silence is worse than bad news.

After the incident

A blameless post-mortem within 72 hours, drafted by the IC, reviewed by the team. The template lives at runbooks/post-mortem-template.md. It goes in the platform repo, not on the intranet — post-mortems are searchable engineering knowledge, not internal politics.

The post-mortem closes with one or more action items, each owned by a named person and ticketed in Linear with a "post-mortem" label. If none ever ship, escalate that pattern.

Customer communication

For SEV-1 and SEV-2 we send a customer email within 24 hours of resolution. The template is in comms/incident-email-template.md. Tone: factual, brief, no marketing language. Apologise once.

Out of hours

The on-call rotation crosses time zones so there is always someone awake. If the page rings and nobody answers within 5 minutes, the escalation flow pages the next person in the rotation. If the second person doesn't answer within 5 more, the EM on-call is paged. If they don't answer, the founder on-call is paged. The chain is unbroken for a reason.