Runbooks¶
Step-by-step procedures for specific operational scenarios. Add a new runbook when you encounter and resolve a new class of problem.
Available runbooks¶
Production incidents¶
- Platform is down — placeholder
- Database CPU spike — placeholder
- Sync is failing for all users — placeholder
- Auth is broken — placeholder
Tenant issues¶
- Tenant cannot log in — placeholder
- Tenant data appears incorrect — placeholder
- Tenant offboarding (deletion) — placeholder
Routine operations¶
- Deploy a hotfix — placeholder
- Rotate a leaked credential — placeholder
- Restore from backup — placeholder
- Add a new tenant
Runbook template¶
When writing a new runbook, follow this template:
# [Runbook] [Problem name]
## When to use this runbook
Specific symptoms that indicate this runbook applies.
## Severity
P0 / P1 / P2 / P3
## Pre-checks (30 seconds)
Quick checks to confirm the problem is what you think it is.
## Immediate mitigation (5 minutes)
Steps to stop the bleeding — restore service, even if root cause unknown.
## Root cause investigation (30+ minutes)
How to find the underlying cause.
## Resolution
How to fix permanently.
## Post-incident
- Update Linear ticket
- Write post-mortem if P0/P1
- Update this runbook with new learnings