Operations¶
Running the platform in production. Read this if you're on-call or doing any production work.
In this section¶
- Deployment — how code goes from PR to production
- Environments — dev, staging, production differences
- Monitoring & Alerts — CloudWatch dashboards, alert routing
- Runbooks — step-by-step responses to specific incidents
- Backup & Restore — RDS backups, point-in-time recovery
- Incident Response — what to do when things break
- Tenant Onboarding — provisioning a new customer
On-call quick reference¶
| Severity | Response time | Examples |
|---|---|---|
| P0 — Critical | 15 min | Platform down, data loss, security breach |
| P1 — High | 1 hour | Module broken, sync failing, auth issues |
| P2 — Medium | Next business day | Performance degradation, minor feature broken |
| P3 — Low | Next sprint | UX issues, cosmetic bugs |
If you're new to on-call, read Incident Response first.