Best Practice: Maintaining Runbooks and Operational Documentation
Context
Runbooks are the institutional memory of your operations organization. Without runbooks, on-call capability depends on individual people. With good runbooks, a junior engineer can resolve an incident without calling the senior engineer at 3am.
Target State
Complete operational documentation:
-
Runbooks for all known failure scenarios and routine tasks
-
Every paging alert has a runbook URL
-
Runbooks are versioned, reviewed, and discoverable
-
Operational Debt Register catalogs all known toil sources
-
Quarterly review ensures currency
Technical Implementation
Step 1: Use the Runbook Template
# Runbook: Payment Service – High 5xx Error Rate
**Created:** 2025-03-18
**Last reviewed:** 2025-03-18
**Author:** payment-team
**Applies to alert:** `payment-service-5xx-error-rate`
## Overview
This runbook describes the response to an increase in the 5xx error rate
in the Payment Service above the SLO threshold.
**Impact:** Users receive error messages during the payment process.
**SLO:** < 0.1% error rate over 5 minutes.
## Diagnosis Steps
### 1. Determine the scope
```bash
# Error rate for the last 5 minutes
aws cloudwatch get-metric-statistics \
--namespace AWS/ApplicationELB \
--metric-name HTTPCode_Target_5XX_Count \
--dimensions Name=LoadBalancer,Value=<lb-arn-suffix> \
--start-time $(date -u -d "5 minutes ago" +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 60 --statistics Sum
```
### 2. Analyze error logs
```bash
# CloudWatch Log Insights Query
aws logs start-query \
--log-group-name "/aws/ecs/payment-service" \
--start-time $(date -d "15 minutes ago" +%s) \
--end-time $(date +%s) \
--query-string 'fields @timestamp, message, error.type, error.message, trace_id
| filter level = "ERROR"
| sort @timestamp desc
| limit 50'
```
### 3. Check recent deployments
```bash
# ECS Service Events
aws ecs describe-services \
--cluster payment-production \
--services payment-service \
--query 'services[0].events[:5]'
```
## Remediation Steps
### Scenario A: Database connection errors
**Detection:** Logs show `Connection refused` or `Connection pool exhausted`
1. Check RDS instance status: AWS Console → RDS → payment-service-db
2. Check connection pool limits: CloudWatch Metric `DatabaseConnections`
3. If limit reached: Emergency scale-up via runbook [Database Emergency]
4. If RDS is unstable: Failover to read replica
### Scenario B: Failed deployment
**Detection:** Errors started shortly after deployment timestamp
1. Trigger canary rollback:
```bash
aws deploy stop-deployment --deployment-id <id> --auto-rollback-enabled
```
2. If no canary deployment: deploy previous task definition:
```bash
aws ecs update-service \
--cluster payment-production \
--service payment-service \
--task-definition payment-service:<previous-version>
```
## Escalation
| Time | Action | Contact |
|-----------|--------|---------|
| +5 minutes | No progress | Senior Engineer: @senior-payment |
| +15 minutes | No resolution | Tech Lead: @payment-lead |
| +30 minutes | Production outage | CTO: @cto-oncall |
## Related Runbooks
- [Database Emergency Runbook](database-emergency.md)
- [Canary Rollback Runbook](canary-rollback.md)
- [Postmortem Template](../postmortem-template.md)
Step 2: Organize Runbooks in Version Control
docs/
├── runbooks/
│ ├── README.md # Runbook index with links and coverage status
│ ├── payment-service/
│ │ ├── 5xx-errors.md # Linked to: payment-service-5xx-error-rate alert
│ │ ├── high-latency.md # Linked to: payment-service-p99-latency alert
│ │ ├── database-full.md
│ │ └── deployment.md # Deployment and rollback procedure
│ ├── infrastructure/
│ │ ├── terraform-drift.md
│ │ └── certificate-renewal.md
│ └── incident-templates/
│ └── postmortem-template.md
└── ops-debt-register.yml # Operational Debt Register
Step 3: Ensure Alert-Runbook Linking
# prometheus-alerts.yaml – runbook URL as mandatory field
- alert: PaymentHighErrorRate
expr: rate(http_requests_total{service="payment",code=~"5.."}[5m]) > 0.01
annotations:
runbook_url: "https://wiki.company.com/runbooks/payment-service/5xx-errors"
# OR: relative link if wiki is internal:
# runbook_url: "https://github.com/myorg/docs/blob/main/runbooks/payment-service/5xx-errors.md"
Step 4: Create the Operational Debt Register
# ops-debt-register.yml
# Operational Debt Register – last reviewed: 2025-03-18
# Review cadence: quarterly (Jan, Apr, Jul, Oct)
entries:
- id: OPS-DEBT-001
title: "Database password rotation still manual"
category: manual-process
severity: high
toil_hours_per_week: 2.0
owner: "@platform-team"
created_date: "2025-01-15"
target_resolution_date: "2025-06-30"
status: in_progress
description: >
Database passwords are rotated monthly manually via AWS Secrets Manager Console.
No automatic rotation Lambda configured.
remediation_plan: >
Configure AWS Secrets Manager Automatic Rotation with Lambda function.
Terraform code already exists in feature branch.
links:
- "https://jira.company.com/PLAT-234"
- id: OPS-DEBT-002
title: "Monitoring dashboard not defined via IaC"
category: missing-automation
severity: medium
toil_hours_per_week: 0.5
owner: "@payment-team"
created_date: "2025-02-01"
target_resolution_date: "2025-05-30"
status: open
description: >
CloudWatch dashboards for Payment Service were created manually in the console.
On stack destroy/create they must be manually recreated.
remediation_plan: >
Define dashboards as aws_cloudwatch_dashboard Terraform resources.
- id: OPS-DEBT-003
title: "Runbooks for background jobs missing"
category: missing-runbook
severity: medium
toil_hours_per_week: 1.5
owner: "@payment-team"
created_date: "2025-03-01"
target_resolution_date: "2025-04-30"
status: open
description: >
SQS-based background job processing has no runbooks.
On queue overflow or dead letter queue, engineers operate in production
without documentation.
remediation_plan: >
Create runbooks for the 3 most common job failure scenarios.
Configure alerts for DLQ overflows with runbook URLs.
Common Anti-Patterns
| Anti-Pattern | Problem |
|---|---|
Runbooks in Confluence without a review process |
Become outdated quickly; rarely updated; incorrect information can worsen an incident |
Runbooks not linked to alerts |
On-call engineer must search for the runbook while the incident is active |
Runbooks only for catastrophic scenarios |
Routine tasks (certificate renewal, scaling) are missing; knowledge stays in people’s heads |
Operational Debt Register discussed in meetings but not documented |
Visibility is lost; the same debt items are repeatedly rediscussed |
No sprint budget for debt reduction |
Register grows; nothing gets reduced; team becomes resigned |
Metrics
-
Runbook coverage: % of critical services with runbooks (target: >= 90%)
-
Runbook freshness: % of runbooks reviewed in < 90 days (target: >= 80%)
-
Alert-runbook linkage: % of paging alerts with runbook URL (target: 100%)
-
Operational debt: Number of open entries in the register, total toil hours/week
Maturity Levels
| Level | Characteristics |
|---|---|
Level 1 |
No runbooks. Knowledge in engineers' heads. No debt register. |
Level 2 |
Runbooks for top-3 incidents. Not linked. No formal debt register. |
Level 3 |
All alerts linked to runbooks. Quarterly review. Debt register version-controlled. |
Level 4 |
Coverage metrics tracked. Runbooks reviewed after incidents. Sprint budget for debt. |
Level 5 |
Self-service automation for critical runbook steps. Debt trend positive (reduction > accumulation). |