Best Practice: Maintaining Runbooks and Operational Documentation

Context

Runbooks are the institutional memory of your operations organization. Without runbooks, on-call capability depends on individual people. With good runbooks, a junior engineer can resolve an incident without calling the senior engineer at 3am.

Related Controls

Target State

Complete operational documentation:

Runbooks for all known failure scenarios and routine tasks
Every paging alert has a runbook URL
Runbooks are versioned, reviewed, and discoverable
Operational Debt Register catalogs all known toil sources
Quarterly review ensures currency

Technical Implementation

Step 1: Use the Runbook Template

# Runbook: Payment Service – High 5xx Error Rate

**Created:** 2025-03-18
**Last reviewed:** 2025-03-18
**Author:** payment-team
**Applies to alert:** `payment-service-5xx-error-rate`

## Overview

This runbook describes the response to an increase in the 5xx error rate
in the Payment Service above the SLO threshold.

**Impact:** Users receive error messages during the payment process.
**SLO:** < 0.1% error rate over 5 minutes.

## Diagnosis Steps

### 1. Determine the scope
```bash
# Error rate for the last 5 minutes
aws cloudwatch get-metric-statistics \
  --namespace AWS/ApplicationELB \
  --metric-name HTTPCode_Target_5XX_Count \
  --dimensions Name=LoadBalancer,Value=<lb-arn-suffix> \
  --start-time $(date -u -d "5 minutes ago" +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 60 --statistics Sum
```

### 2. Analyze error logs
```bash
# CloudWatch Log Insights Query
aws logs start-query \
  --log-group-name "/aws/ecs/payment-service" \
  --start-time $(date -d "15 minutes ago" +%s) \
  --end-time $(date +%s) \
  --query-string 'fields @timestamp, message, error.type, error.message, trace_id
    | filter level = "ERROR"
    | sort @timestamp desc
    | limit 50'
```

### 3. Check recent deployments
```bash
# ECS Service Events
aws ecs describe-services \
  --cluster payment-production \
  --services payment-service \
  --query 'services[0].events[:5]'
```

## Remediation Steps

### Scenario A: Database connection errors
**Detection:** Logs show `Connection refused` or `Connection pool exhausted`
1. Check RDS instance status: AWS Console → RDS → payment-service-db
2. Check connection pool limits: CloudWatch Metric `DatabaseConnections`
3. If limit reached: Emergency scale-up via runbook [Database Emergency]
4. If RDS is unstable: Failover to read replica

### Scenario B: Failed deployment
**Detection:** Errors started shortly after deployment timestamp
1. Trigger canary rollback:
   ```bash
   aws deploy stop-deployment --deployment-id <id> --auto-rollback-enabled
   ```
2. If no canary deployment: deploy previous task definition:
   ```bash
   aws ecs update-service \
     --cluster payment-production \
     --service payment-service \
     --task-definition payment-service:<previous-version>
   ```

## Escalation

| Time | Action | Contact |
|-----------|--------|---------|
| +5 minutes | No progress | Senior Engineer: @senior-payment |
| +15 minutes | No resolution | Tech Lead: @payment-lead |
| +30 minutes | Production outage | CTO: @cto-oncall |

## Related Runbooks

- [Database Emergency Runbook](database-emergency.md)
- [Canary Rollback Runbook](canary-rollback.md)
- [Postmortem Template](../postmortem-template.md)

Step 2: Organize Runbooks in Version Control

docs/
├── runbooks/
│   ├── README.md               # Runbook index with links and coverage status
│   ├── payment-service/
│   │   ├── 5xx-errors.md       # Linked to: payment-service-5xx-error-rate alert
│   │   ├── high-latency.md     # Linked to: payment-service-p99-latency alert
│   │   ├── database-full.md
│   │   └── deployment.md       # Deployment and rollback procedure
│   ├── infrastructure/
│   │   ├── terraform-drift.md
│   │   └── certificate-renewal.md
│   └── incident-templates/
│       └── postmortem-template.md
└── ops-debt-register.yml       # Operational Debt Register

Step 3: Ensure Alert-Runbook Linking

# prometheus-alerts.yaml – runbook URL as mandatory field
- alert: PaymentHighErrorRate
  expr: rate(http_requests_total{service="payment",code=~"5.."}[5m]) > 0.01
  annotations:
    runbook_url: "https://wiki.company.com/runbooks/payment-service/5xx-errors"
    # OR: relative link if wiki is internal:
    # runbook_url: "https://github.com/myorg/docs/blob/main/runbooks/payment-service/5xx-errors.md"

Step 4: Create the Operational Debt Register

# ops-debt-register.yml
# Operational Debt Register – last reviewed: 2025-03-18
# Review cadence: quarterly (Jan, Apr, Jul, Oct)

entries:
  - id: OPS-DEBT-001
    title: "Database password rotation still manual"
    category: manual-process
    severity: high
    toil_hours_per_week: 2.0
    owner: "@platform-team"
    created_date: "2025-01-15"
    target_resolution_date: "2025-06-30"
    status: in_progress
    description: >
      Database passwords are rotated monthly manually via AWS Secrets Manager Console.
      No automatic rotation Lambda configured.
    remediation_plan: >
      Configure AWS Secrets Manager Automatic Rotation with Lambda function.
      Terraform code already exists in feature branch.
    links:
      - "https://jira.company.com/PLAT-234"

  - id: OPS-DEBT-002
    title: "Monitoring dashboard not defined via IaC"
    category: missing-automation
    severity: medium
    toil_hours_per_week: 0.5
    owner: "@payment-team"
    created_date: "2025-02-01"
    target_resolution_date: "2025-05-30"
    status: open
    description: >
      CloudWatch dashboards for Payment Service were created manually in the console.
      On stack destroy/create they must be manually recreated.
    remediation_plan: >
      Define dashboards as aws_cloudwatch_dashboard Terraform resources.

  - id: OPS-DEBT-003
    title: "Runbooks for background jobs missing"
    category: missing-runbook
    severity: medium
    toil_hours_per_week: 1.5
    owner: "@payment-team"
    created_date: "2025-03-01"
    target_resolution_date: "2025-04-30"
    status: open
    description: >
      SQS-based background job processing has no runbooks.
      On queue overflow or dead letter queue, engineers operate in production
      without documentation.
    remediation_plan: >
      Create runbooks for the 3 most common job failure scenarios.
      Configure alerts for DLQ overflows with runbook URLs.

Common Anti-Patterns

Anti-Pattern	Problem
Runbooks in Confluence without a review process	Become outdated quickly; rarely updated; incorrect information can worsen an incident
Runbooks not linked to alerts	On-call engineer must search for the runbook while the incident is active
Runbooks only for catastrophic scenarios	Routine tasks (certificate renewal, scaling) are missing; knowledge stays in people’s heads
Operational Debt Register discussed in meetings but not documented	Visibility is lost; the same debt items are repeatedly rediscussed
No sprint budget for debt reduction	Register grows; nothing gets reduced; team becomes resigned

Anti-Pattern

Problem

Runbooks in Confluence without a review process

Become outdated quickly; rarely updated; incorrect information can worsen an incident

Runbooks not linked to alerts

On-call engineer must search for the runbook while the incident is active

Runbooks only for catastrophic scenarios

Routine tasks (certificate renewal, scaling) are missing; knowledge stays in people’s heads

Operational Debt Register discussed in meetings but not documented

Visibility is lost; the same debt items are repeatedly rediscussed

No sprint budget for debt reduction

Metrics

Runbook coverage: % of critical services with runbooks (target: >= 90%)
Runbook freshness: % of runbooks reviewed in < 90 days (target: >= 80%)
Alert-runbook linkage: % of paging alerts with runbook URL (target: 100%)
Operational debt: Number of open entries in the register, total toil hours/week

Maturity Levels

Level	Characteristics
Level 1	No runbooks. Knowledge in engineers' heads. No debt register.
Level 2	Runbooks for top-3 incidents. Not linked. No formal debt register.
Level 3	All alerts linked to runbooks. Quarterly review. Debt register version-controlled.
Level 4	Coverage metrics tracked. Runbooks reviewed after incidents. Sprint budget for debt.
Level 5	Self-service automation for critical runbook steps. Debt trend positive (reduction > accumulation).

Level

Characteristics

Level 1

No runbooks. Knowledge in engineers' heads. No debt register.

Level 2

Runbooks for top-3 incidents. Not linked. No formal debt register.

Level 3

All alerts linked to runbooks. Quarterly review. Debt register version-controlled.

Level 4

Coverage metrics tracked. Runbooks reviewed after incidents. Sprint budget for debt.

Level 5

Self-service automation for critical runbook steps. Debt trend positive (reduction > accumulation).