Best Practice: Chaos Engineering & Fault Injection

Context

Chaos Engineering is the discipline of deliberately breaking systems in a controlled way to discover resilience weaknesses before real failures expose them. The term was coined by Netflix with the introduction of Chaos Monkey, but the approach is universally applicable.

Common problems without Chaos Engineering:

Circuit breaker configured, but never triggered – unknown whether it works correctly
Multi-AZ deployed, but AZ failover never tested – actual failover time unknown
Backup recovery available, but recovery time is 4x longer than the RTO target
Service claims to support graceful degradation, but optional deps bring it down

Related Controls

WAF-REL-090 – Chaos Engineering & Fault Injection

Target State

Quarterly structured chaos experiments with documented hypotheses
Experiments start in staging and are gradually extended to production
Stop conditions prevent uncontrolled blast radius
GameDay events (half-day, annual) for holistic resilience validation

Principles for Chaos Engineering

Hypothesis First: "If X fails, we expect Y within Z seconds"
Small Blast Radius: Start with 10% of instances, not 100%
Staging First: Every experiment first in staging, then gradually in production
Stop Conditions: Automatic abort when SLO alarm fires
Monitoring Before: Dashboards open before the experiment starts
Document Results: Document hypothesis, outcome, and action

Technical Implementation

Step 1: Experiment Documentation

# chaos-experiments/exp-001-az-failure.yml
id: "EXP-001"
title: "Single AZ Instance Termination"
hypothesis: "Payment API continues serving requests with < 10% error rate when 25% of instances in AZ1 are terminated"
date: "2026-03-15"
team: "payments-team"
environment: "staging"

scope:
  service: "payment-api"
  blast_radius: "25% of instances in eu-west-1a"
  excluded: ["payment-db", "payment-queue"]

stop_conditions:
  - "SLO error rate > 2% for > 2 minutes"
  - "All circuit breakers in OPEN state"

expected_result:
  recovery_time: "< 60 seconds"
  error_rate_peak: "< 5%"
  slo_violation: false

observations:
  start_time: "14:00"
  actual_recovery_time: "42 seconds"
  actual_error_rate_peak: "2.3%"
  slo_violation: false
  issues_found:
    - "Startup probe initialDelaySeconds too short – 3 pods restarted unnecessarily"

actions:
  - id: "ACT-001"
    description: "Increase initialDelaySeconds from 10s to 20s"
    owner: "alice"
    due_date: "2026-03-22"
    status: "completed"

Step 2: AWS FIS Experiment Template

# terraform/chaos/fis-az-failure.tf

data "aws_iam_policy_document" "fis_assume" {
  statement {
    actions = ["sts:AssumeRole"]
    principals {
      type        = "Service"
      identifiers = ["fis.amazonaws.com"]
    }
  }
}

resource "aws_iam_role" "fis" {
  name               = "fis-execution-role"
  assume_role_policy = data.aws_iam_policy_document.fis_assume.json
}

resource "aws_iam_role_policy_attachment" "fis_ec2" {
  role       = aws_iam_role.fis.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonEC2FullAccess"
}

# Stop condition: SLO alarm fires
resource "aws_cloudwatch_metric_alarm" "fis_stop_condition" {
  alarm_name          = "fis-stop-condition-error-rate"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "5XXError"
  namespace           = "AWS/ApiGateway"
  period              = 60
  statistic           = "Sum"
  threshold           = 50  # 50 errors/min = ~0.8% at 6k req/min
}

# Experiment: terminate 25% of instances in AZ1
resource "aws_fis_experiment_template" "az1_instance_termination" {
  description = "EXP-001: Terminate 25% of instances in AZ1 – validate auto-recovery < 60s"
  role_arn    = aws_iam_role.fis.arn

  stop_condition {
    source = "aws:cloudwatch:alarm"
    value  = aws_cloudwatch_metric_alarm.fis_stop_condition.arn
  }

  action {
    name      = "terminate-instances-az1"
    action_id = "aws:ec2:terminate-instances"

    parameter {
      key   = "startInstancesAfterDuration"
      value = "PT0S"  # No auto-restart
    }

    target {
      key   = "Instances"
      value = "payment-api-instances-az1"
    }
  }

  target {
    name           = "payment-api-instances-az1"
    resource_type  = "aws:ec2:instance"
    selection_mode = "PERCENT(25)"

    resource_tag {
      key   = "app"
      value = "payment-api"
    }

    resource_tag {
      key   = "aws:autoscaling:groupName"
      value = "payment-api-asg"
    }

    filter {
      path   = "Placement.AvailabilityZone"
      values = ["eu-west-1a"]
    }
  }

  tags = merge(var.mandatory_tags, {
    "chaos-experiment" = "EXP-001"
    "environment"      = "staging"
  })
}

Step 3: Azure Chaos Studio

resource "azurerm_chaos_studio_experiment" "pod_failure" {
  name                = "payment-api-pod-failure-staging"
  location            = azurerm_resource_group.staging.location
  resource_group_name = azurerm_resource_group.staging.name

  identity {
    type = "SystemAssigned"
  }

  schema_version = "2.0"

  steps {
    name = "pod-failure-step"

    branches {
      name = "inject-pod-failure"

      actions {
        action_urn = "urn:csci:microsoft:azureKubernetesServiceChaosMesh:podChaos/2.2"
        duration   = "PT5M"

        parameters {
          key   = "jsonSpec"
          value = jsonencode({
            mode   = "one"
            action = "pod-kill"
            selector = {
              namespaces = ["payment"]
              labelSelectors = { "app" = "payment-api" }
            }
          })
        }

        target {
          id   = azurerm_chaos_studio_target.aks.id
          type = "ChaosTarget"
        }
      }
    }
  }
}

Step 4: GameDay Agenda Template

# GameDay: Payment Service Reliability – Q1 2026

**Date:** 2026-03-21
**Duration:** 4 hours (09:00–13:00)
**Team:** payments-team (8 people)
**Moderator:** @alice
**Observer:** @bob (Engineering Manager)

## 09:00 – Briefing (30 min)
- Explain the day's agenda
- Assign roles: Chaos Operator, Incident Commander, Observer
- Open dashboards, measure baseline

## 09:30 – Experiment 1: Single Pod Failure (EXP-001)
- Hypothesis: Service recovered in < 30s
- Tool: kubectl delete pod (one pod)
- Expectation: No SLO violation

## 10:00 – Experiment 2: Database Connection Flood (EXP-002)
- Hypothesis: Circuit Breaker opens at > 80% DB error rate
- Tool: AWS FIS – block Security Group Rule to DB
- Stop Condition: SLO alarm

## 10:30 – Break & Analysis (30 min)

## 11:00 – Experiment 3: AZ-Level Failure (EXP-003)
- Hypothesis: Service recovered in < 60s after 25% instance termination in AZ1
- Tool: AWS FIS Experiment Template az1_instance_termination
- Observe: Auto Scaling, LB Health Checks

## 12:00 – Retrospective (60 min)
- What did we learn?
- Which hypotheses were confirmed / refuted?
- Document action items

## 13:00 – End

Typical Anti-Patterns

Chaos without stop conditions: Experiment runs uncontrolled, SLO budget is fully exhausted
Only in staging, never in production: Production-specific configurations are never validated
Chaos without documentation: Insights are lost; the same failures are not found again
Blast radius too large for the first experiment: 100% of instances in AZ1 instead of 25% → real outage

Metrics

Experiment Frequency: Number of documented chaos experiments per quarter (target: >= 3)
Hypothesis Validation Rate: % of experiments where the hypothesis was confirmed (both are informative)
Action Item Closure Rate: % of actions generated from experiments completed within 2 sprints
Recovery Time Actual vs. RTO: Actual recovery time in the experiment vs. RTO target

Maturity Level

Level 1 – No chaos tests; resilience only known through production incidents
Level 2 – Occasional manual tests (restarting containers) without documentation
Level 3 – Structured experiments with hypotheses and documentation, quarterly
Level 4 – Production chaos with stop conditions; GameDay annually; FIS/Chaos Studio active
Level 5 – Continuous, automated low-blast-radius experiments; ML-based anomaly detection