WAF++ WAF++
Back to WAF++ Homepage

Best Practice: Chaos Engineering & Fault Injection

Context

Chaos Engineering is the discipline of deliberately breaking systems in a controlled way to discover resilience weaknesses before real failures expose them. The term was coined by Netflix with the introduction of Chaos Monkey, but the approach is universally applicable.

Common problems without Chaos Engineering:

  • Circuit breaker configured, but never triggered – unknown whether it works correctly

  • Multi-AZ deployed, but AZ failover never tested – actual failover time unknown

  • Backup recovery available, but recovery time is 4x longer than the RTO target

  • Service claims to support graceful degradation, but optional deps bring it down

Target State

  • Quarterly structured chaos experiments with documented hypotheses

  • Experiments start in staging and are gradually extended to production

  • Stop conditions prevent uncontrolled blast radius

  • GameDay events (half-day, annual) for holistic resilience validation

Principles for Chaos Engineering

  1. Hypothesis First: "If X fails, we expect Y within Z seconds"

  2. Small Blast Radius: Start with 10% of instances, not 100%

  3. Staging First: Every experiment first in staging, then gradually in production

  4. Stop Conditions: Automatic abort when SLO alarm fires

  5. Monitoring Before: Dashboards open before the experiment starts

  6. Document Results: Document hypothesis, outcome, and action

Technical Implementation

Step 1: Experiment Documentation

# chaos-experiments/exp-001-az-failure.yml
id: "EXP-001"
title: "Single AZ Instance Termination"
hypothesis: "Payment API continues serving requests with < 10% error rate when 25% of instances in AZ1 are terminated"
date: "2026-03-15"
team: "payments-team"
environment: "staging"

scope:
  service: "payment-api"
  blast_radius: "25% of instances in eu-west-1a"
  excluded: ["payment-db", "payment-queue"]

stop_conditions:
  - "SLO error rate > 2% for > 2 minutes"
  - "All circuit breakers in OPEN state"

expected_result:
  recovery_time: "< 60 seconds"
  error_rate_peak: "< 5%"
  slo_violation: false

observations:
  start_time: "14:00"
  actual_recovery_time: "42 seconds"
  actual_error_rate_peak: "2.3%"
  slo_violation: false
  issues_found:
    - "Startup probe initialDelaySeconds too short – 3 pods restarted unnecessarily"

actions:
  - id: "ACT-001"
    description: "Increase initialDelaySeconds from 10s to 20s"
    owner: "alice"
    due_date: "2026-03-22"
    status: "completed"

Step 2: AWS FIS Experiment Template

# terraform/chaos/fis-az-failure.tf

data "aws_iam_policy_document" "fis_assume" {
  statement {
    actions = ["sts:AssumeRole"]
    principals {
      type        = "Service"
      identifiers = ["fis.amazonaws.com"]
    }
  }
}

resource "aws_iam_role" "fis" {
  name               = "fis-execution-role"
  assume_role_policy = data.aws_iam_policy_document.fis_assume.json
}

resource "aws_iam_role_policy_attachment" "fis_ec2" {
  role       = aws_iam_role.fis.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonEC2FullAccess"
}

# Stop condition: SLO alarm fires
resource "aws_cloudwatch_metric_alarm" "fis_stop_condition" {
  alarm_name          = "fis-stop-condition-error-rate"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "5XXError"
  namespace           = "AWS/ApiGateway"
  period              = 60
  statistic           = "Sum"
  threshold           = 50  # 50 errors/min = ~0.8% at 6k req/min
}

# Experiment: terminate 25% of instances in AZ1
resource "aws_fis_experiment_template" "az1_instance_termination" {
  description = "EXP-001: Terminate 25% of instances in AZ1 – validate auto-recovery < 60s"
  role_arn    = aws_iam_role.fis.arn

  stop_condition {
    source = "aws:cloudwatch:alarm"
    value  = aws_cloudwatch_metric_alarm.fis_stop_condition.arn
  }

  action {
    name      = "terminate-instances-az1"
    action_id = "aws:ec2:terminate-instances"

    parameter {
      key   = "startInstancesAfterDuration"
      value = "PT0S"  # No auto-restart
    }

    target {
      key   = "Instances"
      value = "payment-api-instances-az1"
    }
  }

  target {
    name           = "payment-api-instances-az1"
    resource_type  = "aws:ec2:instance"
    selection_mode = "PERCENT(25)"

    resource_tag {
      key   = "app"
      value = "payment-api"
    }

    resource_tag {
      key   = "aws:autoscaling:groupName"
      value = "payment-api-asg"
    }

    filter {
      path   = "Placement.AvailabilityZone"
      values = ["eu-west-1a"]
    }
  }

  tags = merge(var.mandatory_tags, {
    "chaos-experiment" = "EXP-001"
    "environment"      = "staging"
  })
}

Step 3: Azure Chaos Studio

resource "azurerm_chaos_studio_experiment" "pod_failure" {
  name                = "payment-api-pod-failure-staging"
  location            = azurerm_resource_group.staging.location
  resource_group_name = azurerm_resource_group.staging.name

  identity {
    type = "SystemAssigned"
  }

  schema_version = "2.0"

  steps {
    name = "pod-failure-step"

    branches {
      name = "inject-pod-failure"

      actions {
        action_urn = "urn:csci:microsoft:azureKubernetesServiceChaosMesh:podChaos/2.2"
        duration   = "PT5M"

        parameters {
          key   = "jsonSpec"
          value = jsonencode({
            mode   = "one"
            action = "pod-kill"
            selector = {
              namespaces = ["payment"]
              labelSelectors = { "app" = "payment-api" }
            }
          })
        }

        target {
          id   = azurerm_chaos_studio_target.aks.id
          type = "ChaosTarget"
        }
      }
    }
  }
}

Step 4: GameDay Agenda Template

# GameDay: Payment Service Reliability – Q1 2026

**Date:** 2026-03-21
**Duration:** 4 hours (09:00–13:00)
**Team:** payments-team (8 people)
**Moderator:** @alice
**Observer:** @bob (Engineering Manager)

## 09:00 – Briefing (30 min)
- Explain the day's agenda
- Assign roles: Chaos Operator, Incident Commander, Observer
- Open dashboards, measure baseline

## 09:30 – Experiment 1: Single Pod Failure (EXP-001)
- Hypothesis: Service recovered in < 30s
- Tool: kubectl delete pod (one pod)
- Expectation: No SLO violation

## 10:00 – Experiment 2: Database Connection Flood (EXP-002)
- Hypothesis: Circuit Breaker opens at > 80% DB error rate
- Tool: AWS FIS – block Security Group Rule to DB
- Stop Condition: SLO alarm

## 10:30 – Break & Analysis (30 min)

## 11:00 – Experiment 3: AZ-Level Failure (EXP-003)
- Hypothesis: Service recovered in < 60s after 25% instance termination in AZ1
- Tool: AWS FIS Experiment Template az1_instance_termination
- Observe: Auto Scaling, LB Health Checks

## 12:00 – Retrospective (60 min)
- What did we learn?
- Which hypotheses were confirmed / refuted?
- Document action items

## 13:00 – End

Typical Anti-Patterns

  • Chaos without stop conditions: Experiment runs uncontrolled, SLO budget is fully exhausted

  • Only in staging, never in production: Production-specific configurations are never validated

  • Chaos without documentation: Insights are lost; the same failures are not found again

  • Blast radius too large for the first experiment: 100% of instances in AZ1 instead of 25% → real outage

Metrics

  • Experiment Frequency: Number of documented chaos experiments per quarter (target: >= 3)

  • Hypothesis Validation Rate: % of experiments where the hypothesis was confirmed (both are informative)

  • Action Item Closure Rate: % of actions generated from experiments completed within 2 sprints

  • Recovery Time Actual vs. RTO: Actual recovery time in the experiment vs. RTO target

Maturity Level

Level 1 – No chaos tests; resilience only known through production incidents
Level 2 – Occasional manual tests (restarting containers) without documentation
Level 3 – Structured experiments with hypotheses and documentation, quarterly
Level 4 – Production chaos with stop conditions; GameDay annually; FIS/Chaos Studio active
Level 5 – Continuous, automated low-blast-radius experiments; ML-based anomaly detection