Best Practice: Configure Auto-Scaling

Context

Static capacity for variable load is inefficient and risky: either too much (cost waste) or too little (performance degradation during load spikes). Auto-scaling is the only mechanism that dynamically adjusts capacity to actual demand.

Common problems without validated auto-scaling:

ASG configured, but scaling thresholds never tested under load – scaling triggers too late
Scaling based on CPU utilization, but the service is not CPU-bound
min=max=1 – effectively no scaling, despite ASG configuration
Scale-in too aggressive: instances are terminated immediately before connections are drained

Related Controls

WAF-PERF-020 – Auto-Scaling Configured & Tested
WAF-PERF-080 – Serverless & Managed Services for Variable Load

Target State

Validated auto-scaling:

Correct metrics: Scaling is based on the metric that reflects actual load
Tested: Load test demonstrates that scaling triggers within the latency SLO
Documented: Scaling limits and behavior are described in the runbook
Observed: Scaling events are monitored and anomalies are alerted

Technical Implementation

Step 1: Choose the Right Scaling Metric

Workload Type	Recommended Metric	Rationale
HTTP API	ALBRequestCountPerTarget or P99 Latency	CPU does not correlate directly with request load
Queue Worker	SQS ApproximateNumberOfMessagesVisible	Queue depth is a direct representation of backlog
WebSocket Server	Active Connections (Custom Metric)	CPU does not correctly reflect connection load
Batch Processor	Custom Job-Count Metric	Throughput-based, not CPU-based

Workload Type

Recommended Metric

Rationale

HTTP API

ALBRequestCountPerTarget or P99 Latency

CPU does not correlate directly with request load

Queue Worker

SQS ApproximateNumberOfMessagesVisible

Queue depth is a direct representation of backlog

WebSocket Server

Active Connections (Custom Metric)

CPU does not correctly reflect connection load

Batch Processor

Custom Job-Count Metric

Throughput-based, not CPU-based

Step 2: AWS Auto Scaling Group with Target Tracking

resource "aws_autoscaling_group" "api" {
  name                = "asg-payment-api-${var.environment}"
  min_size            = 2  # At least 2 for redundancy without cold starts
  max_size            = 20
  desired_capacity    = 2
  vpc_zone_identifier = var.private_subnet_ids
  target_group_arns   = [aws_lb_target_group.api.arn]
  health_check_type   = "ELB"

  # Drain connections on scale-in (important!)
  default_cooldown = 300

  # Warm-up: include new instances in scaling decisions only after 120s
  instance_refresh {
    strategy = "Rolling"
    preferences {
      min_healthy_percentage = 50
      instance_warmup        = 120
    }
  }

  launch_template {
    id      = aws_launch_template.api.id
    version = "$Latest"
  }

  tag {
    key                 = "Name"
    value               = "payment-api-${var.environment}"
    propagate_at_launch = true
  }
}

# Target Tracking: ALB requests per target
resource "aws_autoscaling_policy" "api_request_tracking" {
  name                   = "payment-api-request-tracking"
  autoscaling_group_name = aws_autoscaling_group.api.name
  policy_type            = "TargetTrackingScaling"

  target_tracking_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ALBRequestCountPerTarget"
      resource_label         = "${aws_lb.api.arn_suffix}/${aws_lb_target_group.api.arn_suffix}"
    }
    target_value       = 1000.0   # Max requests/s per instance – determined from load test
    scale_in_cooldown  = 300      # Conservative: wait 5 min before scale-in
    scale_out_cooldown = 60       # Aggressive: scale out quickly under load
  }
}

Step 3: Kubernetes HPA with Custom Metrics

# k8s/hpa-payment-api.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: payment-api-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-api
  minReplicas: 2
  maxReplicas: 50
  metrics:
    - type: External
      external:
        metric:
          name: sqs_queue_depth  # KEDA-based SQS metric
          selector:
            matchLabels:
              queue: payment-jobs
        target:
          type: AverageValue
          averageValue: "10"  # 1 pod per 10 messages in the queue
  behavior:
    scaleOut:
      stabilizationWindowSeconds: 0  # Scale immediately
      policies:
        - type: Pods
          value: 5        # Maximum 5 new pods at once
          periodSeconds: 60
    scaleIn:
      stabilizationWindowSeconds: 300  # Wait 5 minutes before scale-in
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60

Step 4: Validate Auto-Scaling with Load Test

// tests/performance/scaling-validation.js (k6)
import http from 'k6/http';
import { check, sleep } from 'k6';
import { Rate } from 'k6/metrics';

const errorRate = new Rate('errors');

// Gradual increase to 2x expected load
export const options = {
  stages: [
    { duration: '2m', target: 10 },   // Warm-up
    { duration: '5m', target: 50 },   // Normal load
    { duration: '5m', target: 100 },  // Scaling range
    { duration: '3m', target: 200 },  // 2x peak load
    { duration: '2m', target: 10 },   // Cool-down
  ],
  thresholds: {
    http_req_duration: ['p(95)<200', 'p(99)<500'],  // SLO
    errors: ['rate<0.001'],                          // < 0.1% errors
  },
};

export default function () {
  const res = http.get(`${__ENV.BASE_URL}/api/v1/payments`);
  check(res, {
    'status is 200': (r) => r.status === 200,
    'response time < 200ms': (r) => r.timings.duration < 200,
  });
  errorRate.add(res.status !== 200);
  sleep(0.1);
}

Step 5: Configure Scaling Monitoring

resource "aws_cloudwatch_metric_alarm" "scaling_out" {
  alarm_name          = "payment-api-scaling-out"
  alarm_description   = "ASG scaling out – check if max capacity approached. Runbook: https://wiki/scaling"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "GroupDesiredCapacity"
  namespace           = "AWS/AutoScaling"
  period              = 60
  statistic           = "Maximum"
  threshold           = var.max_instances * 0.8  # Alert at 80% of max capacity
  alarm_actions       = [aws_sns_topic.ops.arn]
  dimensions = {
    AutoScalingGroupName = aws_autoscaling_group.api.name
  }
}

Common Anti-Patterns

Scale-out on CPU: CPU is often not the limiting resource for HTTP APIs. ALB request count or P99 latency are better choices.
min = max: Effectively no auto-scaling. Typical for teams that want fixed capacity "to be safe".
No instance warmup: New instances are immediately hit with full traffic before they are ready.
Untested scaling thresholds: A threshold of 80% CPU sounds reasonable – but under realistic load it may never trigger.

Metrics

Scaling events per day (baseline for anomaly detection)
Time from scaling trigger to new instance serving traffic (target: < 3 min)
Proportion of time at or near max capacity (target: < 5% of the time)
P99 latency during scaling events (MUST stay within SLO range)

Maturity Level

Level 1 – Static capacity; no auto-scaling configuration
Level 2 – ASG configured, default CPU threshold, not tested
Level 3 – Correct metrics, load test validation, documented limits
Level 4 – Predictive scaling, scale-out duration measured within SLO
Level 5 – Autonomous capacity management, ML-based scaling policies