WAF++ WAF++
Back to WAF++ Homepage

Best Practice: Configure Auto-Scaling

Context

Static capacity for variable load is inefficient and risky: either too much (cost waste) or too little (performance degradation during load spikes). Auto-scaling is the only mechanism that dynamically adjusts capacity to actual demand.

Common problems without validated auto-scaling:

  • ASG configured, but scaling thresholds never tested under load – scaling triggers too late

  • Scaling based on CPU utilization, but the service is not CPU-bound

  • min=max=1 – effectively no scaling, despite ASG configuration

  • Scale-in too aggressive: instances are terminated immediately before connections are drained

Target State

Validated auto-scaling:

  • Correct metrics: Scaling is based on the metric that reflects actual load

  • Tested: Load test demonstrates that scaling triggers within the latency SLO

  • Documented: Scaling limits and behavior are described in the runbook

  • Observed: Scaling events are monitored and anomalies are alerted

Technical Implementation

Step 1: Choose the Right Scaling Metric

Workload Type Recommended Metric Rationale

HTTP API

ALBRequestCountPerTarget or P99 Latency

CPU does not correlate directly with request load

Queue Worker

SQS ApproximateNumberOfMessagesVisible

Queue depth is a direct representation of backlog

WebSocket Server

Active Connections (Custom Metric)

CPU does not correctly reflect connection load

Batch Processor

Custom Job-Count Metric

Throughput-based, not CPU-based

Step 2: AWS Auto Scaling Group with Target Tracking

resource "aws_autoscaling_group" "api" {
  name                = "asg-payment-api-${var.environment}"
  min_size            = 2  # At least 2 for redundancy without cold starts
  max_size            = 20
  desired_capacity    = 2
  vpc_zone_identifier = var.private_subnet_ids
  target_group_arns   = [aws_lb_target_group.api.arn]
  health_check_type   = "ELB"

  # Drain connections on scale-in (important!)
  default_cooldown = 300

  # Warm-up: include new instances in scaling decisions only after 120s
  instance_refresh {
    strategy = "Rolling"
    preferences {
      min_healthy_percentage = 50
      instance_warmup        = 120
    }
  }

  launch_template {
    id      = aws_launch_template.api.id
    version = "$Latest"
  }

  tag {
    key                 = "Name"
    value               = "payment-api-${var.environment}"
    propagate_at_launch = true
  }
}

# Target Tracking: ALB requests per target
resource "aws_autoscaling_policy" "api_request_tracking" {
  name                   = "payment-api-request-tracking"
  autoscaling_group_name = aws_autoscaling_group.api.name
  policy_type            = "TargetTrackingScaling"

  target_tracking_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ALBRequestCountPerTarget"
      resource_label         = "${aws_lb.api.arn_suffix}/${aws_lb_target_group.api.arn_suffix}"
    }
    target_value       = 1000.0   # Max requests/s per instance – determined from load test
    scale_in_cooldown  = 300      # Conservative: wait 5 min before scale-in
    scale_out_cooldown = 60       # Aggressive: scale out quickly under load
  }
}

Step 3: Kubernetes HPA with Custom Metrics

# k8s/hpa-payment-api.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: payment-api-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-api
  minReplicas: 2
  maxReplicas: 50
  metrics:
    - type: External
      external:
        metric:
          name: sqs_queue_depth  # KEDA-based SQS metric
          selector:
            matchLabels:
              queue: payment-jobs
        target:
          type: AverageValue
          averageValue: "10"  # 1 pod per 10 messages in the queue
  behavior:
    scaleOut:
      stabilizationWindowSeconds: 0  # Scale immediately
      policies:
        - type: Pods
          value: 5        # Maximum 5 new pods at once
          periodSeconds: 60
    scaleIn:
      stabilizationWindowSeconds: 300  # Wait 5 minutes before scale-in
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60

Step 4: Validate Auto-Scaling with Load Test

// tests/performance/scaling-validation.js (k6)
import http from 'k6/http';
import { check, sleep } from 'k6';
import { Rate } from 'k6/metrics';

const errorRate = new Rate('errors');

// Gradual increase to 2x expected load
export const options = {
  stages: [
    { duration: '2m', target: 10 },   // Warm-up
    { duration: '5m', target: 50 },   // Normal load
    { duration: '5m', target: 100 },  // Scaling range
    { duration: '3m', target: 200 },  // 2x peak load
    { duration: '2m', target: 10 },   // Cool-down
  ],
  thresholds: {
    http_req_duration: ['p(95)<200', 'p(99)<500'],  // SLO
    errors: ['rate<0.001'],                          // < 0.1% errors
  },
};

export default function () {
  const res = http.get(`${__ENV.BASE_URL}/api/v1/payments`);
  check(res, {
    'status is 200': (r) => r.status === 200,
    'response time < 200ms': (r) => r.timings.duration < 200,
  });
  errorRate.add(res.status !== 200);
  sleep(0.1);
}

Step 5: Configure Scaling Monitoring

resource "aws_cloudwatch_metric_alarm" "scaling_out" {
  alarm_name          = "payment-api-scaling-out"
  alarm_description   = "ASG scaling out – check if max capacity approached. Runbook: https://wiki/scaling"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "GroupDesiredCapacity"
  namespace           = "AWS/AutoScaling"
  period              = 60
  statistic           = "Maximum"
  threshold           = var.max_instances * 0.8  # Alert at 80% of max capacity
  alarm_actions       = [aws_sns_topic.ops.arn]
  dimensions = {
    AutoScalingGroupName = aws_autoscaling_group.api.name
  }
}

Common Anti-Patterns

  • Scale-out on CPU: CPU is often not the limiting resource for HTTP APIs. ALB request count or P99 latency are better choices.

  • min = max: Effectively no auto-scaling. Typical for teams that want fixed capacity "to be safe".

  • No instance warmup: New instances are immediately hit with full traffic before they are ready.

  • Untested scaling thresholds: A threshold of 80% CPU sounds reasonable – but under realistic load it may never trigger.

Metrics

  • Scaling events per day (baseline for anomaly detection)

  • Time from scaling trigger to new instance serving traffic (target: < 3 min)

  • Proportion of time at or near max capacity (target: < 5% of the time)

  • P99 latency during scaling events (MUST stay within SLO range)

Maturity Level

Level 1 – Static capacity; no auto-scaling configuration
Level 2 – ASG configured, default CPU threshold, not tested
Level 3 – Correct metrics, load test validation, documented limits
Level 4 – Predictive scaling, scale-out duration measured within SLO
Level 5 – Autonomous capacity management, ML-based scaling policies