Best Practice: Configure Auto-Scaling
Context
Static capacity for variable load is inefficient and risky: either too much (cost waste) or too little (performance degradation during load spikes). Auto-scaling is the only mechanism that dynamically adjusts capacity to actual demand.
Common problems without validated auto-scaling:
-
ASG configured, but scaling thresholds never tested under load – scaling triggers too late
-
Scaling based on CPU utilization, but the service is not CPU-bound
-
min=max=1 – effectively no scaling, despite ASG configuration
-
Scale-in too aggressive: instances are terminated immediately before connections are drained
Related Controls
-
WAF-PERF-020 – Auto-Scaling Configured & Tested
-
WAF-PERF-080 – Serverless & Managed Services for Variable Load
Target State
Validated auto-scaling:
-
Correct metrics: Scaling is based on the metric that reflects actual load
-
Tested: Load test demonstrates that scaling triggers within the latency SLO
-
Documented: Scaling limits and behavior are described in the runbook
-
Observed: Scaling events are monitored and anomalies are alerted
Technical Implementation
Step 1: Choose the Right Scaling Metric
| Workload Type | Recommended Metric | Rationale |
|---|---|---|
HTTP API |
ALBRequestCountPerTarget or P99 Latency |
CPU does not correlate directly with request load |
Queue Worker |
SQS ApproximateNumberOfMessagesVisible |
Queue depth is a direct representation of backlog |
WebSocket Server |
Active Connections (Custom Metric) |
CPU does not correctly reflect connection load |
Batch Processor |
Custom Job-Count Metric |
Throughput-based, not CPU-based |
Step 2: AWS Auto Scaling Group with Target Tracking
resource "aws_autoscaling_group" "api" {
name = "asg-payment-api-${var.environment}"
min_size = 2 # At least 2 for redundancy without cold starts
max_size = 20
desired_capacity = 2
vpc_zone_identifier = var.private_subnet_ids
target_group_arns = [aws_lb_target_group.api.arn]
health_check_type = "ELB"
# Drain connections on scale-in (important!)
default_cooldown = 300
# Warm-up: include new instances in scaling decisions only after 120s
instance_refresh {
strategy = "Rolling"
preferences {
min_healthy_percentage = 50
instance_warmup = 120
}
}
launch_template {
id = aws_launch_template.api.id
version = "$Latest"
}
tag {
key = "Name"
value = "payment-api-${var.environment}"
propagate_at_launch = true
}
}
# Target Tracking: ALB requests per target
resource "aws_autoscaling_policy" "api_request_tracking" {
name = "payment-api-request-tracking"
autoscaling_group_name = aws_autoscaling_group.api.name
policy_type = "TargetTrackingScaling"
target_tracking_configuration {
predefined_metric_specification {
predefined_metric_type = "ALBRequestCountPerTarget"
resource_label = "${aws_lb.api.arn_suffix}/${aws_lb_target_group.api.arn_suffix}"
}
target_value = 1000.0 # Max requests/s per instance – determined from load test
scale_in_cooldown = 300 # Conservative: wait 5 min before scale-in
scale_out_cooldown = 60 # Aggressive: scale out quickly under load
}
}
Step 3: Kubernetes HPA with Custom Metrics
# k8s/hpa-payment-api.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: payment-api-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: payment-api
minReplicas: 2
maxReplicas: 50
metrics:
- type: External
external:
metric:
name: sqs_queue_depth # KEDA-based SQS metric
selector:
matchLabels:
queue: payment-jobs
target:
type: AverageValue
averageValue: "10" # 1 pod per 10 messages in the queue
behavior:
scaleOut:
stabilizationWindowSeconds: 0 # Scale immediately
policies:
- type: Pods
value: 5 # Maximum 5 new pods at once
periodSeconds: 60
scaleIn:
stabilizationWindowSeconds: 300 # Wait 5 minutes before scale-in
policies:
- type: Pods
value: 2
periodSeconds: 60
Step 4: Validate Auto-Scaling with Load Test
// tests/performance/scaling-validation.js (k6)
import http from 'k6/http';
import { check, sleep } from 'k6';
import { Rate } from 'k6/metrics';
const errorRate = new Rate('errors');
// Gradual increase to 2x expected load
export const options = {
stages: [
{ duration: '2m', target: 10 }, // Warm-up
{ duration: '5m', target: 50 }, // Normal load
{ duration: '5m', target: 100 }, // Scaling range
{ duration: '3m', target: 200 }, // 2x peak load
{ duration: '2m', target: 10 }, // Cool-down
],
thresholds: {
http_req_duration: ['p(95)<200', 'p(99)<500'], // SLO
errors: ['rate<0.001'], // < 0.1% errors
},
};
export default function () {
const res = http.get(`${__ENV.BASE_URL}/api/v1/payments`);
check(res, {
'status is 200': (r) => r.status === 200,
'response time < 200ms': (r) => r.timings.duration < 200,
});
errorRate.add(res.status !== 200);
sleep(0.1);
}
Step 5: Configure Scaling Monitoring
resource "aws_cloudwatch_metric_alarm" "scaling_out" {
alarm_name = "payment-api-scaling-out"
alarm_description = "ASG scaling out – check if max capacity approached. Runbook: https://wiki/scaling"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 1
metric_name = "GroupDesiredCapacity"
namespace = "AWS/AutoScaling"
period = 60
statistic = "Maximum"
threshold = var.max_instances * 0.8 # Alert at 80% of max capacity
alarm_actions = [aws_sns_topic.ops.arn]
dimensions = {
AutoScalingGroupName = aws_autoscaling_group.api.name
}
}
Common Anti-Patterns
-
Scale-out on CPU: CPU is often not the limiting resource for HTTP APIs. ALB request count or P99 latency are better choices.
-
min = max: Effectively no auto-scaling. Typical for teams that want fixed capacity "to be safe".
-
No instance warmup: New instances are immediately hit with full traffic before they are ready.
-
Untested scaling thresholds: A threshold of 80% CPU sounds reasonable – but under realistic load it may never trigger.
Metrics
-
Scaling events per day (baseline for anomaly detection)
-
Time from scaling trigger to new instance serving traffic (target: < 3 min)
-
Proportion of time at or near max capacity (target: < 5% of the time)
-
P99 latency during scaling events (MUST stay within SLO range)
Maturity Level
Level 1 – Static capacity; no auto-scaling configuration
Level 2 – ASG configured, default CPU threshold, not tested
Level 3 – Correct metrics, load test validation, documented limits
Level 4 – Predictive scaling, scale-out duration measured within SLO
Level 5 – Autonomous capacity management, ML-based scaling policies