Best Practice: Rightsizing & Resource Optimization
Context
Over-provisioning is the most common and often largest source of cloud waste. Instances that were oversized for "future growth" or "to be safe" pay the price of that decision every month – often for years without review.
At the same time, poor rightsizing is rarely malicious: sizing decisions are made early in the design process when usage data is unavailable. Without a structured review cycle, nothing changes.
Related Controls
-
WAF-COST-030 – Resource Rightsizing & Idle Detection
-
WAF-COST-080 – Commitment & Reserved Capacity Planning
Target State
-
All compute resources with a
rightsizing-reviewedtag (< 90 days old) -
Idle detection configured: resources < 5% CPU over 7 days are automatically identified
-
Baseline workloads (>= 70% utilization over 30 days) covered by reservations
-
Variable workloads on spot/preemptible instances
Introducing Rightsizing Tags
# Compliant: rightsizing tag with date present
resource "aws_instance" "app_server" {
ami = data.aws_ami.ubuntu.id
instance_type = "t3.medium"
tags = merge(module.mandatory_tags.tags, {
rightsizing-reviewed = "2025-03-01"
rightsizing-result = "no-change" # no-change | downsize | upsize | pending
capacity-commitment = "on-demand" # on-demand | reserved | spot
})
}
# Non-Compliant: no rightsizing tag
resource "aws_instance" "app_server" {
ami = data.aws_ami.ubuntu.id
instance_type = "t3.medium"
tags = {
Name = "app-server"
# Missing: rightsizing-reviewed – WAF-COST-030 Violation
}
}
Configuring Idle Detection
AWS: CloudWatch Alarm for Idle Instances
resource "aws_cloudwatch_metric_alarm" "idle_instance" {
for_each = toset(var.monitored_instance_ids)
alarm_name = "idle-instance-${each.value}"
comparison_operator = "LessThanThreshold"
evaluation_periods = 7 # 7 data points
metric_name = "CPUUtilization"
namespace = "AWS/EC2"
period = 86400 # 1 day in seconds
statistic = "Average"
threshold = 5 # < 5% CPU = idle
alarm_description = "Instance ${each.value} appears idle. Review for shutdown/rightsizing."
dimensions = {
InstanceId = each.value
}
alarm_actions = [aws_sns_topic.finops_alerts.arn]
}
resource "aws_sns_topic" "finops_alerts" {
name = "finops-rightsizing-alerts"
}
resource "aws_sns_topic_subscription" "finops_email" {
topic_arn = aws_sns_topic.finops_alerts.arn
protocol = "email"
endpoint = var.finops_team_email
}
Automated Idle Discovery Script
#!/bin/bash
# scripts/idle-discovery.sh
echo "=== Idle EC2 Instances (< 5% CPU, 7 days) ==="
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--statistics Average \
--start-time $(date -d '7 days ago' --iso-8601=seconds) \
--end-time $(date --iso-8601=seconds) \
--period 604800 \
--dimensions Name=InstanceId,Value="$1" \
--query 'Datapoints[0].Average'
# Find all instances with < 5% CPU utilization over 7 days
aws ec2 describe-instances \
--query 'Reservations[].Instances[?State.Name==`running`].InstanceId' \
--output text | tr '\t' '\n' | while read INSTANCE_ID; do
AVG_CPU=$(aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--statistics Average \
--start-time $(date -d '7 days ago' --iso-8601=seconds) \
--end-time $(date --iso-8601=seconds) \
--period 604800 \
--dimensions Name=InstanceId,Value="$INSTANCE_ID" \
--query 'Datapoints[0].Average' \
--output text 2>/dev/null)
if (( $(echo "$AVG_CPU < 5" | bc -l) )); then
echo "IDLE: $INSTANCE_ID (CPU: ${AVG_CPU}%)"
fi
done
Rightsizing Rules
| Situation | Recommendation | Approach |
|---|---|---|
CPU utilization P95 < 20% |
Consider downsizing by one instance family step |
Test smaller instance in staging; re-measure P95 |
CPU utilization P95 > 80% |
Consider upsizing or introducing auto-scaling |
Analyze utilization pattern: constant or spikes? |
Idle (< 5% CPU, 7 days) |
Shut down or include in non-prod pause policy |
Contact owner; shut down within 14 days |
Memory utilization < 30% |
Replace memory-optimized instance with standard |
Only useful if no memory caching workload |
Dev/test running 24/7 |
Auto-shutdown outside working hours |
Schedule: Mon–Fri 8–20:00; rest off |
Auto-Shutdown for Non-Production
# Non-production auto-shutdown with AWS Instance Scheduler
resource "aws_cloudwatch_event_rule" "stop_dev_instances" {
name = "stop-dev-instances-evening"
description = "Stop development instances outside business hours"
schedule_expression = "cron(0 20 ? * MON-FRI *)" # 20:00 Mon–Fri
}
resource "aws_cloudwatch_event_rule" "start_dev_instances" {
name = "start-dev-instances-morning"
description = "Start development instances at beginning of business hours"
schedule_expression = "cron(0 8 ? * MON-FRI *)" # 8:00 Mon–Fri
}
resource "aws_cloudwatch_event_target" "stop_dev" {
rule = aws_cloudwatch_event_rule.stop_dev_instances.name
arn = "arn:aws:ssm:${var.region}::automation-definition/AWS-StopEC2Instance"
role_arn = aws_iam_role.scheduler.arn
input = jsonencode({
InstanceId = [for id in aws_instance.dev[*].id : id]
AutomationAssumeRole = [aws_iam_role.scheduler.arn]
})
}
Reservation Optimization
When to Reserve?
Rule of thumb: resources with >= 70% utilization over 30 days are reservation candidates.
# Tag for commitment tracking
resource "aws_instance" "baseline_app" {
ami = data.aws_ami.ubuntu.id
instance_type = "c5.2xlarge"
tags = merge(module.mandatory_tags.tags, {
rightsizing-reviewed = "2025-03-01"
capacity-commitment = "reserved" # Status: reserved
commitment-type = "1yr-no-upfront" # Document the reservation type
commitment-expiry = "2026-03-01" # Expiry date of the reservation
})
}
Savings Plans vs. Reserved Instances
| Criterion | Reserved Instances | Savings Plans |
|---|---|---|
Flexibility |
Tied to instance type/region |
Tied to spend amount/hour (more flexible) |
Discount |
Up to 60% (No-Upfront 3yr) |
Up to 66% (Compute Savings Plan) |
Risk with rightsizing |
Unused RIs still cost money |
Savings Plans apply to any compute type |
Recommendation |
For very stable workloads with a known instance type |
For workloads with variable instance types or rightsizing activity |
Metrics
-
Idle instance rate: % of instances with < 5% CPU over 7 days (target: < 3%)
-
Rightsizing coverage: % of instances with
rightsizing-reviewedtag < 90 days (target: >= 80%) -
RI utilization: utilization rate of reservations (target: >= 80%)
-
Average instance utilization P95: across all compute resources (benchmark: 40–70%)