Best Practice: Rightsizing & Resource Optimization

Context

Over-provisioning is the most common and often largest source of cloud waste. Instances that were oversized for "future growth" or "to be safe" pay the price of that decision every month – often for years without review.

At the same time, poor rightsizing is rarely malicious: sizing decisions are made early in the design process when usage data is unavailable. Without a structured review cycle, nothing changes.

Related Controls

WAF-COST-030 – Resource Rightsizing & Idle Detection
WAF-COST-080 – Commitment & Reserved Capacity Planning

Target State

All compute resources with a rightsizing-reviewed tag (< 90 days old)
Idle detection configured: resources < 5% CPU over 7 days are automatically identified
Baseline workloads (>= 70% utilization over 30 days) covered by reservations
Variable workloads on spot/preemptible instances

Introducing Rightsizing Tags

# Compliant: rightsizing tag with date present
resource "aws_instance" "app_server" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = "t3.medium"

  tags = merge(module.mandatory_tags.tags, {
    rightsizing-reviewed  = "2025-03-01"
    rightsizing-result    = "no-change"  # no-change | downsize | upsize | pending
    capacity-commitment   = "on-demand"  # on-demand | reserved | spot
  })
}

# Non-Compliant: no rightsizing tag
resource "aws_instance" "app_server" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = "t3.medium"
  tags = {
    Name = "app-server"
    # Missing: rightsizing-reviewed – WAF-COST-030 Violation
  }
}

Configuring Idle Detection

AWS: CloudWatch Alarm for Idle Instances

resource "aws_cloudwatch_metric_alarm" "idle_instance" {
  for_each = toset(var.monitored_instance_ids)

  alarm_name          = "idle-instance-${each.value}"
  comparison_operator = "LessThanThreshold"
  evaluation_periods  = 7    # 7 data points
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = 86400  # 1 day in seconds
  statistic           = "Average"
  threshold           = 5      # < 5% CPU = idle
  alarm_description   = "Instance ${each.value} appears idle. Review for shutdown/rightsizing."

  dimensions = {
    InstanceId = each.value
  }

  alarm_actions = [aws_sns_topic.finops_alerts.arn]
}

resource "aws_sns_topic" "finops_alerts" {
  name = "finops-rightsizing-alerts"
}

resource "aws_sns_topic_subscription" "finops_email" {
  topic_arn = aws_sns_topic.finops_alerts.arn
  protocol  = "email"
  endpoint  = var.finops_team_email
}

Automated Idle Discovery Script

#!/bin/bash
# scripts/idle-discovery.sh

echo "=== Idle EC2 Instances (< 5% CPU, 7 days) ==="

aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --statistics Average \
  --start-time $(date -d '7 days ago' --iso-8601=seconds) \
  --end-time $(date --iso-8601=seconds) \
  --period 604800 \
  --dimensions Name=InstanceId,Value="$1" \
  --query 'Datapoints[0].Average'

# Find all instances with < 5% CPU utilization over 7 days
aws ec2 describe-instances \
  --query 'Reservations[].Instances[?State.Name==`running`].InstanceId' \
  --output text | tr '\t' '\n' | while read INSTANCE_ID; do

  AVG_CPU=$(aws cloudwatch get-metric-statistics \
    --namespace AWS/EC2 \
    --metric-name CPUUtilization \
    --statistics Average \
    --start-time $(date -d '7 days ago' --iso-8601=seconds) \
    --end-time $(date --iso-8601=seconds) \
    --period 604800 \
    --dimensions Name=InstanceId,Value="$INSTANCE_ID" \
    --query 'Datapoints[0].Average' \
    --output text 2>/dev/null)

  if (( $(echo "$AVG_CPU < 5" | bc -l) )); then
    echo "IDLE: $INSTANCE_ID (CPU: ${AVG_CPU}%)"
  fi
done

Rightsizing Rules

Situation	Recommendation	Approach
CPU utilization P95 < 20%	Consider downsizing by one instance family step	Test smaller instance in staging; re-measure P95
CPU utilization P95 > 80%	Consider upsizing or introducing auto-scaling	Analyze utilization pattern: constant or spikes?
Idle (< 5% CPU, 7 days)	Shut down or include in non-prod pause policy	Contact owner; shut down within 14 days
Memory utilization < 30%	Replace memory-optimized instance with standard	Only useful if no memory caching workload
Dev/test running 24/7	Auto-shutdown outside working hours	Schedule: Mon–Fri 8–20:00; rest off

Situation

Recommendation

Approach

CPU utilization P95 < 20%

Consider downsizing by one instance family step

Test smaller instance in staging; re-measure P95

CPU utilization P95 > 80%

Consider upsizing or introducing auto-scaling

Analyze utilization pattern: constant or spikes?

Idle (< 5% CPU, 7 days)

Shut down or include in non-prod pause policy

Contact owner; shut down within 14 days

Memory utilization < 30%

Replace memory-optimized instance with standard

Only useful if no memory caching workload

Dev/test running 24/7

Auto-shutdown outside working hours

Schedule: Mon–Fri 8–20:00; rest off

Auto-Shutdown for Non-Production

# Non-production auto-shutdown with AWS Instance Scheduler
resource "aws_cloudwatch_event_rule" "stop_dev_instances" {
  name                = "stop-dev-instances-evening"
  description         = "Stop development instances outside business hours"
  schedule_expression = "cron(0 20 ? * MON-FRI *)"  # 20:00 Mon–Fri
}

resource "aws_cloudwatch_event_rule" "start_dev_instances" {
  name                = "start-dev-instances-morning"
  description         = "Start development instances at beginning of business hours"
  schedule_expression = "cron(0 8 ? * MON-FRI *)"  # 8:00 Mon–Fri
}

resource "aws_cloudwatch_event_target" "stop_dev" {
  rule = aws_cloudwatch_event_rule.stop_dev_instances.name
  arn  = "arn:aws:ssm:${var.region}::automation-definition/AWS-StopEC2Instance"
  role_arn = aws_iam_role.scheduler.arn

  input = jsonencode({
    InstanceId = [for id in aws_instance.dev[*].id : id]
    AutomationAssumeRole = [aws_iam_role.scheduler.arn]
  })
}

Reservation Optimization

When to Reserve?

Rule of thumb: resources with >= 70% utilization over 30 days are reservation candidates.

# Tag for commitment tracking
resource "aws_instance" "baseline_app" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = "c5.2xlarge"

  tags = merge(module.mandatory_tags.tags, {
    rightsizing-reviewed = "2025-03-01"
    capacity-commitment  = "reserved"        # Status: reserved
    commitment-type      = "1yr-no-upfront"  # Document the reservation type
    commitment-expiry    = "2026-03-01"      # Expiry date of the reservation
  })
}

Savings Plans vs. Reserved Instances

Criterion	Reserved Instances	Savings Plans
Flexibility	Tied to instance type/region	Tied to spend amount/hour (more flexible)
Discount	Up to 60% (No-Upfront 3yr)	Up to 66% (Compute Savings Plan)
Risk with rightsizing	Unused RIs still cost money	Savings Plans apply to any compute type
Recommendation	For very stable workloads with a known instance type	For workloads with variable instance types or rightsizing activity

Criterion

Reserved Instances

Savings Plans

Flexibility

Tied to instance type/region

Tied to spend amount/hour (more flexible)

Discount

Up to 60% (No-Upfront 3yr)

Up to 66% (Compute Savings Plan)

Risk with rightsizing

Unused RIs still cost money

Savings Plans apply to any compute type

Recommendation

For very stable workloads with a known instance type

For workloads with variable instance types or rightsizing activity

Metrics

Idle instance rate: % of instances with < 5% CPU over 7 days (target: < 3%)
Rightsizing coverage: % of instances with rightsizing-reviewed tag < 90 days (target: >= 80%)
RI utilization: utilization rate of reservations (target: >= 80%)
Average instance utilization P95: across all compute resources (benchmark: 40–70%)