Best Practice: Compute Sizing & Instance Selection
Context
Compute sizing is one of the most common sources of both resource waste and performance problems. Over-provisioned instances waste money; under-provisioned instances cause latency spikes and failures under load.
Common problems without structured sizing:
-
Instance types are chosen by gut feeling or "that’s how it’s always been"
-
Previous generations (t2, m4, c4) continue running for years without review
-
CPU utilization < 5% – a classic sign of unreflective over-provisioning
-
No sizing documentation: nobody remembers why a particular instance type was chosen
Related Controls
-
WAF-PERF-010 – Compute Instance Type & Sizing Validated
Target State
A mature sizing strategy:
-
Data-driven: Sizing decisions are based on measured CPU/memory/network baselines
-
Documented: Every production resource has a sizing rationale in an ADR or sizing sheet
-
Current: Quarterly review; cloud provider upgrade recommendations are tracked
-
Current generation: All resources use current instance generations
Technical Implementation
Step 1: Collect Baseline Data
Before making a sizing decision, 2–4 weeks of metrics must be available:
# AWS: Query CloudWatch CPU metrics for an EC2 instance
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
--start-time 2026-03-01T00:00:00Z \
--end-time 2026-03-18T00:00:00Z \
--period 3600 \
--statistics Average Maximum \
--query 'Datapoints[*].[Timestamp,Average,Maximum]' \
--output table
# GCP: Machine type recommendation
gcloud recommender recommendations list \
--project=my-project \
--location=europe-west3-a \
--recommender=google.compute.instance.MachineTypeRecommender
Step 2: Create a Sizing Document
# docs/sizing/payment-api.yml
service: "payment-api"
last_reviewed: "2026-03-18"
reviewed_by: "platform-team"
current:
provider: "aws"
instance_type: "t3.medium"
vcpu: 2
memory_gb: 4
measured_baseline:
period: "2026-02-15 to 2026-03-15"
cpu_average_pct: 18
cpu_p95_pct: 45
memory_average_gb: 2.1
memory_max_gb: 2.8
network_avg_mbps: 12
assessment:
status: "appropriately_sized"
rationale: >
CPU headroom adequate for 2.5x spikes before auto-scaling triggers.
Memory utilization at 52% of available; sufficient headroom.
Next review: 2026-06-18
auto_scaling:
min_instances: 2
max_instances: 10
scale_out_trigger: "ALBRequestCountPerTarget > 800"
Step 3: Use Current Generation in Terraform
# Compliant: Current generation, explicit sizing rationale as comment
locals {
# t3.medium chosen based on docs/sizing/payment-api.yml
# CPU avg 18%, P95 45% – 2 vCPU provides sufficient headroom for 2.5x spikes
instance_type = "t3.medium"
}
resource "aws_launch_template" "app" {
name_prefix = "lt-payment-api-"
image_id = data.aws_ami.ubuntu.id
instance_type = local.instance_type
tag_specifications {
resource_type = "instance"
tags = {
workload = "payment-api"
sizing-reviewed = "2026-03-18"
owner = "platform-team"
}
}
}
Step 4: Use AWS Compute Optimizer
# Activate Compute Optimizer enrollment
resource "aws_computeoptimizer_enrollment_status" "main" {
status = "Active"
}
# Retrieve recommendations via CLI and include in sizing review
# aws compute-optimizer get-ec2-instance-recommendations \
# --filters name=Finding,values=UNDER_PROVISIONED,OVER_PROVISIONED
Step 5: Quarterly Review Process
# docs/processes/compute-sizing-review.yml
frequency: "quarterly"
next_review: "2026-06-18"
checklist:
- Review Compute Optimizer recommendations
- Identify instances with CPU avg < 10% → rightsizing candidates
- Identify instances with CPU p95 > 80% → upgrade candidates
- Identify previous-generation instances (t2, m4, c4) → migration plan
- Update sizing documents
output:
- Sizing review report (which changes were made)
- Jira tickets for rightsizing actions
- Updated sizing documents in docs/sizing/
Common Anti-Patterns
-
"We needed more in the past": Historical peaks do not justify permanent over-provisioning. Auto-scaling handles peak load.
-
"t2.large was always good enough": t2 is a previous generation; t3 offers better performance at a lower price.
-
"We’d rather provision more to be safe": Vertical sizing increases are not a substitute for auto-scaling.
-
Sizing from AWS defaults: Default recommendations are not workload-specific.
Metrics
-
Average CPU utilization across all compute resources (target: 20–70%)
-
Proportion of resources with CPU avg < 10% (target: < 5% of resources)
-
Proportion of previous-generation instances (target: 0%)
-
Proportion of resources without sizing documentation (target: 0% for production)
Maturity Level
Level 1 – No standard; sizing by intuition or historical values
Level 2 – Experience-based sizing; occasional reviews
Level 3 – Data-driven sizing with documented baselines; quarterly review
Level 4 – Compute Optimizer integration; automatic rightsizing tickets
Level 5 – ML-based predictive sizing; self-optimizing capacity