Best Practice: Optimize Database Performance
Context
Database performance problems are the most common cause of application latency degradation. A single missing index on a high-traffic table can saturate I/O, exhaust connection pools, and trigger cascading timeouts throughout the entire application.
Typical problems without structured database performance management:
-
Full table scans on million-row tables in the critical query path
-
Connection pool exhaustion caused by long-running queries that block connections
-
Query performance regressions after schema changes silently reaching production
-
Performance Insights not active – diagnosis only possible through incident-driven analysis
Related Controls
-
WAF-PERF-040 – Database Performance Baseline & Index Strategy
-
WAF-PERF-090 – Storage I/O Performance & Throughput Optimization
Target State
Performance-mature database environment:
-
Baseline known: P50/P95/P99 execution times for the top 20 queries documented
-
Indexes strategic: All high-frequency queries use indexes; no full table scans
-
Monitoring active: Performance Insights and slow query log running continuously
-
Connection pools: Explicitly dimensioned and protected with PgBouncer/ProxySQL
Technical Implementation
Step 1: Activate Performance Insights (AWS RDS)
resource "aws_db_instance" "main" {
identifier = "db-payment-${var.environment}"
engine = "postgres"
engine_version = "15.4"
instance_class = "db.t3.medium"
allocated_storage = 100
storage_type = "gp3"
iops = 3000
# Performance Insights
performance_insights_enabled = true
performance_insights_retention_period = 7 # 7 days free; 731 days paid
# Enhanced Monitoring
monitoring_interval = 60 # seconds
monitoring_role_arn = aws_iam_role.rds_monitoring.arn
# Slow Query Logs
enabled_cloudwatch_logs_exports = ["postgresql"]
parameter_group_name = aws_db_parameter_group.main.name
# Security
deletion_protection = true
skip_final_snapshot = false
multi_az = true
tags = {
workload = "payment-api"
environment = var.environment
}
}
resource "aws_db_parameter_group" "main" {
name = "pg-payment-${var.environment}"
family = "postgres15"
# Slow Query Logging
parameter {
name = "log_min_duration_statement"
value = "1000" # Log queries slower than 1000ms
}
parameter {
name = "log_lock_waits"
value = "1" # Log lock waits
}
parameter {
name = "auto_explain.log_min_duration"
value = "5000" # EXPLAIN ANALYZE for queries > 5s
}
}
Step 2: Analyze Slow Queries
-- PostgreSQL: Top-20 queries by total execution time
SELECT
left(query, 100) as query_snippet,
calls,
round(total_exec_time::numeric, 2) as total_ms,
round(mean_exec_time::numeric, 2) as avg_ms,
round(stddev_exec_time::numeric, 2) as stddev_ms,
rows,
round(100.0 * total_exec_time / sum(total_exec_time) over(), 2) as pct_total
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
-- Identify missing indexes (sequential scans on large tables)
SELECT
schemaname,
tablename,
seq_scan,
seq_tup_read,
idx_scan,
idx_tup_fetch,
round(seq_tup_read::numeric / NULLIF(seq_scan, 0), 0) as avg_rows_per_seq_scan
FROM pg_stat_user_tables
WHERE seq_scan > 0
AND seq_tup_read > 10000 -- Only tables with significant scans
ORDER BY seq_tup_read DESC;
-- Best index candidates (frequent WHERE columns without index)
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM payments WHERE user_id = $1 AND status = 'pending';
-- → If "Seq Scan" appears: index on (user_id, status) is missing!
Step 3: Document the Index Strategy
# docs/database/index-strategy.yml
database: "payment-db"
last_reviewed: "2026-03-18"
tables:
payments:
row_count: "~50M"
indexes:
- columns: ["id"]
type: "primary_key"
rationale: "Primary key lookups"
- columns: ["user_id"]
type: "btree"
rationale: "All payment queries filter by user_id"
query_volume: "~5000 req/s"
- columns: ["user_id", "status"]
type: "btree"
rationale: "Pending payments list – most common user query"
query_volume: "~2000 req/s"
created: "2025-08-15"
notes: "Without this index: full table scan on 50M rows"
- columns: ["created_at"]
type: "btree"
rationale: "Date range queries for reporting"
used_by: "reporting-service"
do_not_add:
- columns: ["amount"]
reason: "Never used as filter; would add write overhead"
review_schedule: "quarterly"
next_review: "2026-06-18"
Step 4: Create Index in Migration
-- Concurrent index creation (no table lock in PostgreSQL)
CREATE INDEX CONCURRENTLY idx_payments_user_status
ON payments (user_id, status)
WHERE status IN ('pending', 'processing'); -- Partial index for active payments
-- Check index size and efficiency
SELECT
indexname,
pg_size_pretty(pg_relation_size(indexname::regclass)) as index_size,
idx_scan as scans,
idx_tup_read as rows_read,
idx_tup_fetch as rows_fetched
FROM pg_stat_user_indexes
WHERE tablename = 'payments'
ORDER BY idx_scan DESC;
Step 5: Connection Pooling with PgBouncer
# pgbouncer.ini
[databases]
payment_db = host=db-payment.cluster.local port=5432 dbname=payment_prod
[pgbouncer]
pool_mode = transaction # Transaction pooling for stateless APIs
listen_port = 5432
listen_addr = 0.0.0.0
max_client_conn = 1000 # Maximum client connections
default_pool_size = 20 # Connections to database per pool
min_pool_size = 5 # Minimum connections to keep warm
reserve_pool_size = 5 # Reserve for peaks
server_idle_timeout = 600
client_idle_timeout = 600
log_connections = 1
log_disconnections = 1
Common Anti-Patterns
-
Index on wrong columns: Index on
created_atalone, but queries always filter onuser_id AND created_at. -
Too many indexes: Every additional write operation must update all indexes → write overhead.
-
Performance Insights only activated during incidents: Too late; historical data is missing.
-
max_connections = 100 with 50 app instances: 2 connections per instance – connection pool exhaustion is inevitable.
Metrics
-
P50/P95/P99 query execution time for the top 20 queries (document monthly)
-
Connection pool utilization (target: < 80% sustained)
-
DB CPU utilization (target: < 70% sustained)
-
Number of slow queries > 1s per hour (target: declining trend)
Maturity Level
Level 1 – No analysis; database performance unknown
Level 2 – Slow query log active; ad-hoc reviews
Level 3 – Performance Insights active; index strategy documented; monthly review
Level 4 – Automatic regression detection; query performance tests in CI
Level 5 – Query SLOs; automatic tuning; capacity forecasting