Best Practice: Optimize Database Performance

Context

Database performance problems are the most common cause of application latency degradation. A single missing index on a high-traffic table can saturate I/O, exhaust connection pools, and trigger cascading timeouts throughout the entire application.

Typical problems without structured database performance management:

Full table scans on million-row tables in the critical query path
Connection pool exhaustion caused by long-running queries that block connections
Query performance regressions after schema changes silently reaching production
Performance Insights not active – diagnosis only possible through incident-driven analysis

Related Controls

WAF-PERF-040 – Database Performance Baseline & Index Strategy
WAF-PERF-090 – Storage I/O Performance & Throughput Optimization

Target State

Performance-mature database environment:

Baseline known: P50/P95/P99 execution times for the top 20 queries documented
Indexes strategic: All high-frequency queries use indexes; no full table scans
Monitoring active: Performance Insights and slow query log running continuously
Connection pools: Explicitly dimensioned and protected with PgBouncer/ProxySQL

Technical Implementation

Step 1: Activate Performance Insights (AWS RDS)

resource "aws_db_instance" "main" {
  identifier              = "db-payment-${var.environment}"
  engine                  = "postgres"
  engine_version          = "15.4"
  instance_class          = "db.t3.medium"
  allocated_storage       = 100
  storage_type            = "gp3"
  iops                    = 3000

  # Performance Insights
  performance_insights_enabled          = true
  performance_insights_retention_period = 7   # 7 days free; 731 days paid

  # Enhanced Monitoring
  monitoring_interval = 60  # seconds
  monitoring_role_arn = aws_iam_role.rds_monitoring.arn

  # Slow Query Logs
  enabled_cloudwatch_logs_exports = ["postgresql"]
  parameter_group_name            = aws_db_parameter_group.main.name

  # Security
  deletion_protection    = true
  skip_final_snapshot    = false
  multi_az               = true

  tags = {
    workload    = "payment-api"
    environment = var.environment
  }
}

resource "aws_db_parameter_group" "main" {
  name   = "pg-payment-${var.environment}"
  family = "postgres15"

  # Slow Query Logging
  parameter {
    name  = "log_min_duration_statement"
    value = "1000"  # Log queries slower than 1000ms
  }
  parameter {
    name  = "log_lock_waits"
    value = "1"     # Log lock waits
  }
  parameter {
    name  = "auto_explain.log_min_duration"
    value = "5000"  # EXPLAIN ANALYZE for queries > 5s
  }
}

Step 2: Analyze Slow Queries

-- PostgreSQL: Top-20 queries by total execution time
SELECT
    left(query, 100) as query_snippet,
    calls,
    round(total_exec_time::numeric, 2) as total_ms,
    round(mean_exec_time::numeric, 2) as avg_ms,
    round(stddev_exec_time::numeric, 2) as stddev_ms,
    rows,
    round(100.0 * total_exec_time / sum(total_exec_time) over(), 2) as pct_total
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

-- Identify missing indexes (sequential scans on large tables)
SELECT
    schemaname,
    tablename,
    seq_scan,
    seq_tup_read,
    idx_scan,
    idx_tup_fetch,
    round(seq_tup_read::numeric / NULLIF(seq_scan, 0), 0) as avg_rows_per_seq_scan
FROM pg_stat_user_tables
WHERE seq_scan > 0
  AND seq_tup_read > 10000  -- Only tables with significant scans
ORDER BY seq_tup_read DESC;

-- Best index candidates (frequent WHERE columns without index)
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM payments WHERE user_id = $1 AND status = 'pending';
-- → If "Seq Scan" appears: index on (user_id, status) is missing!

Step 3: Document the Index Strategy

# docs/database/index-strategy.yml
database: "payment-db"
last_reviewed: "2026-03-18"

tables:
  payments:
    row_count: "~50M"
    indexes:
      - columns: ["id"]
        type: "primary_key"
        rationale: "Primary key lookups"

      - columns: ["user_id"]
        type: "btree"
        rationale: "All payment queries filter by user_id"
        query_volume: "~5000 req/s"

      - columns: ["user_id", "status"]
        type: "btree"
        rationale: "Pending payments list – most common user query"
        query_volume: "~2000 req/s"
        created: "2025-08-15"
        notes: "Without this index: full table scan on 50M rows"

      - columns: ["created_at"]
        type: "btree"
        rationale: "Date range queries for reporting"
        used_by: "reporting-service"

    do_not_add:
      - columns: ["amount"]
        reason: "Never used as filter; would add write overhead"

review_schedule: "quarterly"
next_review: "2026-06-18"

Step 4: Create Index in Migration

-- Concurrent index creation (no table lock in PostgreSQL)
CREATE INDEX CONCURRENTLY idx_payments_user_status
ON payments (user_id, status)
WHERE status IN ('pending', 'processing');  -- Partial index for active payments

-- Check index size and efficiency
SELECT
    indexname,
    pg_size_pretty(pg_relation_size(indexname::regclass)) as index_size,
    idx_scan as scans,
    idx_tup_read as rows_read,
    idx_tup_fetch as rows_fetched
FROM pg_stat_user_indexes
WHERE tablename = 'payments'
ORDER BY idx_scan DESC;

Step 5: Connection Pooling with PgBouncer

# pgbouncer.ini
[databases]
payment_db = host=db-payment.cluster.local port=5432 dbname=payment_prod

[pgbouncer]
pool_mode = transaction     # Transaction pooling for stateless APIs
listen_port = 5432
listen_addr = 0.0.0.0
max_client_conn = 1000      # Maximum client connections
default_pool_size = 20      # Connections to database per pool
min_pool_size = 5           # Minimum connections to keep warm
reserve_pool_size = 5       # Reserve for peaks
server_idle_timeout = 600
client_idle_timeout = 600
log_connections = 1
log_disconnections = 1

Common Anti-Patterns

Index on wrong columns: Index on created_at alone, but queries always filter on user_id AND created_at.
Too many indexes: Every additional write operation must update all indexes → write overhead.
Performance Insights only activated during incidents: Too late; historical data is missing.
max_connections = 100 with 50 app instances: 2 connections per instance – connection pool exhaustion is inevitable.

Metrics

P50/P95/P99 query execution time for the top 20 queries (document monthly)
Connection pool utilization (target: < 80% sustained)
DB CPU utilization (target: < 70% sustained)
Number of slow queries > 1s per hour (target: declining trend)

Maturity Level

Level 1 – No analysis; database performance unknown
Level 2 – Slow query log active; ad-hoc reviews
Level 3 – Performance Insights active; index strategy documented; monthly review
Level 4 – Automatic regression detection; query performance tests in CI
Level 5 – Query SLOs; automatic tuning; capacity forecasting