Best Practice: Implement a Caching Strategy

Context

Caching is one of the most effective performance optimizations. A well-configured cache can reduce database load by 60–90% and improve P99 latency by an order of magnitude. Without a caching strategy, however, new problems frequently arise: stale data, stampede issues, and hard-to-diagnose inconsistencies.

Typical problems without a caching strategy:

Identical database queries are executed hundreds of times per second
Cache entries expire synchronously → Thundering Herd → database failure
Caching mutable data without invalidation logic → stale data
In-process cache not shared between instances → no consistent caching during scale-out

Related Controls

WAF-PERF-030 – Caching Strategy Defined & Implemented

Target State

A mature caching strategy:

Multi-layer: L1 (in-process), L2 (distributed/Redis), L3 (CDN)
Documented: Which data, which TTL, which invalidation logic
Measured: Cache hit rates are continuously monitored
Secure: No security-relevant decisions are made from the cache

Technical Implementation

Step 1: Caching Strategy Document

# docs/caching-strategy.yml
version: "1.0"
service: "payment-api"

cache_layers:
  l1_application:
    technology: "in-process (python dict + TTL)"
    use_cases:
      - "Static configuration values (TTL: 1h)"
      - "Country-code to currency mappings (TTL: 24h)"
    limitations: "Not shared between instances; cache per pod"

  l2_distributed:
    technology: "AWS ElastiCache Redis (STANDARD_HA)"
    use_cases:
      - "User session data (TTL: 30min)"
      - "Payment method lookups (TTL: 15min)"
      - "Rate limiting counters (TTL: 1min)"
      - "Idempotency keys (TTL: 24h)"
    hit_rate_target: ">= 80%"

  l3_cdn:
    technology: "AWS CloudFront"
    use_cases:
      - "Static assets: JS, CSS, images (TTL: 1 year, cache-busting via hash)"
      - "API responses with Cache-Control header (TTL: per endpoint)"
    hit_rate_target: ">= 95% for static assets"

do_not_cache:
  - "Real-time payment status (mutates frequently)"
  - "Authentication/authorization decisions"
  - "User-specific sensitive financial data"

Step 2: ElastiCache Redis in Terraform

resource "aws_elasticache_subnet_group" "main" {
  name       = "cache-subnet-group-${var.environment}"
  subnet_ids = var.private_subnet_ids
}

resource "aws_elasticache_replication_group" "main" {
  replication_group_id       = "payment-cache-${var.environment}"
  description                = "Redis cache for payment API – see docs/caching-strategy.yml"
  node_type                  = "cache.t3.medium"
  num_cache_clusters         = 2        # Primary + 1 Replica
  automatic_failover_enabled = true
  multi_az_enabled           = true
  at_rest_encryption_enabled = true
  transit_encryption_enabled = true
  auth_token                 = var.redis_auth_token  # TLS + Auth

  # Snapshot for debugging and warm-up
  snapshot_retention_limit = 1
  snapshot_window          = "03:00-04:00"

  subnet_group_name  = aws_elasticache_subnet_group.main.name
  security_group_ids = [aws_security_group.cache.id]

  tags = {
    workload    = "payment-api"
    environment = var.environment
  }
}

Step 3: Cache-Aside Pattern in Python

import redis
import json
import hashlib
from functools import wraps
from typing import Optional, Callable

class CacheManager:
    def __init__(self, redis_url: str, default_ttl: int = 300):
        self.redis = redis.from_url(redis_url, decode_responses=True)
        self.default_ttl = default_ttl

    def get(self, key: str) -> Optional[dict]:
        value = self.redis.get(key)
        return json.loads(value) if value else None

    def set(self, key: str, value: dict, ttl: int = None) -> None:
        self.redis.setex(
            key,
            ttl or self.default_ttl,
            json.dumps(value)
        )

    def invalidate(self, pattern: str) -> int:
        """Delete all keys matching the pattern."""
        keys = self.redis.keys(pattern)
        if keys:
            return self.redis.delete(*keys)
        return 0

    def get_or_compute(self, key: str, compute_fn: Callable,
                       ttl: int = None) -> dict:
        """Cache-aside with stampede protection via lock."""
        # Cache hit
        cached = self.get(key)
        if cached is not None:
            return cached

        # Lock against stampede (nx=True: only set if not exists)
        lock_key = f"lock:{key}"
        lock_acquired = self.redis.set(lock_key, "1", nx=True, ex=10)

        if lock_acquired:
            try:
                value = compute_fn()
                self.set(key, value, ttl)
                return value
            finally:
                self.redis.delete(lock_key)
        else:
            # Wait and re-read (another thread is filling the cache)
            import time
            time.sleep(0.05)
            return self.get(key) or compute_fn()

# Usage
cache = CacheManager(redis_url=REDIS_URL)

def get_payment_method(user_id: str) -> dict:
    cache_key = f"payment_method:user:{user_id}"
    return cache.get_or_compute(
        key=cache_key,
        compute_fn=lambda: db.query_payment_method(user_id),
        ttl=900  # 15 minute TTL
    )

# Cache invalidation on data mutation
def update_payment_method(user_id: str, data: dict) -> dict:
    result = db.update_payment_method(user_id, data)
    cache.invalidate(f"payment_method:user:{user_id}")  # Invalidate immediately
    return result

Step 4: CDN Cache Rules in Terraform (CloudFront)

resource "aws_cloudfront_distribution" "api" {
  enabled             = true
  default_root_object = "index.html"

  # Static assets: long-term caching with cache-busting
  ordered_cache_behavior {
    path_pattern     = "/static/*"
    allowed_methods  = ["GET", "HEAD"]
    cached_methods   = ["GET", "HEAD"]
    target_origin_id = "s3-static"

    forwarded_values {
      query_string = false  # No query strings for static assets
      cookies { forward = "none" }
    }

    min_ttl     = 86400       # 1 day minimum
    default_ttl = 31536000    # 1 year default
    max_ttl     = 31536000    # 1 year maximum
    compress    = true
  }

  # API responses: short caching with Cache-Control header
  default_cache_behavior {
    allowed_methods  = ["DELETE", "GET", "HEAD", "OPTIONS", "PATCH", "POST", "PUT"]
    cached_methods   = ["GET", "HEAD"]
    target_origin_id = "api-origin"

    forwarded_values {
      query_string = true
      headers      = ["Authorization", "Accept", "Origin"]  # Vary headers
      cookies { forward = "none" }
    }

    min_ttl     = 0
    default_ttl = 0       # No default caching for APIs
    max_ttl     = 60      # Max 1 minute if Cache-Control: max-age=60
  }
}

Cache Monitoring with CloudWatch

resource "aws_cloudwatch_metric_alarm" "cache_hit_rate_low" {
  alarm_name          = "payment-cache-hit-rate-low"
  alarm_description   = "Cache hit rate < 70% – investigate access patterns. See docs/caching-strategy.yml"
  comparison_operator = "LessThanThreshold"
  evaluation_periods  = 3
  threshold           = 70

  metric_query {
    id    = "hit_rate"
    label = "Cache Hit Rate %"
    expression = "100 * hits / (hits + misses)"
  }
  metric_query {
    id = "hits"
    metric {
      namespace   = "AWS/ElastiCache"
      metric_name = "CacheHits"
      period      = 300
      stat        = "Sum"
      dimensions = {
        CacheClusterId = aws_elasticache_replication_group.main.id
      }
    }
  }
  metric_query {
    id = "misses"
    metric {
      namespace   = "AWS/ElastiCache"
      metric_name = "CacheMisses"
      period      = 300
      stat        = "Sum"
      dimensions = {
        CacheClusterId = aws_elasticache_replication_group.main.id
      }
    }
  }

  alarm_actions = [aws_sns_topic.ops.arn]
}

Common Anti-Patterns

Caching security-relevant decisions: Permissions must never come from the cache. Always check fresh.
TTLs too long for mutable data: Price data with 24h TTL – users see outdated prices.
No stampede protection: Cache expires at 3am → all threads simultaneously → database failure.
Caching without monitoring: Hit rate never measured → false confidence in cache effectiveness.

Metrics

Cache hit rate (target: >= 80% for application cache, >= 95% for CDN static)
Cache latency P99 (target: < 1ms for Redis lookups in the same AZ)
Number of evictions per hour (high eviction rate = cache too small)
Memory utilization (target: < 80% for headroom)

Maturity Level

Level 1 – No cache; all requests to origin
Level 2 – Ad-hoc caching without strategy; arbitrary TTLs
Level 3 – Documented strategy; hit rates >= 80%; CDN active
Level 4 – Automatic cache invalidation; stampede protection
Level 5 – Adaptive TTLs; ML-assisted cache warming