Best Practice: Health Checks & Readiness/Liveness Probes

Context

Health checks are the operational feedback mechanism between infrastructure and application. Without correct health checks, a load balancer routes traffic to faulty instances, and Kubernetes does not start a new pod when the old one is stuck in a faulty state.

Common problems without a structured health check practice:

Instances with an error in the startup process immediately receive traffic
Deadlocked processes run for weeks without being restarted
Load balancer routes to unstable backends without feedback to operations
Zero-downtime deployments fail because readiness is not configured correctly

Related Controls

WAF-REL-020 – Health Checks & Readiness Probes Configured

Target State

Every service exposes three health endpoints with defined semantics:

/health/live – Liveness: Is the process still alive? Failure → Kubernetes restarts
/health/ready – Readiness: Can it accept traffic? Failure → no traffic
/health/startup – Startup: Has it finished starting? (for slow services)

Technical Implementation

Step 1: Health Check Endpoints in the Application

# Python/FastAPI example
from fastapi import FastAPI, Response
import asyncpg
import redis.asyncio as redis

app = FastAPI()

@app.get("/health/live")
async def liveness():
    """Liveness: Process is running. Only external dependencies if deadlock-relevant."""
    return {"status": "alive"}

@app.get("/health/ready")
async def readiness():
    """Readiness: Service can accept traffic. Check critical deps."""
    checks = {}
    errors = []

    # Check database connection
    try:
        conn = await asyncpg.connect(dsn=DATABASE_URL)
        await conn.fetchval("SELECT 1")
        await conn.close()
        checks["database"] = "ok"
    except Exception as e:
        checks["database"] = "error"
        errors.append(f"database: {e}")

    # Check Redis connection (critical)
    try:
        r = redis.from_url(REDIS_URL)
        await r.ping()
        checks["cache"] = "ok"
    except Exception as e:
        checks["cache"] = "error"
        errors.append(f"cache: {e}")

    if errors:
        return Response(
            content={"status": "not_ready", "checks": checks, "errors": errors},
            status_code=503
        )

    return {"status": "ready", "checks": checks}

Step 2: Kubernetes Probe Configuration

# kubernetes/deployment.yml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-api
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: payment-api
          image: payment-api:1.5.0
          ports:
            - containerPort: 8080

          # Liveness: detects deadlocks and infinite loops
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8080
            initialDelaySeconds: 15    # Measured: startup ~10s; buffer: 5s
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 3        # 3 failures = 30s window before restart

          # Readiness: controls traffic routing
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 5
            timeoutSeconds: 3
            successThreshold: 1
            failureThreshold: 3        # 3 failures = 15s without traffic

          # Startup: for slow startup processes (DB migration etc.)
          startupProbe:
            httpGet:
              path: /health/live
              port: 8080
            failureThreshold: 30       # Max 30 * 10s = 5 minutes startup
            periodSeconds: 10

Step 3: AWS ALB Target Group Health Check (Terraform)

resource "aws_lb_target_group" "api" {
  name             = "payment-api-tg"
  port             = 8080
  protocol         = "HTTP"
  vpc_id           = var.vpc_id
  target_type      = "ip"  # For ECS/EKS

  health_check {
    enabled             = true
    path                = "/health/ready"     # Not "/": checks real readiness
    port                = "traffic-port"
    protocol            = "HTTP"
    interval            = 15                  # More frequent than default 30s
    timeout             = 5
    healthy_threshold   = 2                   # 2 successes → healthy
    unhealthy_threshold = 3                   # 3 failures → unhealthy
    matcher             = "200"               # Exactly 200 OK only
  }

  deregistration_delay = 30  # Graceful shutdown

  tags = var.mandatory_tags
}

Step 4: GCP Cloud Run Health Check

resource "google_cloud_run_v2_service" "api" {
  name     = "payment-api"
  location = var.region

  template {
    containers {
      image = var.container_image
      ports {
        container_port = 8080
      }

      startup_probe {
        http_get {
          path = "/health/live"
          port = 8080
        }
        initial_delay_seconds = 5
        period_seconds        = 10
        failure_threshold     = 6  # 60s max startup
        timeout_seconds       = 3
      }

      liveness_probe {
        http_get {
          path = "/health/live"
          port = 8080
        }
        initial_delay_seconds = 15
        period_seconds        = 15
        failure_threshold     = 3
        timeout_seconds       = 5
      }
    }
  }
}

Typical Anti-Patterns

Liveness checks dependencies: If DB is briefly unreachable, all pods are restarted – mass failure
initialDelaySeconds = 0: Probe fails during startup, pod is restarted endlessly
Health check on port 80 instead of app port: Only checks whether HTTP port is open, not whether app responds
TCP health check for HTTP service: Checks TCP connection, but not application state
Health endpoint always returns 200: Endpoint returns OK even when DB connection fails

Metrics

Health Check Pass Rate: % of health checks that succeed (target: > 99.5%)
Probe Failure Rate: Number of liveness probe failures per hour (target: < 1)
Readiness Recovery Time: Time from pod start until first successful readiness probe
Unhealthy Target Duration: Average time a backend spends in the unhealthy state

Maturity Level

Level 1 – No health checks, no probe configured
Level 2 – Basic LB health check on "/" configured
Level 3 – ReadinessProbe and LivenessProbe with measured delays; LB checks /health/ready
Level 4 – StartupProbe for slow services; deep health checks with dependency status
Level 5 – Synthetic monitoring validates health check endpoints from external