Best Practice: Health Checks & Readiness/Liveness Probes
Context
Health checks are the operational feedback mechanism between infrastructure and application. Without correct health checks, a load balancer routes traffic to faulty instances, and Kubernetes does not start a new pod when the old one is stuck in a faulty state.
Common problems without a structured health check practice:
-
Instances with an error in the startup process immediately receive traffic
-
Deadlocked processes run for weeks without being restarted
-
Load balancer routes to unstable backends without feedback to operations
-
Zero-downtime deployments fail because readiness is not configured correctly
Related Controls
-
WAF-REL-020 – Health Checks & Readiness Probes Configured
Target State
Every service exposes three health endpoints with defined semantics:
-
/health/live– Liveness: Is the process still alive? Failure → Kubernetes restarts -
/health/ready– Readiness: Can it accept traffic? Failure → no traffic -
/health/startup– Startup: Has it finished starting? (for slow services)
Technical Implementation
Step 1: Health Check Endpoints in the Application
# Python/FastAPI example
from fastapi import FastAPI, Response
import asyncpg
import redis.asyncio as redis
app = FastAPI()
@app.get("/health/live")
async def liveness():
"""Liveness: Process is running. Only external dependencies if deadlock-relevant."""
return {"status": "alive"}
@app.get("/health/ready")
async def readiness():
"""Readiness: Service can accept traffic. Check critical deps."""
checks = {}
errors = []
# Check database connection
try:
conn = await asyncpg.connect(dsn=DATABASE_URL)
await conn.fetchval("SELECT 1")
await conn.close()
checks["database"] = "ok"
except Exception as e:
checks["database"] = "error"
errors.append(f"database: {e}")
# Check Redis connection (critical)
try:
r = redis.from_url(REDIS_URL)
await r.ping()
checks["cache"] = "ok"
except Exception as e:
checks["cache"] = "error"
errors.append(f"cache: {e}")
if errors:
return Response(
content={"status": "not_ready", "checks": checks, "errors": errors},
status_code=503
)
return {"status": "ready", "checks": checks}
Step 2: Kubernetes Probe Configuration
# kubernetes/deployment.yml
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-api
spec:
replicas: 3
template:
spec:
containers:
- name: payment-api
image: payment-api:1.5.0
ports:
- containerPort: 8080
# Liveness: detects deadlocks and infinite loops
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 15 # Measured: startup ~10s; buffer: 5s
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3 # 3 failures = 30s window before restart
# Readiness: controls traffic routing
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
successThreshold: 1
failureThreshold: 3 # 3 failures = 15s without traffic
# Startup: for slow startup processes (DB migration etc.)
startupProbe:
httpGet:
path: /health/live
port: 8080
failureThreshold: 30 # Max 30 * 10s = 5 minutes startup
periodSeconds: 10
Step 3: AWS ALB Target Group Health Check (Terraform)
resource "aws_lb_target_group" "api" {
name = "payment-api-tg"
port = 8080
protocol = "HTTP"
vpc_id = var.vpc_id
target_type = "ip" # For ECS/EKS
health_check {
enabled = true
path = "/health/ready" # Not "/": checks real readiness
port = "traffic-port"
protocol = "HTTP"
interval = 15 # More frequent than default 30s
timeout = 5
healthy_threshold = 2 # 2 successes → healthy
unhealthy_threshold = 3 # 3 failures → unhealthy
matcher = "200" # Exactly 200 OK only
}
deregistration_delay = 30 # Graceful shutdown
tags = var.mandatory_tags
}
Step 4: GCP Cloud Run Health Check
resource "google_cloud_run_v2_service" "api" {
name = "payment-api"
location = var.region
template {
containers {
image = var.container_image
ports {
container_port = 8080
}
startup_probe {
http_get {
path = "/health/live"
port = 8080
}
initial_delay_seconds = 5
period_seconds = 10
failure_threshold = 6 # 60s max startup
timeout_seconds = 3
}
liveness_probe {
http_get {
path = "/health/live"
port = 8080
}
initial_delay_seconds = 15
period_seconds = 15
failure_threshold = 3
timeout_seconds = 5
}
}
}
}
Typical Anti-Patterns
-
Liveness checks dependencies: If DB is briefly unreachable, all pods are restarted – mass failure
-
initialDelaySeconds = 0: Probe fails during startup, pod is restarted endlessly
-
Health check on port 80 instead of app port: Only checks whether HTTP port is open, not whether app responds
-
TCP health check for HTTP service: Checks TCP connection, but not application state
-
Health endpoint always returns 200: Endpoint returns OK even when DB connection fails
Metrics
-
Health Check Pass Rate: % of health checks that succeed (target: > 99.5%)
-
Probe Failure Rate: Number of liveness probe failures per hour (target: < 1)
-
Readiness Recovery Time: Time from pod start until first successful readiness probe
-
Unhealthy Target Duration: Average time a backend spends in the unhealthy state
Maturity Level
Level 1 – No health checks, no probe configured
Level 2 – Basic LB health check on "/" configured
Level 3 – ReadinessProbe and LivenessProbe with measured delays; LB checks /health/ready
Level 4 – StartupProbe for slow services; deep health checks with dependency status
Level 5 – Synthetic monitoring validates health check endpoints from external