Health Checks and Zero-Downtime Deployments with Docker Compose
Implement robust health checks and rolling updates for production environments without downtime. Practical examples with NGINX, PostgreSQL, and application containers.
Jean-Pierre Broeders
Freelance DevOps Engineer
Health Checks and Zero-Downtime Deployments with Docker Compose
Production environments run 24/7. Containers can crash, databases can lock up, and updates need to happen without users noticing anything. Health checks and zero-downtime deployments form the foundation for reliable systems.
Why Health Checks Are Essential
A container can be running without the application inside actually working. The database might be unreachable. The API could be throwing timeout errors. A simple docker ps only shows whether the process is active — not whether it's healthy.
Health checks detect these issues automatically. When a health check fails, the container can restart, or the load balancer routes traffic to healthy instances.
Health Check Configuration
A basic health check for a web application looks like this:
services:
web:
image: myapp:latest
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
What happens?
- Every 30 seconds the endpoint gets checked
- The check can take a maximum of 10 seconds
- After 3 failed attempts the container is marked "unhealthy"
- Failed checks during the first 40 seconds are ignored (startup time)
For databases, a different approach works better:
services:
postgres:
image: postgres:16
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 10s
timeout: 5s
retries: 5
start_period: 20s
pg_isready checks whether PostgreSQL actually accepts connections. This is more reliable than just checking if the process is running.
Dependencies Between Services
Applications often start too quickly, before the database is ready. This causes startup crashes. The depends_on option combines well with health checks:
services:
web:
image: myapp:latest
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
healthcheck:
test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:8080/ready"]
interval: 15s
timeout: 5s
retries: 3
postgres:
image: postgres:16
healthcheck:
test: ["CMD-SHELL", "pg_isready"]
interval: 10s
redis:
image: redis:7-alpine
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
The web application only starts when both PostgreSQL and Redis are healthy. This prevents race conditions and unnecessary crashes.
Zero-Downtime Deployments
Updates without downtime require a rolling update strategy. Docker Compose doesn't support this by default, but there are two practical solutions.
Option 1: Blue-Green with Proxy
Run two identical stacks behind a load balancer. Update one while the other handles traffic:
services:
web-blue:
image: myapp:v1
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 10s
web-green:
image: myapp:v2
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 10s
nginx:
image: nginx:alpine
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
ports:
- "80:80"
depends_on:
web-blue:
condition: service_healthy
web-green:
condition: service_healthy
NGINX routes traffic to healthy containers. A simple configuration:
upstream backend {
server web-blue:8080 max_fails=3 fail_timeout=30s;
server web-green:8080 max_fails=3 fail_timeout=30s;
}
server {
listen 80;
location / {
proxy_pass http://backend;
proxy_next_upstream error timeout http_500 http_502 http_503;
}
location /health {
access_log off;
return 200 "OK";
}
}
During deployment, first web-green is stopped and replaced with the new version. NGINX automatically routes everything to web-blue. Once web-green is healthy, web-blue gets replaced.
Option 2: Rolling Update with Replicas
For larger deployments, Docker Swarm works better. A simple migration:
version: "3.8"
services:
web:
image: myapp:latest
deploy:
replicas: 4
update_config:
parallelism: 1
delay: 10s
order: start-first
rollback_config:
parallelism: 1
delay: 10s
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 10s
The configuration updates one replica at a time with 10 seconds delay. The start-first option starts new containers before stopping old ones. If problems occur, Swarm automatically rolls back.
Deploy with: docker stack deploy -c docker-compose.yml myapp
Monitoring Health Checks
Health checks are worthless without monitoring. A simple solution with a health check aggregator:
services:
healthcheck-monitor:
image: alpine:latest
command: |
sh -c '
while true; do
echo "=== Health Check Status ==="
docker ps --format "table {{.Names}}\t{{.Status}}"
sleep 60
done
'
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
For production this is too basic. Better solutions integrate with Prometheus or Grafana. An example with metrics export:
services:
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
ports:
- "8081:8080"
cAdvisor exports container metrics including health status to Prometheus. This provides real-time insights into the health of all services.
Practical Tips
Test health checks locally. A failing health check in production means downtime. Verify that the endpoint responds quickly and reliably:
docker compose up -d
docker inspect --format='{{.State.Health.Status}}' container_name
Distinguish between liveness and readiness. A liveness check detects crashes. A readiness check determines whether the container can handle traffic. Both are important:
| Check Type | Purpose | Action on failure |
|---|---|---|
| Liveness | Has the app crashed? | Restart container |
| Readiness | Can the app handle traffic? | Don't send new requests |
Docker Compose only supports liveness checks. For readiness checks, a load balancer is needed that checks both endpoints.
Avoid overly aggressive timeouts. A health check running every 5 seconds can overload the database. Start with slow intervals and tighten only when necessary.
Log failed health checks. This helps with debugging:
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"]
Conclusion
Health checks and zero-downtime deployments aren't a luxury. They form the foundation for reliable production environments. Start with simple health checks on critical services. Add dependencies to prevent race conditions. Implement a deployment strategy that prevents downtime.
Without these fundamentals, a production environment remains vulnerable to unexpected crashes and downtime during updates. With the right configuration, systems run stable and updates can happen without users noticing anything.
