Building a Complete DevOps Monitoring Stack
Learn how to build a robust monitoring stack with Prometheus, Grafana, and alerting for production workloads.
Jean-Pierre Broeders
Freelance DevOps Engineer
Building a Complete DevOps Monitoring Stack
Monitoring is the nervous system of your infrastructure. Without proper observability, you're flying blind and discovering issues only when users start complaining. In this post, I'll show you how to set up a professional monitoring stack that gives you real-time insights into your systems.
Why Monitoring is Essential
Production systems are complex. Applications crash, disks fill up, databases slow down, and network connections timeout. Good monitoring helps you:
- Detect problems early before they impact users
- Perform root cause analysis faster
- Identify trends pointing to future issues
- Maintain SLAs with concrete metrics
The Stack: Prometheus + Grafana + Alertmanager
The most popular open-source monitoring stack consists of:
- Prometheus: Time-series database and metrics collector
- Grafana: Dashboarding and visualization
- Alertmanager: Intelligent alert routing and grouping
- Node Exporter: System metrics (CPU, RAM, disk)
- cAdvisor: Container metrics
Setup with Docker Compose
Here's a production-ready setup:
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
ports:
- "9090:9090"
restart: unless-stopped
grafana:
image: grafana/grafana:latest
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/dashboards:/etc/grafana/provisioning/dashboards
- ./grafana/datasources:/etc/grafana/provisioning/datasources
environment:
- GF_SECURITY_ADMIN_PASSWORD=changeme
- GF_USERS_ALLOW_SIGN_UP=false
ports:
- "3000:3000"
restart: unless-stopped
alertmanager:
image: prom/alertmanager:latest
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
ports:
- "9093:9093"
restart: unless-stopped
node-exporter:
image: prom/node-exporter:latest
command:
- '--path.rootfs=/host'
volumes:
- '/:/host:ro,rslave'
ports:
- "9100:9100"
restart: unless-stopped
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
ports:
- "8080:8080"
restart: unless-stopped
volumes:
prometheus-data:
grafana-data:
Prometheus Configuration
Create prometheus.yml with your scrape targets:
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
rule_files:
- "alerts.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
Configuring Alert Rules
Create alerts.yml for critical situations:
groups:
- name: system_alerts
interval: 30s
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
description: "CPU usage is above 80% for 5 minutes on {{ $labels.instance }}"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
for: 5m
labels:
severity: critical
annotations:
summary: "Disk space critically low"
description: "Only {{ $value }}% disk space remaining on {{ $labels.instance }}"
- alert: ServiceDown
expr: up == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Service is down"
description: "{{ $labels.job }} on {{ $labels.instance }} has been down for 2 minutes"
Alertmanager Setup
Configure alertmanager.yml for notifications:
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
receivers:
- name: 'default'
webhook_configs:
- url: 'http://your-webhook-endpoint'
send_resolved: true
Grafana Dashboards
After starting the stack:
- Open http://localhost:3000
- Login with admin/changeme
- Add Prometheus as datasource (http://prometheus:9090)
- Import community dashboards:
- Node Exporter Full (ID: 1860)
- Docker Container & Host Metrics (ID: 179)
Best Practices
Set retention: Don't keep metrics forever. 30 days is often sufficient for troubleshooting.
Labeling strategy: Use consistent labels like environment, service, instance for better filtering.
Prevent alert fatigue: Too many alerts = no alerts. Focus on actionable warnings.
Dashboard per service: Create dedicated dashboards for each application with relevant metrics.
Backup your configuration: Prometheus data is replaceable, but your dashboards and alerts aren't.
Conclusion
A good monitoring stack is not optional for production workloads. With Prometheus and Grafana, you can build a professional system in an afternoon that warns you about problems and provides insights into your infrastructure. Start simple, add metrics incrementally, and refine your alerts based on experience.
Nederlandse samenvatting: Dit artikel legt uit hoe je een complete DevOps monitoring stack bouwt met Prometheus, Grafana en Alertmanager. Het behandelt Docker Compose setup, alert configuratie en best practices voor production observability.
