Building a Complete DevOps Monitoring Stack

Monitoring is the nervous system of your infrastructure. Without proper observability, you're flying blind and discovering issues only when users start complaining. In this post, I'll show you how to set up a professional monitoring stack that gives you real-time insights into your systems.

Why Monitoring is Essential

Production systems are complex. Applications crash, disks fill up, databases slow down, and network connections timeout. Good monitoring helps you:

Detect problems early before they impact users
Perform root cause analysis faster
Identify trends pointing to future issues
Maintain SLAs with concrete metrics

The Stack: Prometheus + Grafana + Alertmanager

The most popular open-source monitoring stack consists of:

Prometheus: Time-series database and metrics collector
Grafana: Dashboarding and visualization
Alertmanager: Intelligent alert routing and grouping
Node Exporter: System metrics (CPU, RAM, disk)
cAdvisor: Container metrics

Setup with Docker Compose

Here's a production-ready setup:

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
    ports:
      - "9090:9090"
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/dashboards:/etc/grafana/provisioning/dashboards
      - ./grafana/datasources:/etc/grafana/provisioning/datasources
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=changeme
      - GF_USERS_ALLOW_SIGN_UP=false
    ports:
      - "3000:3000"
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:latest
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    ports:
      - "9093:9093"
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:latest
    command:
      - '--path.rootfs=/host'
    volumes:
      - '/:/host:ro,rslave'
    ports:
      - "9100:9100"
    restart: unless-stopped

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
    ports:
      - "8080:8080"
    restart: unless-stopped

volumes:
  prometheus-data:
  grafana-data:

Prometheus Configuration

Create prometheus.yml with your scrape targets:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - "alerts.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

Configuring Alert Rules

Create alerts.yml for critical situations:

groups:
  - name: system_alerts
    interval: 30s
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage detected"
          description: "CPU usage is above 80% for 5 minutes on {{ $labels.instance }}"

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk space critically low"
          description: "Only {{ $value }}% disk space remaining on {{ $labels.instance }}"

      - alert: ServiceDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Service is down"
          description: "{{ $labels.job }} on {{ $labels.instance }} has been down for 2 minutes"

Alertmanager Setup

Configure alertmanager.yml for notifications:

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'

receivers:
  - name: 'default'
    webhook_configs:
      - url: 'http://your-webhook-endpoint'
        send_resolved: true

Grafana Dashboards

After starting the stack:

Open http://localhost:3000
Login with admin/changeme
Add Prometheus as datasource (http://prometheus:9090)
Import community dashboards:
- Node Exporter Full (ID: 1860)
- Docker Container & Host Metrics (ID: 179)

Best Practices

Set retention: Don't keep metrics forever. 30 days is often sufficient for troubleshooting.

Labeling strategy: Use consistent labels like environment, service, instance for better filtering.

Prevent alert fatigue: Too many alerts = no alerts. Focus on actionable warnings.

Dashboard per service: Create dedicated dashboards for each application with relevant metrics.

Backup your configuration: Prometheus data is replaceable, but your dashboards and alerts aren't.

Conclusion

A good monitoring stack is not optional for production workloads. With Prometheus and Grafana, you can build a professional system in an afternoon that warns you about problems and provides insights into your infrastructure. Start simple, add metrics incrementally, and refine your alerts based on experience.

Nederlandse samenvatting: Dit artikel legt uit hoe je een complete DevOps monitoring stack bouwt met Prometheus, Grafana en Alertmanager. Het behandelt Docker Compose setup, alert configuratie en best practices voor production observability.

Building a Complete DevOps Monitoring Stack

Building a Complete DevOps Monitoring Stack

Why Monitoring is Essential

The Stack: Prometheus + Grafana + Alertmanager

Setup with Docker Compose

Prometheus Configuration

Configuring Alert Rules

Alertmanager Setup

Grafana Dashboards

Best Practices

Conclusion

Related Articles

Azure Functions: Serverless Without the Hype

API Security Best Practices: Protect Your REST APIs

CI/CD with GitHub Actions: From Code to Production in Minutes

Want to stay updated?