Building a Complete DevOps Monitoring Stack

Learn how to build a robust monitoring stack with Prometheus, Grafana, and alerting for production workloads.

Jean-Pierre Broeders

Freelance DevOps Engineer

February 16, 20266 min. read

Building a Complete DevOps Monitoring Stack

Monitoring is the nervous system of your infrastructure. Without proper observability, you're flying blind and discovering issues only when users start complaining. In this post, I'll show you how to set up a professional monitoring stack that gives you real-time insights into your systems.

Why Monitoring is Essential

Production systems are complex. Applications crash, disks fill up, databases slow down, and network connections timeout. Good monitoring helps you:

  • Detect problems early before they impact users
  • Perform root cause analysis faster
  • Identify trends pointing to future issues
  • Maintain SLAs with concrete metrics

The Stack: Prometheus + Grafana + Alertmanager

The most popular open-source monitoring stack consists of:

  • Prometheus: Time-series database and metrics collector
  • Grafana: Dashboarding and visualization
  • Alertmanager: Intelligent alert routing and grouping
  • Node Exporter: System metrics (CPU, RAM, disk)
  • cAdvisor: Container metrics

Setup with Docker Compose

Here's a production-ready setup:

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
    ports:
      - "9090:9090"
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/dashboards:/etc/grafana/provisioning/dashboards
      - ./grafana/datasources:/etc/grafana/provisioning/datasources
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=changeme
      - GF_USERS_ALLOW_SIGN_UP=false
    ports:
      - "3000:3000"
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:latest
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    ports:
      - "9093:9093"
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:latest
    command:
      - '--path.rootfs=/host'
    volumes:
      - '/:/host:ro,rslave'
    ports:
      - "9100:9100"
    restart: unless-stopped

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
    ports:
      - "8080:8080"
    restart: unless-stopped

volumes:
  prometheus-data:
  grafana-data:

Prometheus Configuration

Create prometheus.yml with your scrape targets:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - "alerts.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

Configuring Alert Rules

Create alerts.yml for critical situations:

groups:
  - name: system_alerts
    interval: 30s
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage detected"
          description: "CPU usage is above 80% for 5 minutes on {{ $labels.instance }}"

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk space critically low"
          description: "Only {{ $value }}% disk space remaining on {{ $labels.instance }}"

      - alert: ServiceDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Service is down"
          description: "{{ $labels.job }} on {{ $labels.instance }} has been down for 2 minutes"

Alertmanager Setup

Configure alertmanager.yml for notifications:

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'

receivers:
  - name: 'default'
    webhook_configs:
      - url: 'http://your-webhook-endpoint'
        send_resolved: true

Grafana Dashboards

After starting the stack:

  1. Open http://localhost:3000
  2. Login with admin/changeme
  3. Add Prometheus as datasource (http://prometheus:9090)
  4. Import community dashboards:
    • Node Exporter Full (ID: 1860)
    • Docker Container & Host Metrics (ID: 179)

Best Practices

Set retention: Don't keep metrics forever. 30 days is often sufficient for troubleshooting.

Labeling strategy: Use consistent labels like environment, service, instance for better filtering.

Prevent alert fatigue: Too many alerts = no alerts. Focus on actionable warnings.

Dashboard per service: Create dedicated dashboards for each application with relevant metrics.

Backup your configuration: Prometheus data is replaceable, but your dashboards and alerts aren't.

Conclusion

A good monitoring stack is not optional for production workloads. With Prometheus and Grafana, you can build a professional system in an afternoon that warns you about problems and provides insights into your infrastructure. Start simple, add metrics incrementally, and refine your alerts based on experience.

Nederlandse samenvatting: Dit artikel legt uit hoe je een complete DevOps monitoring stack bouwt met Prometheus, Grafana en Alertmanager. Het behandelt Docker Compose setup, alert configuratie en best practices voor production observability.

Want to stay updated?

Subscribe to my newsletter or get in touch for freelance projects.

Get in Touch