Fighting Alert Fatigue: Effective Alerting Strategies for DevOps Teams
How to prevent your monitoring system from becoming a noise machine. Practical strategies to keep alerting effective and eliminate alert fatigue.
Jean-Pierre Broeders
Freelance DevOps Engineer
Fighting Alert Fatigue: Effective Alerting Strategies
The monitoring stack is running. Prometheus is collecting metrics, Grafana dashboards look great, and alerts are flowing in. Dozens per day. After a week, nobody bothers opening them. After a month, notifications are muted. Sound familiar?
That's alert fatigue. It's one of the most dangerous patterns in operations — not because the tooling fails, but because trust in alerts evaporates. When everything screams, nobody listens anymore.
The "Alert on Everything" Problem
The natural instinct when setting up monitoring is to slap alerts on everything. CPU above 80%? Alert. Disk above 70%? Alert. Response time above 200ms? Alert. Seems reasonable, but the result is a system that produces constant noise.
The core issue: most of these alerts don't require action. CPU at 85% for three minutes is often perfectly normal — a deployment running, a cron job executing, a short traffic burst. Only when it sits above 95% for twenty minutes straight is there a real problem.
Two Simple Rules
Every alert should pass two checks:
- Does it require human action? If the answer is no, it shouldn't be an alert. Put it on a dashboard instead.
- Is it urgent? If it can wait until tomorrow, it belongs in a daily review — not a PagerDuty notification at 3 AM.
Alerts that fail either check get removed or demoted to an informational log entry.
Severity Levels That Actually Work
A common mistake is having too many severity levels. Three is enough:
| Level | Meaning | Response |
|---|---|---|
| Critical | Customers are impacted, right now | Immediate action, even at night |
| Warning | Will become a problem if nobody acts | Handle during business hours |
| Info | Nice to know | Daily review, no notification |
Critical alerts should fire a few times per week at most. If they go off more often, something's wrong with the thresholds — or with the architecture.
Practical Prometheus Alerting Tips
Here's an example that shows the difference. This alert is too sensitive:
# Bad: reacts to every short spike
- alert: HighCPU
expr: node_cpu_seconds_total > 80
for: 1m
This works much better:
# Good: filters out short bursts
- alert: HighCPUSustained
expr: avg_over_time(node_cpu_seconds_total[15m]) > 90
for: 10m
labels:
severity: warning
annotations:
summary: "Sustained high CPU on {{ $labels.instance }}"
description: "CPU averaged above 90% for 25 minutes"
The difference is avg_over_time combined with a longer for period. Short spikes disappear, sustained problems don't.
Symptoms vs. Causes
Alert on symptoms, not causes. Sounds counterintuitive, but it works.
Symptom alert (good): "Error rate on /api/orders is above 5% for 10 minutes." This immediately tells the story: customers are affected. Action needed.
Cause alert (less useful): "Database connection pool is full." Maybe this causes problems, maybe it doesn't. The application might have retry logic. Without context on actual impact, prioritization is hard.
The symptom alert fires whenever it matters. The cause alert sometimes fires for nothing.
Alert Routing and Ownership
An alert without a clear owner is an alert that gets ignored. Every alert rule needs a team or person attached. In Alertmanager, that looks like this:
route:
receiver: 'default-slack'
routes:
- match:
team: platform
receiver: 'platform-pagerduty'
- match:
team: backend
receiver: 'backend-slack'
routes:
- match:
severity: critical
receiver: 'backend-pagerduty'
Platform alerts go to the platform team. Backend alerts go to backend. Critical routes to PagerDuty, everything else to Slack. Simple, but it prevents alerts from drowning in a shared channel where nobody feels responsible.
Weekly Alert Review
Setting up alerts isn't a one-time activity. Block half an hour per week to review:
- Which alerts fired?
- How many were actionable?
- Which ones got ignored or immediately dismissed?
Alerts that consistently get ignored need to go or get adjusted. An alert dismissed without action for three weeks straight adds nothing but noise. Remove it or raise the threshold.
Runbooks: The Forgotten Link
Every critical alert deserves a runbook. Not a vague wiki page, but concrete steps: what to check first, which commands to run, when to escalate.
annotations:
runbook_url: "https://wiki.internal/runbooks/high-error-rate-orders"
A runbook saves minutes during a stressful incident. And it enables less experienced team members to respond effectively to a middle-of-the-night page.
Wrapping Up
Good alerting isn't about more alerts. It's about fewer, better ones. Every alert should be urgent, actionable, and owned. Anything that doesn't meet those criteria is noise — and noise is the enemy of reliable operations.
Start with an audit of current alerts. Cut everything that doesn't require action. Tighten the thresholds. Schedule that weekly review. After a month, the difference is noticeable: fewer notifications, faster response times, and a team that actually trusts the alerting system again.
