Monitoring on a Budget: Cost Control Without Blind Spots
Observability can get expensive fast. Learn how to use smart sampling, retention policies, and open-source tools to keep monitoring affordable without sacrificing quality.
Jean-Pierre Broeders
Freelance DevOps Engineer
Monitoring on a Budget: Cost Control Without Blind Spots
Monitoring is essential. But it can also be a money pit. Datadog, New Relic, Splunk — enterprise observability bills quickly run into thousands of dollars per month, especially when you scale. Yet there's no good reason to fly blind just because your budget is limited.
With the right strategies, you can build a robust monitoring stack that costs almost nothing, or keep costs predictable if you choose paid tools. These are the tactics that work.
The Cost Drivers in Monitoring
Before you optimize, understand where the money goes:
Data volume — The more metrics, logs, and traces you collect, the more you pay. Many vendors charge per GB of ingested data.
Retention — Storing data costs money. Some platforms keep everything for 90 days by default, while you usually know everything you need after two weeks.
Queries — Some platforms charge per search query or dashboard refresh.
Hosts & containers — Per-agent pricing gets expensive when you run many small services.
Alerting & integrations — Premium features like PagerDuty integrations, custom webhooks, or ML-based anomaly detection quickly increase the bill.
Open-Source First: Prometheus + Grafana
The biggest cost savings come from open-source tooling. Prometheus and Grafana run perfectly fine on a small VPS or in your Kubernetes cluster, without license fees.
Prometheus collects metrics via scraping. No agents that cost money per host — just expose an HTTP endpoint. For system metrics use Node Exporter, for containers cAdvisor. Docker Compose setup:
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=15d'
ports:
- "9090:9090"
restart: unless-stopped
grafana:
image: grafana/grafana:latest
volumes:
- grafana-data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
- GF_USERS_ALLOW_SIGN_UP=false
ports:
- "3000:3000"
restart: unless-stopped
volumes:
prometheus-data:
grafana-data:
Cost? A €10/month VPS with 4GB RAM and 80GB storage runs this effortlessly for 20-30 services.
Retention Policies: Not Everything Needs to Last Forever
By default, many tools keep everything for months. But after two weeks, most incidents are already resolved. Why keep paying for old data?
Setting Prometheus retention:
command:
- '--storage.tsdb.retention.time=15d'
Log rotation with Loki: Instead of keeping all logs indefinitely, use different retention tiers:
| Log Type | Retention | Reason |
|---|---|---|
| ERROR logs | 30 days | Compliance & debugging |
| WARN logs | 14 days | Troubleshooting |
| INFO logs | 7 days | Recent context |
| DEBUG logs | 3 days | Development only |
In Loki, configure this with stream selectors in limits_config:
limits_config:
retention_period: 168h # 7 days default
per_stream_rate_limit: 5MB
per_stream_rate_limit_burst: 10MB
For specific streams, use retention rules in the compactor config. Errors stay longer, debug logs disappear quickly. This saves gigabytes per week.
Sampling: Not Every Request Needs Tracing
Distributed tracing can generate enormous data volumes. If you have 10,000 requests per minute and trace each one, you quickly hit terabytes per month. That's not necessary.
Tail-based sampling is the smart approach: trace everything temporarily, but only keep interesting requests — errors, slow calls, specific endpoints.
With OpenTelemetry Collector:
processors:
tail_sampling:
decision_wait: 10s
policies:
- name: errors
type: status_code
status_code:
status_codes: [ERROR]
- name: slow-requests
type: latency
latency:
threshold_ms: 500
- name: sample-normal
type: probabilistic
probabilistic:
sampling_percentage: 5
This keeps:
- All errors (100%)
- All requests > 500ms (100%)
- 5% of normal requests (for baseline)
Result? 95% less trace data, without missing important information.
Cardinality Under Control: Metrics Exploding
High-cardinality labels make metric sets exponentially larger. A simple example:
http_requests_total{endpoint="/api/users/12345"}
If endpoint contains the user ID, you'll soon have millions of unique time series. Prometheus crashes, storage grows explosively.
Fix: Use template endpoints:
http_requests_total{endpoint="/api/users/:id"}
Check your metrics regularly for cardinality issues:
# Top 10 metrics with highest cardinality
curl -s http://localhost:9090/api/v1/label/__name__/values | jq '.data[]' | head -10
Remove labels that are too detailed (like user IDs, session tokens, timestamps). Those belong in traces or logs, not metrics.
Free Alternatives for Paid Features
Some features seem premium, but can be replicated for free:
Uptime monitoring → Use Uptime Kuma (self-hosted) instead of Pingdom.
Alerting → Prometheus Alertmanager + webhook to Discord/Slack costs nothing. PagerDuty is $29/month per user.
Log aggregation → Loki (Grafana's log stack) is free. Splunk costs $1500/GB/month.
Synthetics / End-to-end tests → Playwright in a cron job + simple dashboard. Datadog Synthetics is $5 per test per month.
Cloud-Hosted: Choose Wisely
Sometimes managed monitoring is the better choice — less overhead, better integrations. Then there are smarter options than the big names:
Grafana Cloud — Free tier: 10K series, 50GB logs, 50GB traces per month. Enough for smaller projects.
Sentry (errors only) — 5K events/month free. Often sufficient for error tracking.
Axiom (logs) — 500GB ingested per month free. Scalable and simple.
Compare that to Datadog ($15/host/month) and New Relic ($99/user/month). The cost savings quickly add up.
Dashboards: Less is More
Many teams build massive dashboards with hundreds of panels. That's not only visually overwhelming, it also slows query performance and increases the cloud bill if you pay per query.
Best practice: One dashboard per service, max 8-12 panels. Focus on:
- Golden signals (latency, traffic, errors, saturation)
- Resource usage (CPU, memory, disk)
- Business metrics (orders, signups, etc.)
Everything beyond that belongs in ad-hoc queries, not in auto-refresh dashboards.
Conclusion
Monitoring doesn't have to be a budget killer. With open-source tools, smart sampling, retention policies, and cardinality management, you get full observability for a fraction of what enterprise vendors charge. Start with Prometheus + Grafana for metrics, Loki for logs, and OpenTelemetry for traces. Scale where needed, but never pay for data you don't use.
Nederlandse samenvatting: Dit artikel behandelt praktische kostenoptimalisatie strategieën voor monitoring infrastructuur — van open-source tools zoals Prometheus en Grafana, tot slimme retention policies, tail-based sampling voor traces, en het vermijden van high-cardinality metrics. Het laat zien hoe je production-grade observability opbouwt op budget zonder blindvlekken.
