Monitoring on a Budget: Cost Control Without Blind Spots

Observability can get expensive fast. Learn how to use smart sampling, retention policies, and open-source tools to keep monitoring affordable without sacrificing quality.

Jean-Pierre Broeders

Freelance DevOps Engineer

April 3, 20266 min. read
Monitoring on a Budget: Cost Control Without Blind Spots

Monitoring on a Budget: Cost Control Without Blind Spots

Monitoring is essential. But it can also be a money pit. Datadog, New Relic, Splunk — enterprise observability bills quickly run into thousands of dollars per month, especially when you scale. Yet there's no good reason to fly blind just because your budget is limited.

With the right strategies, you can build a robust monitoring stack that costs almost nothing, or keep costs predictable if you choose paid tools. These are the tactics that work.

The Cost Drivers in Monitoring

Before you optimize, understand where the money goes:

Data volume — The more metrics, logs, and traces you collect, the more you pay. Many vendors charge per GB of ingested data.

Retention — Storing data costs money. Some platforms keep everything for 90 days by default, while you usually know everything you need after two weeks.

Queries — Some platforms charge per search query or dashboard refresh.

Hosts & containers — Per-agent pricing gets expensive when you run many small services.

Alerting & integrations — Premium features like PagerDuty integrations, custom webhooks, or ML-based anomaly detection quickly increase the bill.

Open-Source First: Prometheus + Grafana

The biggest cost savings come from open-source tooling. Prometheus and Grafana run perfectly fine on a small VPS or in your Kubernetes cluster, without license fees.

Prometheus collects metrics via scraping. No agents that cost money per host — just expose an HTTP endpoint. For system metrics use Node Exporter, for containers cAdvisor. Docker Compose setup:

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=15d'
    ports:
      - "9090:9090"
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    volumes:
      - grafana-data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
      - GF_USERS_ALLOW_SIGN_UP=false
    ports:
      - "3000:3000"
    restart: unless-stopped

volumes:
  prometheus-data:
  grafana-data:

Cost? A €10/month VPS with 4GB RAM and 80GB storage runs this effortlessly for 20-30 services.

Retention Policies: Not Everything Needs to Last Forever

By default, many tools keep everything for months. But after two weeks, most incidents are already resolved. Why keep paying for old data?

Setting Prometheus retention:

command:
  - '--storage.tsdb.retention.time=15d'

Log rotation with Loki: Instead of keeping all logs indefinitely, use different retention tiers:

Log TypeRetentionReason
ERROR logs30 daysCompliance & debugging
WARN logs14 daysTroubleshooting
INFO logs7 daysRecent context
DEBUG logs3 daysDevelopment only

In Loki, configure this with stream selectors in limits_config:

limits_config:
  retention_period: 168h  # 7 days default
  per_stream_rate_limit: 5MB
  per_stream_rate_limit_burst: 10MB

For specific streams, use retention rules in the compactor config. Errors stay longer, debug logs disappear quickly. This saves gigabytes per week.

Sampling: Not Every Request Needs Tracing

Distributed tracing can generate enormous data volumes. If you have 10,000 requests per minute and trace each one, you quickly hit terabytes per month. That's not necessary.

Tail-based sampling is the smart approach: trace everything temporarily, but only keep interesting requests — errors, slow calls, specific endpoints.

With OpenTelemetry Collector:

processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow-requests
        type: latency
        latency:
          threshold_ms: 500
      - name: sample-normal
        type: probabilistic
        probabilistic:
          sampling_percentage: 5

This keeps:

  • All errors (100%)
  • All requests > 500ms (100%)
  • 5% of normal requests (for baseline)

Result? 95% less trace data, without missing important information.

Cardinality Under Control: Metrics Exploding

High-cardinality labels make metric sets exponentially larger. A simple example:

http_requests_total{endpoint="/api/users/12345"}

If endpoint contains the user ID, you'll soon have millions of unique time series. Prometheus crashes, storage grows explosively.

Fix: Use template endpoints:

http_requests_total{endpoint="/api/users/:id"}

Check your metrics regularly for cardinality issues:

# Top 10 metrics with highest cardinality
curl -s http://localhost:9090/api/v1/label/__name__/values | jq '.data[]' | head -10

Remove labels that are too detailed (like user IDs, session tokens, timestamps). Those belong in traces or logs, not metrics.

Free Alternatives for Paid Features

Some features seem premium, but can be replicated for free:

Uptime monitoring → Use Uptime Kuma (self-hosted) instead of Pingdom.

Alerting → Prometheus Alertmanager + webhook to Discord/Slack costs nothing. PagerDuty is $29/month per user.

Log aggregation → Loki (Grafana's log stack) is free. Splunk costs $1500/GB/month.

Synthetics / End-to-end tests → Playwright in a cron job + simple dashboard. Datadog Synthetics is $5 per test per month.

Cloud-Hosted: Choose Wisely

Sometimes managed monitoring is the better choice — less overhead, better integrations. Then there are smarter options than the big names:

Grafana Cloud — Free tier: 10K series, 50GB logs, 50GB traces per month. Enough for smaller projects.

Sentry (errors only) — 5K events/month free. Often sufficient for error tracking.

Axiom (logs) — 500GB ingested per month free. Scalable and simple.

Compare that to Datadog ($15/host/month) and New Relic ($99/user/month). The cost savings quickly add up.

Dashboards: Less is More

Many teams build massive dashboards with hundreds of panels. That's not only visually overwhelming, it also slows query performance and increases the cloud bill if you pay per query.

Best practice: One dashboard per service, max 8-12 panels. Focus on:

  • Golden signals (latency, traffic, errors, saturation)
  • Resource usage (CPU, memory, disk)
  • Business metrics (orders, signups, etc.)

Everything beyond that belongs in ad-hoc queries, not in auto-refresh dashboards.

Conclusion

Monitoring doesn't have to be a budget killer. With open-source tools, smart sampling, retention policies, and cardinality management, you get full observability for a fraction of what enterprise vendors charge. Start with Prometheus + Grafana for metrics, Loki for logs, and OpenTelemetry for traces. Scale where needed, but never pay for data you don't use.

Nederlandse samenvatting: Dit artikel behandelt praktische kostenoptimalisatie strategieën voor monitoring infrastructuur — van open-source tools zoals Prometheus en Grafana, tot slimme retention policies, tail-based sampling voor traces, en het vermijden van high-cardinality metrics. Het laat zien hoe je production-grade observability opbouwt op budget zonder blindvlekken.

Want to stay updated?

Subscribe to my newsletter or get in touch for freelance projects.

Get in Touch