Distributed Tracing with OpenTelemetry

Metrics tell you something is slow. Logs tell you what went wrong. But neither answers the question: where in a chain of twelve microservices is the problem exactly? That's where distributed tracing makes the difference.

The Microservices Debugging Problem

A user clicks "place order". Behind the scenes, that request hits the API gateway, the order service, the payment service, a fraud check, inventory management, and a notification service. Response time: 4.2 seconds. Where's the delay?

Without tracing, guesswork begins. Each team lead points at a different service. "Everything runs fine on our end." Sound familiar? With distributed tracing, one click on a trace reveals that the fraud check spent 3.1 seconds waiting on an external API.

OpenTelemetry: The Standard

OpenTelemetry (OTel) resolved the fragmentation in the tracing landscape. Where Jaeger, Zipkin, and various vendor-specific SDKs previously coexisted, OTel now provides a single standardized way to collect telemetry data. The project is backed by the CNCF and has broad support from cloud providers and tooling vendors.

The architecture consists of three components:

SDK — instruments application code and generates traces
Collector — receives, processes, and exports telemetry data
Backend — stores traces and makes them searchable (Jaeger, Tempo, etc.)

Instrumentation in .NET

For a .NET application, the setup requires surprisingly little effort. The OpenTelemetry SDK integrates cleanly with the dependency injection system.

builder.Services.AddOpenTelemetry()
    .WithTracing(tracing =>
    {
        tracing
            .AddAspNetCoreInstrumentation()
            .AddHttpClientInstrumentation()
            .AddEntityFrameworkCoreInstrumentation()
            .AddSource("OrderService")
            .AddOtlpExporter(opts =>
            {
                opts.Endpoint = new Uri("http://otel-collector:4317");
            });
    });

This automatically adds spans for incoming HTTP requests, outgoing HTTP calls, and database queries. Without a single line of custom code, the request flow through the application becomes visible.

For custom spans — around a complex business operation, for example:

private static readonly ActivitySource Source = new("OrderService");

public async Task<Order> ProcessOrder(OrderRequest request)
{
    using var activity = Source.StartActivity("ProcessOrder");
    activity?.SetTag("order.customer_id", request.CustomerId);
    activity?.SetTag("order.item_count", request.Items.Count);

    var validated = await ValidateInventory(request);
    var payment = await ProcessPayment(validated);

    activity?.SetTag("order.total", payment.Amount);
    return await FinalizeOrder(payment);
}

Each StartActivity call creates a span that automatically links to the parent trace. The tags enable filtering later: all orders from a specific customer, or all orders with more than five items.

The OTel Collector

The Collector sits between applications and the backend. Sounds like unnecessary complexity, but it decouples the application from the storage choice. Switch from Jaeger to Grafana Tempo without touching a single line of application code.

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-requests
        type: latency
        latency: { threshold_ms: 2000 }

exporters:
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, tail_sampling]
      exporters: [otlp/jaeger]

The tail_sampling processor matters here. In production, a busy application generates thousands of traces per minute. Storing everything is expensive and unnecessary. Tail sampling keeps only traces that matter: errors and slow requests. The rest gets discarded.

Docker Compose Setup

A working tracing stack fits in a compact Compose file:

services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.96.0
    volumes:
      - ./otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml
    ports:
      - "4317:4317"
      - "4318:4318"

  jaeger:
    image: jaegertracing/all-in-one:1.54
    environment:
      COLLECTOR_OTLP_ENABLED: "true"
    ports:
      - "16686:16686"  # UI
      - "4317"         # OTLP gRPC

  order-service:
    build: ./src/OrderService
    environment:
      OTEL_EXPORTER_OTLP_ENDPOINT: http://otel-collector:4317
      OTEL_SERVICE_NAME: order-service

After docker compose up, the Jaeger UI is available on port 16686. Search by service name, filter by duration or status, and click a trace to see the full request flow.

Context Propagation: The Secret Sauce

Tracing works across service boundaries because context is automatically passed along in HTTP headers. The W3C Trace Context format (traceparent header) is the standard. Most HTTP clients and frameworks propagate this automatically when the OTel SDK is active.

Where things break: message queues. With asynchronous communication via RabbitMQ or Kafka, trace context must be explicitly passed in message headers.

// Producer
var propagator = Propagators.DefaultTextMapPropagator;
propagator.Inject(
    new PropagationContext(Activity.Current!.Context, Baggage.Current),
    message.Headers,
    (headers, key, value) => headers[key] = value
);

// Consumer
var parentContext = propagator.Extract(
    default,
    message.Headers,
    (headers, key) => headers.TryGetValue(key, out var val) ? [val] : []
);
using var activity = Source.StartActivity("ProcessMessage",
    ActivityKind.Consumer, parentContext.ActivityContext);

Without this piece, the trace stops at the queue producer and a new, unrelated trace starts at the consumer. End-to-end visibility is gone.

What Tracing Delivers

Scenario	Without Tracing	With Tracing
Finding slow endpoints	Hours of log file searching	Filter on latency > 2s, click trace open
Debugging cascade failures	Teams pointing fingers	Trace shows exactly which downstream call fails
Performance regression	Only visible after complaints	Span duration dashboards show trends
Dependency mapping	Manually maintained wiki	Automatically generated from traces

Common Pitfalls

A few things that go wrong in practice. First: too many custom spans. Every span has overhead. A span around every method call makes traces unreadable and costs performance. Instrument at the level of business operations and external calls, not at method level.

Second: forgetting to configure the sampling rate. By default, the SDK sends everything. At 10,000 requests per second, that's a firehose of data hammering the network and storage.

And finally: deploying tracing but never looking at it. Sounds obvious. Yet many teams run Jaeger for months without anyone opening the UI. Build tracing into the incident response workflow: first step during a production issue is always checking recent traces.

Wrapping Up

Distributed tracing is the third pillar of observability, alongside metrics and logging. With OpenTelemetry as a standardized SDK and a lightweight backend like Jaeger, the barrier to entry is low. The investment in instrumentation pays for itself at the first serious production incident where a trace reveals in minutes what would otherwise have taken hours of debugging.

Distributed Tracing with OpenTelemetry: Visibility into Microservices

Distributed Tracing with OpenTelemetry

The Microservices Debugging Problem

OpenTelemetry: The Standard

Instrumentation in .NET

The OTel Collector

Docker Compose Setup

Context Propagation: The Secret Sauce

What Tracing Delivers

Common Pitfalls

Wrapping Up

Related Articles

GitHub Actions Caching: 3x Faster Pipelines Without Extra Costs

Monitoring on a Budget: Cost Control Without Blind Spots

Health Checks and Zero-Downtime Deployments with Docker Compose

Want to stay updated?