Distributed Tracing with OpenTelemetry: Visibility into Microservices
How OpenTelemetry and Jaeger make distributed tracing practical for microservice architectures — from instrumentation to finding bottlenecks.
Jean-Pierre Broeders
Freelance DevOps Engineer
Distributed Tracing with OpenTelemetry
Metrics tell you something is slow. Logs tell you what went wrong. But neither answers the question: where in a chain of twelve microservices is the problem exactly? That's where distributed tracing makes the difference.
The Microservices Debugging Problem
A user clicks "place order". Behind the scenes, that request hits the API gateway, the order service, the payment service, a fraud check, inventory management, and a notification service. Response time: 4.2 seconds. Where's the delay?
Without tracing, guesswork begins. Each team lead points at a different service. "Everything runs fine on our end." Sound familiar? With distributed tracing, one click on a trace reveals that the fraud check spent 3.1 seconds waiting on an external API.
OpenTelemetry: The Standard
OpenTelemetry (OTel) resolved the fragmentation in the tracing landscape. Where Jaeger, Zipkin, and various vendor-specific SDKs previously coexisted, OTel now provides a single standardized way to collect telemetry data. The project is backed by the CNCF and has broad support from cloud providers and tooling vendors.
The architecture consists of three components:
- SDK — instruments application code and generates traces
- Collector — receives, processes, and exports telemetry data
- Backend — stores traces and makes them searchable (Jaeger, Tempo, etc.)
Instrumentation in .NET
For a .NET application, the setup requires surprisingly little effort. The OpenTelemetry SDK integrates cleanly with the dependency injection system.
builder.Services.AddOpenTelemetry()
.WithTracing(tracing =>
{
tracing
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddEntityFrameworkCoreInstrumentation()
.AddSource("OrderService")
.AddOtlpExporter(opts =>
{
opts.Endpoint = new Uri("http://otel-collector:4317");
});
});
This automatically adds spans for incoming HTTP requests, outgoing HTTP calls, and database queries. Without a single line of custom code, the request flow through the application becomes visible.
For custom spans — around a complex business operation, for example:
private static readonly ActivitySource Source = new("OrderService");
public async Task<Order> ProcessOrder(OrderRequest request)
{
using var activity = Source.StartActivity("ProcessOrder");
activity?.SetTag("order.customer_id", request.CustomerId);
activity?.SetTag("order.item_count", request.Items.Count);
var validated = await ValidateInventory(request);
var payment = await ProcessPayment(validated);
activity?.SetTag("order.total", payment.Amount);
return await FinalizeOrder(payment);
}
Each StartActivity call creates a span that automatically links to the parent trace. The tags enable filtering later: all orders from a specific customer, or all orders with more than five items.
The OTel Collector
The Collector sits between applications and the backend. Sounds like unnecessary complexity, but it decouples the application from the storage choice. Switch from Jaeger to Grafana Tempo without touching a single line of application code.
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 1024
tail_sampling:
decision_wait: 10s
policies:
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow-requests
type: latency
latency: { threshold_ms: 2000 }
exporters:
otlp/jaeger:
endpoint: jaeger:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, tail_sampling]
exporters: [otlp/jaeger]
The tail_sampling processor matters here. In production, a busy application generates thousands of traces per minute. Storing everything is expensive and unnecessary. Tail sampling keeps only traces that matter: errors and slow requests. The rest gets discarded.
Docker Compose Setup
A working tracing stack fits in a compact Compose file:
services:
otel-collector:
image: otel/opentelemetry-collector-contrib:0.96.0
volumes:
- ./otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml
ports:
- "4317:4317"
- "4318:4318"
jaeger:
image: jaegertracing/all-in-one:1.54
environment:
COLLECTOR_OTLP_ENABLED: "true"
ports:
- "16686:16686" # UI
- "4317" # OTLP gRPC
order-service:
build: ./src/OrderService
environment:
OTEL_EXPORTER_OTLP_ENDPOINT: http://otel-collector:4317
OTEL_SERVICE_NAME: order-service
After docker compose up, the Jaeger UI is available on port 16686. Search by service name, filter by duration or status, and click a trace to see the full request flow.
Context Propagation: The Secret Sauce
Tracing works across service boundaries because context is automatically passed along in HTTP headers. The W3C Trace Context format (traceparent header) is the standard. Most HTTP clients and frameworks propagate this automatically when the OTel SDK is active.
Where things break: message queues. With asynchronous communication via RabbitMQ or Kafka, trace context must be explicitly passed in message headers.
// Producer
var propagator = Propagators.DefaultTextMapPropagator;
propagator.Inject(
new PropagationContext(Activity.Current!.Context, Baggage.Current),
message.Headers,
(headers, key, value) => headers[key] = value
);
// Consumer
var parentContext = propagator.Extract(
default,
message.Headers,
(headers, key) => headers.TryGetValue(key, out var val) ? [val] : []
);
using var activity = Source.StartActivity("ProcessMessage",
ActivityKind.Consumer, parentContext.ActivityContext);
Without this piece, the trace stops at the queue producer and a new, unrelated trace starts at the consumer. End-to-end visibility is gone.
What Tracing Delivers
| Scenario | Without Tracing | With Tracing |
|---|---|---|
| Finding slow endpoints | Hours of log file searching | Filter on latency > 2s, click trace open |
| Debugging cascade failures | Teams pointing fingers | Trace shows exactly which downstream call fails |
| Performance regression | Only visible after complaints | Span duration dashboards show trends |
| Dependency mapping | Manually maintained wiki | Automatically generated from traces |
Common Pitfalls
A few things that go wrong in practice. First: too many custom spans. Every span has overhead. A span around every method call makes traces unreadable and costs performance. Instrument at the level of business operations and external calls, not at method level.
Second: forgetting to configure the sampling rate. By default, the SDK sends everything. At 10,000 requests per second, that's a firehose of data hammering the network and storage.
And finally: deploying tracing but never looking at it. Sounds obvious. Yet many teams run Jaeger for months without anyone opening the UI. Build tracing into the incident response workflow: first step during a production issue is always checking recent traces.
Wrapping Up
Distributed tracing is the third pillar of observability, alongside metrics and logging. With OpenTelemetry as a standardized SDK and a lightweight backend like Jaeger, the barrier to entry is low. The investment in instrumentation pays for itself at the first serious production incident where a trace reveals in minutes what would otherwise have taken hours of debugging.
