How distributed tracing works within your application

Why tracing is its own problem

In a system of any size, a single user request crosses many services. The order arrives at the front end, the front end calls billing, billing calls invoicing, the order also goes to fulfillment, fulfillment talks to the warehouse, and the warehouse hands off to packing and shipping. Each of these services may itself call others. When something goes wrong, perhaps a slow response, an error or an incorrect total, logs alone are not enough. You can find the failure in one service’s log, but reconstructing the full chain that led there means correlating timestamps across machines that don’t share a clock, across log formats that don’t share a schema, and across processes that don’t share a request identifier.

Tracing solves this by recording a small structured event whenever a service starts or finishes a unit of work and by carrying a shared identifier through every call in the chain. Conceptually it is simple. In practice it has one immediate problem: cost. The most obvious implementation would emit a separate message every time a service receives a request and every time it sends one. The result is an immediate tripling of message volume: the original request, an event from the sender when it goes out, and an event from the receiver when it comes in. Two thirds of that traffic ends up concentrated on a single collector, and the situation gets worse if you trace internal calls as well as the ones that cross service boundaries.

A tracing library exists to make this volume manageable, and to spare you from re-deriving the basic patterns yourself. Understanding those patterns helps you choose a library wisely and instrument an application well.

The conceptual picture

Three things have to be in place before tracing produces anything useful:

Instrumentation in each service that emits a span whenever the service handles a unit of work, and that propagates correlation identifiers on every outbound call.
A transport that carries spans from the services to a central location without blocking the services that produced them.
A collector that receives spans, optionally processes them, and forwards them to storage. A query interface sits on top of the stored spans and lets you reconstruct individual traces.

Figure 1 sketches the arrangement.

flowchart TB subgraph services [Business call chain] direction LR Order --> Customer --> Fulfillment --> Billing --> Sales end subgraph otel [OpenTelemetry Collector] HTTP gRPC Kafka["Kafka (contrib)"] end Order -.->|OTLP| HTTP Customer -.->|OTLP| HTTP Fulfillment -.->|OTLP| gRPC Billing -.-> Kafka Sales -.-> Kafka otel -.-> Q[(Storage and query UI)]

Figure 1. Services emit spans for the work they do. The transport carries those spans, in batches, to a collector that gathers them centrally. A query interface on top of the stored spans lets you reconstruct individual traces. HTTP and gRPC are the two transports defined by the OTLP specification. Others, such as Kafka, are provided by contrib receivers in the Collector and lie outside the spec.

Each service is responsible for its own spans, not for the rest of the pipeline. The transport handles batching and back-pressure. The collector is where you apply enrichment, sampling decisions, and routing to one or more storage backends. Treating these three as distinct concerns is what makes the architecture survive contact with production traffic.

Correlating events without leaking application data

The first instinct when correlating events across services is to use something the application already knows, and to enforce logging it: a customer identifier, a session identifier, an order number, a SKU, an invoice number. In a small test environment that approach works. In production it falls apart quickly. The data volumes overwhelm, queries become complex and brittle, and developers have to remember to log the right field at the right place, and make sure the value is present at the time the log message is sent. The log levels have to be right too. A debug-level log line is no help if production runs with debug turned off.

A dedicated correlation scheme avoids these problems by being separate from the application’s domain data. The scheme generates its own identifiers, carries them through every call, and records them in a uniform format that the collector can index without having to understand what an “order” is. The most widely adopted scheme is the Dapper pattern.

The Dapper pattern

The Dapper pattern was described in a 2010 Google paper titled Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. The mechanics are simple enough to fit on a single diagram.

When a request enters the system from the outside, it triggers a tree of additional calls. The first service to handle the request assigns it a trace ID, a globally unique identifier that will travel with every message in the tree. Every call made from that point on is also assigned its own span ID, and each outbound call carries the span ID of the calling work as a parent ID. The combination of trace, span, and parent is enough to reconstruct the entire tree from the spans alone.

In figure 2, the order service assigns trace t1 to a request that arrived without one. It then makes two outbound calls: one to billing with span s1, and one to fulfillment with span s3. Billing in turn calls invoicing, which receives span s2 along with the trace ID and the parent span s1. The fulfillment branch fans out further, with the warehouse making two parallel calls of its own. Every message in the tree shares the trace ID t1; the span and parent IDs encode the shape of the tree.

Figure 2. The initial request is assigned a trace ID that’s passed along with every resulting message. Each service generates a new span ID for the calls it makes, and carries the calling span as the parent. Symbolic IDs are shown for readability; real trace IDs are 128-bit hex values and real span IDs are 64-bit hex values.

Two queries become trivial once spans are correlated this way. Given a trace ID, you can pull every span in the tree and render it as a timeline. Given any single span, you can walk back through its parent chain to the root span and recover the original request that triggered the work, without having to traverse each intermediate step or know anything about the application’s domain.

Where Dapper breaks down: fan-in

The Dapper model has one weakness. It assumes every span has at most one parent, which means it can’t describe the case where multiple inbound messages combine to trigger a single outbound message. Figure 3 shows a small example: the estimate shipping service requests quotes from FedEx and DHL in parallel, and the select carrier service has to wait for both replies before choosing one. The outbound message from select carrier genuinely has two parents, the FedEx reply and the DHL reply, but Dapper has no way to record both.

Figure 3. The Dapper pattern assumes each span has a single parent. When a service assembles multiple inbound messages before sending one outbound message, the choice of which inbound span to record as the parent is essentially arbitrary. Implementations usually pick the last arriving message, but information is lost either way.

In practice this has long been a known limitation, and the pragmatic answer was to pick the last message to arrive, accept that the trace is slightly wrong, and move on. Modern tracing models handle the case directly. We come back to it in the next section.

From OpenTracing to OpenTelemetry

Through the mid-2010s, two open source projects converged on the problem of decoupling instrumentation from any particular tracing backend. OpenTracing, started by Ben Sigelman, who is one of the authors of the original Dapper paper, focused on a portable instrumentation API. OpenCensus, originating at Google, took on a broader scope that included metrics. In 2019 the two projects merged into OpenTelemetry. It is now a top-level project of the Cloud Native Computing Foundation. OpenTracing has been archived since 2022; OpenTelemetry is the answer to the question of which API to instrument against.

OpenTelemetry settles four things at once. There is a single API and SDK in each supported language. There is a single wire format, OTLP, that every conforming exporter and collector implements. There is a reference OpenTelemetry Collector that receives, processes, and exports spans to one or more backends. And there is a published specification that governs all of the above, so a Java application instrumented with the OpenTelemetry SDK can send spans to Jaeger, Tempo, or a commercial observability vendor without changes to the instrumentation code. The choice of backend becomes a configuration detail.

The fan-in case from figure 3 is handled directly. OpenTelemetry spans support span links. A link is a pointer to another span that contributed to a span’s work but isn’t the causal parent. The select carrier span in figure 3 records one of the inbound replies as its parent and the other as a link. Both relationships are preserved in the trace, and a UI built on OpenTelemetry data can render them distinctly.

The OpenTelemetry data model

A trace in OpenTelemetry is a collection of spans that share the same trace ID. Each span describes a single unit of work and carries a small, well-defined set of fields:

A name, which is a short string describing the operation (for example, HTTP GET /orders/{id} or db.query).
A trace ID, a span ID, and an optional parent span ID, exactly as in Dapper.
A kind, which is one of SERVER, CLIENT, PRODUCER, CONSUMER, or INTERNAL. The kind tells the backend how to interpret the span’s place in a call.
Start and end timestamps, recorded by the SDK.
Attributes — key-value pairs that describe the operation. The OpenTelemetry semantic conventions define names for common attributes (http.request.method, db.system, messaging.system) so that backends can render and query them consistently.
Events — timestamped annotations that record something noteworthy happening inside the span. Errors and exceptions are recorded as events.
A status, which is OK, ERROR, or UNSET.
Zero or more links to other spans, used for the fan-in case described above and for joining spans across asynchronous boundaries.

The shape of this model is what makes OpenTelemetry usable for things the original Dapper paper didn’t anticipate. Messaging systems use producer and consumer span kinds to describe the asymmetric handoff between sender and receiver. Span links describe relationships that don’t fit a tree. The semantic conventions give every backend a common vocabulary to query against.

Context propagation

A trace only forms if every service in the chain knows the trace ID and parent span ID of the caller. Getting those values from one service to the next is called context propagation, and for it to work across services written in different languages by different teams, the propagation format has to be standard.

The standard for passing context via HTTP is the W3C Trace Context specification, which defines two headers:

traceparent carries the trace ID, the span ID of the calling service’s active span (which becomes the parent on the receiving end), and a small set of flags including the sampling decision. The format is fixed: a version byte, the 16-byte trace ID, the 8-byte span ID, and a 1-byte flags field, all rendered as lowercase hex and separated by hyphens.
tracestate carries vendor-specific extensions in a comma-separated list. Most applications never need to read or write it directly.

The fixed layout of the traceparent header is shown below.

--- title: "traceparent" config: packet: bitsPerRow: 208 bitWidth: 5 --- packet 0-7: "version (8 bits)" 8-135: "trace-id (128 bits)" 136-199: "parent-id (64 bits)" 200-207: "trace-flags (8 bits)"

The traceparent header is 26 bytes laid out as four fixed-width fields. The trace-flags byte has eight bits, but only the least significant, the sampled flag, is defined by the current specification. The other seven bits are reserved and must be set to zero by senders.

The OpenTelemetry SDK extracts these headers from inbound requests and injects them into outbound requests automatically, so propagation is not something most application code has to think about. For known non-HTTP transports, such as message queues, gRPC metadata or Kafka headers, equivalent binary or text propagators are defined and the SDK selects the right one based on the transport.

The practical consequence is that a polyglot system gets coherent end-to-end traces as long as every service uses an OpenTelemetry SDK (or any library that emits W3C Trace Context headers). A request that flows from a Go front end through a Java service through a Python worker produces a single connected trace, because every service uses the same propagation format.

Instrumentation in practice

OpenTelemetry instrumentation comes in three forms, and a real application typically uses all three at the same time.

Auto-instrumentation agents attach to the application at startup and instrument supported libraries without any code changes. The OpenTelemetry Java agent is the best-known example: a single -javaagent: flag turns on tracing for HTTP servers, HTTP clients, JDBC, JMS, Kafka, gRPC, and dozens of other libraries. Equivalent agents and zero-code distributions exist for Node, Python, .NET, Ruby, and others. Agents are how most teams get their first useful traces; they cover the boundaries of the application without anyone having to touch the code.

Library instrumentation is built into the libraries themselves. Many frameworks now ship OpenTelemetry support directly, either by calling the OpenTelemetry API in their own code or by depending on the contrib instrumentation packages. This produces richer spans than what an agent can generate from the outside, because the library has access to its own internal state.

Manual spans are what application code adds to mark units of work that the agent and the libraries can’t see. Candidates for a manual span might include a business operation that spans several internal method calls, a background job that doesn’t correspond to a single HTTP request, or a piece of logic worth measuring on its own. Manual spans are also the right place to attach business attributes such as a tenant identifier or a feature-flag value, so that traces can be queried later by application-specific dimensions.

Whatever the source, the lifecycle of a span at runtime follows the same six steps. Figure 4 illustrates them.

Figure 4. The lifecycle of a span: (1) the initial request arrives without trace context. (2) The first service starts a root span and generates a new trace ID. (3) Outbound calls inject the trace context into request headers using the W3C Trace Context format. (4) Each downstream service extracts the context from the inbound headers. (5) The downstream service starts a new span as a child of the extracted parent. (6) The pattern repeats at every hop, building the tree.

Most of this happens without application code. The agent or library extracts and injects context at the boundaries; the SDK starts and ends spans around library calls. Manual spans add structure that the automated tiers can’t infer.

Auto-instrumentation has a significant limitation. The agent needs a deep and thorough understanding of every library or framework it intercepts, and it has to identify them at runtime rather than at compile time. The practical consequence is that the exact library versions used by an application must be on the agent’s supported list before its traces can be trusted. The Java Agent Supported Libraries page gives a sense of the scope, and the list extends past the libraries themselves to application servers and to the underlying JVM. Even within the supported ranges, coverage tends to lag: instrumentation authors cannot always anticipate how the next release of a framework will change its internals.

Sampling and volume

The cost argument from the start of the article doesn’t go away just because the SDK does the work. A high-traffic service that records every span produces enormous volumes of data. Sampling is what makes tracing affordable at production scale, and OpenTelemetry distinguishes two places where the decision can be made.

Head-based sampling happens at the start of a trace, in the SDK of the first service that handles a request. The decision is encoded in the traceparent flags and propagated to every downstream service, so an entire trace is either kept or dropped, and the system won’t waste resources on partial recordings. Head-based sampling is cheap and easy to reason about, but the decision has to be made before the SDK knows whether the trace will turn out to be interesting.

Tail-based sampling happens at the OpenTelemetry Collector, after the spans for a trace have been gathered. The collector can decide to keep a trace because it took too long, because it returned an error, because it touched a particular service, or because of any other predicate it can evaluate over the assembled spans. The cost is that the collector has to buffer spans long enough to gather every span in a trace before deciding.

The conventional setup combines both. The SDK samples conservatively at the head, keeping all errors and a percentage of normal traffic, and the collector samples more aggressively at the tail to retain the traces that are useful for debugging. Span batching, which the SDK does automatically, takes care of the rest of the volume problem: spans accumulate in memory and are flushed to the collector in groups, not one at a time.

What to do next

The shortest path to working traces is to pick a backend, start the auto-instrumentation agent, and look at what comes out. Open source backends include Jaeger (the established choice with a UI that maps onto the OpenTelemetry model directly) and Grafana Tempo (a good fit if you already run a Grafana stack). A number of commercial observability vendors also accept OTLP, so you can start sending spans to a hosted backend without changing the instrumentation in your application. Be aware that commercial backends often layer proprietary extensions on top of the standard model, whether as additional attributes, custom query syntax, or vendor-specific SDK wrappers. Those extensions can be convenient, but relying on them ties your instrumentation to a single vendor. That is precisely the lock-in OpenTelemetry was designed to avoid.

Once spans are flowing, the next step is to add manual spans at the boundaries that matter to your application: the entry points the agent doesn’t recognise, the long-running jobs, the business operations whose duration you care about. The instrumentation that your application owns is where most of the value is, because it’s the instrumentation that knows what the application is actually trying to do.

Tracing is one of four signal types in the OpenTelemetry model, and the others are worth a follow-up read. Metrics aggregate numeric measurements across many requests and answer questions about rates, durations, and error counts that no single trace can. Logs are being unified with traces under a shared context and resource model, so that a log line can be tied back to the span that produced it. Profiles, a newer signal that recently reached stable status in the specification, record continuous CPU and memory sampling alongside the trace and let a slow span be drilled into for the function calls that consumed the time. A related concept worth knowing is baggage. It rides on the same context propagation as traceparent and lets the application carry arbitrary key-value pairs through the call tree, a way to hand downstream services information they would not otherwise have, such as a tenant identifier or a feature-flag value.

The basic mechanism hasn’t changed since the Dapper paper: a trace ID, a span ID, a parent, propagated through every call. What OpenTelemetry adds is agreement on one API, one wire format, and one collector. That agreement lets the mechanism work across an entire system without anyone having to choose between portability and capability.