Skip to content

Latest commit

 

History

History
233 lines (181 loc) · 9.99 KB

tracing.md

File metadata and controls

233 lines (181 loc) · 9.99 KB

Tracing Temporal Services with OTEL

The Temporal server supports ability to configure OTEL trace exporters to support emitting spans and traces for observability. More specifically, the server uses the Go Open Telemetry library for instrumentation and multi-protocol multi-model telemetry exporting. This document is intended to help developers understand how to configure exporters and instrument their code. A full exploration of tracing and telemetry is out of scope of this document and the reader is referred to external reference material, third party descriptions, and the specification itself.

Quickstart

  1. Run make start-dependencies (which starts Grafana Tempo)
  2. Start the server using make OTEL=true start (or any other start-x command)
  3. Visit http://localhost:3000/explore and select "Tempo" from the datasource dropdown.

tip: use TraceQL { .temporalWorkflowID =~ "<WF-ID>.*" } to find the traces for your workflow

Configuring

No trace exporters are configured by default and thus trace data is neither collected nor emitted without additional configuration.

In OpenTelemetry, the concept of an "exporter" is abstract. The concrete implementation of an exporter is determined by a 3-tuple of values: the exporter signal, model, and protocol:

  • a "signal" is one of traces, metrics, or logs (in this document we will only deal with traces),
  • "model" indicates the abstract data model for the span and trace data being exported,
  • and the "protocol" specifies the concrete application protocol binding for the indicated model.

Temporal is known to support exporting trace data as defined by otlp over grpc.

Configuration File

The server supports an otel YAML stanza which is used to configure a set of process-wide exporters.

A common configuration is to emit tracing data to an agent such as the otel-collector running locally. To configure such a system add the stanza below to your configuration yaml file(s).

otel:
  exporters:
    - kind:
        signal: traces
        model: otlp
        protocol: grpc
      spec:
        connection:
          insecure: true
          endpoint: localhost:4317

Another example is pointing Temporal directly at the Honeycomb hosted OTLP collection service. To achieve such a configuration you will need an API key from the upstream Honeycomb service and the stanza below.

otel:
  exporters:
    - kind:
        signal: traces
        model: otlp
        protocol: grpc
      spec:
        connection:
          endpoint: api.honeycomb.io:443
        headers:
          x-honeycomb-team: <a honeycomb API key>

Note that the configuration parser supports defining multiple exporters by supplying additional kind and spec declarations. Additional configuration fields can be found in config_test.go and are mostly related to the underlying gRPC client configuration (retries, timeouts, etc).

Environment Variables

Creating Exporter

An OTEL span exporter can also be configured via environment variables: OTEL_TRACES_EXPORTER creates a span exporter.

OTEL_TRACES_EXPORTER=otlp

Note that if the configuration file already defines a traces exporter, no additional exporter will be created.

Configuring Exporter

The Go OTEL SDK will also read a well-known set of environment variables for the configuration of the exporter. So if you prefer setting environment variables to writing YAML then you can use the variables defined in the OTEL spec.

For example:

OTEL_SERVICE_NAME=my-service OTEL_EXPORTER_OTLP_TRACES_INSECURE=true

NOTE: If an environment variable conflicts with YAML-provided configuration then the environment variable takes precedence.

Instrumenting

While the exporter configuration described above is executed and set up at process startup time, instrumentation code - the creation and termination of spans - is inserted inline (like logging statements) into normal server processing code. Spans are created by go.opentelemetry.io/otel/trace.Tracer objects which are themselves created by go.opentelemetry.io/otel/trace.TracerProvider instances. The TracerProvider instances are bound to a single logical service and as such a single Temporal process will have up to four such instances (for worker, mathcing, history, and frontend services respectively). The Tracer object is bound to a single logical library which is different than a service. Consider that a history service instance might run code from the temporal common library, gRPC library, and gocql library.

Tracer and TracerProvider object management has been added to the server's fx DI configuration and thus they are available to be added to any fx-enabled object constructors. Due the possibility of multiple services being coresident within a single process, we do not use the OTEL library's capability to host and access a single global TracerProvider.

By default, gRPC clients and servers are instrumented via the open source otelgrpc library.

Instrumentation Tips

Follow the OTEL attribute naming guidelines

The OpenTelemetry project has published a non-normative set of guidelines for attribute naming.

If nothing else, please

  1. Always check for an appropriate attribute in semconv before creating your own
  2. Always prefix Temporal attributes with io.temporal

Create shared package-appropriate attribute keys

Do not create a single file in common for all attributes

Do not create packages just for OTEL attributes

Do create a set of attribute.Keys in the semantically appropriate package and re-use those to create attribute.KeyValues as needed.

Do create a set of utility functions that can transform frequently used aggregate types (Tasks, WorkflowExecutions, TaskQueues, etc) into an []attribute.KeyValue. The association of attribute.KeyValues to a trace.Span can be verbose in terms of the number of lines of code needed so any reduction in that noise will be a good idea. Not to mention the consistency benefit of sharing a single mapping function.

Start a span in common or other non-service-specific code

Q: Given that common code can be called from any service, how can I start a span in common library code that is bound the appropriate service (frontend/history/matching/worker)?

A: The TracerProvider that created the currently active Span can be retrieved from that Span itself and the currently active Span can be received from the context.Context.

// DoFoo is a function in the common package
func DoFoo(ctx context.Context, x int, y string) string {
   var span trace.Span
   ctx, span = trace.SpanFromContext(ctx).TracerProvider().Tracer("go.temporal.io/server/common").Start("DoFoo")
   defer span.End()
   return fmt.Sprintf("%v-%v", y, x)
}

RecordError does not imply Span failure

Using Span.RecordError is a good idea but not all errors imply failure. Thus, if you want to capture an error and also capture that a span failed, you must additionally call Span.SetStatus(codes.Error, err.Error()). A FailSpanWithError utility function might be a good idea.

Propagate TraceContext across things other than function calls

This is taken care of by default for gRPC calls via the otelgrpc interceptors. However, you may want to propagate tracing information between goroutines or other places where the context.Context is not passed such as handoffs through a Go channel or an external datastore. There are two broad approaches that are applicable in different situations:

  1. If the object being transferred is not externally durable (e.g. an object put into a Go channel but not spooled to a database) then you can pull the trace.SpanContext out of the current trace.Span with trace.SpanContextFromContext(context.Context) or Span.SpanContext() and pass that object along with the data being transferred. The consuming side can restore the tracing state with trace.ContextWithSpanContext(trace.SpanContext).
  2. If the tracing state needs to be serialized, the OTEL library provides the propagation package to convert trace state into a more serialization-friendly type such as a map[string]string. The propagation.TraceContext type can be used to inject and extract trace state into a key-value-ish object.
carrier := propagation.MapCarrier(map[string]string{})
propagation.TraceContext{}.Inject(ctx, carrier)
// write the carrier object to a durable store

Trace individual tasks that are processed together in batches

OpenTelemetry Spans can be linked together to form a non-parent-child relationship. One of the main use cases for linking is so that a batch process (e.g. a database read that fills a large buffer of work items) can create Spans for each of the individual work items it creates and those Spans can be linked back to the parent batch Span without that span becoming their logical parent.

Still want to log things?

Use Span.AddEvent to write messages that will be associated with that Span. From the OTEL manual

An event is a human-readable message on a span that represents “something happening” during it’s lifetime